Skip to Main content Skip to Navigation
Journal articles

Sampling and empirical risk minimization

Abstract : In certain situations that shall be undoubtedly more and more common in the Big Data era, the datasets available are so massive that computing statistics over the full samples is hardly feasible, if not unfeasible. A natural approach in this context consists in using survey schemes and substituting the ‘full data’ statistics with their counterparts based on the resulting random samples, of manageable size. It is the main purpose of this paper to investigate the impact of survey sampling on statistical learning methods based on empirical risk minimization through the standard binary classification problem, considered here as a ‘case in point’. Precisely, we prove that, in presence of auxiliary information, appropriate use of optimally coupled Poisson survey plans may not affect much the learning rates, while possibly reducing significantly the number of terms that must be averaged to compute the empirical risk functional with overwhelming probability. These striking results are next shown to extend to more general sampling schemes by means of a coupling technique, originally introduced by Hajek [Asymptotic theory of rejective sampling with varying probabilities from a finite population. Ann Math Stat. 1964;35(4):1491–1523].
Complete list of metadata
Contributor : Stephan Clémençon Connect in order to contact the contributor
Submitted on : Tuesday, April 23, 2019 - 4:42:48 PM
Last modification on : Tuesday, October 19, 2021 - 11:14:14 AM



Stéphan Clémençon, Patrice Bertail, Emilie Chautru. Sampling and empirical risk minimization. Statistics, Taylor & Francis: STM, Behavioural Science and Public Health Titles, 2016, 51 (1), pp.30-42. ⟨10.1080/02331888.2016.1259810⟩. ⟨hal-02107516⟩



Record views