Sampling and empirical risk minimization

Stéphan Clémençon; Patrice Bertail; Emilie Chautru

doi:10.1080/02331888.2016.1259810

Article Dans Une Revue Statistics Année : 2016

Sampling and empirical risk minimization

(1, 2) , (3, 4) , (5)

1
2
3
4
5

Stéphan Clémençon

Fonction : Auteur
PersonId : 174491
IdHAL : stephan-clemencon
ORCID : 0000-0002-5879-9500
IdRef : 08905203X

Laboratoire Traitement et Communication de l'Information

Département Images, Données, Signal

Patrice Bertail

Fonction : Auteur
PersonId : 17670
IdHAL : patrice-bertail
ORCID : 0000-0002-6011-3432
IdRef : 034681280

Modélisation aléatoire de Paris X

Centre de Recherche en Économie et Statistique

Emilie Chautru

Fonction : Auteur

Centre de Géosciences

Résumé

In certain situations that shall be undoubtedly more and more common in the Big Data era, the datasets available are so massive that computing statistics over the full samples is hardly feasible, if not unfeasible. A natural approach in this context consists in using survey schemes and substituting the ‘full data’ statistics with their counterparts based on the resulting random samples, of manageable size. It is the main purpose of this paper to investigate the impact of survey sampling on statistical learning methods based on empirical risk minimization through the standard binary classification problem, considered here as a ‘case in point’. Precisely, we prove that, in presence of auxiliary information, appropriate use of optimally coupled Poisson survey plans may not affect much the learning rates, while possibly reducing significantly the number of terms that must be averaged to compute the empirical risk functional with overwhelming probability. These striking results are next shown to extend to more general sampling schemes by means of a coupling technique, originally introduced by Hajek [Asymptotic theory of rejective sampling with varying probabilities from a finite population. Ann Math Stat. 1964;35(4):1491–1523].

Mots clés

Empirical process empirical risk minimization generalization bound Horvitz-Thompson estimation Poisson design survey sampling

Domaines

Mathématiques [math] Machine Learning [stat.ML] Statistiques [math.ST] Probabilités [math.PR] Théorie [stat.TH]

Stephan Clémençon : Connectez-vous pour contacter le contributeur

https://telecom-paris.hal.science/hal-02107516

Soumis le : mardi 23 avril 2019-16:42:48

Dernière modification le : vendredi 26 avril 2024-16:48:03

Dates et versions

hal-02107516 , version 1 (23-04-2019)

Identifiants

HAL Id : hal-02107516 , version 1
DOI : 10.1080/02331888.2016.1259810

Citer

Stéphan Clémençon, Patrice Bertail, Emilie Chautru. Sampling and empirical risk minimization. Statistics, 2016, 51 (1), pp.30-42. ⟨10.1080/02331888.2016.1259810⟩. ⟨hal-02107516⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

X INSTITUT-TELECOM ENSMP GENES CNRS ENSAE INSMI ENSMP_GEOSCIENCES PARISTECH CREST ENSAI PSL MODALX X-CREST ENSMP_DR LTCI IDS S2A UNIV-PARIS-LUMIERES UNIV-PARIS-NANTERRE

234 Consultations

1 Téléchargements

Sampling and empirical risk minimization

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager