Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks

Melih Barsbey; Milad Sefidgaran; Murat A Erdogdu; Gael Richard; Umut Şimşekli

Communication Dans Un Congrès Année : 2021

Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks

(1) , (2, 3, 4) , (5, 6) , (2, 3, 4) , (2, 3, 4)

1
2
3
4
5
6

Melih Barsbey

Fonction : Auteur
PersonId : 1115691

Boǧaziçi üniversitesi = Boğaziçi University [Istanbul]

Milad Sefidgaran

Fonction : Auteur
PersonId : 1115692

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Institut Polytechnique de Paris

Murat A Erdogdu

Fonction : Auteur
PersonId : 1115693

University of Toronto

Vector Institute

Gael Richard

Fonction : Auteur
PersonId : 14146
IdHAL : gael-richard
IdRef : 094977208

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Institut Polytechnique de Paris

Umut Şimşekli

Fonction : Auteur
PersonId : 6757
IdHAL : umut-simsekli
IdRef : 250884003

Signal, Statistique et Apprentissage

Département Images, Données, Signal

Institut Polytechnique de Paris

Résumé

Neural network compression techniques have become increasingly popular as they can drastically reduce the storage and computation requirements for very large networks. Recent empirical studies have illustrated that even simple pruning strategies can be surprisingly effective, and several theoretical studies have shown that compressible networks (in specific senses) should achieve a low generalization error. Yet, a theoretical characterization of the underlying causes that make the networks amenable to such simple compression schemes is still missing. In this study, focusing our attention on stochastic gradient descent (SGD), our main contribution is to link compressibility to two recently established properties of SGD: (i) as the network size goes to infinity, the system can converge to a mean-field limit, where the network weights behave independently [DBDFŞ20], (ii) for a large stepsize/batch-size ratio, the SGD iterates can converge to a heavy-tailed stationary distribution [HM20, GŞZ21]. Assuming that both of these phenomena occur simultaneously, we prove that the networks are guaranteed to be ' p-compressible', and the compression errors of different pruning techniques (magnitude, singular value, or node pruning) become arbitrarily small as the network size increases. We further prove generalization bounds adapted to our theoretical framework, which are consistent with the observation that the generalization error will be lower for more compressible networks. Our theory and numerical study on various neural networks show that large step-size/batch-size ratios introduce heavy tails, which, in combination with overparametrization, result in compressibility. * Equal contribution. 35th Conference on Neural Information Processing Systems (NeurIPS 2021).

Domaines

Machine Learning [stat.ML]

Fichier principal

HT_and_Compressibility.pdf (1.34 Mo)

Origine : Fichiers produits par l'(les) auteur(s)

Gaël RICHARD : Connectez-vous pour contacter le contributeur

https://telecom-paris.hal.science/hal-03413484

Soumis le : mercredi 3 novembre 2021-18:28:56

Dernière modification le : lundi 9 octobre 2023-12:49:43

Dates et versions

hal-03413484 , version 1 (03-11-2021)

Identifiants

HAL Id : hal-03413484 , version 1

Citer

Melih Barsbey, Milad Sefidgaran, Murat A Erdogdu, Gael Richard, Umut Şimşekli. Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks. 35th Conference on Neural Information Processing Systems (NeurIPS), Dec 2021, Online, United States. ⟨hal-03413484⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

INSTITUT-TELECOM LTCI IDS S2A IP_PARIS ANR PRAIRIE-IA

124 Consultations

96 Téléchargements

Heavy Tails in SGD and Compressibility of Overparametrized Neural Networks

Résumé

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Partager