Deep learning

Novel view

Deep Double Descent
1. The double descent phenomenon occurs in CNNs, ResNets, and transformers
2. Performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time.
3. This effect is often avoided through careful regularization.

How Good is the Bayes Posterior in Deep Neural Networks Really?
1. ICML 2020 Talk
2. SG-MCMC is accurate enough. Cold posteriors work. More work on priors for deep nets is needed.

A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth
1. ICML 2020 Talk
2. Globally convex w.r.t. distribution of parameters?
3. ResNet is an ensemble of small NNs.
4. Analysis of Wasserstein Gradient Flow

Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks
1. Masking based method - subnetwork finding.
2. discover modules and they tend to resist sharing
3. Generalization is because they
Student Specialization in Deep Rectified Networks With Finite Width and Input Dimension
1. They prove that when the gradient is small at every data sample, each teacher node is specialized by at least one student node at the lowest layer.
2. Their theory suggests that teacher nodes with large fan-out weights get specialized first when the gradient is still large, while others are specialized with small gradient, which suggests inductive bias in training.
Supermasks in Superposition
1. During training SupSup learns a separate supermask (subnetwork) for each task.
2. At inference time, SupSup can infer task identity by superimposing all supermasks, each weighted by an αi, and using gradients to maximize confidence.

Evolving Normalization-Activation Layers (2004.02967) (Hanxiao Liu, Andrew Brock, Karen Simonyan, Quoc V. Le)
1. The first effort to automatically co-design the activation function and the normalization layer as a unified computation graph.
2. Layer search instead of architecture search.
Meta-learning curiosity algorithms (2003.05325) (Ferran Alet, Martin F. Schneider, Tomas Lozano-Perez & Leslie Pack Kaelbling)
1. Make the search over programs feasible with relatively modest amounts of computation.
2. Do meta-learning in a rich, combinatorial space of programs rather than transferring neural network weights.
3. The author's response is also interesting.

Deep kernel learning (AISTATS 2016) (Andrew Gordon Wilson, Zhiting Hu Ruslan Salakhutdinov, Eric P. Xing)
1. Combine the structural properties of deep architectures with the non-parametric flexibility of kernel methods.
2. Transform the inputs of a spectral mixture base kernel with a deep architecture, using local kernel interpolation, inducing points, and structure exploiting (Kronecker and Toeplitz) algebra for a scalable kernel representation