Deep learning
Novel view
What Can Neural Networks Reason About? ICLR 2020 Spotlight
Optimization
Course notes on Optimization for Machine Learning Gabriel Peyré
Visualizations
Neural Network Playground | TensorFlow
GAN Lab
GAN playground
OpenAI Microscope (CNN Convolution Layers)
Loss Landscape
Meta-learning
Regularizing Meta-Learning via Gradient Dropout Figure 4, 7, 8, 9
Generalization of NN
Deep Double Descent
The double descent phenomenon occurs in CNNs, ResNets, and transformers
Performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time.
This effect is often avoided through careful regularization.
Bayesian NN
How Good is the Bayes Posterior in Deep Neural Networks Really?
ICML 2020 Talk
SG-MCMC is accurate enough. Cold posteriors work. More work on priors for deep nets is needed.
Adversarial
Understanding and Mitigating the Tradeoff Between Robustness and Accuracy
ICML 2020 Talk
Infinite width NN
A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth
ICML 2020 Talk
Globally convex w.r.t. distribution of parameters?
ResNet is an ensemble of small NNs.
Analysis of Wasserstein Gradient Flow
Interpretation
Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks
Masking based method - subnetwork finding.
discover modules and they tend to resist sharing
Generalization is because they
Student Specialization in Deep Rectified Networks With Finite Width and Input Dimension
They prove that when the gradient is small at every data sample, each teacher node is specialized by at least one student node at the lowest layer.
Their theory suggests that teacher nodes with large fan-out weights get specialized first when the gradient is still large, while others are specialized with small gradient, which suggests inductive bias in training.
Supermasks in Superposition
During training SupSup learns a separate supermask (subnetwork) for each task.
At inference time, SupSup can infer task identity by superimposing all supermasks, each weighted by an αi, and using gradients to maximize confidence.
NAS
Evolving Normalization-Activation Layers (2004.02967) (Hanxiao Liu, Andrew Brock, Karen Simonyan, Quoc V. Le)
The first effort to automatically co-design the activation function and the normalization layer as a unified computation graph.
Layer search instead of architecture search.
Meta-learning curiosity algorithms (2003.05325) (Ferran Alet, Martin F. Schneider, Tomas Lozano-Perez & Leslie Pack Kaelbling)
Make the search over programs feasible with relatively modest amounts of computation.
Do meta-learning in a rich, combinatorial space of programs rather than transferring neural network weights.
The author's response is also interesting.
Old interesting deep-learning papers
Connection to classical methods
Deep kernel learning (AISTATS 2016) (Andrew Gordon Wilson, Zhiting Hu Ruslan Salakhutdinov, Eric P. Xing)
Combine the structural properties of deep architectures with the non-parametric flexibility of kernel methods.
Transform the inputs of a spectral mixture base kernel with a deep architecture, using local kernel interpolation, inducing points, and structure exploiting (Kronecker and Toeplitz) algebra for a scalable kernel representation
|