Deep learning

Novel view

  1. What Can Neural Networks Reason About? ICLR 2020 Spotlight

Optimization

  1. Course notes on Optimization for Machine Learning Gabriel Peyré

Visualizations

  1. Neural Network Playground | TensorFlow

  2. GAN Lab

  3. GAN playground

  4. OpenAI Microscope (CNN Convolution Layers)

  5. Loss Landscape

Meta-learning

  1. Regularizing Meta-Learning via Gradient Dropout Figure 4, 7, 8, 9

Generalization of NN

  1. Deep Double Descent

    1. The double descent phenomenon occurs in CNNs, ResNets, and transformers

    2. Performance first improves, then gets worse, and then improves again with increasing model size, data size, or training time.

    3. This effect is often avoided through careful regularization.

Bayesian NN

  1. How Good is the Bayes Posterior in Deep Neural Networks Really?

    1. ICML 2020 Talk

    2. SG-MCMC is accurate enough. Cold posteriors work. More work on priors for deep nets is needed.

Adversarial

  1. Understanding and Mitigating the Tradeoff Between Robustness and Accuracy

    1. ICML 2020 Talk

Infinite width NN

  1. A Mean-field Analysis of Deep ResNet and Beyond: Towards Provable Optimization Via Overparameterization From Depth

    1. ICML 2020 Talk

    2. Globally convex w.r.t. distribution of parameters?

    3. ResNet is an ensemble of small NNs.

    4. Analysis of Wasserstein Gradient Flow

Interpretation

  1. Are Neural Nets Modular? Inspecting Their Functionality Through Differentiable Weight Masks

    1. Masking based method - subnetwork finding.

    2. discover modules and they tend to resist sharing

    3. Generalization is because they

  2. Student Specialization in Deep Rectified Networks With Finite Width and Input Dimension

    1. They prove that when the gradient is small at every data sample, each teacher node is specialized by at least one student node at the lowest layer.

    2. Their theory suggests that teacher nodes with large fan-out weights get specialized first when the gradient is still large, while others are specialized with small gradient, which suggests inductive bias in training.

  3. Supermasks in Superposition

    1. During training SupSup learns a separate supermask (subnetwork) for each task.

    2. At inference time, SupSup can infer task identity by superimposing all supermasks, each weighted by an αi, and using gradients to maximize confidence.

NAS

  1. Evolving Normalization-Activation Layers (2004.02967) (Hanxiao Liu, Andrew Brock, Karen Simonyan, Quoc V. Le)

    1. The first effort to automatically co-design the activation function and the normalization layer as a unified computation graph.

    2. Layer search instead of architecture search.

  2. Meta-learning curiosity algorithms (2003.05325) (Ferran Alet, Martin F. Schneider, Tomas Lozano-Perez & Leslie Pack Kaelbling)

    1. Make the search over programs feasible with relatively modest amounts of computation.

    2. Do meta-learning in a rich, combinatorial space of programs rather than transferring neural network weights.

    3. The author's response is also interesting.

Old interesting deep-learning papers

Connection to classical methods

  1. Deep kernel learning (AISTATS 2016) (Andrew Gordon Wilson, Zhiting Hu Ruslan Salakhutdinov, Eric P. Xing)

    1. Combine the structural properties of deep architectures with the non-parametric flexibility of kernel methods.

    2. Transform the inputs of a spectral mixture base kernel with a deep architecture, using local kernel interpolation, inducing points, and structure exploiting (Kronecker and Toeplitz) algebra for a scalable kernel representation