Review of NeurIPS (AABI) Workshop

16 Dec, 2019

Tags: bayesian-statistics machine-learning

I attended the [Symposium on Approximate Bayesian Inference](http://approximateinference.org/), [Black in AI](https://blackinai.github.io/), and several poster sessions within the 2 days that I spent at [NeurIPS](https://nips.cc/), 2019 in Vancouver. During the event, I focused on the Bayesian workshop as it tends to emphasize the fundamentals, rather than bleeding-edge results. This is in line with my goals of understanding how things work on a deeper level; as such knowledge is likely to stand the test of time. [Black in AI](https://blackinai.github.io/) workshop was very informative as it was a refresher on subjects that I was already familiar with and as such we won't write more about the summary of the event. Unfortunately, I did not attend a number of the presentations in the main NeurIPS conference due to information overload. My case is not yet hopeless, because our reading groups in Vancouver will eventually read the set of interesting papers from the conference in the coming months. Conference is the de facto venue for disseminating the latest knowledge in the field. By rational principles, for a work to be considered the state of the art, it must answer this question without any ambiguity. Are improvements in the performance metrics due to the novelty of the method or the preprocessing steps or even random effects or sheer luck? Subsequently, I attended the Symposium in Approximate Bayesian Inference session without any formal training in mathematics or initial prep for the workshop. This was done on purpose, to calibrate my understanding of Bayesian statistics and engage with researchers during the poster sessions. Fortunately, it appeared that the contents of some talks were already familiar to me. There was a keynote that had some Kalman filters which were part of a [well-received talk](https://kenluck2001.github.io/blog_post/pysmooth_a_time_series_library_from_first_principles.html) that I gave in Vancouver. The concepts are similar to the many papers that we had read in our advanced reading group. However, there were lots of acronyms which I think makes simple things appear to be complicated. Furthermore, I observed that there is a growing effort to unify the Bayesian world with the neural network world. One of the reasons is that it is easier to perform uncertainty quantification when your model has some form of Gaussian processes. Several talks attempted to draw this connection. One of the clearest of such attempts was the Neural Tangent's talk. The premise of the talk was based on answering the question. Can GP be used as a building block for Bayesian deep learning? Neural Tangent's library is an easy-to-use library for creating finite-width and infinite width neural networks. It provided a way to analyze the training dynamics of the neural network. This library can learn on small datasets based on its Bayesian origins. I heard the term "infinite-width neural network" for the first time in this conference, while the details are not yet clear to me. I found the description in a [paper](https://openreview.net/pdf?id=SkGT6sRcFX) that was published in ICLR 2019. Surprisingly, I also found out that there was a universal misunderstanding of “noise” in the workshop. Some refer to noise as variance, bias, overfitting, and underfitting. There is a need for the field to unify the conventions. I can live with having one more acronym to memorize. Let's now get into the main themes of the workshop. The symposium was focused on the following topics: robustness, a better understanding of generalization, difficulty of quantifying mutual information, and efficient computation. A number of the talks focused on performing Bayesian computation even in the face of model misspecification, model collapse, and increased variance. One talk attempts to improve the vanilla OMC. The resulting method is named [Robust OMC](https://arxiv.org/pdf/1904.00670.pdf). Original OMC can fail when the likelihood is flat. The method favors conditioning on summary statistics rather than a single variable and the use of a single point to represent an area where the likelihood is nearly constant. ROMC provides a way of sampling while preventing model collapse by fixing weights through the stabilization of the matrices. Weights are unstable by default. Robustness is achieved by using a variable to switch off faulty weights in a scheme that is similar to dropout. Consequently, another talk focused on the formulation of a robust estimate of the likelihood by using [pseudo-likelihood](https://arxiv.org/abs/1909.13339) based on maximum mean discrepancy which is resilient to issues that may arise due to misspecification of the model. There was a talk that provided a way to reduce the cost of Bayesian computation by using clever parallelism. Sample efficiency is a measure of the discrepancy between observed and simulated data. This necessitates the creation of a principled sequential Bayesian experimental design to select optimal simulation locations that maximize sample efficiency. The work allows the running of several experiments to select these locations at once. The study used [batch simulation](https://arxiv.org/abs/1910.06121) to reduce the time for Bayesian inference. Another way of enhancing the robustness of models is to introduce sparsity in the approximation of Gaussian processes by using [inducing points](https://arxiv.org/abs/1910.10596). These inducing points lead to a more scalable algorithm, as no neural network or data argumentation is required. Pre-training and transfer learning are similar to what happens in the field of neural networks in order to improve performance. There are attempts to replicate the same feat in the Bayesian world. The work about creating [probabilistic map](https://openreview.net/pdf?id=BJgnty2NYr) for robotic by incrementally updating the model by finding the correspondence between model and data as a form of transfer learning. There was a paper on a variant of [Kalman filter](https://openreview.net/forum?id=HkxNKk2VKS) that made use of 2-passes instead of the traditional 1-pass. Many two-pass algorithms tend to reproduce noise in the process in the backward pass by using Brownian trees. This is because Brownian Tree has a better ability to capture the dynamics of the system. There were talks on performing backpropagation through time (BPTT) where the choice of k for backprop is adaptive, and they also provided theoretical guarantees that can learn under concept drift. The most impressive talk for me at the Symposium was the use of normalizing flow for progressive image rendering. This provides a principled way to achieve multiple scales of decompressing images with varying quality. The details are not yet clear to me, but I intended to read more about the [work](https://arxiv.org/pdf/1905.07376.pdf). The work has clear commercial applications and is a remarkable piece. There was a talk that connected reinforcement learning to information theory. There are known issues with current reinforcement learning algorithms that include: - It is likely to get two different solutions based on slight perturbation. - A need for a detailed reward scheme - Long training times - Lacking a diversity-seeking exploration for the reward function The talk proposed a distribution-matching formulation of reinforcement learning that makes use of maximizing entropy over distribution. Prior to committing mass to an area, we track different possibilities from that state. It can be difficult to choose a policy that maximizes Q values without getting stuck, leading to suboptimal solutions. Keep track of all states using the RL with maximum entropy, and don't commit to a probability mass until you are sure it is optimal. It is essentially matching the distribution of states instead of rewards. Optimal actions lead to an optimal future. Inference can be drawn as to which action was taken, given that the future was optimal. More information can be found in this [work](https://arxiv.org/pdf/1805.00909.pdf). ### Words of Wisdom The recurring themes in the workshop are: - The MCMC algorithm wins on bias, while the Variational Inference algorithm (VI) wins on variance and amortization. It is better to run MCMC without waiting for convergence, as reasonable results can be obtained even before convergence. - VI provides faster convergence, but the results may not be very reliable. VB (Variational Bayes) fails to capture the heteroscedastic noise and uses homoskedastic noise to fit the data. - Reparametrization tricks were used in virtually every poster. I think it is probably the most useful technique in Bayesian literature, as it allows for performing differentiation on a process. Making it easier to perform gradient descent for optimization. - Stratonovich SDE (stochastic differential equation) can be simulated cheaper. - The adjoint sensitivity method is a cheaper way of solving ODE. It can be combined with reverse mode auto-diff for time-efficient and constant memory. - The mixture of Gaussian processes is largely non-Gaussian. It is well-known that our perception of reality can be relativistic and as such, this is not an official summary of the symposium, but my recollection of the unfolding of events. ### Conclusions I saw a poster on the saliency map which focused on current problems with the interpretability of the map. I met the author to ask how they could discuss their work with interpreting the attention map. We concluded that ablation style intervention is a way to understand the underlying process in the network. Can the answer lie in causality? I was excited to have a selfie with the legendary Professor Christopher Manning from Stanford University and Jeff Dean from Google. It was a memorable day for me. The entire [proceedings](https://openreview.net/group?id=approximateinference.org/AABI/2019/Symposium) can be found here.

13/18

Please feel free to donate to support my work by clicking

Read more of our blog posts, technical talks, and publications.