Critical initialisation for deep signal propagation in noisy rectifier neural networks

NeurIPS 2018

Stochastic regularisation is an important weapon in the arsenal of a deep learning practitioner. However, despite recent theoretical advances, our understanding of how noise influences signal propagation in deep neural networks remains limited. By extending recent work based on mean field theory, we develop a new framework for signal propagation in stochastic regularised neural networks.

Our noisy signal propagation theory can incorporate several common noise distributions, including additive and multiplicative Gaussian noise as well as dropout. We use this framework to investigate initialisation strategies for noisy ReLU networks. We show that no critical initialisation strategy exists using additive noise, with signal propagation exploding regardless of the selected noise distribution. For multiplicative noise (eg. dropout), we identify alternative critical initialisation strategies that depend on the second moment of the noise distribution.

Simulations and experiments on real-world data confirm that our proposed initialisation is able to stably propagate signals in deep networks, while using an initialisation disregarding noise fails to do so. Furthermore, we analyse correlation dynamics between inputs. Stronger noise regularisation is shown to reduce the depth to which discriminatory information about the inputs to a noisy ReLU network is able to propagate, even when initialised at criticality. We support our theoretical predictions for these trainable depths with simulations, as well as with experiments on MNIST and CIFAR-10.

If dropout limits trainable depth, does critical initialisation still matter? A large-scale statistical analysis on ReLU networks

Pattern Recognition Letters 2020

Recent work in signal propagation theory has shown that dropout limits the depth to which information can propagate through a neural network. In this paper, we investigate the effect of initialisation on training speed and generalisation for ReLU networks within this depth limit. We ask the following research question: given that critical initialisation is crucial for training at large depth, if dropout limits the depth at which networks are trainable, does initialising critically still matter?

We conduct a large-scale controlled experiment, and perform a statistical analysis of over 12000 trained networks. We find that (1) trainable networks show no statistically significant difference in performance over a wide range of non-critical initialisations; (2) for initialisations that show a statistically significant difference, the net effect on performance is small; (3) only extreme initialisations (very small or very large) perform worse than criticality.

These findings also apply to standard ReLU networks of moderate depth as a special case of zero dropout. Our results therefore suggest that, in the shallow-to-moderate depth setting, critical initialisation provides zero performance gains when compared to off-critical initialisations and that searching for off-critical initialisations that might improve training speed or generalisation, is likely to be a fruitless endeavour.

Initialisation of noise-regularised neural networks

Stellenbosch University 2021

Recently, proper initialisation and stochastic regularisation techniques have greatly improved the performance and ease of training of neural networks. Some research has gone into how the magnitude of the initial weights impact optimisation, while others have focused on how initialisation affects signal propagation. In terms of noise regularisation, dropout has allowed networks to train relatively quickly and reduced overfitting. Much research has gone towards understanding why dropout improves the generalisation of networks. Two major theories are (i) that it prevents neurons from becoming too dependent on the output of other neurons and (ii) that dropout leads a network to optimise a smoother loss landscape.

Despite this, our theoretical understanding of the interaction between regularisation and initialisation is sparse. Thus, the aim of this work was to broaden our knowledge of how initialisation and stochastic regularisation interact and what impact this has on network training and performance. Because rectifier activation functions are widely used, we extended new network signal propagation theory to rectifier networks that may use stochastic regularisation. Our theory predicted a critical initialisation that allows for stable pre-activation variance signal propagation. However, our theory also indicated that stochastic regularisation reduces the depth to which correlation information can propagate in ReLU networks. We validated this theory and showed that it accurately predicts a boundary across which networks do not train effectively.

We then extended the investigation by conducting a large-scale randomised control trial to search for initialisations in a region that conserves input signal around the critical initialisation in the hopes of finding initialisations that provide advantages to training or generalisation. We compare the critical initialisation to 10 other initialisation schemes in a trial that consisted of over 12000 networks. We found that initialisations much larger than the critical initialisation provide extremely poor performance, while network initialisations close to the critical initialisation provide similar performance. No initialisations clearly outperformed the critical initialisation. Thus, we recommend it as a safe default for practitioners.

A game-theoretic analysis of networked system control for common-pool resource management using multi-agent reinforcement learning

NeurIPS 2020

Multi-agent reinforcement learning has recently shown great promise as an approach to networked system control. Arguably, one of the most difficult and important tasks for which large scale networked system control is applicable is common-pool resource management. Crucial common-pool resources include arable land, fresh water, wetlands, wildlife, fish stock, forests and the atmosphere, of which proper management is related to some of society's greatest challenges such as food security, inequality and climate change.

Here we take inspiration from a recent research program investigating the game-theoretic incentives of humans in social dilemma situations such as the well-known tragedy of the commons. However, instead of focusing on biologically evolved human-like agents, our concern is rather to better understand the learning and operating behaviour of engineered networked systems comprising general-purpose reinforcement learning agents, subject only to non-biological constraints such as memory, computation and communication bandwidth.

Harnessing tools from empirical game-theoretic analysis, we analyse the differences in resulting solution concepts that stem from employing different information structures in the design of networked multi-agent systems. These information structures pertain to the type of information shared between agents as well as the employed communication protocol and network topology. Our analysis contributes new insights into the consequences associated with certain design choices and provides an additional dimension of comparison between systems beyond efficiency, robustness, scalability and mean control performance.

Learning to communicate through imagination with model-based deep multi-agent reinforcement learning

OpenReview.net 2021

The human imagination is an integral component of our intelligence. Furthermore, the core utility of our imagination is deeply coupled with communication. Language, argued to have been developed through complex interaction within growing collective societies serves as an instruction to the imagination, giving us the ability to share abstract mental representations and perform joint spatiotemporal planning.

In this paper, we explore communication through imagination with multi-agent reinforcement learning. Specifically, we develop a model-based approach where agents jointly plan through recurrent communication of their respective predictions of the future.

Each agent has access to a learned world model capable of producing model rollouts of future states and predicted rewards, conditioned on the actions sampled from the agent's policy. These rollouts are then encoded into messages and used to learn a communication protocol during training via differentiable message passing. We highlight the benefits of our model-based approach, compared to a set of strong baselines, by developing a set of specialised experiments using novel as well as well-known multi-agent environments.

Creating Intelligent Agents with Reinforcement Learning

Stellenbosch University 2017

Reinforcement learning is a relatively new and undiscovered branch of machine learning. However, reinforcement learning has recently become very popular. Even so, very few understand what reinforcement learning is and possible applications thereof.

This project report serves to give an overview of reinforcement learning and will explain some of the recently developed approaches, such as deep Q learning. Throughout this report, we build our understanding of reinforcement learning until we reach the level of deep Q learning. We then apply a deep Q network to a computer game, Code vs Zombies.

While our implementation stabilised on a suboptimal policy when playing the full game, it was able to find optimal policies for constrained versions. In the process, we experiment with and optimise some of the leading approaches in reinforcement learning.

Unsupervised acoustic unit discovery for speech synthesis using discrete latent-variable neural networks

arXiv 2019

For our submission to the ZeroSpeech 2019 challenge, we apply discrete latent-variable neural networks to unlabelled speech and use the discovered units for speech synthesis. Unsupervised discrete subword modelling could be useful for studies of phonetic category learning in infants or in low-resource speech technology requiring symbolic input.

We use an autoencoder (AE) architecture with intermediate discretisation. We decouple acoustic unit discovery from speaker modelling by conditioning the AE's decoder on the training speaker identity.

At test time, unit discovery is performed on speech from an unseen speaker, followed by unit decoding conditioned on a known target speaker to obtain reconstructed filterbanks. This output is fed to a neural vocoder to synthesise speech in the target speaker's voice. For discretisation, categorical variational autoencoders (CatVAEs), vector-quantised VAEs (VQ-VAEs) and straight-through estimation are compared at different compression levels on two languages. Our final model uses convolutional encoding, VQ-VAE discretisation, deconvolutional decoding and an FFTNet vocoder.

We show that decoupled speaker conditioning intrinsically improves discrete acoustic representations, yielding competitive synthesis quality compared to the challenge baseline.

Participatory research for low-resourced machine translation: A case study in african languages

arXiv 2020

Research in NLP lacks geographic diversity, and the question of how NLP can be scaled to low-resourced languages has not yet been adequately solved. "Low-resourced"-ness is a complex problem going beyond data availability and reflects systemic problems in society.

In this paper, we focus on the task of Machine Translation (MT), that plays a crucial role for information accessibility and communication worldwide. Despite immense improvements in MT over the past decade, MT is centered around a few high-resourced languages. As MT researchers cannot solve the problem of low-resourcedness alone, we propose participatory research as a means to involve all necessary agents required in the MT development process.

We demonstrate the feasibility and scalability of participatory research with a case study on MT for African languages. Its implementation leads to a collection of novel translation datasets, MT benchmarks for over 30 languages, with human evaluations for a third of them, and enables participants without formal training to make a unique scientific contribution. Benchmarks, models, data, code, and evaluation results are released under https://github.com/masakhane-io/masakhane-mt.

Masakhane--Machine Translation For Africa

arXiv 2020

Africa has over 2000 languages. Despite this, African languages account for a small portion of available resources and publications in Natural Language Processing (NLP). This is due to multiple factors, including: a lack of focus from government and funding, discoverability, a lack of community, sheer language complexity, difficulty in reproducing papers and no benchmarks to compare techniques.

To begin to address the identified problems, MASAKHANE, an open-source, continent-wide, distributed, online research effort for machine translation for African languages, was founded. In this paper, we discuss our methodology for building the community and spurring research from the African continent, as well as outline the success of the community in terms of addressing the identified problems affecting African NLP.

On optimal transformer depth for low-resource language translation

arXiv 2020

Transformers have shown great promise as an approach to Neural Machine Translation (NMT) for low-resource languages. However, at the same time, transformer models remain difficult to optimize and require careful tuning of hyper-parameters to be useful in this setting.

Many NMT toolkits come with a set of default hyper-parameters, which researchers and practitioners often adopt for the sake of convenience and avoiding tuning. These configurations, however, have been optimized for large-scale machine translation data sets with several millions of parallel sentences for European languages like English and French. In this work, we find that the current trend in the field to use very large models is detrimental for low-resource languages, since it makes training more difficult and hurts overall performance, confirming previous observations.

We see our work as complementary to the Masakhane project ("Masakhane" means "We Build Together" in isiZulu.) In this spirit, low-resource NMT systems are now being built by the community who needs them the most. However, many in the community still have very limited access to the type of computational resources required for building extremely large models promoted by industrial research. Therefore, by showing that transformer models perform well (and often best) at low-to-moderate depth, we hope to convince fellow researchers to devote less computational resources, as well as time, to exploring overly large models during the development of these systems.

Machine Learning

Pipeline

I designed the methodology and program flow for many of the projects I have been a part of. I broke the process into steps. This let me see which tasks could be completed at the same time and where the bottlenecks may be. Doing so let me plan which parts of the code were most important to optimise. For instance, the code to train multiple neural networks at the same time let us complete experiments that would have taken months in under a week. I also optimised python code that we borrowed from previous research so that it took about an hour to execute, instead of the roughly 20 hours before.

Parallelisation

For some projects I used multiple computers to conduct our experiments. Thus, I wrote custom code that let us use the computers as effectively as possible by distributing experiments across computers.

Multi-threading

I enabled each computer to run multiple experiments at the same time. This is called "multi-threading". I created multiple "threads" per computer and conducted an experiment on each thread.

Collaboration

I also made the computers communicate over the network and collaborate to complete the experiments as quickly as possible. Each computer keeps track of what it is busy with and the experiments it has completed. Other computers then check if the experiment they want to conduct is already done or under way by another. If so, it moves onto the next experiment.

Custom Neural Networks

I did research on how initialisation affects neural networks. Thus, I wrote custom code to create network architectures and initialise them in various ways.

Stochastically Regularised Networks

My research focused on networks that use Dropout and other types of stochastic regularisation. I introduced noise various layers of our networks. Thus, I wrote custom code for these "noisy" layers.

Research

Visualisation of results

To make sure our results were robust and general, I ran experiments on multiple data sets. I often wanted to evaluate how accurately my theoretical predictions matched experiment results. My first comparison method was usually to plot the predictions and results together and do a visual inspection.

For instance, the cyan lines in the below figure shows how deep we predict information can travel in neural networks. The heatmap then shows how close information at each layer is to being lost through numeric under or overflow.

The dashed vertical line shows that information will travel arbitrarily deep if networks are initialised correctly. I felt it was important to be able to compare this to what would happen if someone did not initialise their network correctly. So, I depicted the depths of the 2 most common initialisations using more vertical dashed lines.

I took care to represent as much information as clearly as possible when visualising results.

Visualisation of results

I often wanted to give more information about how accurate our theoretical predictions were. In some cases, like the figure below, I plotted the predictions as solid lines, then overlaid dots that represented means and shaded regions that represented the first standard deviation of empirical measurements.

Visualisation of results

There were also times when I wanted to learn more than the mean and standard deviation of measurements. In cases like the figure below, I plotted the distribution of results and marked key statistics. For instance, I marked the mean with a grey and white dashed line and marked (and ranked) all modes with black dashed lines. I could then use this information to make conclusions like: "using dropout decreases the average performance of networks (if training time is not increased)", "Moderate levels of dropout increases both best and most likely performance of networks", and "when dropout levels are too high, network performance decreases".

Visualisation of methodology

One of my main roles was to plan the methodology of experiments. I found that visual aids could help explain these processes. The figure below is an example where I showed how I created a class of networks, randomly sampled network architectures from this class, and then ran multiple experiments on each sample.

Visualisation of methodology

In a few of my research projects I wanted to investigate how initialisation of networks affects their performance. The space of possible initialisations has continuous dimensions, meaning that I could not test all possible cases. So, I developed ways of testing certain parts of the space that would likely give important information. I first made a graphical representation of this method, shown below.

Definition of key algorithms

In some cases I developed algorithms that were important to my research. I carefully documented them in my publications with enough detail for others to be able to implement and confirm them. The algorithm below is the one for the initialisation sampling technique showed on the previous slide.

Programming competitions

Smash The Code

CodinGame's Smash The Code competition tasked competitors with creating programs to play a game against other competitor programs. The game was a blend of Tetris and colour matching games. Whenever a player makes a group of at least 4 adjacent spheres of the same colour, the group vanishes and sends spheres to the opponents space to block them. Chains of groups vanishing in a single turn gives more points and sends more blocking spheres to the opponent.

South African placement

I ranked 1st out of all South African competitors in the Smash The Code competition.

International placement

I ranked 72nd out of all 2500+ competitors in the Smash The Code competition.

Algorithms

I have used many programming language like Python, C, C++, Java, and Javascipt. I used search and sorting algorithms like binary search, breadth first search, quick sort, and merge sort in competitions and real-life applications. I especially enjoy when I used concepts I learned from these algorithms to design new algorithms to solve unique problems.

Data structures

We often need to trade memory usage for computation efficiency. Because of this, I have learned a lot about data structures like queues, trees, graphs, and has tables (sets). Using a data structure that matches the task usually lets us use a more computationally efficient and simple algorithm.

Simulation

Many interesting optimisation problems need us to find a sequence of actions and simulate some environment. I have learned how to evaluate states, perform Monte-Carlo searches, and use genetic algorithms to find solutions to these sorts of problems. This is actually how I learned about reinforcement learning and then machine learning in general.

contactless

Resume

More of my credentials

My tertiary education, 10 academic publications, and work as a research engineer during this time have taught me how to learn, and problem solve. In addition, these experiences gave me practice in researching, writing, editing, and fact-checking. Finally, I learnt to juggle multiple projects simultaneously, meet deadlines, and effectively communicate and work on a team.

I have spent time teaching, volunteering in places of learning, and serving as a reviewer for world-class academic journals. These experiences have taught me to understand students in order to meet them where they are, developed my mentoring skills, allowed me to create curiosity in others, and give back to my community.

Experience

Science Communicator

2022

Writer and Fact Checker

I am currently working as a freelance writer. My role is to read research papers and summarise them into a form the general public can more easily understand.

Graduate Student

7 Publications

Cum Laude

2018 - 2021

More Info

Foundational Theory of Machine Learning

My research focused on understanding the effect of different sets of initial parameters on training neural networks, especially those using noise to increase their robustness.

My Graduate Years

While I worked on many projects during my graduate studies, my research focused on initialising noisy neural networks. I co-authored multiple papers on this topic, but I described and expanded on that work in my thesis. I give a 30 minute summary of my thesis in the following video.

My thesis - Initialisation of noise-regularised neural networks

The key conclusions of my thesis were:

one must take regularisation strength into account when initialising neural networks that use noise regularisation;
using noise-regularisation limits the depth to which one can train neural networks;
initialisation values much larger than ours lead to poor performing networks;
no initialisation outperforms ours; thus, our suggested initialisation is a safe default.

Critical initialisation for deep signal propagation in noisy rectifier neural networks

We used mean-field theory to predict how the output of ReLU neural networks that use dropout and Gaussian noise would change due to initialisation. As a result, we found a boundary on the maximum depth that these networks could be trained using those predictions. Finally, we derived an initialisation that should allow one to train networks to the depth boundary.

If dropout limits trainable depth, does critical initialisation still matter? A large-scale statistical analysis on ReLU networks

To validate results from our previous work more thoroughly, we brought the idea of randomised control trials into this paper. This also allowed us to test many different initialisation strategies and compare their performance to our own. We trained over 12 000 neural networks and found that no other initialisation consistently outperformed ours.

Reviewer

Ranked top 5%

2019 - 2020

More Info

International Machine Learning Journals

When reviewing submissions, I ensured that the cited facts were correct, the new theory and methodology were sound, the conclusions logically followed the results, and the work was novel. I then worked with the writers and other reviewers to make the submission as good as possible, even if it was not yet ready to be accepted to this journal edition.

Reviewing

I served as a reviewer for the international conference on machine learning (ICML) in 2019 and 2020 and the conference on neural information processing systems (NeurIPS) in 2019. Due to my hard work, I was ranked in the top 5% of reviewers by the ICML in 2019.

Many of the submissions I reviewed focused on increasing the understanding of the foundational theories of machine learning. For instance, some submissions investigated factors that affect the ability of models to make valuable predictions when given new data ( generalisation ), such as optimisation and regularisation techniques and activation functions.

I also reviewed submissions related to learning behaviours and sequential decision-making ( reinforcement learning ). This included:

investigating methods to find what parameters should be changed ( credit assignment );
how agents go about assessing the desirability of their current scenario ( value functions );
what methods agents can use to approximate the value of their current state ( variational methods );
how multiple agents interact and learn simultaneously ( multi-agent reinforcement learning ); and
how many agents might communicate to complete a shared task ( communication schemes ).

Finally, I reviewed submissions relating to neural networks. These covered topics such as:

how information travels through neural networks (signal flow);
how the selection of starting parameters ( initialisation ) may affect signal flow and training;
expanding the applications of more complex algorithms, such as transformers ; and
how to learn minimal structures of neural networks ( pruning ).

Research Engineer

2 Publications

2020

Research and Publication

I focused on conducting novel research and writing academic publications. My research focused on using machine learning to find optimal behaviours in systems with multiple agents.

Reinforcement Learning

Proin gravida nibh vel velit auctor aliquet. Aenean sollicitudin, lorem quis bibendum auctor, nisi elit consequat ipsum, nec sagittis sem nibh id elit.Proin gravida nibh vel velit auctor aliquet. Aenean sollicitudin, lorem quis bibendum auctor, nisi elit consequat ipsum, nec sagittis sem nibh id elit.

Machine Learning

Teaching Experience

2018 - 2019

Foundations of Machine Learning

For two years, I served as the teaching assistant for a foundation of machine learning course at my university. I also served as a teaching assistant at multiple conferences during this time. In addition, I was the head of the optimisation of neural networks section at the 2019 Deep Learning Indaba.

Undergraduate Student

1 Publication

Cum Laude

2014 - 2017

Electronic Engineering and Informatics

While most of my classes' content focused on mathematics, physics, and engineering, I learnt more valuable skills. This time thoroughly developed my ability to learn and problem solve.

Teaching assistant

2015 - 2016

Mathematics

In these years, I discovered my love for inspiring curiosity in others. I found it exciting to uncover each student's understanding of the topic and bridge it with their needed information.

Download Resume

I am Elan van Biljon.

Cover Letter

Writing Samples

Research Papers

Initialisation of Neural Networks

Reinforcement Learning

Language Processing

NeurIPS 2018

Pattern Recognition Letters 2020

Stellenbosch University 2021

NeurIPS 2020

OpenReview.net 2021

Stellenbosch University 2017

arXiv 2019

arXiv 2020

arXiv 2020

arXiv 2020

About

Who am I?

Profile

Skills

>

Code Samples

_

Machine Learning

Research

Programming competitions

Pipeline

Parallelisation

Multi-threading

Collaboration

Custom Neural Networks

Stochastically Regularised Networks

Visualisation of results

Visualisation of results

Visualisation of results

Visualisation of methodology

Visualisation of methodology

Definition of key algorithms

Smash The Code

South African placement

International placement

Algorithms

Data structures

Simulation

Resume

More of my credentials

Experience

Science Communicator

Writer and Fact Checker

Graduate Student

7 Publications

Cum Laude

Foundational Theory of Machine Learning

My Graduate Years

Reviewer

Ranked top 5%

International Machine Learning Journals

Reviewing

Research Engineer

2 Publications

Research and Publication

Reinforcement Learning

Teaching Experience

Foundations of Machine Learning

Basics of machine learning

Undergraduate Student

1 Publication

Cum Laude

Electronic Engineering and Informatics

How to solve problems

Teaching assistant

Mathematics

Pure maths

Hobbies

What I do in my free time

Rock Climbing

Trail Running

Hiking

Art