Story of Neural Networks
1800s: Biological Neural Networks
The discovery and identification of biological neural networks paved a new path in human biology. The neurons are different from other body cells and even violate ‘cell theory’.
Neuron or nerve cell is the structural and functional unit of the nervous system. It is similar to any other cell in the body. However, it is different from other cells by two ways:
- A neuron has branches or processes called axons and dendrites.
- Neurons do not have centrosomes. So, they cannot undergo division, unlike other cells.
A neuron has branches or processes called Axon and Dendrites. These are called nerve fibers and are responsible for signal transmission.
Based on their structure neurons are classified into three different types namely Unipolar, Bipolar, and Multipolar neurons.
Types of neurons:
When an input signal(electrical impulse) received from dendrites, it is transmitted into the cell body which in turns fires out an electrical impulse if the input signal is higher than a threshold. The fired electrical impulse passes through the Axon and reaches either to a dendrite of another neuron or directly to the axon of that neuron.
Let’s get into details of a neuron and the process of information in a neuron.
Dendrites – branched processes of a neuron. Responsible for transmitting impulses towards the nerve cell body. These may present or absent in a neuron and if present there may be one or many for a single neuron.
Axon – The long process of a neuron. In general, a neuron contains only one axon and transmits impulses away from the cell body. The axon provides a pathway for neurotransmitters that carry the information.
Nerve fibers have a lower threshold for excitation than the other cells. When a nerve fiber is stimulated, based on the strength of the stimulus, two types of response develop:
- Action Potential or Nerve impulse
- Electronic Potential or Local potential i.e no information transfer.
Another property of nerve fibers is that a nerve impulse can travel in either direction(forward and backward).
Neuroglia – a supporting cell of the nervous system. It is non-excitable and doesn’t transmit nerve impulses (dead neurons). These regulate the transmission of information between neurons.
An impulse passes through a junction between two neurons called Synapse. This space is formed by an axon of one neuron ending on the cell body or dendrite or axon of the next neuron. It either excites the transmitting impulse or inhibits the impulse.
This neural structure of the brain-inspired the researchers to build an artificial neural network that can handle things like a human brain.
Detailed representation of a biological neuron as an artificial neuron
The 1940s: Neural Networks
Warren McCulloch, a neurophysiologist, and Walter Pitts, a logician presented the first sophisticated discussion of “neuro-logical networks” in their manifesto “A Logical Calculus of the Ideas Immanent in Nervous Activity” in 1943.
They combined the concepts of finite-state machines, linear threshold decision elements, and logical representations of various forms of behavior and memory.
In 1947 they published another manifesto “How we know Universals”, which describes that neural networks are capable of recognizing spatial patterns in geometric transformations.
This led to a new era called “Cybernetics”, which combines the concepts from biology, psychology, mathematics, and engineering.
By the end of 1940, Donald Hebb’s “The Organization of Behavior” book made the first attempt to create a base theory of neural networks and their way of learning through constructing internal representations that support neuron activity.
Since Donald Hebb’s book is a conjecture, his theory isn’t taken seriously.
The 1950s: Learning in Neural Networks
During the cybernetics era, researchers and scientists raced to build mind-like machines. This led to the experiment ‘reinforcement learning’, a concept from behavioristic psychology.
In 1951, the first reinforcement-based network learning machine was built by Marvin Minsky. Called SNARC(Stochastic Neural-Analog Reinforcement Calculator) the first neural network simulator which consisted of 40 neurons. It successfully modeled the behavior of a rat in a maze searching for food.
A single neuron in SNARC | Source
In 1958 an astonishing thing happened, Frank Rosenblatt created ‘Perceptron’ a neural network, a computer system that can mimic a brain. It is used to classify images of men and women.
Frank Rosenblatt ’50, Ph.D. ’56 works on the “perceptron” – what he described as the first machine “capable of having an original idea.”
Perceptron is a single-layer neural network(inspired by the way neurons work together in the brain), an algorithm that classifies input into two possible categories.
An image of the perceptron from Rosenblatt’s “The Design of an Intelligent Automaton,” Summer 1958.
Perceptron was built on simple functions that don’t have the ability to handle complex problems.
The 1960s: Concepts of Backpropagation, Birth of Deep Learning and Fall of Perceptron
Henry Kelly in 1961, represented the first version of the backpropagation model in his paper ‘Gradient Theory of Optimal Flight Paths’. He introduced the ‘method of steepest descent’ to optimize the flight’s performance, which was later used to optimize neural networks.
In 1962, Stuart Dreyfus paper ‘The Numerical Solution of Variational Problems’ showed a backpropagation model using the derivative chain rule.
In 1965, Ukrainian mathematician Alexey Grigorevich Ivakhnenko(developed GMDH algorithm) and Valentin Grigorʹevich Lapa developed multiple layers of non-linear units which form a polynomial network, using polynomial activation functions. Here the neuron is a complex unit featuring a polynomial transfer function. It is an automatic or self-organized model designed to find the optimal solution.
In 1969. Kunihiko Fukushima wrote a paper ‘Visual feature extraction by a multilayered network of analog threshold elements’
The 1970s: AI Winter, Evolution of Backpropagation and a Deep Neural Network
First AI winter kicked in, the lack of funding limited AI research, but some individuals carried their research in AI.
In 1970, Seppo Linnainmaa coded the Backpropagation algorithm in FORTRAN code. He introduced the concept of reversing automatic differentiation without referring to Neural Networks.
In 1971, Alexey Grigorevich Ivakhnenko created an 8-layered neural network using the Group Method of Data Handling(GMDH).
Paul Werbos proposed the use of backpropagation in artificial neural networks in 1974. He described that artificial neural networks can be trained effectively by applying backpropagation.
In 1975, Kunihiko Fukushima continuously worked on feature extraction for pattern recognizers and curvilinear patterns by using a visual system and wrote a paper, ‘Cognitron: A self-organizing multilayered neural network’ based on the discoveries of development of the brain depends on the visual environment.
The 1980s: Revolution in Artificial Neural Networks, Birth of CNNs, RNNs and Reinforcement Learning, and Rise of Backpropagation
Kunihiko Fukushima published a paper ‘Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position’, which paved a perfect path for Convolutional Neural Networks.
Neocognitron | Source
This neural network model is used to recognize visual patterns from an image.
John Hopfield in 1982, invented Hopfield Network to store and retrieve memory like a human brain. The network is composed of binary threshold units with recurrent connections between them which provided a basis for future RNN implementations. Hopfield Network consists of a single layer containing one or more fully connected neurons.
Hopfield Network | Source
In 1984, Geoffrey Hinton, David Ackley, and Terrence Sejnowski created the Boltzmann Machines, a stochastic Hopfield Net with hidden units. The Boltzmann machine is good at modeling binary data. For a given set of binary training vectors, the model assigns a probability to every possible binary vector, which is useful to decide if other binary vectors come from the same distribution. The applications of Boltzmann machines can be found while retrieving relevant documents and detecting unusual behavior.
David Rumelhart, Geff Hinton, and Ronald Williams published a paper ‘Learning representations by back-propagating errors’ describing a new learning procedure, back-propagation for neural networks in 1986. The procedure adjusts the weights of the connections so as to minimize the measure of the difference between the output vector of the model and the desired output.
The Restricted Boltzmann Machine(RBM) was invented by Paul Smolensky in 1986. A simplified architecture, in which there are only one hidden layer and no connections between hidden units. The connectivity is restricted to make learning easier.
Terrence Sejnowski created ‘NETtalk: a parallel network that learns to read aloud’ a backpropagation trained a 3 layered neural network which converts English text into speech in 1987.
Schematic drawing of NETtalk
In 1989, Yann LeCun proposed a type of artificial neural network called Convolutional Neural Network. He and his team built a handwritten digit recognizer using convolutions.
The early model had a convolution with a stride and did not have any pooling layers. It was used to read handwritten cheques and zip codes mostly at AT&T.
In 1984 research on autonomous vehicles was started at Carnegie Mellon, and production of the first vehicle NAVLAB 1 started in 1986. And then something phenomenal happened during 1989, ALVINN(Autonomous Land Vehicle In a Neural Network) was tested on the roads.
A tweet from Dean Pomerleau, the man under the hood…
Christopher Watkins introduced the concept of Q-learning in his paper ‘Learning from Delayed Rewards’ in 1989, which paved a path to Reinforcement Learning in later years. It introduced the notion of reinforcement learning as learning to control a Markov Decision Process by incremental dynamic programming.
In 1989 for another key finding: “Multilayer feedforward networks are universal approximators” mathematically proved that multiple layers allow neural nets to theoretically implement any function i.e approximate any function.
The 1990s: Birth of LSTM and A Computer defeated a World Chess Champion!
Sepp Hochreiter and Jürgen Schmidhuber developed ‘Long Short-Term Memory(LSTM)’ in 1997, which revolutionized deep learning in the coming years. The LSTM is developed in order to overcome the vanishing gradient problem in recurrent neural networks.
In the 1980s IBM began the development of Deep Blue, a chess-playing computer. In 1997, Deep Blue defeated Gary Kasparov, the world champion. It used a brute-force search approach, so it is fair to say that Deep Blue at that time isn’t an AI implementation.
In 1998, Yann LeCun, Yoshua Bengio, and others published a paper ‘Gradient-Based Learning Applied to Document Recognition’. It states that ‘Gradient-Based Learning algorithms(stochastic gradient descent) can be used to synthesize a complex decision surface that can classify high-dimensional patterns such as handwritten character recognition’.
LeNet5: A typical convolutional neural network to recognize characters using the Stochastic Gradient Descent algorithm.
The concept of ‘Unsupervised Learning’ emerged during this period in order to overcome the limitations of backpropagation. Backpropagation requires labeled data(which is hard at those times) and learning time isn’t scalable when using multiple hidden layers.
The idea was to keep the efficiency and simplicity of using stochastic mini-batch gradient descent for adjusting the weights, but using it for modeling the structure of input rather than modeling for the relation between input and output(in case of supervised learning).
21st Century: Advancements in Deep Learning and AI Revolution has started.
In 2006, Geff Hinton and others developed Deep-Belief Network by stacking multiple RBMs together to speed up the training process in the neural networks. This popularized the use of Deep Learning worldwide. Belief Nets are a particular subset of graphs which are directed acyclic graphs, which have clever inference algorithms due to sparsely connected nodes.
In 2008, GPUs were introduced to train neural networks on huge volumes of data. The creation of the ‘world wide web’, Facebook, and many more generated lots and lots of data.
Fei-Fei Li created ImageNet in 2009, a large-scale visual database of 14 million images, which revolutionized the deep learning domain.
2011, IBM’s Watson powered by NLP and Information Retrieval techniques beats two Jeopardy champions!
IBM’s DeepQA built for Question and Answering
In the same year, Yoshua Bengio and others published ‘Deep Sparse Rectifier Neural Networks’ showing that the ReLU activation function can avoid Vanishing Gradient problems.
Sparse propagation of activations and gradients in a network of rectifier units
The paper concluded that ‘Sparsity and neurons operating mostly in a linear regime can be brought together in more biologically plausible deep neural networks. Rectifier units help to bridge the gap between unsupervised pre-training and no pre-training, which suggests that they may help in finding better minima during training. Furthermore, rectifier activation functions have shown to be remarkably adapted to sentiment analysis, a text-based task with a very large degree of data spar- sity.’
In 2012, Google Brain led by Andrew Ng and Jeff Dean created a deep learning framework focused on pattern detection that can recognize cats by watching unlabeled images and videos. Their focus was to build high-level, class-specific feature detectors from unlabeled images using stochastic gradient descent algorithms.
The architecture and parameters in one layer of the network. The overall network replicates this structure three times.
In the same year, AlexNet won the ImageNet challenge. It used Convolutional Neural Networks and ReLU as an activation function trained on GPUs.
The first convolutional layer filters the 224 × 224 × 3 input image with 96 kernels of size 11 × 11 × 3 with a stride of 4 pixels. The second convolutional layer takes as input the output of the first convolutional layer and filters it with 256 kernels of size 5 × 5 × 48. The third, fourth, and fifth convolutional layers are connected to one another without any intervening pooling or normalization layers. The third convolutional layer has 384 kernels of size 3 × 3 × 256 connected to the (normalized, pooled) outputs of the second convolutional layer. The fourth convolutional layer has 384 kernels of size 3 × 3 × 192, and the fifth convolutional layer has 256 kernels of size 3 × 3 × 192. The fully-connected layers have 4096 neurons each. The AlexNet applied two primary methods to avoid overfitting: Data Augmentation and Dropout.
Facebook created DeepFace in 2014, a deep neural network that can recognize human faces!
Outline of the DeepFace architecture. A front-end of single convolution-pooling-convolution filtering on the rectified input, followed by three locally-connected layers and two fully-connected layers. Colors illustrate feature maps produced at each layer. The net includes more than 120 million parameters, where more than 95% come from the local and fully connected layers.
This work demonstrates that coupling a 3D model-based alignment with large capacity feedforward models can effectively learn from many examples to overcome the drawbacks and limitations of previous methods which mostly suffer from occlusion or lower illumination.
And some marvelous things happened in 2014, the birth of ‘Generative Adversarial Neural Network(GAN)’ created by Ian Goodfellow. This opened a whole new level and approach of deep learning in science, art, fashion, and many areas. It is a new framework trained on two models: a generative model that captures the data distribution, and a discriminative model that estimates the probability that a sample came from the training data rather than the generative model.
In 2015, a major advancement in CNNs was achieved through ResNet proposed in ‘Deep Residual Learning for Image Recognition’. After a certain depth, stacking up additional layers to feedforward convolutional networks results in a higher training error. This underfitting is due to the vanishing gradient problem. ResNet uses a standard convolutional neural network and adds connections that skip a few convolution layers, where each bypass gives a residual block.
Regular block vs Residual block | Source
Residual Block in ResNet
DeepMind(started in 2010), a team of experts working on state-of-the-art AI applications, built the AlphaGo program first to beat a professional player at the world’s most complex and ancient game of Go in 2016. AlphaGo was built solely on Reinforcement Learning, without human data.
Google’s Neural Machine Translation, an end-to-end learning approach for automated translation was launched. The model consists of a deep LSTM network with 8 encoder and 8 decoder layers using attention and residual connections.
DeepMind built Wavenet, a deep generative model of raw audio waveforms. They showed that ‘WaveNets are able to generate speech which mimics any human voice and which sounds more natural than the best existing Text-to-Speech systems, reducing the gap with human performance by over 50%. And also demonstrated that the same network can be used to synthesize other audio signals such as music, and present some striking samples of automatically generated piano pieces.’
And later a Lip-reading model was developed by DeepMind – a machine that can lip read, from dictating instructions or messages to a phone in a noisy environment, transcribing and re-dubbing archived movies and silent films.
In 2017, AI researchers published a paper on Wasserstein GAN(WGAN) an improvement to traditional GAN, and later BEGAN, CycleGAN, Progressive GANS were developed.
2019 witnessed the most influential deep learning research in areas like GANs, Reinforcement Learning, Auto-Encoders, and Attention Mechanisms raised to a new level in the Deep Learning space during the past years.
OpenAI created a GPT-n series, which is an autoregressive language model that produces human-like text and can generate images too…using deep learning. It was built using transformer decoder blocks.
Google’s BERT(Bidirectional Encoder Representations from Transformers) revolutionized NLP. BERT is the first deeply bidirectional, unsupervised language representation, pre-trained using only a plain text corpus.
BERT vs GPT vs Elmo | Source
The majority of those who use BERT will never need to pre-train their own models from scratch. The BERT models are English at present, and Google is planning to release models that are trained in a variety of languages.
These are some achievements and breakthroughs in Neural Networks and are still some advancements that aren’t covered in this article.
There is a lot more research going on and state-of-the-art Deep Learning applications are being used in various fields like Autonomous Vehicles, Agriculture, Finance, Healthcare, Military, and many more. This demanded a change in hardware, Google deployed TPUs and NVIDIA deployed Volta and Turing GPUs to provide faster training and greater deep learning performance.