Tuesday, July 19, 2016

Summary of "Visualizing and Understanding Recurrent Networks"

Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) perform well in practice, but the source of their performance is not well understood.  The purpose of this paper is to test LSTM models at the character-level, in particular, the source of the long-range structural dependencies.

For testing, two texts were used.  The first was Leo Tolstoy's War and Peace, which has minimal markup characters.  The other text was the source code for the Linux Kernel, which uses markup characters extensively.  Some generalized findings are that LSTM and Gated Recurrent Unit (GRU) models work much better than a simple RNN.  The models work better with 2 hidden layers rather than 1, and for the GRU, a third layer can sometimes see additional benefit.  I find it surprising that more than 2 hidden layers seems to increase the LSTM cross entropy (although only slightly).



























































































This figure from the paper shows the activity of cells in the best LSTMs.  Text color corresponds to tanh(c), where -1 is red and +1 is blue.  The first cell is sensitive to the position in a line.  In a sense, it is counting how many characters are in the line.  Some cells can very clearly activate inside of quotes, if statements, comments, or for predicting new lines.  This shows how some cells are highly-specialized into niches.

This figure shows the fraction of time a gate spends left saturated (activation less than 0.1) and right saturated (activation more than 0.9).  What I find most interesting about this figure is how different the gates in the first layer are from those in the second and the third.  According to the authors, the number of right-saturated forget gates is especially interesting because it indicates cells that remember their values for long periods of time and thus work as nearly perfect integrators.

This figure shows how LSTMs are able to consistently outperform 20-gram models in matching a closing bracket with the corresponding opening bracket.  This provides strong evidence that they are able to keep track of long-range interactions.  The LSTM shows improvement up to a distance of about 60 characters, then it has difficulty keeping track of the long-range dependencies, though it still performs well.







































This last figure shows the breakdown of types of errors that the LSTM made. This was done by, as they say in the paper, "peeling the onion".  They start by peeling away errors at the top of the pie chart and go counterclockwise around.  They remove the errors by using a particular model and count how many there are.  Errors are defined to occur when the probability a character is assigned is below 0.5.
They start by using an n-gram model that might be better at short-term dependencies.  The character error is removed if the n-gram assigns more than 0.5 probability to the new character assignment.

The n memory model eliminates instances where the LSTM makes the same error with a substring in the past n characters.  The LSTM consistently makes these errors, despite the theoretical possibility that it could learn correctly after training on the first substring.

Rare words present a problem because the LSTM might not have had enough opportunities to train on them to give a correct answer.  This can be corrected by a larger training set or better pretraining.  These are removed and the error is presented based on how frequently that word occured in the training set.

Many of the errors occurred after a space or a quote, after a newline (which the LSTM needs to learn acts the same as a space), or after punctuation.  The remaining errors did not have apparent structures and were removed by simply adding 0.1,0.2,0.3,0.4, or 0.5 to the probability of that character until all errors were accounted for.

Interestingly, increasing the number of layers and nodes of the LSTM had a dramatic improvement in correcting for n-gram errors, but other types of errors seemed to be unaffected.  It is expeced that further scaling up the model even larger will eliminate n-gram errors, but to fix other errors, different architecture may be required.
----------
Link to the paper

Tuesday, July 12, 2016

RNN Weight Visualization

I wanted to see the distribution of the weights inside a neural net to see what was going on.  I ran the code twice, trained with a different sample text each time. The first sample text was the simple sentence "thebrownfoxjumpsoverthefence." repeated many times.  The second text was the short story The Last Question by Isaac Asimov. These text samples and the code are included in a .zip file at the end of this post.

The "fox" sample text was trained with 3 hidden layers with 128 nodes each.  The code converged relatively quickly, see the output after 0 epochs:




and after 20 epochs:



The network is able to converge upon the correct answer with few errors.

I also plotted a histogram of the weights below.  At first, the weights are grouped together, presumably near their initialized values.  After several epochs, the weights become more normally distributed about 0.

The more complicated sample text was run with 3 hidden layers of 512 nodes each.  See the output at epoch 0:







and epoch 20:






The network is clearly learning the formatting of the document (frequent line breaks) and the text is looking more like english, even though it doesn't quite become intelligible after only 20 epochs.  Despite not reaching convergence, the graphs of the weights look quite similar to how they did for the fox text.

The histogram is taller because there are way more weights with the larger network, but the distribution of the weights is normal, despite not reaching convergence yet.  The reason for this must be that the weights tend towards a normal distribution, but those weights still need to be shuffled around to the correct places in the network to get the correct answer.

-------------
The code and training files

Sunday, July 10, 2016

Summary of "A Critical Review of Recurrent Neural Networks for Sequence Learning"

This paper gives an overview of recurrent neural networks (RNNs), bidirectional recurrent neural networks (BRNNs) and long short-term memory (LSTM).

The paper highlights some key features of RNNs that make them important.  They retain a state that can represent information from an arbitrarily long context window.  They are able to model input and output consisting of elements that are not independent.  What this means is that, unlike conventional neural networks, the state of the network is not lost after each data point is processed.  These features are possible because RNNs include edges spanning adjacent time steps, which introduces the notion of time to the model.  This allows the hidden nodes of the network to remember prior states.

Training RNNs introduces some new challenges.  Vanishing and exploding gradients occur when backpropagating errors across many time steps.  Essentially, as time passes, the state of the model at time 0 contributes less and less to the output as compared to the current input, and this occurs exponentially fast.  The exploding gradient problem can be solved by truncated backpropagation through time (TBPTT).  TBPTT works by limiting the number of time steps which error can propagate, but this gives up the ability of the RNN to learn long-range dependencies.

Figure 1: Demonstration of the vanishing gradient problem.  When the weight
along the recurrent edge is less than 1 the state of the hidden layer is diluted
by the input as time passes.
The LSTM model is another way to eliminate the vanishing gradients problem.  Here, each node in the hidden layer of the RNN is replaced with a memory cell (Figure 2).  Each memory cell contains a node with a self-connected recurrent edge of weight 1 to prevent vanishing or exploding gradient.  The forget gate 'f' controls this recurrent edge allowing the network to eventually forget information, but the value of the forget gate can be learned and is affected by input.  The internal state 's' is affected by the product of its value in the previous time step and the value of the forget gate.
Figure 2:  LSTM memory cell
The input node 'g' and the input gate 'i' multiply their outputs to form another input going into the internal state.  The input gate is a distinctive feature of the LSTM approach, and its purpose is control the flow of the other node.  A value of zero effectively cuts off the input, keeping the internal state the same.  A value of one allows the full flow of the input node, and intermediate values are allowed.

The output gate 'o' multiplies with the output of the internal state to control the output of the hidden layer.  The internal state is commonly run through a tanh activation first, although a ReLU activation is apparently easier to train.

Each of these gates can learn when to let data in or out of the memory cell, thus improving the ability of the network to learn long-range dependencies compared to simple RNNs.

The paper also has a relatively small section on bidirectional recurrent neural networks (BRNNs)  (Figure 3).  BRNNs allow the network to receive information from the 'future' and the past.  For example, if reading through a document, the network has context from words before and after the current word being read.  This type of network cannot be run continuously, because it needs an endpoint for the future and the past.  Despite this limitation, it is a powerful technique for some applications such as part-of-speech tagging.
Figure 3: Bidirectional Recurrent Neural Net (BRNN)

--------------------
Link to the paper