Tuesday, July 19, 2016

Summary of "Visualizing and Understanding Recurrent Networks"

Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) perform well in practice, but the source of their performance is not well understood.  The purpose of this paper is to test LSTM models at the character-level, in particular, the source of the long-range structural dependencies.

For testing, two texts were used.  The first was Leo Tolstoy's War and Peace, which has minimal markup characters.  The other text was the source code for the Linux Kernel, which uses markup characters extensively.  Some generalized findings are that LSTM and Gated Recurrent Unit (GRU) models work much better than a simple RNN.  The models work better with 2 hidden layers rather than 1, and for the GRU, a third layer can sometimes see additional benefit.  I find it surprising that more than 2 hidden layers seems to increase the LSTM cross entropy (although only slightly).



























































































This figure from the paper shows the activity of cells in the best LSTMs.  Text color corresponds to tanh(c), where -1 is red and +1 is blue.  The first cell is sensitive to the position in a line.  In a sense, it is counting how many characters are in the line.  Some cells can very clearly activate inside of quotes, if statements, comments, or for predicting new lines.  This shows how some cells are highly-specialized into niches.

This figure shows the fraction of time a gate spends left saturated (activation less than 0.1) and right saturated (activation more than 0.9).  What I find most interesting about this figure is how different the gates in the first layer are from those in the second and the third.  According to the authors, the number of right-saturated forget gates is especially interesting because it indicates cells that remember their values for long periods of time and thus work as nearly perfect integrators.

This figure shows how LSTMs are able to consistently outperform 20-gram models in matching a closing bracket with the corresponding opening bracket.  This provides strong evidence that they are able to keep track of long-range interactions.  The LSTM shows improvement up to a distance of about 60 characters, then it has difficulty keeping track of the long-range dependencies, though it still performs well.







































This last figure shows the breakdown of types of errors that the LSTM made. This was done by, as they say in the paper, "peeling the onion".  They start by peeling away errors at the top of the pie chart and go counterclockwise around.  They remove the errors by using a particular model and count how many there are.  Errors are defined to occur when the probability a character is assigned is below 0.5.
They start by using an n-gram model that might be better at short-term dependencies.  The character error is removed if the n-gram assigns more than 0.5 probability to the new character assignment.

The n memory model eliminates instances where the LSTM makes the same error with a substring in the past n characters.  The LSTM consistently makes these errors, despite the theoretical possibility that it could learn correctly after training on the first substring.

Rare words present a problem because the LSTM might not have had enough opportunities to train on them to give a correct answer.  This can be corrected by a larger training set or better pretraining.  These are removed and the error is presented based on how frequently that word occured in the training set.

Many of the errors occurred after a space or a quote, after a newline (which the LSTM needs to learn acts the same as a space), or after punctuation.  The remaining errors did not have apparent structures and were removed by simply adding 0.1,0.2,0.3,0.4, or 0.5 to the probability of that character until all errors were accounted for.

Interestingly, increasing the number of layers and nodes of the LSTM had a dramatic improvement in correcting for n-gram errors, but other types of errors seemed to be unaffected.  It is expeced that further scaling up the model even larger will eliminate n-gram errors, but to fix other errors, different architecture may be required.
----------
Link to the paper

Tuesday, July 12, 2016

RNN Weight Visualization

I wanted to see the distribution of the weights inside a neural net to see what was going on.  I ran the code twice, trained with a different sample text each time. The first sample text was the simple sentence "thebrownfoxjumpsoverthefence." repeated many times.  The second text was the short story The Last Question by Isaac Asimov. These text samples and the code are included in a .zip file at the end of this post.

The "fox" sample text was trained with 3 hidden layers with 128 nodes each.  The code converged relatively quickly, see the output after 0 epochs:




and after 20 epochs:



The network is able to converge upon the correct answer with few errors.

I also plotted a histogram of the weights below.  At first, the weights are grouped together, presumably near their initialized values.  After several epochs, the weights become more normally distributed about 0.

The more complicated sample text was run with 3 hidden layers of 512 nodes each.  See the output at epoch 0:







and epoch 20:






The network is clearly learning the formatting of the document (frequent line breaks) and the text is looking more like english, even though it doesn't quite become intelligible after only 20 epochs.  Despite not reaching convergence, the graphs of the weights look quite similar to how they did for the fox text.

The histogram is taller because there are way more weights with the larger network, but the distribution of the weights is normal, despite not reaching convergence yet.  The reason for this must be that the weights tend towards a normal distribution, but those weights still need to be shuffled around to the correct places in the network to get the correct answer.

-------------
The code and training files

Sunday, July 10, 2016

Summary of "A Critical Review of Recurrent Neural Networks for Sequence Learning"

This paper gives an overview of recurrent neural networks (RNNs), bidirectional recurrent neural networks (BRNNs) and long short-term memory (LSTM).

The paper highlights some key features of RNNs that make them important.  They retain a state that can represent information from an arbitrarily long context window.  They are able to model input and output consisting of elements that are not independent.  What this means is that, unlike conventional neural networks, the state of the network is not lost after each data point is processed.  These features are possible because RNNs include edges spanning adjacent time steps, which introduces the notion of time to the model.  This allows the hidden nodes of the network to remember prior states.

Training RNNs introduces some new challenges.  Vanishing and exploding gradients occur when backpropagating errors across many time steps.  Essentially, as time passes, the state of the model at time 0 contributes less and less to the output as compared to the current input, and this occurs exponentially fast.  The exploding gradient problem can be solved by truncated backpropagation through time (TBPTT).  TBPTT works by limiting the number of time steps which error can propagate, but this gives up the ability of the RNN to learn long-range dependencies.

Figure 1: Demonstration of the vanishing gradient problem.  When the weight
along the recurrent edge is less than 1 the state of the hidden layer is diluted
by the input as time passes.
The LSTM model is another way to eliminate the vanishing gradients problem.  Here, each node in the hidden layer of the RNN is replaced with a memory cell (Figure 2).  Each memory cell contains a node with a self-connected recurrent edge of weight 1 to prevent vanishing or exploding gradient.  The forget gate 'f' controls this recurrent edge allowing the network to eventually forget information, but the value of the forget gate can be learned and is affected by input.  The internal state 's' is affected by the product of its value in the previous time step and the value of the forget gate.
Figure 2:  LSTM memory cell
The input node 'g' and the input gate 'i' multiply their outputs to form another input going into the internal state.  The input gate is a distinctive feature of the LSTM approach, and its purpose is control the flow of the other node.  A value of zero effectively cuts off the input, keeping the internal state the same.  A value of one allows the full flow of the input node, and intermediate values are allowed.

The output gate 'o' multiplies with the output of the internal state to control the output of the hidden layer.  The internal state is commonly run through a tanh activation first, although a ReLU activation is apparently easier to train.

Each of these gates can learn when to let data in or out of the memory cell, thus improving the ability of the network to learn long-range dependencies compared to simple RNNs.

The paper also has a relatively small section on bidirectional recurrent neural networks (BRNNs)  (Figure 3).  BRNNs allow the network to receive information from the 'future' and the past.  For example, if reading through a document, the network has context from words before and after the current word being read.  This type of network cannot be run continuously, because it needs an endpoint for the future and the past.  Despite this limitation, it is a powerful technique for some applications such as part-of-speech tagging.
Figure 3: Bidirectional Recurrent Neural Net (BRNN)

--------------------
Link to the paper

Monday, June 27, 2016

Stateful Parameter in Keras

According to the keras documentation; a stateful recurrent model is one for which the internal states (memories) obtained after processing a batch of samples are reused as initial states for the samples of the next batch. This allows to process longer sequences while keeping computational complexity manageable.  Tome, this is not completely clear,so I wanted to test this parameter to see how it affects a neural net.

I used lstm_text_generation.py as a baseline, and I modified it to  test for stateful=true and stateful=false.  In order to make the code run for stateful=true, I needed to make some changes.


First, I needed to include a parameter batch_input_shape() instead of simply input_shape(), which needed to be passed the batch size, like this:
model.fit(X, y, batch_size=batch_Size, nb_epoch=1,callbacks=[history])
Another requirement of using stateful=true is that the number of samples passed as input need to be evenly divisible by the batch size.  This was problematic since it limited which batch sizes I could use, but I tried a batch size of 518, which is a factor of the number of samples in each iteration, 15022.

This solution didn't quite work, because I got another error:
ValueError: non-broadcastable output operand with shape (1,512) doesn't match the broadcast shape (518,512) 
512 refers to the number of nodes in each hidden layer of the neural net.  The error seems to be a matrix multiplication error.  When the number of nodes in each layer was made 518 (from the original value of 512) to match the batch size, I received a similar error.
ValueError: non-broadcastable output operand with shape (1,518) doesn't match the broadcast shape (518,518) 
 I could not figure out a proper solution to this problem, so I settled with a batch size of 1 so the matrix multiplication would work.  This dramatically increased the computation time, so in order to run the test in a matter of hours instead of days I changed the number of nodes in the neural net to 256 from 512.  I also ran only 10 iterations rather than the original 60.  My results are below:

Notice for both graphs the loss function is increasing.  For stateful=True, the loss increases monotonically, whereas with stateful=False, there are some downward changes, but the trend is overall increasing.  This is the opposite of what I would expect, because the loss should decrease as the network learns.  It's possible it needed more iterations, or that the smaller network kept it from learning, but I'm not yet sure of the reason for the loss increasing.

Monday, June 20, 2016

Summary of "The Unreasonable Effectiveness of Recurrent Neural Networks"

Link to the blog post 

Recurrent Neural Networks (RNNs), are a type of neural net where inputs with an arbitrary structure are represented by a fixed-size vector.  In Andrej's blog, he emphasizes that what is special about RNNs is that the input can be a sequence.  The inputs can be combined with a state vector to produce a new state vector.  Even if the inputs are not a sequence, the data can still be processed sequentially, which is more powerful than a conventional neutral net.

Here is how Andrej describes the forward pass of a RNN:
rnn = RNN()
y = rnn.step(x) # x is an input vector, y is the RNN's output vector
class RNN:
  # ...
  def step(self, x):
    # update the hidden state
    self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
    # compute the output vector
    y = np.dot(self.W_hy, self.h)
    return y

The parameters of the RNN are W_hh, W_xh, and W_hy.  self.h is initialized with the zero vector.  According to Andrej, RNNs work monotonically better with more hidden layers and more nodes.

In his blog, he goes through several very cool examples of RNNs being trained on datasets, and their outputs.  Training with text results in mostly nonsensical text with a format and a "sound" that seems to match the input text.  When given inputs from Shakespeare, the neural network makes a sample that keeps the format and verbage that you would expect from Shakespeare, but the story is nonsense, it loses its iambic pentameter.  Characters have a mix of dialogue and long monologues, which is again consistent with the style of the source material, but it lacks content.  Given all the machine is doing, this is quite impressive if not especially useful.

Like many other computer models, a "temperature" parameter can be used to control the trade off between accuracy to the text and how diverse it is.  A very low temperature is likely to get stuck at a peak of likelihoods and result in plain or repetitive outputs.  A high temperature is more likely to have problems such as spelling mistakes.

When given a wikipedia page as input, the neural net was able to produce an output that again, had proper formatting, used the structured markdown correctly to create headings and tables, cited sources (although incorrectly), and completely makes up numbers for timestamps and id numbers.  It seems to have an understanding of the style of how the pages should look, if not the content.
The RNN was also able to work with Latex and other code which didn't quite compile, but looked like code should look.  The network was good at getting the syntax right, even if the variable names were right.  It had difficulty with "long-term interactions" such as whether or not a return statement should be present in a function depending on the absence or presence of a void declaration.  Its context window may have been too short to be able to relate the presence of void in the training data, and the absence of the return statement.

What I found most interesting was when he looked at individual neurons.  Initially, the neurons had random parameters, but after some training, they specialized into niches.  An individual neuron might be given an open parenthesis as an input.  If it gives a high score to say, neuron 3, and a low score to the others, then neuron 3 is basically being told to specialize for stuff inside parenthesis.  It will learn that a closed parenthesis should be at the end, and the format inside may be different than normal text (such as a list of parameters in some code).  Effectively, evolution of niches is an emergent property of the system!  These niches should require a large neural network with several hidden layers and many neurons.

In summary, RNNs seem to do a very good job of mimicking the formatting, style, and overall feel of an input.  In the examples given, it is not able to give meaningful outputs however, and it struggles with long-term interactions.  All this does seem to beg the question.  Can these difficulties be overcome by bigger networks with more hidden layers and larger training sets?  Or is a totally different approach required to get more meaningful results.  Either way, RNNs seem to be unreasonably effective in getting so much right.

Monday, June 6, 2016

Gaussian Mixture Models

In trying to learn about Gaussian mixture models, I've had some difficulty in simply reading about them, but I've found some videos that give a good explanation.  I will summarize my understanding below, and link to the videos at the end.

Essentially, a Gaussian mixture model is a way to combine several Gaussian PDFs (Probability Distribution Functions) of different shapes and sizes into a single distribution.  This is done by making a linear combination of the individual Gaussian PDFs.

Expectation-Maximization (EM) is a procedure that allows us to learn the parameters of the Gaussian mixture model.  These parameters are refined over several iterations.  The Expectation step (E-step) keeps fixed the mean μc, covariance Σc, and size πc of the Gaussian, c.  The assignment probability, ric, that a each data point belongs to cluster c is then calculated.


In this example, the data point x is more likely to belong to distributyion 2 (66% chance) over distribution 1 (33% chance) 
In the Maximzation step (M-step) keeps the assignment probabilities the same while updating the parameters μcΣc, and  πc.  The assignment probabilities ric are used to calculate the new parameters.  Here, m represents the total data, whereas mc represents the data corresponding to a particular cluster.


This way, the points with a large ric have more of an effect on the parameters than those with a small ric.  Each iteration of this EM model increases the log-likelihood of the model.  This can result in the model becoming stuck in local optima, so intialization is important.

Using Expectation-Maximization, the model can learn the parameters over time and refine a distribution given datapoints from overlapping distributions.

-----------------------------------
link to videos:
https://www.youtube.com/watch?v=Rkl30Fr2S38
https://www.youtube.com/watch?v=qMTuMa86NzU

Friday, June 3, 2016

Binary Classifier Error Testing

Using the Keras package, I wrote some code to make a binary classifier neural network.  Given the inputs of "age" and "LTI" (the Loan To Income ratio), the network should output a '1' if the individual is predicted to default on the loan, and a '0' if they are not.

I tested using several different activation functions, and here is what I've found:

Wrong Approach

First, I defined errors the number of errors to be:

nb_errors=abs(sum(y1-Y_test)) ,

where y1 is the output of the neural network, and Y_test is the actual results from the dataset.

This is the wrong approach because it means that false positives make my data look better.  If the network detects outputs a '1' where there should be a '0', then that will effectively cancel out some of the false negatives, where the network outputs a '0', but there should be a '1'.  This resulted in some interesting behavior.



With only 1 hidden layer, both networks performed much worse than with 2, even when the number of nodes were the same.  The number of errors saturate when there is few nodes with the 'relu' activation function, because the network is outputting an answer of all-zeros.  For these instances. it hasn't really learned anything yet, but it gets an answer that still get ~85% of the outputs right.

More interesting is the performance of 'tanh'.  It actually does better with fewer nodes so long as there are two hidden layers.  This is because 'tanh' seems to give more false positives than 'relu' for this dataset.

Right Approach

The number of errors should be calculated by:

nb_errors=sum(abs(y1-Y_test)) ,

This way, the false positives and false negatives are treated the same, so you can get the total number of errors by just adding them up.  Here is the results from this approach:



'relu' and 'tanh' activation functions seemed to perform similarly for high numbers of nodes, but 'tanh' performed better overall, especially when there were fewer nodes.  More hidden layers adds computation time, but makes a big difference in accuracy.  Of the 1200 points tested, the network gets ~50 wrong answers with two layers of 64 nodes, which is not bad.

I also tested the 'sigmoid' activation function, and it performed poorly.  It was not able to break away from the answer of all-zeros until the very end of the last simulation, and even then it performed poorly.  Thus, Sigmoid does not appear to be a good activation function to use for the hidden layers.

Conclusion

'tanh' performs better than 'relu' for this binary classification problem.  It also has a higher frequency of false-positives.  This is fine for the lending industry, which may be hesitant to give out risky loans, but if false-positives are worse than false-negatives, than the 'relu' activation function may be better to use so long as enough nodes are used to get a good answer.
---------------------------
download the data set: creditset.csv
download my: python code