Monday, June 20, 2016

Summary of "The Unreasonable Effectiveness of Recurrent Neural Networks"

Link to the blog post 

Recurrent Neural Networks (RNNs), are a type of neural net where inputs with an arbitrary structure are represented by a fixed-size vector.  In Andrej's blog, he emphasizes that what is special about RNNs is that the input can be a sequence.  The inputs can be combined with a state vector to produce a new state vector.  Even if the inputs are not a sequence, the data can still be processed sequentially, which is more powerful than a conventional neutral net.

Here is how Andrej describes the forward pass of a RNN:
rnn = RNN()
y = rnn.step(x) # x is an input vector, y is the RNN's output vector
class RNN:
  # ...
  def step(self, x):
    # update the hidden state
    self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
    # compute the output vector
    y = np.dot(self.W_hy, self.h)
    return y

The parameters of the RNN are W_hh, W_xh, and W_hy.  self.h is initialized with the zero vector.  According to Andrej, RNNs work monotonically better with more hidden layers and more nodes.

In his blog, he goes through several very cool examples of RNNs being trained on datasets, and their outputs.  Training with text results in mostly nonsensical text with a format and a "sound" that seems to match the input text.  When given inputs from Shakespeare, the neural network makes a sample that keeps the format and verbage that you would expect from Shakespeare, but the story is nonsense, it loses its iambic pentameter.  Characters have a mix of dialogue and long monologues, which is again consistent with the style of the source material, but it lacks content.  Given all the machine is doing, this is quite impressive if not especially useful.

Like many other computer models, a "temperature" parameter can be used to control the trade off between accuracy to the text and how diverse it is.  A very low temperature is likely to get stuck at a peak of likelihoods and result in plain or repetitive outputs.  A high temperature is more likely to have problems such as spelling mistakes.

When given a wikipedia page as input, the neural net was able to produce an output that again, had proper formatting, used the structured markdown correctly to create headings and tables, cited sources (although incorrectly), and completely makes up numbers for timestamps and id numbers.  It seems to have an understanding of the style of how the pages should look, if not the content.
The RNN was also able to work with Latex and other code which didn't quite compile, but looked like code should look.  The network was good at getting the syntax right, even if the variable names were right.  It had difficulty with "long-term interactions" such as whether or not a return statement should be present in a function depending on the absence or presence of a void declaration.  Its context window may have been too short to be able to relate the presence of void in the training data, and the absence of the return statement.

What I found most interesting was when he looked at individual neurons.  Initially, the neurons had random parameters, but after some training, they specialized into niches.  An individual neuron might be given an open parenthesis as an input.  If it gives a high score to say, neuron 3, and a low score to the others, then neuron 3 is basically being told to specialize for stuff inside parenthesis.  It will learn that a closed parenthesis should be at the end, and the format inside may be different than normal text (such as a list of parameters in some code).  Effectively, evolution of niches is an emergent property of the system!  These niches should require a large neural network with several hidden layers and many neurons.

In summary, RNNs seem to do a very good job of mimicking the formatting, style, and overall feel of an input.  In the examples given, it is not able to give meaningful outputs however, and it struggles with long-term interactions.  All this does seem to beg the question.  Can these difficulties be overcome by bigger networks with more hidden layers and larger training sets?  Or is a totally different approach required to get more meaningful results.  Either way, RNNs seem to be unreasonably effective in getting so much right.

No comments:

Post a Comment