Sunday, July 10, 2016

Summary of "A Critical Review of Recurrent Neural Networks for Sequence Learning"

This paper gives an overview of recurrent neural networks (RNNs), bidirectional recurrent neural networks (BRNNs) and long short-term memory (LSTM).

The paper highlights some key features of RNNs that make them important.  They retain a state that can represent information from an arbitrarily long context window.  They are able to model input and output consisting of elements that are not independent.  What this means is that, unlike conventional neural networks, the state of the network is not lost after each data point is processed.  These features are possible because RNNs include edges spanning adjacent time steps, which introduces the notion of time to the model.  This allows the hidden nodes of the network to remember prior states.

Training RNNs introduces some new challenges.  Vanishing and exploding gradients occur when backpropagating errors across many time steps.  Essentially, as time passes, the state of the model at time 0 contributes less and less to the output as compared to the current input, and this occurs exponentially fast.  The exploding gradient problem can be solved by truncated backpropagation through time (TBPTT).  TBPTT works by limiting the number of time steps which error can propagate, but this gives up the ability of the RNN to learn long-range dependencies.

Figure 1: Demonstration of the vanishing gradient problem.  When the weight
along the recurrent edge is less than 1 the state of the hidden layer is diluted
by the input as time passes.
The LSTM model is another way to eliminate the vanishing gradients problem.  Here, each node in the hidden layer of the RNN is replaced with a memory cell (Figure 2).  Each memory cell contains a node with a self-connected recurrent edge of weight 1 to prevent vanishing or exploding gradient.  The forget gate 'f' controls this recurrent edge allowing the network to eventually forget information, but the value of the forget gate can be learned and is affected by input.  The internal state 's' is affected by the product of its value in the previous time step and the value of the forget gate.
Figure 2:  LSTM memory cell
The input node 'g' and the input gate 'i' multiply their outputs to form another input going into the internal state.  The input gate is a distinctive feature of the LSTM approach, and its purpose is control the flow of the other node.  A value of zero effectively cuts off the input, keeping the internal state the same.  A value of one allows the full flow of the input node, and intermediate values are allowed.

The output gate 'o' multiplies with the output of the internal state to control the output of the hidden layer.  The internal state is commonly run through a tanh activation first, although a ReLU activation is apparently easier to train.

Each of these gates can learn when to let data in or out of the memory cell, thus improving the ability of the network to learn long-range dependencies compared to simple RNNs.

The paper also has a relatively small section on bidirectional recurrent neural networks (BRNNs)  (Figure 3).  BRNNs allow the network to receive information from the 'future' and the past.  For example, if reading through a document, the network has context from words before and after the current word being read.  This type of network cannot be run continuously, because it needs an endpoint for the future and the past.  Despite this limitation, it is a powerful technique for some applications such as part-of-speech tagging.
Figure 3: Bidirectional Recurrent Neural Net (BRNN)

--------------------
Link to the paper

No comments:

Post a Comment