This paper frequently uses the term "feature" to refer to an input such as a word, suffix, or part-of-speech tag. In general, this paper discusses these features being input as a vector into a neural network, which returns an output of another vector with a possibly different dimension. Features are represented as dense vectors. One important reason for the use of dense vectors is to ease computation by keeping the number of dimensions low. The main benefit of the dense representation is "generalization power". Learned vectors for two different but related features ('cat' and 'dog'), have similar features, and the model can take this into account. Figure 1b below shows how features can be represented as dense vectors, but the paper does not seem to explain how this works. These dense, trainable vectors are a fundamental part of how neural networks work.
The number of dimensions of each vector should be is important, because more dimensions mean better accuracy at the cost of more memory, and more processing time. Still, the paper states that there are no best-practices for the number of dimensions of different features. It is therefore recommended to experiment with different sizes and pick a dimension that strikes a good balance.
Features that contain the same words but in different orders can share the same vector, or be made into different vectors. A decision needs to be made to decide how important this context is when deciding whether to share vectors or not.
A detailed example is the Feed-forward Neural Network. Figure 2 below is an example of how the feed-forward neural network is structured. Each arrow on the network carries a weight to reflect its importance. The sigmoid shape inside the neurons in the hidden layer represent a non-linear function. This network can be described as mathematical operations with a matrix W of weights and a vector b of bias terms. These parameters will change as the network learns and attempts to minimize the "Loss function".
The paper gives several examples of non-linear functions that transform each value x into the range [0,1] or [-1,1]. These non-linear transformations seem to be fairly arbitrary, and the paper states that there is no real theory for which non-linearity to apply under which conditions.
In learning, the network is trying to minimize the loss function. The loss function can be defined several ways, but ultimately it is a comparison of the output and the intended output. Loss is small when the output is similar to the intended output.
Embedding vectors are initially randomized to random values and are tuned by training. "Supervised" pre-training is used to initialize rare features, such as features for individual words, but I wonder if this biases the end result. Unsupervised pre-training can be used when large amounts of text is available. This is useful for building vectors for a large vocabulary of words.
Training methods boil down to repeating the following steps: computing an error estimate over the dataset, computing the gradient with respect to the error, and moving the parameters in the direction of the gradient. Different models have different ways of calculating the error and how "moving in the direction of the gradient" is defined.
The basic algorithm for training is Stochastic Gradient Training (SGD). The error from this algorithm is calculated based on a single training example, and is thus only a rough estimate. The noise in this estimate can result in inaccurate gradients. This noise can be reduced by calculating the error and gradients on a sample of m examples, referred to as a minibatch. This improves accuracy of the gradients and can allow for parallel computing.
The loss function can get stuck in local minima or saddle points, so starting from different initial points can give different results. It is thus important to run several training runs with different initializations and picking the best one. I suspect this is another instance where the result could be biased.
In deep networks, gradients can often vanish or become exceedingly high as they move through the network. This can be mitigated by making the network shallower, stepwise training by training the layers independently, and making specialized architecture to assist the gradient flow.
Some transformations can become saturated, resulting in output values for a layer near 0 or 1. This results in very small gradients, which should be avoided because this slows learning. This can be controlled by scaling the range of input values or changing the learning rate. Normalization can be used, but this tends to be expensive in terms of computation.
Learning rates that are too large can prevent the network from converging, whereas too small can result in the outputs taking a long time to converge. The paper proposes experimenting with rates between [0,1] (0.001,0.01,0.1,1). Learning rate should be decreased if the network seems to get stuck in a region, and can be decreased dynamically according to a schedule.
Overfitting can be a problem in neural network models as the models have many parameters. This can be mitigated by regularization, such as placing a penalty on parameters with large values by adding a multiple of the L2 norm of the parameters to the function being minimized.
Model Cascading is a technique where large networks are built from smaller networks. Hidden layers of smaller networks can be used as input to a larger network. This results in a very deep network, which can cause the vanishing gradient problem. This can be solved by training the networks separately before combining them for further training.
It is often important to understand the context around a word and not just the word itself. For this reason, context windows can be used. The size of this window has a strong effect on vector similarities. For instance, larger windows tend to produce topical similarities ("dog","bark","leash"). Smaller windows tend to produce more functional similarities ("Poodle","Pitbull","Rottweiler"). Positional contexts can also be considered so that words far from the focus word and those near are factored in differently (context word of "the" becomes "the:+2" to indicate that it is two positions to the right of the focus word).
Convolutional Neural Networks (CNN) are also important in modeling context. A CNN identifies local predictors in a large structure and combine them to produce a vector representation of the structure, keeping track of the local aspects that are most informative. This works by taking a window of words and transforming it into a vector the captures the most important properties. Vectors from multiple windows are combined into a single vector by taking the average or max value of each dimension. This way, the most important features from the windows can be focused on regardless of their location. Each dimension of the vector should capture a different kind of information. Additional context information may be preserved by k-max pooling, where the 'k' best elements in each dimension are retained instead of only the best one.
Recurrent Neural Networks (RNN) allow arbitrarily-sized structured inputs to be represented as a fixed-size vector, while preserving the structured properties. The network is built by a recursively defined function that takes as an input a state vector si, and an input vector xi+1, and results in a new state vector si+1 and an output vector yi. See the figure below for a visual representation of this idea.
The RNN can be thought of as a very deep neural network, in which the same parameters are shared across many parts of the computation. A backpropagation algorithm can be used to compute the gradients with respect to the loss of the unrolled graph. This is known as backpropagation through time (BPTT).
Recursive Neural Networks (RecNN) is a generalization of the RNN from sequences to binary trees. The paper claims that this can be useful in language processing because it is often natural to work with tree structures, but I did not get a good sense of why this was important.