Tuesday, May 17, 2016

Summary of "A Primer on Neural Network Models for Natural Language Processing" Goldberg, 2015

This paper covers the use of Neural Networks for use in Natural Language Processing (NLP).  It is more detailed than the previous papers I've summarized.  It's explanations are a bit jumbled, in that it introduces topics without explaining them, only to elaborate on them several sections (and many pages) later.  This made it a more difficult read, but I will try to give an overview.















This paper frequently uses the term "feature" to refer to an input such as a word, suffix, or part-of-speech tag.  In general, this paper discusses these features being input as a vector into a neural network, which returns an output of another vector with a possibly different dimension.  Features are represented as dense vectors.  One important reason for the use of dense vectors is to ease computation by keeping the number of dimensions low.  The main benefit of the dense representation is "generalization power".  Learned vectors for two different but related features ('cat' and 'dog'), have similar features, and the model can take this into account.  Figure 1b below shows how features can be represented as dense vectors, but the paper does not seem to explain how this works.  These dense, trainable vectors are a fundamental part of how neural networks work.

The number of dimensions of each vector should be is important, because more dimensions mean better accuracy at the cost of more memory, and more processing time.  Still, the paper states that there are no best-practices for the number of dimensions of different features.  It is therefore recommended to experiment with different sizes and pick a dimension that strikes a good balance.

Features that contain the same words but in different orders can share the same vector, or be made into different vectors.  A decision needs to be made to decide how important this context is when deciding whether to share vectors or not.

A detailed example is the Feed-forward Neural Network.  Figure 2 below is an example of how the feed-forward neural network is structured.  Each arrow on the network carries a weight to reflect its importance.  The sigmoid shape inside the neurons in the hidden layer represent a non-linear function.  This network can be described as mathematical operations with a matrix W of weights and a vector b of bias terms.  These parameters will change as the network learns and attempts to minimize the "Loss function".
The paper gives several examples of non-linear functions that transform each value x into the range [0,1] or [-1,1].  These non-linear transformations seem to be fairly arbitrary, and the paper states that there is no real theory for which non-linearity to apply under which conditions.

In learning, the network is trying to minimize the loss function.  The loss function can be defined several ways, but ultimately it is a comparison of the output and the intended output.  Loss is small when the output is similar to the intended output.  

Embedding vectors are initially randomized to random values and are tuned by training.  "Supervised" pre-training is used to initialize rare features, such as features for individual words, but I wonder if this biases the end result.  Unsupervised pre-training can be used when large amounts of text is available.  This is useful for building vectors for a large vocabulary of words.  

Training methods boil down to repeating the following steps: computing an error estimate over the dataset, computing the gradient with respect to the error, and moving the parameters in the direction of the gradient.  Different models have different ways of calculating the error and how "moving in the direction of the gradient" is defined.

The basic algorithm for training is Stochastic Gradient Training (SGD). The error from this algorithm is calculated based on a single training example, and is thus only a rough estimate.  The noise in this estimate can result in inaccurate gradients.  This noise can be reduced by calculating the error and gradients on a sample of m examples, referred to as a minibatch.  This improves accuracy of the gradients and can allow for parallel computing.

The loss function can get stuck in local minima or saddle points, so starting from different initial points can give different results.  It is thus important to run several training runs with different initializations and picking the best one.  I suspect this is another instance where the result could be biased.  

In deep networks, gradients can often vanish or become exceedingly high as they move through the network.  This can be mitigated by making the network shallower, stepwise training by training the layers independently, and  making specialized architecture to assist the gradient flow.

Some transformations can become saturated, resulting in output values for a layer near 0 or 1.  This results in very small gradients, which should be avoided because this slows learning.  This can be controlled by scaling the range of input values or changing the learning rate.  Normalization can be used, but this tends to be expensive in terms of computation.

Learning rates that are too large can prevent the network from converging, whereas too small can result in the outputs taking a long time to converge.  The paper proposes experimenting with rates between [0,1] (0.001,0.01,0.1,1).  Learning rate should be decreased if the network seems to get stuck in a region, and can be decreased dynamically according to a schedule.

Overfitting can be a problem in neural network models as the models have many parameters.  This can be mitigated by regularization, such as placing a penalty on parameters with large values by adding a multiple of the L2 norm of the parameters to the function being minimized.  

Model Cascading is a technique where large networks are built from smaller networks.  Hidden layers of smaller networks can be used as input to a larger network.  This results in a very deep network, which can cause the vanishing gradient problem.  This can be solved by training the networks separately before combining them for further training.

It is often important to understand the context around a word and not just the word itself.  For this reason, context windows can be used.  The size of this window has a strong effect on vector similarities.  For instance, larger windows tend to produce topical similarities ("dog","bark","leash").  Smaller windows tend to produce more functional similarities ("Poodle","Pitbull","Rottweiler").  Positional contexts can also be considered so that words far from the focus word and those near are factored in differently (context word of "the" becomes "the:+2" to indicate that it is two positions to the right of the focus word).

Convolutional Neural Networks (CNN) are also important in modeling context.  A CNN identifies local predictors in a large structure and combine them to produce a vector representation of the structure, keeping track of the local aspects that are most informative.  This works by taking a window of words and transforming it into a vector the captures the most important properties.  Vectors from multiple windows are combined into a single vector by taking the average or max value of each dimension.  This way, the most important features from the windows can be focused on regardless of their location.  Each dimension of the vector should capture a different kind of information.  Additional context information may be preserved by k-max pooling, where the 'k' best elements in each dimension are retained instead of only the best one.

Recurrent Neural Networks (RNN) allow arbitrarily-sized structured inputs to be represented as a fixed-size vector, while preserving the structured properties.  The network is built by a recursively defined function that takes as an input a state vector si, and an input vector xi+1, and results in a new state vector si+1 and an output vector yi.  See the figure below for a visual representation of this idea.  
The RNN can be thought of as a very deep neural network, in which the same parameters are shared across many parts of the computation.  A backpropagation algorithm can be used to compute the gradients with respect to the loss of the unrolled graph.  This is known as backpropagation through time (BPTT).  

Recursive Neural Networks (RecNN) is a generalization of the RNN from sequences to binary trees.  The paper claims that this can be useful in language processing because it is often natural to work with tree structures, but I did not get a good sense of why this was important.

Monday, May 9, 2016

Summary of "From Frequency to Meaning: Vector Space Models of Semantics" P. Turney, P. Pantel, 2010

This paper describes using Vector Space Models (VSMs) as a way to understand meaning in human language.  The first example of a VSM discussed, is that of a search engine.  Each document can be considered a point in space.  Points close together are similar, while points far apart are semantically different.  A query can be considered a pseudo-document in this space, and the documents can be sorted in order of distance in this space.

The defining feature of a VSM is that the elements in the VSM are come from event frequencies, such as the number of times a word appears in context.  These elements can be compared for similarity using vector mathematics.  This idea is applied to three types of matrices; term-document, word-context, and pair-pattern.

In a term-document matrix, row vectors correspond to terms (words), and column vectors correspond to documents.  The value of an element in this matrix is the frequency of a particular word in a particular document.  This method only considers word frequency, not context.  This results in a very large sparse matrix,

The word-context matrix look at the immediate context around a word rather than a whole document.  The fundamental hypothesis of this technique is that words that occur in similar contexts tend to have similar meanings.  This method is inspired by looking at the rows of words in the term-document matrix, rather than looking at the columns, thus it finds attributional similarity between the words.

In pair-pattern matrices, row vectors correspond to pairs of words, and column vectors orrespond to the patterns that the pairs occur in, such as "X cuts Y".  The idea is that patterns that co-occur with similar pairs have similar meaning, so word pairs with similar row vectors tend to have similar semantic relations.  This method is useful for determining if one sentence is a paraphrase of another.  This matrix can be used for triple-patterns, or n-tuples of words, but these matrices become more and more sparse as the higher n patterns become more rare.  A denser pair-pattern matrix is often more useful.

Before the matrix can be generated, it is important to do some processing to the raw text.  This processing seems like it fairly specific to linguistic uses, but there are some aspects that may be more widely useful.  The raw text needs to be "tokenized" to decide what constitutes a term.  This includes dealing with punctuation, hyphenization, and contractions.  The text should also be normalized to convert similar words to the same form (Car vs car vs Cars can be normalized to car).  They should also be annotated by making different words that look the same.  Normalization has the effect of reducing precision but increasing the sensitivity (or "recall") of a system.  For a small corpus, excessive normalization may be necessary, whereas for a large corpus normalization may be able to be ignored to keep precision high.  Annotation is the opposite of normalization, so it decreases recall and increases precision.

After the text has been processed and the matrix has been generated, weights can be applied.  This is important because common words are often less informative than rarer words. Some words can be given a weight of zero and thus removed from the matrix entirely.  Words that are relatively common in the document, but relatively rare in the corpus can be weighted highly.  This is known as tf-idf weghting (term frequency × inverse document frequency).  This can be biased towards infrequent events, but this problem can be accounted for.

The matrix can also be smoothed to reduce the amount of random noise.  Words with a low weighting can be eliminated with little effect on precision.  This idea seems like it could be important for signal generation.  Singular Value Decomposition (SVD) can be used to decompose the sparse matrix into three smaller, denser matrices.  This makes computation easier by reducing the number of dimensions.  It also has a smoothing effect and can reveal when words show up in similar contexts.  The paper makes the claim that truncated SVD is able to simulate missing text and compensate for lack of data.  I find this hard to believe, but it makes some sense if I think of it in a context of signals and "filling in the gaps" or generating a line of best fit through data points.

Once a matrix is generated, weighted, and smoothed the similarity of the frequency vectors can be calculated.  The most common method is taking the cosine of the angle between the vectors by taking the inner product of the vectors after they are normalized to unit length.


The cosine ranges from -1 to 1, where values closer to 1 represent a high degree of similarity between the two vectors.  These calculations are slow O(nr2 nc) where nr is the number of rows and nc is the number of columns in the matrix.   The expense of calculating the similarity matrix can be reduced by only calculating the similarity when both vectors share a nonzero element.

Rather than using truncated SVD to lower the dimension of a matrix, high-dimension vectors can be randomly projected onto a lower-dimensional subspace without a large impact on the similarity scores, but I don't really understand how this process works.

The three matrix types discussed in this paper are not an exhaustive look at VSMs, but they demonstrate how computers could better understand language.  Most of the meaning in English text comes from word choice, but word order is still important, and is something that VSMs often ignore, but even this can be accounted for to some degree.

Link to the paper


Tuesday, May 3, 2016

Summary of "Two Decades of Statistial Language Modeling: Where Do We Go From Here?" R. Rosenfeld, 2000

I decided to start my literature summary by starting with the oldest paper, to get more of a chronological overview.

This paper argues for a Bayesian approach to statistical language models.  It gives an overview of techniques used in Statistical Language Modeling (SLM), but it is not yet obvious to me how much these techniques will carry over to machine learning for signal generation.  Many of the techniques discussed seem to have a very narrow scope and may not useful for my purposes.  Generally, the techniques have a huge number of parameters and require large amounts of training data.

Bigram, trigram, and n-gram models are discussed.  This is a new concept for me, but what it means is the models look at 2, 3, or 'n' adjacent letters or words at a time.  Trigrams for letters could be words like 'the' or 'and' or fragments like 'ing'.  An example of words used as trigrams is "The quick red" and "quick red fox".  Essentially, the computer is looking at a narrow window for comparison.  Raising he value of n is a trade-off between stability and bias.

The statistical language models in the paper all use Bayes Theorem to find the most likely sentence or classification based on a probability distribution.  Quality of a language modeling technique is measured in terms of cross-entropy and perplexity.  I don't have a good intuitive sense of cross-entropy, but it is proportional to the log-likelihood of the model distribution.
Perplexity = 2cross-entropy 
Perpelexity is a measure of the branching factor of the language according to the model.  It estimates both the complexity of the language and the quality of the model.  Lower perplexity suggests a better quality model.

Language models discussed in the paper are extremely sensitive to changes in the style, topic, or genre of the text they are trained on.  This seems obvious, but it is important to keep in mind, because I suspect this will be true for signal generation as well.  These models assume large amounts of independence, which as we know from language is not a great assumption.  In language, context is very important, but many models may only be considering nearby words and not words in more distant parts of the document. These false independence assumptions result in overly sharp posterior distributions.  I suspect that this would be a more serious problem for signal generation.

Decision tree models were described as models that partition the history of the words in the text by asking "arbitrary questions about the history".  The types of questions were not detailed in the paper, so I don't really know what kinds of questions they are referring to.  Questions are picked by greedily selecting the ones that have the greatest reduction in entropy.  This type of decision tree model is considered to be optimal in theory, but as of the writing of this paper, results were disappointing.

Context Free Grammar (CFG) is another model that was discussed.  Like most of the models, the computer has no real understanding of the language, it is simply following a set of rules that it builds itself.  This model mentions terminals and non-terminals, but the paper does not define them clearly and I did not get a good sense of what they meant.  The use of the transition rules in this method results in a lot of local maxima that fall short of the global maxima in the probability distribution and context is needed to account for this.

Data fragmentation is a problem for many of these models because each new parameter is being estimated with less data.  An exponential model can avoid this problem by using a model of the form:
It was not clear to me from the equation how this exponential model helps fragmentation, but it resulted in a huge drop in perplexity.

Dimensionality reduction was another method used to significantly reduce perplexity.  Vocabulary has a huge number of dimensions, and many models do not consider words like "Bank" and "Loan" to be any more related than "Bank" and "Brazil".  To manage this problem, a sparse matrix is generated with the first occurence of each vocabulary word in a document.  The matrix can then be reduced by SVD to a much lower dimension to give a much better correlation between words.  This results in another huge reduction in perplexity.  I suspect a similar approach may be useful for signal generation.

Interactive modeling is introduced at the end of the paper as a way to use human knowledge to generate a prior by assisting the model in grouping words.  This is an interesting approach, but it introduces a bias, and I don't think it would be a practical solution for the purposes of generating signals.