Monday, May 9, 2016

Summary of "From Frequency to Meaning: Vector Space Models of Semantics" P. Turney, P. Pantel, 2010

This paper describes using Vector Space Models (VSMs) as a way to understand meaning in human language.  The first example of a VSM discussed, is that of a search engine.  Each document can be considered a point in space.  Points close together are similar, while points far apart are semantically different.  A query can be considered a pseudo-document in this space, and the documents can be sorted in order of distance in this space.

The defining feature of a VSM is that the elements in the VSM are come from event frequencies, such as the number of times a word appears in context.  These elements can be compared for similarity using vector mathematics.  This idea is applied to three types of matrices; term-document, word-context, and pair-pattern.

In a term-document matrix, row vectors correspond to terms (words), and column vectors correspond to documents.  The value of an element in this matrix is the frequency of a particular word in a particular document.  This method only considers word frequency, not context.  This results in a very large sparse matrix,

The word-context matrix look at the immediate context around a word rather than a whole document.  The fundamental hypothesis of this technique is that words that occur in similar contexts tend to have similar meanings.  This method is inspired by looking at the rows of words in the term-document matrix, rather than looking at the columns, thus it finds attributional similarity between the words.

In pair-pattern matrices, row vectors correspond to pairs of words, and column vectors orrespond to the patterns that the pairs occur in, such as "X cuts Y".  The idea is that patterns that co-occur with similar pairs have similar meaning, so word pairs with similar row vectors tend to have similar semantic relations.  This method is useful for determining if one sentence is a paraphrase of another.  This matrix can be used for triple-patterns, or n-tuples of words, but these matrices become more and more sparse as the higher n patterns become more rare.  A denser pair-pattern matrix is often more useful.

Before the matrix can be generated, it is important to do some processing to the raw text.  This processing seems like it fairly specific to linguistic uses, but there are some aspects that may be more widely useful.  The raw text needs to be "tokenized" to decide what constitutes a term.  This includes dealing with punctuation, hyphenization, and contractions.  The text should also be normalized to convert similar words to the same form (Car vs car vs Cars can be normalized to car).  They should also be annotated by making different words that look the same.  Normalization has the effect of reducing precision but increasing the sensitivity (or "recall") of a system.  For a small corpus, excessive normalization may be necessary, whereas for a large corpus normalization may be able to be ignored to keep precision high.  Annotation is the opposite of normalization, so it decreases recall and increases precision.

After the text has been processed and the matrix has been generated, weights can be applied.  This is important because common words are often less informative than rarer words. Some words can be given a weight of zero and thus removed from the matrix entirely.  Words that are relatively common in the document, but relatively rare in the corpus can be weighted highly.  This is known as tf-idf weghting (term frequency × inverse document frequency).  This can be biased towards infrequent events, but this problem can be accounted for.

The matrix can also be smoothed to reduce the amount of random noise.  Words with a low weighting can be eliminated with little effect on precision.  This idea seems like it could be important for signal generation.  Singular Value Decomposition (SVD) can be used to decompose the sparse matrix into three smaller, denser matrices.  This makes computation easier by reducing the number of dimensions.  It also has a smoothing effect and can reveal when words show up in similar contexts.  The paper makes the claim that truncated SVD is able to simulate missing text and compensate for lack of data.  I find this hard to believe, but it makes some sense if I think of it in a context of signals and "filling in the gaps" or generating a line of best fit through data points.

Once a matrix is generated, weighted, and smoothed the similarity of the frequency vectors can be calculated.  The most common method is taking the cosine of the angle between the vectors by taking the inner product of the vectors after they are normalized to unit length.


The cosine ranges from -1 to 1, where values closer to 1 represent a high degree of similarity between the two vectors.  These calculations are slow O(nr2 nc) where nr is the number of rows and nc is the number of columns in the matrix.   The expense of calculating the similarity matrix can be reduced by only calculating the similarity when both vectors share a nonzero element.

Rather than using truncated SVD to lower the dimension of a matrix, high-dimension vectors can be randomly projected onto a lower-dimensional subspace without a large impact on the similarity scores, but I don't really understand how this process works.

The three matrix types discussed in this paper are not an exhaustive look at VSMs, but they demonstrate how computers could better understand language.  Most of the meaning in English text comes from word choice, but word order is still important, and is something that VSMs often ignore, but even this can be accounted for to some degree.

Link to the paper


No comments:

Post a Comment