Tuesday, May 3, 2016

Summary of "Two Decades of Statistial Language Modeling: Where Do We Go From Here?" R. Rosenfeld, 2000

I decided to start my literature summary by starting with the oldest paper, to get more of a chronological overview.

This paper argues for a Bayesian approach to statistical language models.  It gives an overview of techniques used in Statistical Language Modeling (SLM), but it is not yet obvious to me how much these techniques will carry over to machine learning for signal generation.  Many of the techniques discussed seem to have a very narrow scope and may not useful for my purposes.  Generally, the techniques have a huge number of parameters and require large amounts of training data.

Bigram, trigram, and n-gram models are discussed.  This is a new concept for me, but what it means is the models look at 2, 3, or 'n' adjacent letters or words at a time.  Trigrams for letters could be words like 'the' or 'and' or fragments like 'ing'.  An example of words used as trigrams is "The quick red" and "quick red fox".  Essentially, the computer is looking at a narrow window for comparison.  Raising he value of n is a trade-off between stability and bias.

The statistical language models in the paper all use Bayes Theorem to find the most likely sentence or classification based on a probability distribution.  Quality of a language modeling technique is measured in terms of cross-entropy and perplexity.  I don't have a good intuitive sense of cross-entropy, but it is proportional to the log-likelihood of the model distribution.
Perplexity = 2cross-entropy 
Perpelexity is a measure of the branching factor of the language according to the model.  It estimates both the complexity of the language and the quality of the model.  Lower perplexity suggests a better quality model.

Language models discussed in the paper are extremely sensitive to changes in the style, topic, or genre of the text they are trained on.  This seems obvious, but it is important to keep in mind, because I suspect this will be true for signal generation as well.  These models assume large amounts of independence, which as we know from language is not a great assumption.  In language, context is very important, but many models may only be considering nearby words and not words in more distant parts of the document. These false independence assumptions result in overly sharp posterior distributions.  I suspect that this would be a more serious problem for signal generation.

Decision tree models were described as models that partition the history of the words in the text by asking "arbitrary questions about the history".  The types of questions were not detailed in the paper, so I don't really know what kinds of questions they are referring to.  Questions are picked by greedily selecting the ones that have the greatest reduction in entropy.  This type of decision tree model is considered to be optimal in theory, but as of the writing of this paper, results were disappointing.

Context Free Grammar (CFG) is another model that was discussed.  Like most of the models, the computer has no real understanding of the language, it is simply following a set of rules that it builds itself.  This model mentions terminals and non-terminals, but the paper does not define them clearly and I did not get a good sense of what they meant.  The use of the transition rules in this method results in a lot of local maxima that fall short of the global maxima in the probability distribution and context is needed to account for this.

Data fragmentation is a problem for many of these models because each new parameter is being estimated with less data.  An exponential model can avoid this problem by using a model of the form:
It was not clear to me from the equation how this exponential model helps fragmentation, but it resulted in a huge drop in perplexity.

Dimensionality reduction was another method used to significantly reduce perplexity.  Vocabulary has a huge number of dimensions, and many models do not consider words like "Bank" and "Loan" to be any more related than "Bank" and "Brazil".  To manage this problem, a sparse matrix is generated with the first occurence of each vocabulary word in a document.  The matrix can then be reduced by SVD to a much lower dimension to give a much better correlation between words.  This results in another huge reduction in perplexity.  I suspect a similar approach may be useful for signal generation.

Interactive modeling is introduced at the end of the paper as a way to use human knowledge to generate a prior by assisting the model in grouping words.  This is an interesting approach, but it introduces a bias, and I don't think it would be a practical solution for the purposes of generating signals.







No comments:

Post a Comment