Tuesday, July 19, 2016

Summary of "Visualizing and Understanding Recurrent Networks"

Recurrent Neural Networks (RNNs) with Long Short-Term Memory (LSTM) perform well in practice, but the source of their performance is not well understood.  The purpose of this paper is to test LSTM models at the character-level, in particular, the source of the long-range structural dependencies.

For testing, two texts were used.  The first was Leo Tolstoy's War and Peace, which has minimal markup characters.  The other text was the source code for the Linux Kernel, which uses markup characters extensively.  Some generalized findings are that LSTM and Gated Recurrent Unit (GRU) models work much better than a simple RNN.  The models work better with 2 hidden layers rather than 1, and for the GRU, a third layer can sometimes see additional benefit.  I find it surprising that more than 2 hidden layers seems to increase the LSTM cross entropy (although only slightly).



























































































This figure from the paper shows the activity of cells in the best LSTMs.  Text color corresponds to tanh(c), where -1 is red and +1 is blue.  The first cell is sensitive to the position in a line.  In a sense, it is counting how many characters are in the line.  Some cells can very clearly activate inside of quotes, if statements, comments, or for predicting new lines.  This shows how some cells are highly-specialized into niches.

This figure shows the fraction of time a gate spends left saturated (activation less than 0.1) and right saturated (activation more than 0.9).  What I find most interesting about this figure is how different the gates in the first layer are from those in the second and the third.  According to the authors, the number of right-saturated forget gates is especially interesting because it indicates cells that remember their values for long periods of time and thus work as nearly perfect integrators.

This figure shows how LSTMs are able to consistently outperform 20-gram models in matching a closing bracket with the corresponding opening bracket.  This provides strong evidence that they are able to keep track of long-range interactions.  The LSTM shows improvement up to a distance of about 60 characters, then it has difficulty keeping track of the long-range dependencies, though it still performs well.







































This last figure shows the breakdown of types of errors that the LSTM made. This was done by, as they say in the paper, "peeling the onion".  They start by peeling away errors at the top of the pie chart and go counterclockwise around.  They remove the errors by using a particular model and count how many there are.  Errors are defined to occur when the probability a character is assigned is below 0.5.
They start by using an n-gram model that might be better at short-term dependencies.  The character error is removed if the n-gram assigns more than 0.5 probability to the new character assignment.

The n memory model eliminates instances where the LSTM makes the same error with a substring in the past n characters.  The LSTM consistently makes these errors, despite the theoretical possibility that it could learn correctly after training on the first substring.

Rare words present a problem because the LSTM might not have had enough opportunities to train on them to give a correct answer.  This can be corrected by a larger training set or better pretraining.  These are removed and the error is presented based on how frequently that word occured in the training set.

Many of the errors occurred after a space or a quote, after a newline (which the LSTM needs to learn acts the same as a space), or after punctuation.  The remaining errors did not have apparent structures and were removed by simply adding 0.1,0.2,0.3,0.4, or 0.5 to the probability of that character until all errors were accounted for.

Interestingly, increasing the number of layers and nodes of the LSTM had a dramatic improvement in correcting for n-gram errors, but other types of errors seemed to be unaffected.  It is expeced that further scaling up the model even larger will eliminate n-gram errors, but to fix other errors, different architecture may be required.
----------
Link to the paper

Tuesday, July 12, 2016

RNN Weight Visualization

I wanted to see the distribution of the weights inside a neural net to see what was going on.  I ran the code twice, trained with a different sample text each time. The first sample text was the simple sentence "thebrownfoxjumpsoverthefence." repeated many times.  The second text was the short story The Last Question by Isaac Asimov. These text samples and the code are included in a .zip file at the end of this post.

The "fox" sample text was trained with 3 hidden layers with 128 nodes each.  The code converged relatively quickly, see the output after 0 epochs:




and after 20 epochs:



The network is able to converge upon the correct answer with few errors.

I also plotted a histogram of the weights below.  At first, the weights are grouped together, presumably near their initialized values.  After several epochs, the weights become more normally distributed about 0.

The more complicated sample text was run with 3 hidden layers of 512 nodes each.  See the output at epoch 0:







and epoch 20:






The network is clearly learning the formatting of the document (frequent line breaks) and the text is looking more like english, even though it doesn't quite become intelligible after only 20 epochs.  Despite not reaching convergence, the graphs of the weights look quite similar to how they did for the fox text.

The histogram is taller because there are way more weights with the larger network, but the distribution of the weights is normal, despite not reaching convergence yet.  The reason for this must be that the weights tend towards a normal distribution, but those weights still need to be shuffled around to the correct places in the network to get the correct answer.

-------------
The code and training files

Sunday, July 10, 2016

Summary of "A Critical Review of Recurrent Neural Networks for Sequence Learning"

This paper gives an overview of recurrent neural networks (RNNs), bidirectional recurrent neural networks (BRNNs) and long short-term memory (LSTM).

The paper highlights some key features of RNNs that make them important.  They retain a state that can represent information from an arbitrarily long context window.  They are able to model input and output consisting of elements that are not independent.  What this means is that, unlike conventional neural networks, the state of the network is not lost after each data point is processed.  These features are possible because RNNs include edges spanning adjacent time steps, which introduces the notion of time to the model.  This allows the hidden nodes of the network to remember prior states.

Training RNNs introduces some new challenges.  Vanishing and exploding gradients occur when backpropagating errors across many time steps.  Essentially, as time passes, the state of the model at time 0 contributes less and less to the output as compared to the current input, and this occurs exponentially fast.  The exploding gradient problem can be solved by truncated backpropagation through time (TBPTT).  TBPTT works by limiting the number of time steps which error can propagate, but this gives up the ability of the RNN to learn long-range dependencies.

Figure 1: Demonstration of the vanishing gradient problem.  When the weight
along the recurrent edge is less than 1 the state of the hidden layer is diluted
by the input as time passes.
The LSTM model is another way to eliminate the vanishing gradients problem.  Here, each node in the hidden layer of the RNN is replaced with a memory cell (Figure 2).  Each memory cell contains a node with a self-connected recurrent edge of weight 1 to prevent vanishing or exploding gradient.  The forget gate 'f' controls this recurrent edge allowing the network to eventually forget information, but the value of the forget gate can be learned and is affected by input.  The internal state 's' is affected by the product of its value in the previous time step and the value of the forget gate.
Figure 2:  LSTM memory cell
The input node 'g' and the input gate 'i' multiply their outputs to form another input going into the internal state.  The input gate is a distinctive feature of the LSTM approach, and its purpose is control the flow of the other node.  A value of zero effectively cuts off the input, keeping the internal state the same.  A value of one allows the full flow of the input node, and intermediate values are allowed.

The output gate 'o' multiplies with the output of the internal state to control the output of the hidden layer.  The internal state is commonly run through a tanh activation first, although a ReLU activation is apparently easier to train.

Each of these gates can learn when to let data in or out of the memory cell, thus improving the ability of the network to learn long-range dependencies compared to simple RNNs.

The paper also has a relatively small section on bidirectional recurrent neural networks (BRNNs)  (Figure 3).  BRNNs allow the network to receive information from the 'future' and the past.  For example, if reading through a document, the network has context from words before and after the current word being read.  This type of network cannot be run continuously, because it needs an endpoint for the future and the past.  Despite this limitation, it is a powerful technique for some applications such as part-of-speech tagging.
Figure 3: Bidirectional Recurrent Neural Net (BRNN)

--------------------
Link to the paper

Monday, June 27, 2016

Stateful Parameter in Keras

According to the keras documentation; a stateful recurrent model is one for which the internal states (memories) obtained after processing a batch of samples are reused as initial states for the samples of the next batch. This allows to process longer sequences while keeping computational complexity manageable.  Tome, this is not completely clear,so I wanted to test this parameter to see how it affects a neural net.

I used lstm_text_generation.py as a baseline, and I modified it to  test for stateful=true and stateful=false.  In order to make the code run for stateful=true, I needed to make some changes.


First, I needed to include a parameter batch_input_shape() instead of simply input_shape(), which needed to be passed the batch size, like this:
model.fit(X, y, batch_size=batch_Size, nb_epoch=1,callbacks=[history])
Another requirement of using stateful=true is that the number of samples passed as input need to be evenly divisible by the batch size.  This was problematic since it limited which batch sizes I could use, but I tried a batch size of 518, which is a factor of the number of samples in each iteration, 15022.

This solution didn't quite work, because I got another error:
ValueError: non-broadcastable output operand with shape (1,512) doesn't match the broadcast shape (518,512) 
512 refers to the number of nodes in each hidden layer of the neural net.  The error seems to be a matrix multiplication error.  When the number of nodes in each layer was made 518 (from the original value of 512) to match the batch size, I received a similar error.
ValueError: non-broadcastable output operand with shape (1,518) doesn't match the broadcast shape (518,518) 
 I could not figure out a proper solution to this problem, so I settled with a batch size of 1 so the matrix multiplication would work.  This dramatically increased the computation time, so in order to run the test in a matter of hours instead of days I changed the number of nodes in the neural net to 256 from 512.  I also ran only 10 iterations rather than the original 60.  My results are below:

Notice for both graphs the loss function is increasing.  For stateful=True, the loss increases monotonically, whereas with stateful=False, there are some downward changes, but the trend is overall increasing.  This is the opposite of what I would expect, because the loss should decrease as the network learns.  It's possible it needed more iterations, or that the smaller network kept it from learning, but I'm not yet sure of the reason for the loss increasing.

Monday, June 20, 2016

Summary of "The Unreasonable Effectiveness of Recurrent Neural Networks"

Link to the blog post 

Recurrent Neural Networks (RNNs), are a type of neural net where inputs with an arbitrary structure are represented by a fixed-size vector.  In Andrej's blog, he emphasizes that what is special about RNNs is that the input can be a sequence.  The inputs can be combined with a state vector to produce a new state vector.  Even if the inputs are not a sequence, the data can still be processed sequentially, which is more powerful than a conventional neutral net.

Here is how Andrej describes the forward pass of a RNN:
rnn = RNN()
y = rnn.step(x) # x is an input vector, y is the RNN's output vector
class RNN:
  # ...
  def step(self, x):
    # update the hidden state
    self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
    # compute the output vector
    y = np.dot(self.W_hy, self.h)
    return y

The parameters of the RNN are W_hh, W_xh, and W_hy.  self.h is initialized with the zero vector.  According to Andrej, RNNs work monotonically better with more hidden layers and more nodes.

In his blog, he goes through several very cool examples of RNNs being trained on datasets, and their outputs.  Training with text results in mostly nonsensical text with a format and a "sound" that seems to match the input text.  When given inputs from Shakespeare, the neural network makes a sample that keeps the format and verbage that you would expect from Shakespeare, but the story is nonsense, it loses its iambic pentameter.  Characters have a mix of dialogue and long monologues, which is again consistent with the style of the source material, but it lacks content.  Given all the machine is doing, this is quite impressive if not especially useful.

Like many other computer models, a "temperature" parameter can be used to control the trade off between accuracy to the text and how diverse it is.  A very low temperature is likely to get stuck at a peak of likelihoods and result in plain or repetitive outputs.  A high temperature is more likely to have problems such as spelling mistakes.

When given a wikipedia page as input, the neural net was able to produce an output that again, had proper formatting, used the structured markdown correctly to create headings and tables, cited sources (although incorrectly), and completely makes up numbers for timestamps and id numbers.  It seems to have an understanding of the style of how the pages should look, if not the content.
The RNN was also able to work with Latex and other code which didn't quite compile, but looked like code should look.  The network was good at getting the syntax right, even if the variable names were right.  It had difficulty with "long-term interactions" such as whether or not a return statement should be present in a function depending on the absence or presence of a void declaration.  Its context window may have been too short to be able to relate the presence of void in the training data, and the absence of the return statement.

What I found most interesting was when he looked at individual neurons.  Initially, the neurons had random parameters, but after some training, they specialized into niches.  An individual neuron might be given an open parenthesis as an input.  If it gives a high score to say, neuron 3, and a low score to the others, then neuron 3 is basically being told to specialize for stuff inside parenthesis.  It will learn that a closed parenthesis should be at the end, and the format inside may be different than normal text (such as a list of parameters in some code).  Effectively, evolution of niches is an emergent property of the system!  These niches should require a large neural network with several hidden layers and many neurons.

In summary, RNNs seem to do a very good job of mimicking the formatting, style, and overall feel of an input.  In the examples given, it is not able to give meaningful outputs however, and it struggles with long-term interactions.  All this does seem to beg the question.  Can these difficulties be overcome by bigger networks with more hidden layers and larger training sets?  Or is a totally different approach required to get more meaningful results.  Either way, RNNs seem to be unreasonably effective in getting so much right.

Monday, June 6, 2016

Gaussian Mixture Models

In trying to learn about Gaussian mixture models, I've had some difficulty in simply reading about them, but I've found some videos that give a good explanation.  I will summarize my understanding below, and link to the videos at the end.

Essentially, a Gaussian mixture model is a way to combine several Gaussian PDFs (Probability Distribution Functions) of different shapes and sizes into a single distribution.  This is done by making a linear combination of the individual Gaussian PDFs.

Expectation-Maximization (EM) is a procedure that allows us to learn the parameters of the Gaussian mixture model.  These parameters are refined over several iterations.  The Expectation step (E-step) keeps fixed the mean μc, covariance Σc, and size πc of the Gaussian, c.  The assignment probability, ric, that a each data point belongs to cluster c is then calculated.


In this example, the data point x is more likely to belong to distributyion 2 (66% chance) over distribution 1 (33% chance) 
In the Maximzation step (M-step) keeps the assignment probabilities the same while updating the parameters μcΣc, and  πc.  The assignment probabilities ric are used to calculate the new parameters.  Here, m represents the total data, whereas mc represents the data corresponding to a particular cluster.


This way, the points with a large ric have more of an effect on the parameters than those with a small ric.  Each iteration of this EM model increases the log-likelihood of the model.  This can result in the model becoming stuck in local optima, so intialization is important.

Using Expectation-Maximization, the model can learn the parameters over time and refine a distribution given datapoints from overlapping distributions.

-----------------------------------
link to videos:
https://www.youtube.com/watch?v=Rkl30Fr2S38
https://www.youtube.com/watch?v=qMTuMa86NzU

Friday, June 3, 2016

Binary Classifier Error Testing

Using the Keras package, I wrote some code to make a binary classifier neural network.  Given the inputs of "age" and "LTI" (the Loan To Income ratio), the network should output a '1' if the individual is predicted to default on the loan, and a '0' if they are not.

I tested using several different activation functions, and here is what I've found:

Wrong Approach

First, I defined errors the number of errors to be:

nb_errors=abs(sum(y1-Y_test)) ,

where y1 is the output of the neural network, and Y_test is the actual results from the dataset.

This is the wrong approach because it means that false positives make my data look better.  If the network detects outputs a '1' where there should be a '0', then that will effectively cancel out some of the false negatives, where the network outputs a '0', but there should be a '1'.  This resulted in some interesting behavior.



With only 1 hidden layer, both networks performed much worse than with 2, even when the number of nodes were the same.  The number of errors saturate when there is few nodes with the 'relu' activation function, because the network is outputting an answer of all-zeros.  For these instances. it hasn't really learned anything yet, but it gets an answer that still get ~85% of the outputs right.

More interesting is the performance of 'tanh'.  It actually does better with fewer nodes so long as there are two hidden layers.  This is because 'tanh' seems to give more false positives than 'relu' for this dataset.

Right Approach

The number of errors should be calculated by:

nb_errors=sum(abs(y1-Y_test)) ,

This way, the false positives and false negatives are treated the same, so you can get the total number of errors by just adding them up.  Here is the results from this approach:



'relu' and 'tanh' activation functions seemed to perform similarly for high numbers of nodes, but 'tanh' performed better overall, especially when there were fewer nodes.  More hidden layers adds computation time, but makes a big difference in accuracy.  Of the 1200 points tested, the network gets ~50 wrong answers with two layers of 64 nodes, which is not bad.

I also tested the 'sigmoid' activation function, and it performed poorly.  It was not able to break away from the answer of all-zeros until the very end of the last simulation, and even then it performed poorly.  Thus, Sigmoid does not appear to be a good activation function to use for the hidden layers.

Conclusion

'tanh' performs better than 'relu' for this binary classification problem.  It also has a higher frequency of false-positives.  This is fine for the lending industry, which may be hesitant to give out risky loans, but if false-positives are worse than false-negatives, than the 'relu' activation function may be better to use so long as enough nodes are used to get a good answer.
---------------------------
download the data set: creditset.csv
download my: python code

Tuesday, May 17, 2016

Summary of "A Primer on Neural Network Models for Natural Language Processing" Goldberg, 2015

This paper covers the use of Neural Networks for use in Natural Language Processing (NLP).  It is more detailed than the previous papers I've summarized.  It's explanations are a bit jumbled, in that it introduces topics without explaining them, only to elaborate on them several sections (and many pages) later.  This made it a more difficult read, but I will try to give an overview.















This paper frequently uses the term "feature" to refer to an input such as a word, suffix, or part-of-speech tag.  In general, this paper discusses these features being input as a vector into a neural network, which returns an output of another vector with a possibly different dimension.  Features are represented as dense vectors.  One important reason for the use of dense vectors is to ease computation by keeping the number of dimensions low.  The main benefit of the dense representation is "generalization power".  Learned vectors for two different but related features ('cat' and 'dog'), have similar features, and the model can take this into account.  Figure 1b below shows how features can be represented as dense vectors, but the paper does not seem to explain how this works.  These dense, trainable vectors are a fundamental part of how neural networks work.

The number of dimensions of each vector should be is important, because more dimensions mean better accuracy at the cost of more memory, and more processing time.  Still, the paper states that there are no best-practices for the number of dimensions of different features.  It is therefore recommended to experiment with different sizes and pick a dimension that strikes a good balance.

Features that contain the same words but in different orders can share the same vector, or be made into different vectors.  A decision needs to be made to decide how important this context is when deciding whether to share vectors or not.

A detailed example is the Feed-forward Neural Network.  Figure 2 below is an example of how the feed-forward neural network is structured.  Each arrow on the network carries a weight to reflect its importance.  The sigmoid shape inside the neurons in the hidden layer represent a non-linear function.  This network can be described as mathematical operations with a matrix W of weights and a vector b of bias terms.  These parameters will change as the network learns and attempts to minimize the "Loss function".
The paper gives several examples of non-linear functions that transform each value x into the range [0,1] or [-1,1].  These non-linear transformations seem to be fairly arbitrary, and the paper states that there is no real theory for which non-linearity to apply under which conditions.

In learning, the network is trying to minimize the loss function.  The loss function can be defined several ways, but ultimately it is a comparison of the output and the intended output.  Loss is small when the output is similar to the intended output.  

Embedding vectors are initially randomized to random values and are tuned by training.  "Supervised" pre-training is used to initialize rare features, such as features for individual words, but I wonder if this biases the end result.  Unsupervised pre-training can be used when large amounts of text is available.  This is useful for building vectors for a large vocabulary of words.  

Training methods boil down to repeating the following steps: computing an error estimate over the dataset, computing the gradient with respect to the error, and moving the parameters in the direction of the gradient.  Different models have different ways of calculating the error and how "moving in the direction of the gradient" is defined.

The basic algorithm for training is Stochastic Gradient Training (SGD). The error from this algorithm is calculated based on a single training example, and is thus only a rough estimate.  The noise in this estimate can result in inaccurate gradients.  This noise can be reduced by calculating the error and gradients on a sample of m examples, referred to as a minibatch.  This improves accuracy of the gradients and can allow for parallel computing.

The loss function can get stuck in local minima or saddle points, so starting from different initial points can give different results.  It is thus important to run several training runs with different initializations and picking the best one.  I suspect this is another instance where the result could be biased.  

In deep networks, gradients can often vanish or become exceedingly high as they move through the network.  This can be mitigated by making the network shallower, stepwise training by training the layers independently, and  making specialized architecture to assist the gradient flow.

Some transformations can become saturated, resulting in output values for a layer near 0 or 1.  This results in very small gradients, which should be avoided because this slows learning.  This can be controlled by scaling the range of input values or changing the learning rate.  Normalization can be used, but this tends to be expensive in terms of computation.

Learning rates that are too large can prevent the network from converging, whereas too small can result in the outputs taking a long time to converge.  The paper proposes experimenting with rates between [0,1] (0.001,0.01,0.1,1).  Learning rate should be decreased if the network seems to get stuck in a region, and can be decreased dynamically according to a schedule.

Overfitting can be a problem in neural network models as the models have many parameters.  This can be mitigated by regularization, such as placing a penalty on parameters with large values by adding a multiple of the L2 norm of the parameters to the function being minimized.  

Model Cascading is a technique where large networks are built from smaller networks.  Hidden layers of smaller networks can be used as input to a larger network.  This results in a very deep network, which can cause the vanishing gradient problem.  This can be solved by training the networks separately before combining them for further training.

It is often important to understand the context around a word and not just the word itself.  For this reason, context windows can be used.  The size of this window has a strong effect on vector similarities.  For instance, larger windows tend to produce topical similarities ("dog","bark","leash").  Smaller windows tend to produce more functional similarities ("Poodle","Pitbull","Rottweiler").  Positional contexts can also be considered so that words far from the focus word and those near are factored in differently (context word of "the" becomes "the:+2" to indicate that it is two positions to the right of the focus word).

Convolutional Neural Networks (CNN) are also important in modeling context.  A CNN identifies local predictors in a large structure and combine them to produce a vector representation of the structure, keeping track of the local aspects that are most informative.  This works by taking a window of words and transforming it into a vector the captures the most important properties.  Vectors from multiple windows are combined into a single vector by taking the average or max value of each dimension.  This way, the most important features from the windows can be focused on regardless of their location.  Each dimension of the vector should capture a different kind of information.  Additional context information may be preserved by k-max pooling, where the 'k' best elements in each dimension are retained instead of only the best one.

Recurrent Neural Networks (RNN) allow arbitrarily-sized structured inputs to be represented as a fixed-size vector, while preserving the structured properties.  The network is built by a recursively defined function that takes as an input a state vector si, and an input vector xi+1, and results in a new state vector si+1 and an output vector yi.  See the figure below for a visual representation of this idea.  
The RNN can be thought of as a very deep neural network, in which the same parameters are shared across many parts of the computation.  A backpropagation algorithm can be used to compute the gradients with respect to the loss of the unrolled graph.  This is known as backpropagation through time (BPTT).  

Recursive Neural Networks (RecNN) is a generalization of the RNN from sequences to binary trees.  The paper claims that this can be useful in language processing because it is often natural to work with tree structures, but I did not get a good sense of why this was important.

Monday, May 9, 2016

Summary of "From Frequency to Meaning: Vector Space Models of Semantics" P. Turney, P. Pantel, 2010

This paper describes using Vector Space Models (VSMs) as a way to understand meaning in human language.  The first example of a VSM discussed, is that of a search engine.  Each document can be considered a point in space.  Points close together are similar, while points far apart are semantically different.  A query can be considered a pseudo-document in this space, and the documents can be sorted in order of distance in this space.

The defining feature of a VSM is that the elements in the VSM are come from event frequencies, such as the number of times a word appears in context.  These elements can be compared for similarity using vector mathematics.  This idea is applied to three types of matrices; term-document, word-context, and pair-pattern.

In a term-document matrix, row vectors correspond to terms (words), and column vectors correspond to documents.  The value of an element in this matrix is the frequency of a particular word in a particular document.  This method only considers word frequency, not context.  This results in a very large sparse matrix,

The word-context matrix look at the immediate context around a word rather than a whole document.  The fundamental hypothesis of this technique is that words that occur in similar contexts tend to have similar meanings.  This method is inspired by looking at the rows of words in the term-document matrix, rather than looking at the columns, thus it finds attributional similarity between the words.

In pair-pattern matrices, row vectors correspond to pairs of words, and column vectors orrespond to the patterns that the pairs occur in, such as "X cuts Y".  The idea is that patterns that co-occur with similar pairs have similar meaning, so word pairs with similar row vectors tend to have similar semantic relations.  This method is useful for determining if one sentence is a paraphrase of another.  This matrix can be used for triple-patterns, or n-tuples of words, but these matrices become more and more sparse as the higher n patterns become more rare.  A denser pair-pattern matrix is often more useful.

Before the matrix can be generated, it is important to do some processing to the raw text.  This processing seems like it fairly specific to linguistic uses, but there are some aspects that may be more widely useful.  The raw text needs to be "tokenized" to decide what constitutes a term.  This includes dealing with punctuation, hyphenization, and contractions.  The text should also be normalized to convert similar words to the same form (Car vs car vs Cars can be normalized to car).  They should also be annotated by making different words that look the same.  Normalization has the effect of reducing precision but increasing the sensitivity (or "recall") of a system.  For a small corpus, excessive normalization may be necessary, whereas for a large corpus normalization may be able to be ignored to keep precision high.  Annotation is the opposite of normalization, so it decreases recall and increases precision.

After the text has been processed and the matrix has been generated, weights can be applied.  This is important because common words are often less informative than rarer words. Some words can be given a weight of zero and thus removed from the matrix entirely.  Words that are relatively common in the document, but relatively rare in the corpus can be weighted highly.  This is known as tf-idf weghting (term frequency × inverse document frequency).  This can be biased towards infrequent events, but this problem can be accounted for.

The matrix can also be smoothed to reduce the amount of random noise.  Words with a low weighting can be eliminated with little effect on precision.  This idea seems like it could be important for signal generation.  Singular Value Decomposition (SVD) can be used to decompose the sparse matrix into three smaller, denser matrices.  This makes computation easier by reducing the number of dimensions.  It also has a smoothing effect and can reveal when words show up in similar contexts.  The paper makes the claim that truncated SVD is able to simulate missing text and compensate for lack of data.  I find this hard to believe, but it makes some sense if I think of it in a context of signals and "filling in the gaps" or generating a line of best fit through data points.

Once a matrix is generated, weighted, and smoothed the similarity of the frequency vectors can be calculated.  The most common method is taking the cosine of the angle between the vectors by taking the inner product of the vectors after they are normalized to unit length.


The cosine ranges from -1 to 1, where values closer to 1 represent a high degree of similarity between the two vectors.  These calculations are slow O(nr2 nc) where nr is the number of rows and nc is the number of columns in the matrix.   The expense of calculating the similarity matrix can be reduced by only calculating the similarity when both vectors share a nonzero element.

Rather than using truncated SVD to lower the dimension of a matrix, high-dimension vectors can be randomly projected onto a lower-dimensional subspace without a large impact on the similarity scores, but I don't really understand how this process works.

The three matrix types discussed in this paper are not an exhaustive look at VSMs, but they demonstrate how computers could better understand language.  Most of the meaning in English text comes from word choice, but word order is still important, and is something that VSMs often ignore, but even this can be accounted for to some degree.

Link to the paper


Tuesday, May 3, 2016

Summary of "Two Decades of Statistial Language Modeling: Where Do We Go From Here?" R. Rosenfeld, 2000

I decided to start my literature summary by starting with the oldest paper, to get more of a chronological overview.

This paper argues for a Bayesian approach to statistical language models.  It gives an overview of techniques used in Statistical Language Modeling (SLM), but it is not yet obvious to me how much these techniques will carry over to machine learning for signal generation.  Many of the techniques discussed seem to have a very narrow scope and may not useful for my purposes.  Generally, the techniques have a huge number of parameters and require large amounts of training data.

Bigram, trigram, and n-gram models are discussed.  This is a new concept for me, but what it means is the models look at 2, 3, or 'n' adjacent letters or words at a time.  Trigrams for letters could be words like 'the' or 'and' or fragments like 'ing'.  An example of words used as trigrams is "The quick red" and "quick red fox".  Essentially, the computer is looking at a narrow window for comparison.  Raising he value of n is a trade-off between stability and bias.

The statistical language models in the paper all use Bayes Theorem to find the most likely sentence or classification based on a probability distribution.  Quality of a language modeling technique is measured in terms of cross-entropy and perplexity.  I don't have a good intuitive sense of cross-entropy, but it is proportional to the log-likelihood of the model distribution.
Perplexity = 2cross-entropy 
Perpelexity is a measure of the branching factor of the language according to the model.  It estimates both the complexity of the language and the quality of the model.  Lower perplexity suggests a better quality model.

Language models discussed in the paper are extremely sensitive to changes in the style, topic, or genre of the text they are trained on.  This seems obvious, but it is important to keep in mind, because I suspect this will be true for signal generation as well.  These models assume large amounts of independence, which as we know from language is not a great assumption.  In language, context is very important, but many models may only be considering nearby words and not words in more distant parts of the document. These false independence assumptions result in overly sharp posterior distributions.  I suspect that this would be a more serious problem for signal generation.

Decision tree models were described as models that partition the history of the words in the text by asking "arbitrary questions about the history".  The types of questions were not detailed in the paper, so I don't really know what kinds of questions they are referring to.  Questions are picked by greedily selecting the ones that have the greatest reduction in entropy.  This type of decision tree model is considered to be optimal in theory, but as of the writing of this paper, results were disappointing.

Context Free Grammar (CFG) is another model that was discussed.  Like most of the models, the computer has no real understanding of the language, it is simply following a set of rules that it builds itself.  This model mentions terminals and non-terminals, but the paper does not define them clearly and I did not get a good sense of what they meant.  The use of the transition rules in this method results in a lot of local maxima that fall short of the global maxima in the probability distribution and context is needed to account for this.

Data fragmentation is a problem for many of these models because each new parameter is being estimated with less data.  An exponential model can avoid this problem by using a model of the form:
It was not clear to me from the equation how this exponential model helps fragmentation, but it resulted in a huge drop in perplexity.

Dimensionality reduction was another method used to significantly reduce perplexity.  Vocabulary has a huge number of dimensions, and many models do not consider words like "Bank" and "Loan" to be any more related than "Bank" and "Brazil".  To manage this problem, a sparse matrix is generated with the first occurence of each vocabulary word in a document.  The matrix can then be reduced by SVD to a much lower dimension to give a much better correlation between words.  This results in another huge reduction in perplexity.  I suspect a similar approach may be useful for signal generation.

Interactive modeling is introduced at the end of the paper as a way to use human knowledge to generate a prior by assisting the model in grouping words.  This is an interesting approach, but it introduces a bias, and I don't think it would be a practical solution for the purposes of generating signals.