Monday, June 27, 2016

Stateful Parameter in Keras

According to the keras documentation; a stateful recurrent model is one for which the internal states (memories) obtained after processing a batch of samples are reused as initial states for the samples of the next batch. This allows to process longer sequences while keeping computational complexity manageable.  Tome, this is not completely clear,so I wanted to test this parameter to see how it affects a neural net.

I used lstm_text_generation.py as a baseline, and I modified it to  test for stateful=true and stateful=false.  In order to make the code run for stateful=true, I needed to make some changes.


First, I needed to include a parameter batch_input_shape() instead of simply input_shape(), which needed to be passed the batch size, like this:
model.fit(X, y, batch_size=batch_Size, nb_epoch=1,callbacks=[history])
Another requirement of using stateful=true is that the number of samples passed as input need to be evenly divisible by the batch size.  This was problematic since it limited which batch sizes I could use, but I tried a batch size of 518, which is a factor of the number of samples in each iteration, 15022.

This solution didn't quite work, because I got another error:
ValueError: non-broadcastable output operand with shape (1,512) doesn't match the broadcast shape (518,512) 
512 refers to the number of nodes in each hidden layer of the neural net.  The error seems to be a matrix multiplication error.  When the number of nodes in each layer was made 518 (from the original value of 512) to match the batch size, I received a similar error.
ValueError: non-broadcastable output operand with shape (1,518) doesn't match the broadcast shape (518,518) 
 I could not figure out a proper solution to this problem, so I settled with a batch size of 1 so the matrix multiplication would work.  This dramatically increased the computation time, so in order to run the test in a matter of hours instead of days I changed the number of nodes in the neural net to 256 from 512.  I also ran only 10 iterations rather than the original 60.  My results are below:

Notice for both graphs the loss function is increasing.  For stateful=True, the loss increases monotonically, whereas with stateful=False, there are some downward changes, but the trend is overall increasing.  This is the opposite of what I would expect, because the loss should decrease as the network learns.  It's possible it needed more iterations, or that the smaller network kept it from learning, but I'm not yet sure of the reason for the loss increasing.

Monday, June 20, 2016

Summary of "The Unreasonable Effectiveness of Recurrent Neural Networks"

Link to the blog post 

Recurrent Neural Networks (RNNs), are a type of neural net where inputs with an arbitrary structure are represented by a fixed-size vector.  In Andrej's blog, he emphasizes that what is special about RNNs is that the input can be a sequence.  The inputs can be combined with a state vector to produce a new state vector.  Even if the inputs are not a sequence, the data can still be processed sequentially, which is more powerful than a conventional neutral net.

Here is how Andrej describes the forward pass of a RNN:
rnn = RNN()
y = rnn.step(x) # x is an input vector, y is the RNN's output vector
class RNN:
  # ...
  def step(self, x):
    # update the hidden state
    self.h = np.tanh(np.dot(self.W_hh, self.h) + np.dot(self.W_xh, x))
    # compute the output vector
    y = np.dot(self.W_hy, self.h)
    return y

The parameters of the RNN are W_hh, W_xh, and W_hy.  self.h is initialized with the zero vector.  According to Andrej, RNNs work monotonically better with more hidden layers and more nodes.

In his blog, he goes through several very cool examples of RNNs being trained on datasets, and their outputs.  Training with text results in mostly nonsensical text with a format and a "sound" that seems to match the input text.  When given inputs from Shakespeare, the neural network makes a sample that keeps the format and verbage that you would expect from Shakespeare, but the story is nonsense, it loses its iambic pentameter.  Characters have a mix of dialogue and long monologues, which is again consistent with the style of the source material, but it lacks content.  Given all the machine is doing, this is quite impressive if not especially useful.

Like many other computer models, a "temperature" parameter can be used to control the trade off between accuracy to the text and how diverse it is.  A very low temperature is likely to get stuck at a peak of likelihoods and result in plain or repetitive outputs.  A high temperature is more likely to have problems such as spelling mistakes.

When given a wikipedia page as input, the neural net was able to produce an output that again, had proper formatting, used the structured markdown correctly to create headings and tables, cited sources (although incorrectly), and completely makes up numbers for timestamps and id numbers.  It seems to have an understanding of the style of how the pages should look, if not the content.
The RNN was also able to work with Latex and other code which didn't quite compile, but looked like code should look.  The network was good at getting the syntax right, even if the variable names were right.  It had difficulty with "long-term interactions" such as whether or not a return statement should be present in a function depending on the absence or presence of a void declaration.  Its context window may have been too short to be able to relate the presence of void in the training data, and the absence of the return statement.

What I found most interesting was when he looked at individual neurons.  Initially, the neurons had random parameters, but after some training, they specialized into niches.  An individual neuron might be given an open parenthesis as an input.  If it gives a high score to say, neuron 3, and a low score to the others, then neuron 3 is basically being told to specialize for stuff inside parenthesis.  It will learn that a closed parenthesis should be at the end, and the format inside may be different than normal text (such as a list of parameters in some code).  Effectively, evolution of niches is an emergent property of the system!  These niches should require a large neural network with several hidden layers and many neurons.

In summary, RNNs seem to do a very good job of mimicking the formatting, style, and overall feel of an input.  In the examples given, it is not able to give meaningful outputs however, and it struggles with long-term interactions.  All this does seem to beg the question.  Can these difficulties be overcome by bigger networks with more hidden layers and larger training sets?  Or is a totally different approach required to get more meaningful results.  Either way, RNNs seem to be unreasonably effective in getting so much right.

Monday, June 6, 2016

Gaussian Mixture Models

In trying to learn about Gaussian mixture models, I've had some difficulty in simply reading about them, but I've found some videos that give a good explanation.  I will summarize my understanding below, and link to the videos at the end.

Essentially, a Gaussian mixture model is a way to combine several Gaussian PDFs (Probability Distribution Functions) of different shapes and sizes into a single distribution.  This is done by making a linear combination of the individual Gaussian PDFs.

Expectation-Maximization (EM) is a procedure that allows us to learn the parameters of the Gaussian mixture model.  These parameters are refined over several iterations.  The Expectation step (E-step) keeps fixed the mean μc, covariance Σc, and size πc of the Gaussian, c.  The assignment probability, ric, that a each data point belongs to cluster c is then calculated.


In this example, the data point x is more likely to belong to distributyion 2 (66% chance) over distribution 1 (33% chance) 
In the Maximzation step (M-step) keeps the assignment probabilities the same while updating the parameters μcΣc, and  πc.  The assignment probabilities ric are used to calculate the new parameters.  Here, m represents the total data, whereas mc represents the data corresponding to a particular cluster.


This way, the points with a large ric have more of an effect on the parameters than those with a small ric.  Each iteration of this EM model increases the log-likelihood of the model.  This can result in the model becoming stuck in local optima, so intialization is important.

Using Expectation-Maximization, the model can learn the parameters over time and refine a distribution given datapoints from overlapping distributions.

-----------------------------------
link to videos:
https://www.youtube.com/watch?v=Rkl30Fr2S38
https://www.youtube.com/watch?v=qMTuMa86NzU

Friday, June 3, 2016

Binary Classifier Error Testing

Using the Keras package, I wrote some code to make a binary classifier neural network.  Given the inputs of "age" and "LTI" (the Loan To Income ratio), the network should output a '1' if the individual is predicted to default on the loan, and a '0' if they are not.

I tested using several different activation functions, and here is what I've found:

Wrong Approach

First, I defined errors the number of errors to be:

nb_errors=abs(sum(y1-Y_test)) ,

where y1 is the output of the neural network, and Y_test is the actual results from the dataset.

This is the wrong approach because it means that false positives make my data look better.  If the network detects outputs a '1' where there should be a '0', then that will effectively cancel out some of the false negatives, where the network outputs a '0', but there should be a '1'.  This resulted in some interesting behavior.



With only 1 hidden layer, both networks performed much worse than with 2, even when the number of nodes were the same.  The number of errors saturate when there is few nodes with the 'relu' activation function, because the network is outputting an answer of all-zeros.  For these instances. it hasn't really learned anything yet, but it gets an answer that still get ~85% of the outputs right.

More interesting is the performance of 'tanh'.  It actually does better with fewer nodes so long as there are two hidden layers.  This is because 'tanh' seems to give more false positives than 'relu' for this dataset.

Right Approach

The number of errors should be calculated by:

nb_errors=sum(abs(y1-Y_test)) ,

This way, the false positives and false negatives are treated the same, so you can get the total number of errors by just adding them up.  Here is the results from this approach:



'relu' and 'tanh' activation functions seemed to perform similarly for high numbers of nodes, but 'tanh' performed better overall, especially when there were fewer nodes.  More hidden layers adds computation time, but makes a big difference in accuracy.  Of the 1200 points tested, the network gets ~50 wrong answers with two layers of 64 nodes, which is not bad.

I also tested the 'sigmoid' activation function, and it performed poorly.  It was not able to break away from the answer of all-zeros until the very end of the last simulation, and even then it performed poorly.  Thus, Sigmoid does not appear to be a good activation function to use for the hidden layers.

Conclusion

'tanh' performs better than 'relu' for this binary classification problem.  It also has a higher frequency of false-positives.  This is fine for the lending industry, which may be hesitant to give out risky loans, but if false-positives are worse than false-negatives, than the 'relu' activation function may be better to use so long as enough nodes are used to get a good answer.
---------------------------
download the data set: creditset.csv
download my: python code