Too real, looks just like our cat.
Word2Vec is cool. So is tsne. But trying to figure out how to train a model and reduce the vector space can feel really, really complicated. While working on a sprint-residency at Bell Labs, Cambridge last fall, which has morphed into a project where live wind data blows a text through Word2Vec space, I wrote a set of Python scripts to make using these tools easier.
This tutorial is not meant to cover the ins-and-outs of how Word2Vec and tsne work, or about machine learning more generally. Instead, it walks you through the basics of how to train a model and reduce its vector space so you can move on and make cool stuff with it. (If you do make something awesome from this tutorial, please let me know!)
Above: a Word2Vec model trained on a large language dataset, showing the telltale swirls and blobs from the tsne reduction.
Some WIP for an upcoming performance: 1,047 syllables from H.G. Wells’ Time Machine input to word2vec space, then reduced from 50 dimensions to two. View a much larger version here.
A detail of one of the syllable swirls.
Some progress, trying to caption images using sentences from novels. This first step is working ok, next will be to train a neural-network to do this automatically from my captioned set, then output new captions algorithmically.
Part of a short-term collaboration with the Social Dynamics group at Bell Labs, Cambridge.
Update! For El Capitan and users of newer version of OS X, you may run into issues installing Torch or Lua packages. A fix is included now.
There have been many recent examples of neural networks making interesting content after the algorithm has been fed input data and “learned” about it. Many of these, Google’s Deep Dream being the most well-covered, use and generate images, but what about text? This tutorial will show you how to install Torch-rnn, a set of recurrent neural network tools for character-based (ie: single letter) learning and output – it’s written by Justin Johnson, who deserves a huge “thanks!” for this tool.
The details about how all this works are complex and quite technical, but in short we train our neural network character-by-character, instead of with words like a Markov chain might. It learns what letters are most likely to come after others, and the text is generated the same way. One might think this would output random character soup, but the results are startlingly coherent, even more so than more traditional Markov output.
Torch-rnn is built on Torch, a set of scientific computing tools for the programming language Lua, which lets us take advantage of the GPU, using CUDA or OpenCL to accelerate the training process. Training can take a very long time, especially with large data sets, so the GPU acceleration is a big plus.