Word2Vec is cool. So is tsne. But trying to figure out how to train a model and reduce the vector space can feel really, really complicated. While working on a sprint-residency at Bell Labs, Cambridge last fall, which has morphed into a project where live wind data blows a text through Word2Vec space, I wrote a set of Python scripts to make using these tools easier.
This tutorial is not meant to cover the ins-and-outs of how Word2Vec and tsne work, or about machine learning more generally. Instead, it walks you through the basics of how to train a model and reduce its vector space so you can move on and make cool stuff with it. (If you do make something awesome from this tutorial, please let me know!)
Above: a Word2Vec model trained on a large language dataset, showing the telltale swirls and blobs from the tsne reduction.
1. INSTALL REQUIRED LIBRARIES
First, you’ll need to install a few libraries to get things running. Luckily, unlike Torch or OpenCV, they’re really pretty easy to install using package managers like pip.
- gensim for Word2Vec
- sklearn for its tsne implementation
- numpy for handling the lists of vectors
- Optional: rasterfairy for tsne-to-grid layout, put in the lib/ folder
- Optional: pattern for part-of-speech tagging
- Optional: Wikipedia Extractor to strip Wiki tags (if you’re using a Wikipedia dump as your data source), put in the lib/ folder
You’ll also need the scripts used in this tutorial, available here.
2. SELECT A TRAINING FILE
To train your Word2Vec model, you’ll need some plain text input for it to learn from. Larger files, like a Wikipedia dump*, will produce a more robust model but will take way, way longer to train and reduce. A good place to start would be a novel downloaded from Internet Archive.
Keep in mind that misspellings get learned too, so a “clean” file can make a big difference. Also important to think about (though sadly out of the scope here) is that any bias present in your source text will get baked into your Word2Vec model as well. Gender relationships, connections between ideas – Word2Vec captures these from its input the same as any other connections between words. TLDR: it’s worth picking your source text carefully, and important not to think of a machine learning model as a pure representation of language.
You can put your source text anywhere, though I keep mine in the ModelsAndData/ folder to keep everything organized, and these scripts will save there too.
* If you do use Wikipedia, you’ll want to strip the wiki tags from the text. There are a few ways to do it, but I suggest Wikipedia Extractor, which is very reliable and makes it really easy. Why reinvent the wheel, right?
2a. OPTIONAL: TAG PARTS-OF-SPEECH
While this won’t be an issue for most projects, you may want finer-grained modeling of language, especially with words that are spelled the same but have different meanings (homonyms). For example, the word “box” can be an object (noun) or an action (verb). To preserve these differences, we can tag the words with their parts-of-speech. When training, box_NN will be seen as a separate entity from box_VB .
We can add POS with the help of the pattern library, which does all the heavy lifting, and TagTextForTraining.py which wraps it up and outputs a text file for training. Open the script and modify it to include your input text file and a new filename to save. Run it in the Terminal – this could take quite a long time, depending on the size of the input.
The resulting file should look something like this, in the format of <word>_<POS> :
3. TRAIN YOUR MODEL
Open the TrainModel.py file in a text editor and make some modifications to suit your input. You will want to set:
- input_filename : the path to your input text file
- model_filename : the path and filename for the trained .model file
- skip_gram : optional and tweaks how Word2Vec is trained – leave as False as a default, or read more about it here
Once set, open the Terminal, navigate to your folder and run the script:
The script will build a vocabulary from the text file, train the model, and save it. This can take between 30 seconds and a few hours, depending on your source file and your computer.
4. REDUCE VECTOR SPACE
The resulting vector space can be hundreds or thousands of dimensions – very detailed but impossible to visualize. Luckily, the tsne (t-distributed stochastic neighbor embedding) algorithm lets us efficiently reduce the vector space while preserving, as much as possible, local spatial relationships between words. It’s way out of the scope here to discuss how tsne works, so let’s call it magic, or you can read way more about it from its creator Laurens van der Maaten.
The reduction is done with the TwoStageReduce.py script – open it like before and modify the variables as needed.
- model_filename : the trained model from the last step
- model_name : used to format the name of several output files later
- num_dimensions : how many dimensions for the final reduction – 2 will let us visualize the model in an image, so let’s leave it at that
- run_init_reduction and init_dimensions : for large data sets, if we went straight to a 2D tsne our computer would run out of memory and choke; instead, we can do an initial reduction with incremental PCA (a more memory-friendly but less precise method) to make our vector space more manageable before running tsne – a setting of 20D seems about right on my machine
- only_most_common and num_common : we can also reduce our vector space by only keeping the most common words; this is loaded from a file in ModelsAndData/ and lets us specify how many words to keep – try 10k as a good starting point, then bump it up to 50k if you need it
- tagged_pos : set to True if you trained your model with parts-of-speech; if so, we have to strip the POS before matching to common words
When ready, run your script! This will take the longest of any step – I’ve had it take up to several hours. First it will load your model, then reduce the vocabulary as specified, do an initial reduction, a final reduction, and normalize the vectors to a range of -1 to 1. It will save each of these variations as csv files, making them easy to use for visualizations, etc.
Here’s a sample from the normalized output:
Optionally, you may want to convert your vector space into a nice, even grid. This can be helpful for visualizing data that is clumped together, or for things like searching. The TsneToGrid.py script uses the rasterfairy module (installed in the lib/ folder, since it can’t be installed with pip). Change the input/output files and run it.
The script will print the output dimensions of the grid (such as 25×26 words) which you’ll want to note if you’re doing any kind of visualization or interactive project.
Data is hard to read, so visualizing the vector space can be really helpful. The included Processing script will load your 2D csv file and output a png file, showing the characteristic tsne blobs and tails (or a grid, if you changed it in the previous step).
Open the sketch, change the input and output filenames, and any other settings you want to change.
Above: detail of Word2Vec space, trained on H.G. Well’s “Time Machine”.
Above: detail of the same space, converted to a grid with rasterfairy.
6. MAKE SOMETHING!
That’s it, go make something cool!