Word2Vec is cool. So is tsne. But trying to figure out how to train a model and reduce the vector space can feel really, really complicated. While working on a sprint-residency at Bell Labs, Cambridge last fall, which has morphed into a project where live wind data blows a text through Word2Vec space, I wrote a set of Python scripts to make using these tools easier.
This tutorial is not meant to cover the ins-and-outs of how Word2Vec and tsne work, or about machine learning more generally. Instead, it walks you through the basics of how to train a model and reduce its vector space so you can move on and make cool stuff with it. (If you do make something awesome from this tutorial, please let me know!)
Above: a Word2Vec model trained on a large language dataset, showing the telltale swirls and blobs from the tsne reduction.
Continue reading “Using Word2Vec and TSNE”
Some WIP for an upcoming performance: 1,047 syllables from H.G. Wells’ Time Machine input to word2vec space, then reduced from 50 dimensions to two. View a much larger version here.
A detail of one of the syllable swirls.
UPDATE 9/14: A few things have changed for setting up a Twitter application since this tutorial was written. The main change is you will need a phone number to register your app. Most of this guide should be fairly close to the current system, though the screenshots may look a bit different.
Creating Twitter bots, automated text-generators that spew spam, poetry, and other things, can be a bit of a confusing process. This tutorial will hopefully get you through the tough bits and make bot-building possible!
For this tutorial I will be using Python, a language whose simplicity and natural syntax is great for working with text. However, this tutorial should be easily portable to your language of choice. I assume you know at least enough programming to write your own algorithmic text; if you need some help, I would suggest one of the myriad resources including Learn X in Y Minutes. Finally, this post is written from a Mac user’s perspective – if you use another OS and have suggestions or required different steps, my apologies and let me know so I can add them.
If your programming is not up to snuff, you might consider using IFTTT to trigger a Tweet. While the range of possible text is much more limited, you can easily do things like post a Tweet when tomorrow’s weather is forecasted to be nice or you like a video on Vimeo! (You can also use this as a backup for storing your bot’s awesome Tweets.)
You can view the source files used here, screenshots, and other miscellany for this tutorial on GitHub.
Continue reading “Tutorial: Twitter Bots”
Having just wrapped up a long project, I’ve wasted much of this morning on a dumb little idea: compiling all file extensions that are also valid words in the English language. Using a Processing sketch to scrape the website filext.com, then a Python script running the Natural Language Toolkit to check against the dictionary.
Not perfect (some acronyms made their way through) and could be better (separate files for parts of speech, making it easier to build texts).
Also included is a random poem builder – here’s a sample:
BD SETUP DREAM
al vat 100 works tb nob aim name press beacon xes sod code atm four arm
tao play hairy mob whiz medical ipod exs or
ews bh lxs session poem wax serial locked primer
ybs erasure rummy ascii tis hiv sparse driver spiff pic video 98 amos first
arp tree ad watch
wus ebs mo
clearance pip pro english ph idea messenger monday wmo ism
caps fat correct pub three blocks 110 more blue hdl saw value m start holly
fez tnf male chorus kvs kick vac frame nrc
night lsd resource arcane arch bks
Code and resulting data is available on GitHub; full CSV results after the break.
Continue reading “English Language File Extensions”