This month, the curatorial collaborative project Drift Station, which I’m a part of along with Angeles Cossio, released an online project titled Empty Apartments. We pulled nearly 125,000 photographs of apartments and houses for rent on Craigslist that were completely empty because of a removal service, and presented them as an interactive online exhibition. The project took nearly two years of work, and much of it was manual (Angeles triple-checking every single image by hand to remove ones that included common spaces or non-apartments), but we also used several automated processes and machine learning to sort the photos.
This post outlines some of the technical steps used to create the project. All the code and data used are available on Github.
TOOLS USED
Mainly, the technical parts of the project were accomplished using custom-written Python scripts (for scraping and the machine learning), some Processing sketches (to generate the map tiles), and Leaflet.js for the map interface.
TLDR, the important pieces were:
- Python
- BeautifulSoup
- numpy
- scikit-learn
- HDF5
- Multicore t-SNE
- Rasterfairy
- Processing
- ImageMagick
- Leaflet.js
SCRAPING
The first step was to download the images. Since Craigslist doesn’t have an API for this kind of thing (no surprise) I manually grabbed all the locations they serve in the U.S. and wrote a Python script to do a search for each and download the images from a bunch of listings. Scraping is more kludge than science, so it was mostly a combination of BeautifulSoup and lots of regular expressions.
Here’s a few additions that made this work better, though:
- While it took more work to code, I built the script to take arguments such as to start anywhere in the list of locations (letting me stop partway through the massive download) and to set how many pages of listings I wanted to scrape.
- Initially, we discussed showing apartments that were listed at a rent that a family making the median income could afford. I pulled the income data by state from the U.S. Census Bureau site and used it to search for apartments that were 30% of that or less. In the end, we scrapped this idea, but it was an interesting thing to try.
- Throughout the project, we tried to automate as much as possible. As a first pass, I used ImageMagick to compute the entropy in each image. This can be a good way to differentiate graphical images (low entropy, such as floorplans or ads) from photographic ones (high entropy, the ones we wanted to keep). The script would download images to a graphics folder if it then entropy was below 3.9, a photos folder if above 4.4, and a maybe folder if in between. It wasn’t perfect, but it made Angeles’ job of manually checking much quicker.
- I found that my IP was likely to get temporarily blocked when doing a lot of downloads. Doing it from the university where I teach, though, meant a dot-edu IP address and I didn’t get blocked at all :)
In this whole process, I tried to preserve the data as much as possible. Instead of dumping to a single folder, images were downloaded to a folder with the name of the location they came from. Original filenames were kept intact too, on the off-chance I wanted to find them again, or if the format of those names became important later in the project.
CULLING
Once we had the images, we took a look through and realized there were a ton that escaped my graphics/photo filter, and many that were of common areas, outside the building, etc. We wanted the project to just show the inside of homes, so Angeles spent months (literally) looking through every image. In fact, she went over them several times to be sure, and each pass removed some that she missed.
As a final sweep, I wrote another Python script to remove duplicate images – out of the total 130,000 images kept, nearly 4,500 were duplicates! This was accomplished using some code via this Real Python post, which computes the “distance hash” for each image. Also known as a “locality-sensitive” hash, it works by reducing the data so that similar items map to the same hash with a high probability. This is the opposite of cryptographic hashing, where tiny differences should result in a really different hash. The resulting values are saved as keys in a dictionary. Any hashes with more than one file are most likely duplicates, and the redundant images were moved to another folder.
WHY MACHINE LEARNING?
While in the early stages of the project, we had lots of conversations about how best to present the images: grouped by location, randomly in a grid, by monthly rent. Drift Station is, in part, about curatorial experimentation, and after spending so much time looking at them, we realized that patterns were emerging: colors palettes and objects repeated, and there was a vernacular overall. For this reason, using machine learning to sort the images seemed like a good way to go – it removed our hand a bit, and would be able to bring out these patterns across the entire data set.
(What follows isn’t meant to get into the nitty-gritty of machine learning or how to make it work best. Instead, it’s an overview what we did and found worked for us.)
IMAGE PREP
Generally, machine learning systems don’t need (or want) the full-res image, so before extracting features from the images, they were reduced to 64×64-pixels. Normal command-line calls like grep choke on huge lists of files, but I found the find command along with ImageMagick worked well:
1 |
find . -name '*.jpg' -execdir mogrify -resize 64x64! {} \; |
I also made sure the color-space was consistent, otherwise I had problems later when trying to make the map tiles in Processing – easier to just take care of that now:
1 |
find . -name '*.jpg' -execdir mogrify -colorspace RGB {} \; |
The last step required the most trial-and-error. While there a lot of ways to define features in an image, I didn’t have access to a crazy GPU setup or tons of cash to blow on AWS. What I found was an example by Kyle McDonald where an image is blurred at two different levels stacked vertically with the original. The system doesn’t much care if this is an artificial representation, but it retains large- and small-scale data. The blurring was done in ImageMagick as well, then combined with a Python script:
1 2 |
find . -name '*.jpg' -execdir mogrify -blur 0x3 {} \; find . -name '*.jpg' -execdir mogrify -blur 0x8 {} \; |
The exact blur levels may depend on your images, but I found a setting of 3 and 8 worked well. Python’s PIL library handled combining the images in a reasonably fast way, though this could have been done in Processing too.
EXTRACTING FEATURES
While I did look at using a pre-trained network like the VGG16 weights that Gene Kogan suggests in his excellent Machine Learning For Artists tutorials, I found it really slow for a dataset of this size and with little payoff, since I don’t need to identify objects. Instead, I just loaded all the images into numpy arrays and saved them to an HDF5 file, since it allows for the data to be compressed easily.
- Create labels for the data (the filenames)
- Strip alpha channel from the image
- Convert to grayscale
- Add to the numpy array as float32 data (no need for super-high precision)
I then normalized the entire set of data to a range of 0–1, making it easier for later steps. The final HDF5 file was about 7GB.
DIMENSIONALITY REDUCTION
As is often the case with machine learning, the number of dimensions in our final data is way too high to be useful. In this case, the three stacked 64×64-pixel images mean 12,288 dimensions of data (one grayscale value per pixel). Since the goal was to ultimately end up in a 2D layout, the number of dimensions needed to be reduced (a process called “decomposition”).
With previous projects, I had the best luck running an initial reduction using PCA. Loading the entire 7GB dataset into RAM wasn’t going to work very well on my laptop, but thankfully sklearn has a version called Incremental PCA that lets us stream data from an HDF5 file, fitting as it goes.
Getting ideal settings for machine learning seems to be half art, half word of mouth. I ended up reducing the dimensions to 150 using PCA – leaving it at too many and tSNE will choke, going down to too few gave bad results. Imanol Luengo suggests that a good starting point is to use the square root of the number of vectors in one datapoint (in this case sqrt(12288) = 110). The data was fed in 1000-image chunks and saved out as a new HDF5 file.
The final reduction was done using tSNE, which gives much better results. Similarly, trying to do this all in RAM is difficult, but the MulticoreTSNE implementation, while really difficult to install, was a huge help. (See the tsne_Multicore.py notes for how I got it to work.) The overall settings don’t seem to matter too much, but I did write a program to visualize the results as a way of tuning everything.
- Number of cores = 8 (you can check how many your computer has, on a Mac at least, using sysctl -n hw.ncpu )
- Final dimensions = 2
- Number of iterations = 1000 (the default setting)
- Learning rate = 1000 (probably the most critical value)
- Perplexity = 30 (number of clusters the algorithm will try to fit too, larger datasets often require a higher perplexity)
- Angle = 0.2 (lower = more accurate, higher = faster)
GRID LAYOUT
The final step in prepping the data was to get it into a grid (and avoid the tSNE-look that just screams “machine learning”). For that, Mario Klingemann’s rasterfairy code does a great job. I converted the HDF5 data to csv for easier input (I found that normalizing the data made rasterfairy do weird things, which was surprising) and ran the grid layout. The results on data that large are good, but a machine with more RAM could take advantage of the multiple passes that rasterfairy can apply, giving a better layout.
I ended up running this optimization on an Amazon Web Services EC2 instance – I’m working on a separate post on that, so stay tuned.
CREATE MAP TILES
The final step was to render out “slippy map” tiles, giving us the Google Maps-like zoom-and-pan interface. While there are tools made to do this, they’re not meant to take lots of randomly-sized images (surprise surprise), so I built my own in Processing.
Map tiles are always 256-pixels square, and though there’s probably a smarter way to do this, I wrote code to make giant rows, then slice that into squares. Files are output using a standard scheme that denotes their x/y position:
1 |
<outputFolder>/<zoomLevel>/x/y.png |
These tiles can then be used to generate other zoom levels, simply by combining four together, making a new tile that’s half size! I also implemented some compressed JPG code in Processing to make the tile files smaller (makes a huge difference over the full-res PNG with this many files).
Since the images were all different widths, determining the exact layout a priori was going to be difficult. Instead, all images are sized to the same height, and new ones are added to a row until it fills. This dynamic process takes a bit of tuning, but worked well once set up.
New zoom levels can be created from the original set simply by combining four into a single tile (zooming out by a factor of 2).
Something is wrong with your DNS entry for http://www.driftstation.org/emptyapartments
Ah, forgot to change the URL – updated to emptyapartments.net!
I think the first paragraph has been hacked to add links to loan web sites. :(
Great project, and blog! Thanks.
@mb – yikes, thanks for letting me know and thanks for the kind words!
love to know these amazing stuff