URLs are likely some of the most common forms of reading and writing in our daily lives. We might not be able to quote a line from Shakespeare, but most of us probably know the address www.google.com [1]. Whois makes available what appears to be a fairly comprehensive list of .com domain names [2]: a rich data set to mine.
The data itself is not in a tidy or easily-accessible format, but after some clever command-line downloading and Python cleanup [3], I downloaded and extracted 102,204,367,101 (102.2 billion) URLs. Many lines of inquiry could be traced through this data, but this post suggests two: the most common words used in web addresses, and an analysis of their first-letters.
Extracting separate words from URLs is more difficult than it first seems. Since words are not separated by spaces like in a book, and separating words with dashes is not very common, most URLs read as many words smashed together ( thisisasmashedwebaddress.com). While a human can easily separate the words, a computer is much less suited for the task.
A search for possible methods turned up Viterbi Segmentation and an implementation for exactly this purpose on Stack Overflow – perfect! The example used a large text as an input to generate a dictionary of words and probabilities, from which the algorithm can split the words. Rather that try to generate such a text from scratch (I think even a single novel wouldn’t be diverse enough for this purpose), I found Peter Norvig’s excellent website that includes a very large word list with probabilities already included! A little cleanup and modification produces this Python script to extract words from a URL:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 |
# EXTRACT WORDS FROM MERGED TEXT def viterbi_segment(text): probs, lasts = [1.0], [0] for i in range(1, len(text) + 1): prob_k, k = max((probs[j] * word_prob(text[j:i]), j) for j in range(max(0, i - max_word_length), i)) probs.append(prob_k) lasts.append(k) words = [] i = len(text) while 0 < i: words.append(text[lasts[i]:i]) i = lasts[i] words.reverse() return words, probs[-1] def word_prob(word): return dictionary.get(word, 0) / total def words(text): return re.findall('[a-z]+', text.lower()) # CREATE DICTIONARY OF WORDS TO COMPARE TO dictionary = {} with open('InputWordList.txt') as input: for entry in input: w = entry.split() dictionary[w[0]] = int(w[1]) max_word_length = max(map(len, dictionary)) total = float(sum(dictionary.values())) # SPLIT URL words, prob = viterbi_segment('thisisacombinedurl.com') |
For example, myawesomewebsite.com can be split into my, awesome, and website. This is definitely not foolproof, but for scanning just over 100 billion URLs, it’s the only option I could find easily and is surprisingly fast.
The resulting 40 most common words are shown above, with the, and, and online coming in as the words most used in URLs. The final count tops out at 294,594 unique words (download the full file with frequency counts here). There is likely a lot more analysis that can be done of this word list. (A project I have been wanting to tackle for a long time is to interpolate all remaining domains from the existing ones, into a sort of master list of possible addresses up to the 63-character limit.)
With this domain name set, we can also analyze the distribution of URL starting letters. Above is a graph showing the frequency of each letter in billions of URLs. The most common letter is “S”, which begins 9,028,271,732 registered URLs; “Q” is the least-used with only 442,794,746 domains.
Perhaps more interesting is to compare this with the frequency that words starting with the same letter appear elsewhere in the English language [4]. For example, the letter “S” starts 8.83% of all the URLs analyzed, but starts 10.8% of all words in the English language. Compare that to the letter “J”, which starts 2.46% of URLs but only starts 0.9% of English words.
What does this mean? Since a URL can have a maximum of 63 characters, there are something like 1.11444219e+98 possible URLs. Just because a URL starts with the letter like “J” doesn’t mean there are not many domains that start with the same letter left. However, it does give us some insight into popularity of certain linguistic choices for URLs over preferences in other uses of the English language.
NOTES, ETC
- According to an Alexa report cited here, Google receives approximately 1,100,000,000 unique visitors per month. As of 2012, there are about 2,405,518,376 internet users in the world, or 34.3% of the world’s total population. This means that about 46% of all internet users access Google on a monthly basis! It is likely that significantly more users know of Google, but elect to use other services.
- To view the lists, take the following URL http://www.whois.ws/whois_index/index.a.php and change the letter “a” to any other letter, or “0-9” for domains starting with numbers.
- Whois embeds their data in many nested webpages. Using the command-line tool
wget , I was able to extract all the pages of links using the following two commands:
1wget http://www.whois.ws/whois_index/index.A.php
This downloads the main page for domains starting with the letter A; all the lists can then be downloaded using this page as an input:
1wget -i index.A.php -FB http://www.whois.ws/whois_index/
From there, the data required several iterations of cleanup in Python to extract the links and do the analysis. A link to the project GitHub repo to come shortly. - It turns out that Wolfram Alpha is the easiest place to get a count of how many words start with each letter! Try the query “words that start with a” or any other letter.
This is fantastic Jeff. ‘Viterbi Segmentation’ is exactly the phrase I needed to hear. +10 for doing this in Python
Hi Jeff Thompson.
can I find the indexed sites and Dark web with the technique(Analysis of Every Registered URL)?
@Hami – if you had a dataset of those sites, then most likely. I used a dataset from Whois, so my code doesn’t get URLs, it just parses them.
How can I fix this problem?
@Hami – it’s not really a problem, just what it does (and doesn’t) do. If you wanted to make that work, you’d have to find that dataset yourself.
I suggest to use https://lookup.tools to get whois for all GTLD and CCTLD and also access to free Phone, NS, IP, Email address and owner reverse whois.