In my earlier post, I’d posted links to the Project Gutenberg Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead.
These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a week before the date of this post.
gutenberg_en_files.tar.bz2.0 (2.0GB)
gutenberg_en_files.tar.bz2.1 (1.4GB)
Unigrams along with frequency count from the text data above
gutenberg_en_unigrams.tar.gz (7.4MB)
Bi-grams and Tri-grams along with frequency count from the text data above
gutenberg_en_bi_tri_grams.tar.gz (493MB)
I had to split the files because my webserver has a limitation in serving out files larger than 2GB. After downloading the files, do this
mv gutenberg_en_files.tar.bz2.0 gutenberg_en_files.tar.bz2
cat gutenberg_en_files.tar.bz2.1 >> gutenberg_en_files.tar.bz2
rm gutenberg_en_files.tar.bz2.1
If you find the data useful, I’d be delighted to hear the context in which you made use of it.
Permalink | No Comments
I’ve been working on Wordza.com for which I needed Ngram data from a sufficiently large corpus. Initially, I thought of using Wikipedia data which I already have on my disk, but decided on using Project Gutenberg data as it is more representative of the general usage of English language.
Continue Reading »
Permalink | 2 Comments
I’d thought of making a word quizzer as a web application to improve my vocabulary when I took the GRE test a couple of years back. I’d written one in Visual Basic 6 when I wrote SAT :), but desktop applications are boring!
I got inspired to bring my long standing idea to fruition and the outcome is Wordza.

Continue Reading »
Permalink | 7 Comments
What is Alexa rank?
Alexa collects statistics about visits by internet users to websites through the Alexa Toolbar. Based on the collected data, Alexa computes site ranking. By examining the Alexa rank of a site, you can get a rough idea of how popular the site is. Many argue that Alexa rank is misleading but it has its uses.
The Alexa rank script
You can find out the Alexa rank for any site by using this page. However, if you want to programatically get the Alexa rank, you can do it using this script.
Continue Reading »
Permalink | 7 Comments
One of the trickiest and enjoyable parts of starting something new (be it a website, project, band) is naming it! Sometimes a good name can be quite elusive and cause more than the deserved share of brain ache. Here is a list of automated services around the internet that will help you get name suggestions.
Let us name a domain!

http://www.domaintools.com/
DomainTools takes a concept as input and comes up with domain name suggestions. Let us say you are starting a website about “Vacations in Mexico”. Go to their website and type in “Mexico Vacations” in the text box and click on the button to get suggestions.
Continue Reading »
Permalink | 2 Comments
The MPEG4 video encoding process makes use of block motion compensation to achieve compression. The motion compensation process serves to produce the intra frames which are the frames between keyframes. I’ve always been fascinated by this process and was delighted to find out that my favorite video player, mplayer, allows one to visualize this process. I tried it and it is wonderful!
Continue Reading »
Permalink | 2 Comments
Generating thumbnails/screenshots of a video is useful in many ways. Youtube and many other video sites use this to show a preview of the video as a small thumbnail. Google video captures a series of thumbnails from a video at various time intervals to show a better video preview.
Continue Reading »
Permalink | 19 Comments

Every once in a while, someone comes up with a way of doing things in an extremely obvious and simple way. When this happens, a zillion others say, “of course that’s the way to do it!”. Songza is a music search engine and jukebox that is dead simple to use. You should try it to really grok how simple the interface is.
Continue Reading »
Permalink | 11 Comments
I am a huge fan of the science fiction genre. Arthur C Clarke is one of my favorite science fiction writers after Isaac Asimov. It saddens me to have learnt that he has passed away. Most people get reminded of “2001: A Space Odyssey” when they hear the name Arthur C Clarke. I get reminded of “Rendezvouz with Rama“, a brilliantly conceived novel that set my imagination on fire. For all the Clarke fans out there, “Rendezvouz with Rama“.
Continue Reading »
Permalink | 2 Comments