Notice: Undefined index: aiosp_use_tags_as_keywords in /home/ellina/blog.prashanthellina.com/wp-content/plugins/all-in-one-seo-pack/aioseop.class.php on line 885

Project Gutenberg Ngram data: English only

In my earlier post, I’d posted links to the Project Gutenberg Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead.

These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a week before the date of this post.

gutenberg_en_files.tar.bz2.0 (2.0GB)

gutenberg_en_files.tar.bz2.1 (1.4GB)

Unigrams along with frequency count from the text data above

gutenberg_en_unigrams.tar.gz (7.4MB)

Bi-grams and Tri-grams along with frequency count from the text data above

gutenberg_en_bi_tri_grams.tar.gz (493MB)

I had to split the files because my webserver has a limitation in serving out files larger than 2GB. After downloading the files, do this

mv gutenberg_en_files.tar.bz2.0 gutenberg_en_files.tar.bz2
cat gutenberg_en_files.tar.bz2.1 >> gutenberg_en_files.tar.bz2
rm gutenberg_en_files.tar.bz2.1


If you find the data useful, I’d be delighted to hear the context in which you made use of it.