( to ) ? be : ! be;
sidebar left sidebar right

Project Gutenberg Ngram data: English only (729 views)

In my earlier post, I’d posted links to the Project Gutenberg Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead.

These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a week before the date of this post.

gutenberg_en_files.tar.bz2.0 (2.0GB)

gutenberg_en_files.tar.bz2.1 (1.4GB)

Unigrams along with frequency count from the text data above

gutenberg_en_unigrams.tar.gz (7.4MB)

Bi-grams and Tri-grams along with frequency count from the text data above

gutenberg_en_bi_tri_grams.tar.gz (493MB)

I had to split the files because my webserver has a limitation in serving out files larger than 2GB. After downloading the files, do this

mv gutenberg_en_files.tar.bz2.0 gutenberg_en_files.tar.bz2
cat gutenberg_en_files.tar.bz2.1 >> gutenberg_en_files.tar.bz2
rm gutenberg_en_files.tar.bz2.1


If you find the data useful, I’d be delighted to hear the context in which you made use of it.

subscribe to feed

Tags: , , ,

"Project Gutenberg Ngram data: English only" was published on May 13th, 2008 and is listed in data mining, linux, text processing.

Follow comments via the RSS Feed | Leave a comment | Trackback URL

Leave Your Comment

Subscribe without commenting

187,488 views

Prashanth Ellina is powered by WordPress

No Complaints Shifter Series Theme by Buzzdroid.com
Computers blogarama - the blog directory Blog Flux Directory Blog Directory & Search engine Computer Blogs - Blog Catalog Blog Directory Computers blogs Bloggeries Blog Directory blog directory Computers Blog Blog Search, Blog Directory p Listed in LS Blogs the Blog Directory and Blog Search Engine Blog Review Blog search - categorized blog directory Link With Us - Web Directory Find Blogs in the Blog
Directory Blog Directory