INCLUDE_DATA
( to ) ? be : ! be;
sidebar left sidebar right

Project Gutenberg Ngram data: English only (1,978 views)

In my earlier post, I’d posted links to the Project Gutenberg Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead.

These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a week before the date of this post.

gutenberg_en_files.tar.bz2.0 (2.0GB)

gutenberg_en_files.tar.bz2.1 (1.4GB)

Unigrams along with frequency count from the text data above

gutenberg_en_unigrams.tar.gz (7.4MB)

Bi-grams and Tri-grams along with frequency count from the text data above

gutenberg_en_bi_tri_grams.tar.gz (493MB)

I had to split the files because my webserver has a limitation in serving out files larger than 2GB. After downloading the files, do this

mv gutenberg_en_files.tar.bz2.0 gutenberg_en_files.tar.bz2
cat gutenberg_en_files.tar.bz2.1 >> gutenberg_en_files.tar.bz2
rm gutenberg_en_files.tar.bz2.1


If you find the data useful, I’d be delighted to hear the context in which you made use of it.

subscribe to feed

Tags: , , ,

"Project Gutenberg Ngram data: English only" was published on May 13th, 2008 and is listed in data mining, linux, text processing.

557741 views

Prashanth Ellina is powered by WordPress

No Complaints Shifter Series Theme by Buzzdroid.com
Computers blogarama - the blog directory Blog Flux Directory Blog Directory & Search engine Computer Blogs - Blog Catalog Blog Directory Computers blogs Bloggeries Blog Directory blog directory Computers Blog Blog Search, Blog Directory p Listed in LS Blogs the Blog Directory and Blog Search Engine Blog Review Blog search - categorized blog directory Link With Us - Web Directory Find Blogs in the Blog
Directory Blog Directory