In my earlier post, I’d posted links to the Project Gutenberg Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead.
These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a week before the date of this post.
gutenberg_en_files.tar.bz2.0 (2.0GB)
gutenberg_en_files.tar.bz2.1 (1.4GB)
Unigrams along with frequency count from the text data above
gutenberg_en_unigrams.tar.gz (7.4MB)
Bi-grams and Tri-grams along with frequency count from the text data above
gutenberg_en_bi_tri_grams.tar.gz (493MB)
I had to split the files because my webserver has a limitation in serving out files larger than 2GB. After downloading the files, do this
mv gutenberg_en_files.tar.bz2.0 gutenberg_en_files.tar.bz2 cat gutenberg_en_files.tar.bz2.1 >> gutenberg_en_files.tar.bz2 rm gutenberg_en_files.tar.bz2.1
If you find the data useful, I’d be delighted to hear the context in which you made use of it.
Tags: data mining, ngram, project gutenberg, text processing














