These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a week before the date of this post.
Unigrams along with frequency count from the text data above
Bi-grams and Tri-grams along with frequency count from the text data above
I had to split the files because my webserver has a limitation in serving out files larger than 2GB. After downloading the files, do this
mv gutenberg_en_files.tar.bz2.0 gutenberg_en_files.tar.bz2 cat gutenberg_en_files.tar.bz2.1 >> gutenberg_en_files.tar.bz2 rm gutenberg_en_files.tar.bz2.1
If you find the data useful, I’d be delighted to hear the context in which you made use of it.