I’ve been working on Wordza.com for which I needed Ngram data from a sufficiently large corpus. Initially, I thought of using Wikipedia data which I already have on my disk, but decided on using Project Gutenberg data as it is more representative of the general usage of English language.
Get Project Gutenberg Ngram data
The Ngram data contains bi-grams and tri-grams for now. I plan to generate uni-grams soon. I’ve made the data available here so you can download and use it! This data contains all of the e-books hosted by Project Gutenberg (which means the data contains English, French, German and other languages). If you want an English only dataset, check back in a week or two. I am in the process of generating the same.
The Ngram data containing bi-grams and tri-grams. Each line is prepended with the occurence count.
gutenberg_ngrams.tar.bz2 (624 MB)
This is the compressed tarball of all the txt files in Project Gutenberg (as of a week before this blog post). Note that you don’t need this file unless you want to generate the Ngrams yourself using the scripts provided below.
gutenberg_files.tar.bz2.2 (5.3 GB)
My webserver (Apache) has a problem serving out files bigger than 2GB, so I had to split the file up. After you download the splits, you have to join them like this.
mv gutenberg_files.tar.bz2.0 gutenberg_files.tar.bz2 cat gutenberg_files.tar.bz2.1 >> gutenberg_files.tar.bz2 cat gutenberg_files.tar.bz2.2 >> gutenberg_files.tar.bz2
To decompress the files, you will need bunzip2 on *nix/Cygwin. On Windows, use 7zip.
Generate the data yourself
In case you want to generate the Ngrams yourself by processing the Project Gutenberg data files, follow these instructions. You will have to get the Project gutenberg data files. Use the following command to get all the English language files in txt format.
mkdir gutenberg cd gutenberg wget -w 2 -m "http://www.gutenberg.org/robot/harvest?filetypes=txt&langs=en"
The txt files are compressed and stored in files ending with .zip extension. These zip files are spread across multiple directories. The following command will move the zip files into the “gutenberg” directory you created in the above step.
for i in `find . -name "*.zip"`; do mv $i . ; done;
Now that all the zip files are in the same directory, unzip the zip files.Some zip files may contain files other than .txt’s. The following command extracts only .txt’s in the zip files.
cd .. mkdir gutenberg_txt for i in `find gutenberg -name "*.zip"`; do unzip $i \*.txt -d gutenberg_txt/ ; done; cd gutenberg_txt for i in `find . -name "*.txt"`; do mv $i . ; done; cd ..
The gutenberg txt files have gutenberg headers and footers which should be removed lest they skew the frequency of Ngrams. The script “remove_gutenberg_text.py” does exactly this. The “generate_ngrams.py” script creates uni, bi and tri-grams of whatever text is piped into it. The following command pipes all the txt files through both the scripts to create the ngrams file.
for i in `find gutenberg_txt/ -name "*.txt"`; \ do cat $i | python remove_gutenberg_text.py | \ grep -i -v "project gutenberg" |\ python generate_ngrams.py >> gutenberg_ngrams; done;
Now you have to count the number of times an ngram occurs. The following sequence of commands process the ngrams file generated above and produce a file with the frequency counts of the ngrams. Note that the “512K” option to sort is because I had to run these scripts on my host which kills processes that take too much memory. If you have a machine with a lot of memory, sorting can be significantly faster if you use a higher value, say “1G”.
sort -S 512K -T tmp_sort/ gutenberg_ngrams > gutenberg_ngrams.sorted uniq -c gutenberg_ngrams.sorted > gutenberg_ngrams.counted sort -S 512K -T tmp_sort/ gutenberg_ngrams.counted > gutenberg_ngrams.counted.sorted
Gutenberg data processing scripts
- remove_gutenberg_text.py — removes Project Gutenberg header and footer from txt files
- generate_ngrams.py — generate uni, bi and tri-grams for any text
Do get back
If you use this data, I would really appreciate if you get back with details about how you used it in the context of your project