N-gram data from Project Gutenberg

I’ve been working on Wordza.com for which I needed Ngram data from a sufficiently large corpus. Initially, I thought of using Wikipedia data which I already have on my disk, but decided on using Project Gutenberg data as it is more representative of the general usage of English language.

Get Project Gutenberg Ngram data

The Ngram data contains bi-grams and tri-grams for now. I plan to generate uni-grams soon. I’ve made the data available here so you can download and use it! This data contains all of the e-books hosted by Project Gutenberg (which means the data contains English, French, German and other languages). If you want an English only dataset, check back in a week or two. I am in the process of generating the same.

The Ngram data containing bi-grams and tri-grams. Each line is prepended with the occurence count.

gutenberg_ngrams.tar.bz2 (624 MB)


This is the compressed tarball of all the txt files in Project Gutenberg (as of a week before this blog post). Note that you don’t need this file unless you want to generate the Ngrams yourself using the scripts provided below.

gutenberg_files.tar.bz2.0,
gutenberg_files.tar.bz2.1,
gutenberg_files.tar.bz2.2 (5.3 GB)

My webserver (Apache) has a problem serving out files bigger than 2GB, so I had to split the file up. After you download the splits, you have to join them like this.

mv gutenberg_files.tar.bz2.0 gutenberg_files.tar.bz2
cat gutenberg_files.tar.bz2.1 >> gutenberg_files.tar.bz2
cat gutenberg_files.tar.bz2.2 >> gutenberg_files.tar.bz2


To decompress the files, you will need bunzip2 on *nix/Cygwin. On Windows, use 7zip.

Generate the data yourself

In case you want to generate the Ngrams yourself by processing the Project Gutenberg data files, follow these instructions. You will have to get the Project gutenberg data files. Use the following command to get all the English language files in txt format.

mkdir gutenberg
cd gutenberg
wget -w 2 -m "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"


The txt files are compressed and stored in files ending with .zip extension. These zip files are spread across multiple directories. The following command will move the zip files into the “gutenberg” directory you created in the above step.

for i in `find . -name "*.zip"`; do mv $i . ; done;


Now that all the zip files are in the same directory, unzip the zip files.Some zip files may contain files other than .txt’s. The following command extracts only .txt’s in the zip files.

cd ..
mkdir gutenberg_txt
for i in `find gutenberg -name "*.zip"`; do unzip $i \*.txt -d gutenberg_txt/ ; done;
cd gutenberg_txt
for i in `find . -name "*.txt"`; do mv $i . ; done;
cd ..


The gutenberg txt files have gutenberg headers and footers which should be removed lest they skew the frequency of Ngrams. The script “remove_gutenberg_text.py” does exactly this. The “generate_ngrams.py” script creates uni, bi and tri-grams of whatever text is piped into it. The following command pipes all the txt files through both the scripts to create the ngrams file.

for i in `find gutenberg_txt/ -name "*.txt"`; \
do cat $i | python remove_gutenberg_text.py | \
grep -i -v "project gutenberg" |\
 python generate_ngrams.py >> gutenberg_ngrams; done;


Now you have to count the number of times an ngram occurs. The following sequence of commands process the ngrams file generated above and produce a file with the frequency counts of the ngrams. Note that the “512K” option to sort is because I had to run these scripts on my host which kills processes that take too much memory. If you have a machine with a lot of memory, sorting can be significantly faster if you use a higher value, say “1G”.

sort -S 512K -T tmp_sort/ gutenberg_ngrams > gutenberg_ngrams.sorted
uniq -c gutenberg_ngrams.sorted > gutenberg_ngrams.counted
sort -S 512K -T tmp_sort/ gutenberg_ngrams.counted > gutenberg_ngrams.counted.sorted 


Gutenberg data processing scripts

Do get back

If you use this data, I would really appreciate if you get back with details about how you used it in the context of your project

1 Trackbacks

  1. By links for 2009-08-13 « Blarney Fellow on August 14, 2009 at 6:27 am

    [...] N-gram data from Project Gutenberg | Prashanth Ellina (tags: nlp dataset linguistics) [...]

4 Comments

  1. Som

    Hi Prashant,

    Thanks for the wonderful blog you do. I was curious to know what kind of problem you faced with Wikipedia and that is not there in the Gutenberg data. May be the related question is how you use the N-gram statistics in Wordza and why you think the statistics obtained from Gutenberg data is better.

    Thanks,
    Som

    Posted May 5, 2008 at 9:08 am | Permalink
  2. I am glad you enjoy my blog! I did not attempt doing Ngram generation on Wikipedia data, so I don’t have any hard numbers to support my argument. However, I’ll explain my reasoning. I need the Ngram data for a feature of Wordza where the user gets to see the most frequently used phrases containing a given word. Eg: “abysmal corruption”, “abysmal conditions” for the word abysmal. Wikipedia being an encyclopedia will tend to have a sanitized usage of English where many of the “non-frequent” words in english don’t even occur. In comparison, Gutenberg data is better suited because it is English literature. When I can I’ll try to generate Ngram data for Wikipedia. It will be interesting to compare the results.

    Posted May 5, 2008 at 11:23 am | Permalink
  3. sehugg

    I think there are cases for both, your site is more of an English reference so it makes sense to use sanitized Gutenberg data, whereas a site that needs up-to-date terminology would want to use Wikipedia.

    Posted August 27, 2009 at 8:59 pm | Permalink
  4. Yes, Indeed.

    Posted August 28, 2009 at 7:59 am | Permalink