INCLUDE_DATA
( to ) ? be : ! be;
sidebar left sidebar right

N-gram data from Project Gutenberg (4,441 views)

I’ve been working on Wordza.com for which I needed Ngram data from a sufficiently large corpus. Initially, I thought of using Wikipedia data which I already have on my disk, but decided on using Project Gutenberg data as it is more representative of the general usage of English language.

Get Project Gutenberg Ngram data

The Ngram data contains bi-grams and tri-grams for now. I plan to generate uni-grams soon. I’ve made the data available here so you can download and use it! This data contains all of the e-books hosted by Project Gutenberg (which means the data contains English, French, German and other languages). If you want an English only dataset, check back in a week or two. I am in the process of generating the same.

The Ngram data containing bi-grams and tri-grams. Each line is prepended with the occurence count.

gutenberg_ngrams.tar.bz2 (624 MB)


This is the compressed tarball of all the txt files in Project Gutenberg (as of a week before this blog post). Note that you don’t need this file unless you want to generate the Ngrams yourself using the scripts provided below.

gutenberg_files.tar.bz2.0,
gutenberg_files.tar.bz2.1,
gutenberg_files.tar.bz2.2 (5.3 GB)

My webserver (Apache) has a problem serving out files bigger than 2GB, so I had to split the file up. After you download the splits, you have to join them like this.

mv gutenberg_files.tar.bz2.0 gutenberg_files.tar.bz2
cat gutenberg_files.tar.bz2.1 >> gutenberg_files.tar.bz2
cat gutenberg_files.tar.bz2.2 >> gutenberg_files.tar.bz2


To decompress the files, you will need bunzip2 on *nix/Cygwin. On Windows, use 7zip.

Generate the data yourself

In case you want to generate the Ngrams yourself by processing the Project Gutenberg data files, follow these instructions. You will have to get the Project gutenberg data files. Use the following command to get all the English language files in txt format.

mkdir gutenberg
cd gutenberg
wget -w 2 -m "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&langs[]=en"


The txt files are compressed and stored in files ending with .zip extension. These zip files are spread across multiple directories. The following command will move the zip files into the “gutenberg” directory you created in the above step.

for i in `find . -name "*.zip"`; do mv $i . ; done;


Now that all the zip files are in the same directory, unzip the zip files.Some zip files may contain files other than .txt’s. The following command extracts only .txt’s in the zip files.

cd ..
mkdir gutenberg_txt
for i in `find gutenberg -name "*.zip"`; do unzip $i \*.txt -d gutenberg_txt/ ; done;
cd gutenberg_txt
for i in `find . -name "*.txt"`; do mv $i . ; done;
cd ..


The gutenberg txt files have gutenberg headers and footers which should be removed lest they skew the frequency of Ngrams. The script “remove_gutenberg_text.py” does exactly this. The “generate_ngrams.py” script creates uni, bi and tri-grams of whatever text is piped into it. The following command pipes all the txt files through both the scripts to create the ngrams file.

for i in `find gutenberg_txt/ -name "*.txt"`; \
do cat $i | python remove_gutenberg_text.py | \
grep -i -v "project gutenberg" |\
 python generate_ngrams.py >> gutenberg_ngrams; done;


Now you have to count the number of times an ngram occurs. The following sequence of commands process the ngrams file generated above and produce a file with the frequency counts of the ngrams. Note that the “512K” option to sort is because I had to run these scripts on my host which kills processes that take too much memory. If you have a machine with a lot of memory, sorting can be significantly faster if you use a higher value, say “1G”.

sort -S 512K -T tmp_sort/ gutenberg_ngrams > gutenberg_ngrams.sorted
uniq -c gutenberg_ngrams.sorted > gutenberg_ngrams.counted
sort -S 512K -T tmp_sort/ gutenberg_ngrams.counted > gutenberg_ngrams.counted.sorted


Gutenberg data processing scripts

Do get back

If you use this data, I would really appreciate if you get back with details about how you used it in the context of your project

subscribe to feed

Tags: , , , ,

"N-gram data from Project Gutenberg" was published on May 4th, 2008 and is listed in data mining, linux, programming, python, text processing.

Comments on "N-gram data from Project Gutenberg": 5 Comments

  1. Som wrote,

    Hi Prashant,

    Thanks for the wonderful blog you do. I was curious to know what kind of problem you faced with Wikipedia and that is not there in the Gutenberg data. May be the related question is how you use the N-gram statistics in Wordza and why you think the statistics obtained from Gutenberg data is better.

    Thanks,
    Som

  2. prashanthellina wrote,

    I am glad you enjoy my blog! I did not attempt doing Ngram generation on Wikipedia data, so I don’t have any hard numbers to support my argument. However, I’ll explain my reasoning. I need the Ngram data for a feature of Wordza where the user gets to see the most frequently used phrases containing a given word. Eg: “abysmal corruption”, “abysmal conditions” for the word abysmal. Wikipedia being an encyclopedia will tend to have a sanitized usage of English where many of the “non-frequent” words in english don’t even occur. In comparison, Gutenberg data is better suited because it is English literature. When I can I’ll try to generate Ngram data for Wikipedia. It will be interesting to compare the results.

  3. links for 2009-08-13 « Blarney Fellow wrote,

    [...] N-gram data from Project Gutenberg | Prashanth Ellina (tags: nlp dataset linguistics) [...]

  4. sehugg wrote,

    I think there are cases for both, your site is more of an English reference so it makes sense to use sanitized Gutenberg data, whereas a site that needs up-to-date terminology would want to use Wikipedia.

  5. prashanthellina wrote,

    Yes, Indeed.

560754 views

Prashanth Ellina is powered by WordPress

No Complaints Shifter Series Theme by Buzzdroid.com
Computers blogarama - the blog directory Blog Flux Directory Blog Directory & Search engine Computer Blogs - Blog Catalog Blog Directory Computers blogs Bloggeries Blog Directory blog directory Computers Blog Blog Search, Blog Directory p Listed in LS Blogs the Blog Directory and Blog Search Engine Blog Review Blog search - categorized blog directory Link With Us - Web Directory Find Blogs in the Blog
Directory Blog Directory