In my earlier post, I’d posted links to the Project Gutenberg Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead. These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a [...]
Categories: data mining,linux,text processing
Tagged: data mining, ngram, project gutenberg, text processing
- Published:
- May 13, 2008 – 9:54 pm
- Author:
- By prashanthellina
I’ve been working on Wordza.com for which I needed Ngram data from a sufficiently large corpus. Initially, I thought of using Wikipedia data which I already have on my disk, but decided on using Project Gutenberg data as it is more representative of the general usage of English language.
Categories: data mining,linux,programming,python,text processing
Tagged: gutenberg, ngrams, project gutenberg, python, text parsing
- Published:
- May 4, 2008 – 9:58 pm
- Author:
- By prashanthellina