I’ve been working on Wordza.com for which I needed Ngram data from a sufficiently large corpus. Initially, I thought of using Wikipedia data which I already have on my disk, but decided on using Project Gutenberg data as it is more representative of the general usage of English language.
Categories: data mining,linux,programming,python,text processing
Tagged: gutenberg, ngrams, project gutenberg, python, text parsing