Monthly Archives: May 2008

Nose – TDD – Python 2

What, why I’ve been reading up on TDD and it has struck me as particularly useful methodology to achieve “clean code that works”. TDD encourages writing unit tests to cover all the code (because by definition, you write a test before a line of code is written). Because all your code is covered you are [...]

Project Gutenberg Ngram data: English only Comments Off

In my earlier post, I’d posted links to the Project Gutenberg Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead. These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a [...]

N-gram data from Project Gutenberg 5

I’ve been working on Wordza.com for which I needed Ngram data from a sufficiently large corpus. Initially, I thought of using Wikipedia data which I already have on my disk, but decided on using Project Gutenberg data as it is more representative of the general usage of English language.