Tag Archives: text processing

Project Gutenberg Ngram data: English only Comments Off

In my earlier post, I’d posted links to the Project Gutenberg Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead. These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a [...]

Topic extraction using Wikipedia data 12

In an earlier article, I mentioned that I was trying to use Wikipedia data to do news article clustering to make it easy for me follow news feeds. I have made some progress. I’ve written an algorithm to produce a list of Wikipedia articles relevant to the input text. Input text has to be in [...]

Ways to process and use Wikipedia dumps 22

  Wikipedia is a superb resource for reference (taken with a pinch of salt of course). I spend hours at a time spidering through its pages and always come away amazed at how much information it hosts. In my opinion this ranks amongst the defining milestones of mankind’s advancement. Apart from being available through http://www.wikipedia.org, [...]