In my earlier post, I’d posted links to the Project Gutenberg Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead. These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a [...]
Categories: data mining,linux,text processing
Tagged: data mining, ngram, project gutenberg, text processing
- Published:
- May 13, 2008 – 9:54 pm
- Author:
- By prashanthellina
In an earlier article, I mentioned that I was trying to use Wikipedia data to do news article clustering to make it easy for me follow news feeds. I have made some progress. I’ve written an algorithm to produce a list of Wikipedia articles relevant to the input text. Input text has to be in [...]
Categories: programming,wikipedia
Tagged: data mining, graph, graphviz, programming, python, semantic analysis, text processing, visualization, web, wikipedia
- Published:
- December 21, 2007 – 5:07 pm
- Author:
- By prashanthellina
Wikipedia is a superb resource for reference (taken with a pinch of salt of course). I spend hours at a time spidering through its pages and always come away amazed at how much information it hosts. In my opinion this ranks amongst the defining milestones of mankind’s advancement. Apart from being available through http://www.wikipedia.org, [...]
Categories: programming
Tagged: , data mining, programming, python, text processing, wikipedia
- Published:
- October 17, 2007 – 10:02 pm
- Author:
- By prashanthellina