Category Archives: text processing

All your aliases are belong to you 1

I like setting up shortcuts to frequently used commands whether I used Windows or Linux. I use the terminal often and create shortcuts to frequently used commands using “alias” feature of BASH. This has saved me considerable time in the past. However, I recently felt that if I could have a helper tool to monitor [...]

Extracting relevant text from HTML pages Comments Off

Some time back I had done some work on extracting topics from an arbitrary piece of text using Wikipedia data. Recently I thought of a concept to put that algorithm to work. As a part of this project, I need to extract relevant text from an arbitrary HTML page. By relevant I mean the “meat” [...]

Clustering Data using Python 5

As a part of a project I am working on, I had to cluster urls on a page. After some light googling I found, python-cluster. You can find below a simple python script to illustrate the usage of python-cluster library.

Microsoft Surface Unboxing Comments Off

Today, we received the shipment from Microsoft at Veveo. If you have not heard of Microsoft Surface before, It is a touch screen based computer embedded in a table. The surface of table is illuminated from underneath by a projector (rear-projection) and touch input is implemented by reflecting IR radiation off the fingers and then [...]

Project Gutenberg Ngram data: English only Comments Off

In my earlier post, I’d posted links to the Project Gutenberg Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead. These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a [...]

N-gram data from Project Gutenberg 5

I’ve been working on Wordza.com for which I needed Ngram data from a sufficiently large corpus. Initially, I thought of using Wikipedia data which I already have on my disk, but decided on using Project Gutenberg data as it is more representative of the general usage of English language.