<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Prashanth Ellina &#187; text processing</title>
	<atom:link href="http://blog.prashanthellina.com/category/text-processing/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.prashanthellina.com</link>
	<description>In Pursuit of Truth</description>
	<lastBuildDate>Sun, 28 Nov 2010 09:35:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>All your aliases are belong to you</title>
		<link>http://blog.prashanthellina.com/2009/08/28/all-your-aliases-are-belong-to-you/</link>
		<comments>http://blog.prashanthellina.com/2009/08/28/all-your-aliases-are-belong-to-you/#comments</comments>
		<pubDate>Fri, 28 Aug 2009 02:47:06 +0000</pubDate>
		<dc:creator>prashanthellina</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[text processing]]></category>
		<category><![CDATA[Uncategorized]]></category>
		<category><![CDATA[veveo]]></category>
		<category><![CDATA[bash]]></category>
		<category><![CDATA[productivity]]></category>
		<category><![CDATA[script]]></category>

		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=129</guid>
		<description><![CDATA[I like setting up shortcuts to frequently used commands whether I used Windows or Linux. I use the terminal often and create shortcuts to frequently used commands using &#8220;alias&#8221; feature of BASH. This has saved me considerable time in the past. However, I recently felt that if I could have a helper tool to monitor [...]]]></description>
			<content:encoded><![CDATA[<p>I like setting up shortcuts to frequently used commands whether I used Windows or Linux. I use the terminal often and create shortcuts to frequently used    commands using &#8220;alias&#8221; feature of BASH. This has saved me considerable time in the past. However, I recently felt that if I could have a helper tool to       monitor my usage of commands and automatically suggest candidates for aliasing, that would be useful. The output of that is Aliaser.</p>
<p>Aliaser works by monitoring your bash history. It analyses command frequency and suggests candidates for aliasing. It manages aliases so created. The feature I like most in Aliaser is that it reminds you to use the aliases you created by showing tips on opening a new terminal session.</p>
<p>Download Aliaser from <a href="http://aliaser.googlecode.com">http://aliaser.googlecode.com</a>.</p>
<p><a href="http://aliaser.googlecode.com"><br />
<img align="center" src="http://aliaser.googlecode.com/files/aliaser_tips.png"/><br />
</a></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.prashanthellina.com/2009/08/28/all-your-aliases-are-belong-to-you/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Extracting relevant text from HTML pages</title>
		<link>http://blog.prashanthellina.com/2009/07/27/extracting-relevant-text-from-html-pages/</link>
		<comments>http://blog.prashanthellina.com/2009/07/27/extracting-relevant-text-from-html-pages/#comments</comments>
		<pubDate>Mon, 27 Jul 2009 11:28:09 +0000</pubDate>
		<dc:creator>prashanthellina</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[text processing]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=99</guid>
		<description><![CDATA[Some time back I had done some work on extracting topics from an arbitrary piece of text using Wikipedia data. Recently I thought of a concept to put that algorithm to work. As a part of this project, I need to extract relevant text from an arbitrary HTML page. By relevant I mean the &#8220;meat&#8221; [...]]]></description>
			<content:encoded><![CDATA[<p>Some time back I had done some work on <a href="http://blog.prashanthellina.com/2007/12/21/topic-extraction-using-wikipedia-data/">extracting topics</a> from an arbitrary piece of text using Wikipedia data. Recently I thought of a concept to put that algorithm to work. As a part of this project, I need to extract <strong>relevant</strong> text from an arbitrary HTML page. By relevant I mean the &#8220;meat&#8221; of the page devoid of navigation links and side-content.</p>
<p>This algorithm has the following <strong>steps</strong>:</p>
<ul>
<li>make doc from html data (clean html)
<li>identify content nodes (nodes having substantial content)
<li>prune xml tree to remove irrelevant nodes
<li>get the most linked node from pruned tree (subtree contains relevant text)
<li>make the dot graph
</ul>
<p>I&#8217;ve pasted the relevant python module below for easy reading. However, if you want to download the code and hack it, you can get all the files from <a href="http://code.prashanthellina.com/code/content_extraction">here</a>.</p>
<p><strong>Code files</strong></p>
<ul>
<li>content_extract.py &#8211; actual work gets done here (file pasted below)
<li>cextract.py &#8211; cgi front-end which fetched url content and feeds to above script.
<li>cextract_config.py &#8211; cgi script configuration file. You have to adjust this to your environment.
</ul>
<p><strong>Try it right here and right now</strong></p>
<form action="http://www.prashanthellina.com/cgi-bin/cextract.py" method="GET">
url:<br />
<input type="text" name="url" size="60"/>
<input type="submit" value="extract text"/>
</form>
<p><strong>Some samples</strong></p>
<ul>
<li><a href="http://www.prashanthellina.com/cextract_data/510fbe51d89334aecb70d9b1d1635711.html">http://news.bbc.co.uk/sport2/hi/motorsport/formula_one/8169436.stm</a>
<li><a href="http://www.prashanthellina.com/cextract_data/640ca0b7c6e818b2b1bf952a206a6388.html">http://www.prashanthellina.com/cextract_data/640ca0b7c6e818b2b1bf952a206a6388.html</a>
<li><a href="http://www.prashanthellina.com/cextract_data/8408efd51b4f1f6b91650d4ea3ce8924.html">http://www.telegraph.co.uk/news/worldnews/europe/france/5913494/Nicolas-Sarkozy-to-slow-down-after-collapsing-while-jogging.html</a>
</ul>
<p>Please let me know if you find cases for which the algorithm does not work. Even better would be to download the code and hack it up and post back. I am eager to see what you can come up with.</p>
<pre lang="python">
#!/usr/bin/env python

import sys
from cStringIO import StringIO

from lxml import etree #http://codespeak.net/lxml/

IGNORABLE_TAGS = set(['script', 'a'])
MIN_TEXT_LEN = 50

def get_text(node):
    '''
    Given a XML node, extract all the text it contains.
    (does not recurse into children)
    '''
    text = [node.text or '']
    for cnode in node.getchildren():
        tail = cnode.tail
        if tail is not None:
            text.append(cnode.tail)

    text = '\n'.join(text).strip()
    return text

def get_xml(node):
    '''
    Convert the sub-tree from node downwards
    into string XML representation.
    '''
    return etree.tostring(node)

def create_doc(data):
    '''
    Construct XML tree datastructure from xml string representation.
    '''
    parser = etree.HTMLParser()
    doc = etree.parse(StringIO(data), parser)
    return doc

def get_content_nodes(doc):
    '''
    Identify nodes in the XML document that
    have substantial text.
    '''
    nodes = []

    for n in doc.xpath('//*'):
        tag = n.tag

        if tag.lower() in IGNORABLE_TAGS:
            continue

        text = get_text(n)
        if not text:
            continue

        if len(text) < MIN_TEXT_LEN:
            continue

        nodes.append(n)

    return nodes

def make_pruned_tree(content_nodes):
    '''
    Prune the whole XML tree by remnoving nodes
    other than content nodes and their ancestors.
    '''
    nodes = {}
    links = {}

    for node in content_nodes:

        nodes[id(node)] = node

        parent = node.getparent()
        if parent is not None:
            links[id(node)] = id(parent)

        for anode in node.iterancestors():
            _id = id(anode)
            parent = anode.getparent()
            if parent is not None:
                links[_id] = id(parent)

            if _id not in nodes:
                nodes[_id] = anode

    return nodes, links

def get_inlink_counts(links):
    '''
    Given the inter-node links, find out which
    node has maximum number of links coming into it.
    '''
    counts = {}

    for from_id, to_id in links.iteritems():
        count = counts.setdefault(to_id, 0)
        counts[to_id] = count + 1

    return counts

def get_most_linked_node(nodes, links):
    '''
    Identify the node which is most linked.
    (i,e) has most number of inlinks.
    '''
    inlink_counts = get_inlink_counts(links)

    mcount, mid = max([(count, _id) for _id, count in inlink_counts.iteritems()])
    node = nodes[mid]
    return node

def make_dot_graph(nodes, links, chosen_node, stream):
    '''
    Construct the dot format graph representation
    so that graphviz can render the tree for visualization.
    '''
    o = stream

    print >> o, "digraph G {"

    for _id, node in nodes.iteritems():

        tlen = len(get_text(node))
        tag = node.tag

        if tlen:
            text = '%s (%d)' % (tag, tlen)
        else:
            text = tag

        if _id == chosen_node:
            attrs = 'style=filled color=lightblue'
        else:
            attrs = ''

        print >> o, "%s [label=\"%s\" %s];" % (_id, text, attrs)

    for fid, tid in links.iteritems():
        print >> o, "%d -> %d;" % (fid, tid)

    print >> o, "}"

def main():
    # make doc from html data (cleans html)
    doc = create_doc(sys.stdin.read())

    # identify content nodes
    content_nodes = get_content_nodes(doc)

    # prune xml tree to remove irrelevant nodes
    nodes, links = make_pruned_tree(content_nodes)

    # get the most linked node from pruned tree
    mnode = get_most_linked_node(nodes, links)

    # make the dot graph
    make_dot_graph(nodes, links, id(mnode), sys.stdout)

if __name__ == '__main__':
    #Eg: wget "http://blog.prashanthellina.com" -O - | python thisscript.py | dot -Tpng -o /tmp/test.png ; eog /tmp/test.png
    main()
</pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.prashanthellina.com/2009/07/27/extracting-relevant-text-from-html-pages/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Clustering Data using Python</title>
		<link>http://blog.prashanthellina.com/2009/07/25/clustering-data-using-python/</link>
		<comments>http://blog.prashanthellina.com/2009/07/25/clustering-data-using-python/#comments</comments>
		<pubDate>Sat, 25 Jul 2009 04:06:43 +0000</pubDate>
		<dc:creator>prashanthellina</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[text processing]]></category>
		<category><![CDATA[clustering]]></category>
		<category><![CDATA[script]]></category>

		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=93</guid>
		<description><![CDATA[As a part of a project I am working on, I had to cluster urls on a page. After some light googling I found, python-cluster. You can find below a simple python script to illustrate the usage of python-cluster library. Code import pprint from difflib import SequenceMatcher # http://python-cluster.sourceforge.net/ from cluster import HierarchicalClustering # input [...]]]></description>
			<content:encoded><![CDATA[<p>As a part of a project I am working on, I had to cluster urls on a page. After some light googling I found, <a href="http://python-cluster.sourceforge.net/">python-cluster</a>. You can find below a simple python script to illustrate the usage of python-cluster library.</p>
<p><strong>Code</strong></p>
<pre lang="python">
import pprint
from difflib import SequenceMatcher

# http://python-cluster.sourceforge.net/
from cluster import HierarchicalClustering

# input urls to be clustered
urls = [
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814385',
    '#articles',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814335',
    'http://yro.slashdot.org/~drDugan/',
    'http://web.sourceforge.com/privacy.php',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28815123',
    'http://slashdot.org//slashdot.org/~Darkness404',
    'http://slashdot.org//radio.slashdot.org',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;op=Reply&#038;threshold=1&#038;commentsort=0&#038;mode=thread&#038;pid=28814429',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;op=Reply&#038;threshold=1&#038;commentsort=0&#038;mode=thread&#038;pid=28814457',
    'http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238',
    'http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&#038;cid=28810581',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28815269',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814657',
    'http://web.sourceforge.com/terms.php'
    'http://slashdot.org//it.slashdot.org/search',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814581',
    'http://xkcd.com/612/',
    'http://web.sourceforge.com/advertising',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;op=Reply&#038;threshold=1&#038;commentsort=0&#038;mode=thread&#038;pid=28814785',
]

# distance function compares two urls and finds the distance
# uses SequenceMatcher from python standard module difflib
def distance(url1, url2):
    ratio = SequenceMatcher(None, url1, url2).ratio()
    return 1.0 - ratio

# Perform clustering
hc = HierarchicalClustering(urls, distance)
clusters = hc.getlevel(0.2)

pprint.pprint(clusters)
</pre>
<p><br/></p>
<p><strong> Output </strong></p>
<pre lang="python">
[['#articles'],
 ['http://xkcd.com/612/'],
 ['http://web.sourceforge.com/privacy.php'],
 ['http://web.sourceforge.com/advertising'],
 ['http://web.sourceforge.com/terms.phphttp://slashdot.org//it.slashdot.org/search'],
 ['http://yro.slashdot.org/~drDugan/'],
 ['http://slashdot.org//slashdot.org/~Darkness404'],
 ['http://slashdot.org//radio.slashdot.org'],
 ['http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;op=Reply&#038;threshold=1&#038;commentsort=0&#038;mode=thread&#038;pid=28814785',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;op=Reply&#038;threshold=1&#038;commentsort=0&#038;mode=thread&#038;pid=28814429',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;op=Reply&#038;threshold=1&#038;commentsort=0&#038;mode=thread&#038;pid=28814457'],
 ['http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238',
  'http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&#038;cid=28810581',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28815123',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28815269',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814385',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814335',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814657',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814581']]
</pre>
<p><br/></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.prashanthellina.com/2009/07/25/clustering-data-using-python/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
		<item>
		<title>Microsoft Surface Unboxing</title>
		<link>http://blog.prashanthellina.com/2008/12/30/microsoft-surface-unboxing/</link>
		<comments>http://blog.prashanthellina.com/2008/12/30/microsoft-surface-unboxing/#comments</comments>
		<pubDate>Mon, 29 Dec 2008 23:58:11 +0000</pubDate>
		<dc:creator>prashanthellina</dc:creator>
				<category><![CDATA[computer hardware]]></category>
		<category><![CDATA[text processing]]></category>
		<category><![CDATA[veveo]]></category>
		<category><![CDATA[computer]]></category>
		<category><![CDATA[gadget]]></category>
		<category><![CDATA[interface]]></category>
		<category><![CDATA[microsoft]]></category>
		<category><![CDATA[surface]]></category>
		<category><![CDATA[touch]]></category>
		<category><![CDATA[unboxing]]></category>

		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=74</guid>
		<description><![CDATA[Today, we received the shipment from Microsoft at Veveo. If you have not heard of Microsoft Surface before, It is a touch screen based computer embedded in a table. The surface of table is illuminated from underneath by a projector (rear-projection) and touch input is implemented by reflecting IR radiation off the fingers and then [...]]]></description>
			<content:encoded><![CDATA[<p>Today, we received the shipment from Microsoft at Veveo. If you have not heard of Microsoft Surface before, It is a touch screen based computer embedded in a table. The surface of table is illuminated from underneath by a projector (rear-projection) and touch input is implemented by reflecting IR radiation off the fingers and then being captured by five IR camera hidden inside the unit.</p>
<p>To learn more about Microsoft Surface head over to:</p>
<ul>
<li> <a href="http://www.microsoft.com/SURFACE/index.html">Microsoft&#8217;s page on Surface</a>
<li> <a href="http://en.wikipedia.org/wiki/Microsoft_Surface">Wikipedia article on Microsoft Surface</a>
<li> <a href="http://www.youtube.com/watch?v=rP5y7yp06n0">Watch a Youtube video on Microsoft Surface</a>
</ul>
<p><strong>Unboxing Pictures</strong><br />
<center><iframe align="center" src="http://www.flickr.com/slideShow/index.gne?user_id=prashanthellina&#038;set_id=72157611858989460" frameBorder="0" width="500" scrolling="no" height="500"></iframe><br />
<a href="http://www.flickr.com/photos/prashanthellina/sets/72157611858989460/">flickr set on microsoft surface unboxing</a><br />
</center></p>
<p><strong>Some observations:</strong></p>
<ul>
<li> It is very heavy!
<li> and expensive (around $15,000)
<li> The power socket is hidden underneath and is very difficult to access. The power button is equally well hidden and difficult to find.
<li> Installation was non-trivial. The touch input did not start working out of the box. We had to use the bundled mouse to initial installation steps.
<li> The &#8220;surface shell&#8221; with the ripples in the water is a great way to understand the potential of this device. It feels like you are touching water! and your brain expects that water will drip when you lift your fingers up. I think it is more realistic (compared to devices with smaller touch screens) because of the size of the display and the fact that it is aligned horizontally making it more natural.
<li> Since rear-projection is used for the display, the viewing angle is very wide (nearly 180 degrees)
<li> The matte finish on the touch surface as a good feel (almost like paper).
</ul>
]]></content:encoded>
			<wfw:commentRss>http://blog.prashanthellina.com/2008/12/30/microsoft-surface-unboxing/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Project Gutenberg Ngram data: English only</title>
		<link>http://blog.prashanthellina.com/2008/05/13/project-gutenberg-ngram-data-english-only/</link>
		<comments>http://blog.prashanthellina.com/2008/05/13/project-gutenberg-ngram-data-english-only/#comments</comments>
		<pubDate>Tue, 13 May 2008 16:36:05 +0000</pubDate>
		<dc:creator>prashanthellina</dc:creator>
				<category><![CDATA[data mining]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[text processing]]></category>
		<category><![CDATA[ngram]]></category>
		<category><![CDATA[project gutenberg]]></category>

		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=64</guid>
		<description><![CDATA[In my earlier post, I&#8217;d posted links to the Project Gutenberg Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead. These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a [...]]]></description>
			<content:encoded><![CDATA[<p>
In my <a href="/2008/05/04/n-gram-data-from-project-gutenberg/">earlier post</a>, I&#8217;d posted links to the <a href="http://www.gutenberg.org">Project Gutenberg</a> Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead.
</p>
<p>
These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a week before the date of this post.<br/><br />
<a href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_en_files.tar.bz2.0">gutenberg_en_files.tar.bz2.0</a> (<strong>2.0GB</strong>) <br/></p>
<p><a href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_en_files.tar.bz2.1">gutenberg_en_files.tar.bz2.1</a> (<strong>1.4GB</strong>) <br/></p>
<p>Unigrams along with frequency count from the text data above<br/><br />
<a href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_en_unigrams.tar.gz">gutenberg_en_unigrams.tar.gz</a> (<strong>7.4MB</strong>) <br/></p>
<p>Bi-grams and Tri-grams along with frequency count from the text data above<br/><br />
<a href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_en_bi_tri_grams.tar.gz">gutenberg_en_bi_tri_grams.tar.gz</a> (<strong>493MB</strong>) <br/>
</p>
<p>I had to split the files because my webserver has a limitation in serving out files larger than 2GB. After downloading the files, do this</p>
<pre lang="BASH">
mv gutenberg_en_files.tar.bz2.0 gutenberg_en_files.tar.bz2
cat gutenberg_en_files.tar.bz2.1 >> gutenberg_en_files.tar.bz2
rm gutenberg_en_files.tar.bz2.1
</pre>
<p><br/></p>
<p>If you find the data useful, I&#8217;d be delighted to hear the context in which you made use of it.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.prashanthellina.com/2008/05/13/project-gutenberg-ngram-data-english-only/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>N-gram data from Project Gutenberg</title>
		<link>http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-project-gutenberg/</link>
		<comments>http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-project-gutenberg/#comments</comments>
		<pubDate>Sun, 04 May 2008 16:40:14 +0000</pubDate>
		<dc:creator>prashanthellina</dc:creator>
				<category><![CDATA[data mining]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[text processing]]></category>
		<category><![CDATA[gutenberg]]></category>
		<category><![CDATA[ngrams]]></category>
		<category><![CDATA[project gutenberg]]></category>
		<category><![CDATA[text parsing]]></category>

		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=63</guid>
		<description><![CDATA[I&#8217;ve been working on Wordza.com for which I needed Ngram data from a sufficiently large corpus. Initially, I thought of using Wikipedia data which I already have on my disk, but decided on using Project Gutenberg data as it is more representative of the general usage of English language. Get Project Gutenberg Ngram data The [...]]]></description>
			<content:encoded><![CDATA[<p>
  I&#8217;ve been working on <A href="http://www.wordza.com" name="Wordza">Wordza.com</A> for which I needed Ngram data from a sufficiently large corpus. Initially,  I thought of using Wikipedia data which I already <A href="/2007/12/21/topic-extraction-using-wikipedia-data/">have on my disk</A>, but decided on using <A href="http://www.gutenberg.org">Project Gutenberg</A> data as it is more representative of the general usage of English language.
</p>
<h2>Get Project Gutenberg Ngram data</h2>
<p>
The Ngram data contains bi-grams and tri-grams for now. I plan to generate uni-grams soon. I&#8217;ve made the data available here so you can download and use it! This data contains all of the e-books hosted by Project Gutenberg (which means the data contains English, French, German and other languages). If you want an English only dataset, check back in a week or two. I am in the process of generating the same.
</p>
<p>
  The Ngram data containing bi-grams and tri-grams. Each line is prepended with the occurence count.<br/><br />
<A href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_ngrams.tar.bz2">gutenberg_ngrams.tar.bz2</A> (<strong>624 MB</strong>)<br/></p>
<p><br/></p>
<p>This is the compressed tarball of all the txt files in Project Gutenberg (as of a week before this blog post). Note that you don&#8217;t need this file unless you want to generate the Ngrams yourself using the scripts provided below.<br/><br />
<A href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_files.tar.bz2.0">gutenberg_files.tar.bz2.0</A>,<br />
<A href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_files.tar.bz2.1">gutenberg_files.tar.bz2.1</A>,<br />
<A href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_files.tar.bz2.2">gutenberg_files.tar.bz2.2</A> (<strong>5.3 GB</strong>)<br/></p>
<p>My webserver (Apache) has a problem serving out files bigger than 2GB, so I had to split the file up. After you download the splits, you have to join them like this.</p>
<pre lang="BASH">
mv gutenberg_files.tar.bz2.0 gutenberg_files.tar.bz2
cat gutenberg_files.tar.bz2.1 >> gutenberg_files.tar.bz2
cat gutenberg_files.tar.bz2.2 >> gutenberg_files.tar.bz2
</pre>
<p><br/></p>
<p>To decompress the files, you will need bunzip2 on *nix/Cygwin. On Windows, use 7zip.
</p>
<h2>Generate the data yourself</h2>
<p>
In case you want to generate the Ngrams yourself by processing the Project Gutenberg data files, follow these instructions. You will have to get the Project gutenberg data files. Use the following command to get all the English language files in txt format.</p>
<pre lang="bash">
mkdir gutenberg
cd gutenberg
wget -w 2 -m "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&#038;langs[]=en"
</pre>
<p><br/></p>
<p>The txt files are compressed and stored in files ending with .zip extension. These zip files are spread across multiple directories. The following command will move the zip files into the &#8220;gutenberg&#8221; directory you created in the above step.</p>
<pre lang="BASH">
for i in `find . -name "*.zip"`; do mv $i . ; done;
</pre>
<p><br/></p>
<p>Now that all the zip files are in the same directory, unzip the zip files.Some zip files may contain files other than .txt&#8217;s. The following command extracts only .txt&#8217;s in the zip files.</p>
<pre lang="BASH">
cd ..
mkdir gutenberg_txt
for i in `find gutenberg -name "*.zip"`; do unzip $i \*.txt -d gutenberg_txt/ ; done;
cd gutenberg_txt
for i in `find . -name "*.txt"`; do mv $i . ; done;
cd ..
</pre>
<p><br/></p>
<p>The gutenberg txt files have gutenberg headers and footers which should be removed lest they skew the frequency of Ngrams. The script &#8220;remove_gutenberg_text.py&#8221; does exactly this. The &#8220;generate_ngrams.py&#8221; script creates uni, bi and tri-grams of whatever text is piped into it. The following command pipes all the txt files through both the scripts to create the ngrams file.</p>
<pre lang="BASH">
for i in `find gutenberg_txt/ -name "*.txt"`; \
do cat $i | python remove_gutenberg_text.py | \
grep -i -v "project gutenberg" |\
 python generate_ngrams.py >> gutenberg_ngrams; done;
</pre>
<p><br/></p>
<p>Now you have to count the number of times an ngram occurs. The following sequence of commands process the ngrams file generated above and produce a file with the frequency counts of the ngrams. Note that the &#8220;512K&#8221; option to sort is because I had to run these scripts on my host which kills processes that take too much memory. If you have a machine with a lot of memory, sorting can be significantly faster if you use a higher value, say &#8220;1G&#8221;.</p>
<pre lang="BASH">
sort -S 512K -T tmp_sort/ gutenberg_ngrams > gutenberg_ngrams.sorted
uniq -c gutenberg_ngrams.sorted > gutenberg_ngrams.counted
sort -S 512K -T tmp_sort/ gutenberg_ngrams.counted > gutenberg_ngrams.counted.sorted
</pre>
<p><br/>
</p>
<h3>Gutenberg data processing scripts</h3>
<ul>
<li><A href="http://code.prashanthellina.com/code/remove_gutenberg_text.py">remove_gutenberg_text.py</A> &#8212; removes Project Gutenberg header and footer from txt files</li>
<li><A href="http://code.prashanthellina.com/code/generate_ngrams.py">generate_ngrams.py</A> &#8212; generate uni, bi and tri-grams for any text</li>
</ul>
<h2>Do get back</h2>
<p>If you use this data, I would really appreciate if you get back with details about how you used it in the context of your project</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-project-gutenberg/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

