<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Prashanth Ellina &#187; data mining</title>
	<atom:link href="http://blog.prashanthellina.com/category/data-mining/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.prashanthellina.com</link>
	<description>In Pursuit of Truth</description>
	<lastBuildDate>Sun, 28 Nov 2010 09:35:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Project Gutenberg Ngram data: English only</title>
		<link>http://blog.prashanthellina.com/2008/05/13/project-gutenberg-ngram-data-english-only/</link>
		<comments>http://blog.prashanthellina.com/2008/05/13/project-gutenberg-ngram-data-english-only/#comments</comments>
		<pubDate>Tue, 13 May 2008 16:36:05 +0000</pubDate>
		<dc:creator>prashanthellina</dc:creator>
				<category><![CDATA[data mining]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[text processing]]></category>
		<category><![CDATA[ngram]]></category>
		<category><![CDATA[project gutenberg]]></category>

		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=64</guid>
		<description><![CDATA[In my earlier post, I&#8217;d posted links to the Project Gutenberg Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead. These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a [...]]]></description>
			<content:encoded><![CDATA[<p>
In my <a href="/2008/05/04/n-gram-data-from-project-gutenberg/">earlier post</a>, I&#8217;d posted links to the <a href="http://www.gutenberg.org">Project Gutenberg</a> Ngram data I had computed for e-books of all languages. If you are interested in only the English data, get these files instead.
</p>
<p>
These two files are splits of a compressed file which contains all of the Project Gutenberg English e-books downloaded about a week before the date of this post.<br/><br />
<a href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_en_files.tar.bz2.0">gutenberg_en_files.tar.bz2.0</a> (<strong>2.0GB</strong>) <br/></p>
<p><a href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_en_files.tar.bz2.1">gutenberg_en_files.tar.bz2.1</a> (<strong>1.4GB</strong>) <br/></p>
<p>Unigrams along with frequency count from the text data above<br/><br />
<a href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_en_unigrams.tar.gz">gutenberg_en_unigrams.tar.gz</a> (<strong>7.4MB</strong>) <br/></p>
<p>Bi-grams and Tri-grams along with frequency count from the text data above<br/><br />
<a href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_en_bi_tri_grams.tar.gz">gutenberg_en_bi_tri_grams.tar.gz</a> (<strong>493MB</strong>) <br/>
</p>
<p>I had to split the files because my webserver has a limitation in serving out files larger than 2GB. After downloading the files, do this</p>
<pre lang="BASH">
mv gutenberg_en_files.tar.bz2.0 gutenberg_en_files.tar.bz2
cat gutenberg_en_files.tar.bz2.1 >> gutenberg_en_files.tar.bz2
rm gutenberg_en_files.tar.bz2.1
</pre>
<p><br/></p>
<p>If you find the data useful, I&#8217;d be delighted to hear the context in which you made use of it.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.prashanthellina.com/2008/05/13/project-gutenberg-ngram-data-english-only/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>N-gram data from Project Gutenberg</title>
		<link>http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-project-gutenberg/</link>
		<comments>http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-project-gutenberg/#comments</comments>
		<pubDate>Sun, 04 May 2008 16:40:14 +0000</pubDate>
		<dc:creator>prashanthellina</dc:creator>
				<category><![CDATA[data mining]]></category>
		<category><![CDATA[linux]]></category>
		<category><![CDATA[programming]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[text processing]]></category>
		<category><![CDATA[gutenberg]]></category>
		<category><![CDATA[ngrams]]></category>
		<category><![CDATA[project gutenberg]]></category>
		<category><![CDATA[text parsing]]></category>

		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=63</guid>
		<description><![CDATA[I&#8217;ve been working on Wordza.com for which I needed Ngram data from a sufficiently large corpus. Initially, I thought of using Wikipedia data which I already have on my disk, but decided on using Project Gutenberg data as it is more representative of the general usage of English language. Get Project Gutenberg Ngram data The [...]]]></description>
			<content:encoded><![CDATA[<p>
  I&#8217;ve been working on <A href="http://www.wordza.com" name="Wordza">Wordza.com</A> for which I needed Ngram data from a sufficiently large corpus. Initially,  I thought of using Wikipedia data which I already <A href="/2007/12/21/topic-extraction-using-wikipedia-data/">have on my disk</A>, but decided on using <A href="http://www.gutenberg.org">Project Gutenberg</A> data as it is more representative of the general usage of English language.
</p>
<h2>Get Project Gutenberg Ngram data</h2>
<p>
The Ngram data contains bi-grams and tri-grams for now. I plan to generate uni-grams soon. I&#8217;ve made the data available here so you can download and use it! This data contains all of the e-books hosted by Project Gutenberg (which means the data contains English, French, German and other languages). If you want an English only dataset, check back in a week or two. I am in the process of generating the same.
</p>
<p>
  The Ngram data containing bi-grams and tri-grams. Each line is prepended with the occurence count.<br/><br />
<A href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_ngrams.tar.bz2">gutenberg_ngrams.tar.bz2</A> (<strong>624 MB</strong>)<br/></p>
<p><br/></p>
<p>This is the compressed tarball of all the txt files in Project Gutenberg (as of a week before this blog post). Note that you don&#8217;t need this file unless you want to generate the Ngrams yourself using the scripts provided below.<br/><br />
<A href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_files.tar.bz2.0">gutenberg_files.tar.bz2.0</A>,<br />
<A href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_files.tar.bz2.1">gutenberg_files.tar.bz2.1</A>,<br />
<A href="http://www.prashanthellina.com/docs/gutenberg_data/gutenberg_files.tar.bz2.2">gutenberg_files.tar.bz2.2</A> (<strong>5.3 GB</strong>)<br/></p>
<p>My webserver (Apache) has a problem serving out files bigger than 2GB, so I had to split the file up. After you download the splits, you have to join them like this.</p>
<pre lang="BASH">
mv gutenberg_files.tar.bz2.0 gutenberg_files.tar.bz2
cat gutenberg_files.tar.bz2.1 >> gutenberg_files.tar.bz2
cat gutenberg_files.tar.bz2.2 >> gutenberg_files.tar.bz2
</pre>
<p><br/></p>
<p>To decompress the files, you will need bunzip2 on *nix/Cygwin. On Windows, use 7zip.
</p>
<h2>Generate the data yourself</h2>
<p>
In case you want to generate the Ngrams yourself by processing the Project Gutenberg data files, follow these instructions. You will have to get the Project gutenberg data files. Use the following command to get all the English language files in txt format.</p>
<pre lang="bash">
mkdir gutenberg
cd gutenberg
wget -w 2 -m "http://www.gutenberg.org/robot/harvest?filetypes[]=txt&#038;langs[]=en"
</pre>
<p><br/></p>
<p>The txt files are compressed and stored in files ending with .zip extension. These zip files are spread across multiple directories. The following command will move the zip files into the &#8220;gutenberg&#8221; directory you created in the above step.</p>
<pre lang="BASH">
for i in `find . -name "*.zip"`; do mv $i . ; done;
</pre>
<p><br/></p>
<p>Now that all the zip files are in the same directory, unzip the zip files.Some zip files may contain files other than .txt&#8217;s. The following command extracts only .txt&#8217;s in the zip files.</p>
<pre lang="BASH">
cd ..
mkdir gutenberg_txt
for i in `find gutenberg -name "*.zip"`; do unzip $i \*.txt -d gutenberg_txt/ ; done;
cd gutenberg_txt
for i in `find . -name "*.txt"`; do mv $i . ; done;
cd ..
</pre>
<p><br/></p>
<p>The gutenberg txt files have gutenberg headers and footers which should be removed lest they skew the frequency of Ngrams. The script &#8220;remove_gutenberg_text.py&#8221; does exactly this. The &#8220;generate_ngrams.py&#8221; script creates uni, bi and tri-grams of whatever text is piped into it. The following command pipes all the txt files through both the scripts to create the ngrams file.</p>
<pre lang="BASH">
for i in `find gutenberg_txt/ -name "*.txt"`; \
do cat $i | python remove_gutenberg_text.py | \
grep -i -v "project gutenberg" |\
 python generate_ngrams.py >> gutenberg_ngrams; done;
</pre>
<p><br/></p>
<p>Now you have to count the number of times an ngram occurs. The following sequence of commands process the ngrams file generated above and produce a file with the frequency counts of the ngrams. Note that the &#8220;512K&#8221; option to sort is because I had to run these scripts on my host which kills processes that take too much memory. If you have a machine with a lot of memory, sorting can be significantly faster if you use a higher value, say &#8220;1G&#8221;.</p>
<pre lang="BASH">
sort -S 512K -T tmp_sort/ gutenberg_ngrams > gutenberg_ngrams.sorted
uniq -c gutenberg_ngrams.sorted > gutenberg_ngrams.counted
sort -S 512K -T tmp_sort/ gutenberg_ngrams.counted > gutenberg_ngrams.counted.sorted
</pre>
<p><br/>
</p>
<h3>Gutenberg data processing scripts</h3>
<ul>
<li><A href="http://code.prashanthellina.com/code/remove_gutenberg_text.py">remove_gutenberg_text.py</A> &#8212; removes Project Gutenberg header and footer from txt files</li>
<li><A href="http://code.prashanthellina.com/code/generate_ngrams.py">generate_ngrams.py</A> &#8212; generate uni, bi and tri-grams for any text</li>
</ul>
<h2>Do get back</h2>
<p>If you use this data, I would really appreciate if you get back with details about how you used it in the context of your project</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-project-gutenberg/feed/</wfw:commentRss>
		<slash:comments>5</slash:comments>
		</item>
	</channel>
</rss>

