<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	>
<channel>
	<title>Comments on: N-gram data from Project Gutenberg</title>
	<atom:link href="http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-project-gutenberg/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-project-gutenberg/</link>
	<description>( to ) ? be : ! be;</description>
	<pubDate>Tue, 06 Jan 2009 23:09:14 +0000</pubDate>
	<generator>http://wordpress.org/?v=2.5</generator>
		<item>
		<title>By: prashanthellina</title>
		<link>http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-project-gutenberg/#comment-3814</link>
		<dc:creator>prashanthellina</dc:creator>
		<pubDate>Mon, 05 May 2008 06:05:57 +0000</pubDate>
		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=63#comment-3814</guid>
		<description>I am glad you enjoy my blog! I did not attempt doing Ngram generation on Wikipedia data, so I don't have any hard numbers to support my argument. However, I'll explain my reasoning. I need the Ngram data for a feature of Wordza where the user gets to see the most frequently used phrases containing a given word. Eg: "abysmal corruption", "abysmal conditions" for the word abysmal. Wikipedia being an encyclopedia will tend to have a sanitized usage of English where many of the "non-frequent" words in english don't even occur. In comparison, Gutenberg data is better suited because it is English literature. When I can I'll try to generate Ngram data for Wikipedia. It will be interesting to compare the results.</description>
		<content:encoded><![CDATA[<p>I am glad you enjoy my blog! I did not attempt doing Ngram generation on Wikipedia data, so I don&#8217;t have any hard numbers to support my argument. However, I&#8217;ll explain my reasoning. I need the Ngram data for a feature of Wordza where the user gets to see the most frequently used phrases containing a given word. Eg: &#8220;abysmal corruption&#8221;, &#8220;abysmal conditions&#8221; for the word abysmal. Wikipedia being an encyclopedia will tend to have a sanitized usage of English where many of the &#8220;non-frequent&#8221; words in english don&#8217;t even occur. In comparison, Gutenberg data is better suited because it is English literature. When I can I&#8217;ll try to generate Ngram data for Wikipedia. It will be interesting to compare the results.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Som</title>
		<link>http://blog.prashanthellina.com/2008/05/04/n-gram-data-from-project-gutenberg/#comment-3812</link>
		<dc:creator>Som</dc:creator>
		<pubDate>Mon, 05 May 2008 03:50:05 +0000</pubDate>
		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=63#comment-3812</guid>
		<description>Hi Prashant,

    Thanks for the wonderful blog you do. I was curious to know what kind of problem you faced with Wikipedia and that is not there in the Gutenberg data. May be the related question is how you use the N-gram statistics in Wordza and why you think the statistics obtained from Gutenberg data is better.

Thanks,
Som</description>
		<content:encoded><![CDATA[<p>Hi Prashant,</p>
<p>    Thanks for the wonderful blog you do. I was curious to know what kind of problem you faced with Wikipedia and that is not there in the Gutenberg data. May be the related question is how you use the N-gram statistics in Wordza and why you think the statistics obtained from Gutenberg data is better.</p>
<p>Thanks,<br />
Som</p>
]]></content:encoded>
	</item>
</channel>
</rss>
