<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Prashanth Ellina &#187; wikipedia</title>
	<atom:link href="http://blog.prashanthellina.com/category/wikipedia/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.prashanthellina.com</link>
	<description>In Pursuit of Truth</description>
	<lastBuildDate>Sun, 28 Nov 2010 09:35:27 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
		<item>
		<title>Query Wikipedia from your terminal</title>
		<link>http://blog.prashanthellina.com/2009/08/23/query-wikipedia-from-your-terminal/</link>
		<comments>http://blog.prashanthellina.com/2009/08/23/query-wikipedia-from-your-terminal/#comments</comments>
		<pubDate>Sun, 23 Aug 2009 05:34:23 +0000</pubDate>
		<dc:creator>prashanthellina</dc:creator>
				<category><![CDATA[linux]]></category>
		<category><![CDATA[web]]></category>
		<category><![CDATA[wikipedia]]></category>
		<category><![CDATA[bash]]></category>
		<category><![CDATA[code]]></category>
		<category><![CDATA[function]]></category>
		<category><![CDATA[productivity]]></category>
		<category><![CDATA[terminal]]></category>

		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=122</guid>
		<description><![CDATA[I refer Wikipedia frequently. I use this BASH function to help me do that from the terminal. For explanation of how this works head over here. BASH function # wiki # eg: wiki India # wiki Apple_Inc # wiki Anglo_Saxon wiki() { dig +short txt $1.wp.dg.cx } Example usage prashanth@prashanth-desktop:~$ wiki India "India, officially the [...]]]></description>
			<content:encoded><![CDATA[<p>I refer Wikipedia frequently. I use this BASH function to help me do that from the terminal. For explanation of how this works head over <a href="http://www.commandlinefu.com/commands/view/2829/query-wikipedia-via-console-over-dns">here</a>.</p>
<p><strong> BASH function </strong></p>
<pre lang="bash">
# wiki
<page>
# eg: wiki India
#     wiki Apple_Inc
#     wiki Anglo_Saxon
wiki()
{
    dig +short txt $1.wp.dg.cx
}
</pre>
<p><br/></p>
<p><strong> Example usage </strong></p>
<pre lang="bash">
prashanth@prashanth-desktop:~$ wiki India
"India, officially the Republic of India ( '\; see also other Indian languages), is a country in South Asia.
It is the seventh-largest country by geographical area, the second-most populous country, and the most
populous democracy in the world. Bounded by t" "he Indian Ocean on the south, the Arabian Sea on
the west, and the Bay of Bengal on the east, India has a coastline of ... http://a.vu/w:India"

prashanth@prashanth-desktop:~$ wiki Anglo_Saxon
"Anglo-Saxons (or Anglo-Saxon) is the term usually used to describe the invading tribes in the south
and east of Great Britain starting from the early 5th century AD, and their creation of the English
nation, lasting until the Norman conquest of 1066. The " "Benedictine monk, Bede, identified
them as the descendants of three Germanic tribes: http://a.vu/w:Anglo-Saxons"
</pre>
]]></content:encoded>
			<wfw:commentRss>http://blog.prashanthellina.com/2009/08/23/query-wikipedia-from-your-terminal/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Topic extraction using Wikipedia data</title>
		<link>http://blog.prashanthellina.com/2007/12/21/topic-extraction-using-wikipedia-data/</link>
		<comments>http://blog.prashanthellina.com/2007/12/21/topic-extraction-using-wikipedia-data/#comments</comments>
		<pubDate>Fri, 21 Dec 2007 11:49:14 +0000</pubDate>
		<dc:creator>prashanthellina</dc:creator>
				<category><![CDATA[programming]]></category>
		<category><![CDATA[wikipedia]]></category>
		<category><![CDATA[data mining]]></category>
		<category><![CDATA[graph]]></category>
		<category><![CDATA[graphviz]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[semantic analysis]]></category>
		<category><![CDATA[text processing]]></category>
		<category><![CDATA[visualization]]></category>
		<category><![CDATA[web]]></category>

		<guid isPermaLink="false">http://blog.prashanthellina.com/2007/12/21/topic-extraction-using-wikipedia-data/</guid>
		<description><![CDATA[In an earlier article, I mentioned that I was trying to use Wikipedia data to do news article clustering to make it easy for me follow news feeds. I have made some progress. I&#8217;ve written an algorithm to produce a list of Wikipedia articles relevant to the input text. Input text has to be in [...]]]></description>
			<content:encoded><![CDATA[<p><center><br />
    <img src="http://www.prashanthellina.com/images/wiki_topic_graph_header.png" alt="decorative graph header"/><br />
</center></p>
<p><br/></p>
<p>In an earlier <a href="/2007/10/17/ways-to-process-and-use-wikipedia-dumps/">article</a>, I mentioned that I was trying to use Wikipedia data to do <strong>news article clustering</strong> to make it easy for me follow news feeds. I have made some progress. I&#8217;ve written an algorithm to produce a list of Wikipedia articles relevant to the input text. Input text has to be in English. The algorithm will not work well for very short pieces of text. At least a paragraph or two with sizable text are required. The list of Wikipedia articles will represent the &#8220;topic&#8221; of the input text.</p>
<h3>Test run</h3>
<p>To test the algorithm, I gave the text from an earlier article from this blog (<a href="/2007/12/10/accessing-your-home-computer-from-the-internet/">Accessing your home computer from the internet</a>). The top Wikipedia articles in the output are</p>
<ul>
<li>Internet</li>
<li>Domain Name System (DNS)</li>
<li>IP Address</li>
<li>Hypertext Transfer Protocol (HTTP)</li>
<li>Modem</li>
<li>World Wide Web (WWW)</li>
<li>Domain Name</li>
<li>Dynamic Host Configuration Protocol (DHCP)</li>
<li>Internet Service Provider</li>
<li>Network Address Translation (NAT)</li>
<li>Firewall</li>
</ul>
<h3>How it works?</h3>
<p>The basis of the algorithm is to find Wikipedia article titles occuring in the input text. The &#8220;found&#8221; set of Wikipedia articles are then used to construct a sub-graph from the Wikipedia graph (formed by linkages between Wikipedia pages). The most interconnected nodes happen to be relevant. However, as I have not applied any filtering on the input text, a lot of &#8220;junk&#8221; matches happened. For example, the word &#8220;let&#8221; is picked up and it matches a Wikipedia article by the same title which redirects to Lashkar-e-Toiba. This is totally irrelevant to the input text. To remove such spurious matches, I dropped all the least interconnected nodes and constructed a sub-graph with the remaining nodes. In the sub-graph, I did recomputation for node interconnection.</p>
<p>Below is the output of the first phase. This graph contains all nodes found from matching phrases in the input text. The <strong>nodes of darker blue are more relevant than lighter ones</strong>. The <strong>darker and thicker a link is, the more relevant</strong> it is.<br />
<a href="http://www.prashanthellina.com/images/wiki_topic_full_graph_big.png"><br />
    <img src="http://www.prashanthellina.com/images/wiki_topic_full_graph.png" alt="full graph with all found wikipedia titles"/><br />
</a><br />
Download higher resolution image <a href="http://www.prashanthellina.com/images/wiki_topic_full_graph_big.png">here</a>. <strong>8.2MB</strong></p>
<p>A lot of extracted articles are not relevant to the input text. Some of these spurious nodes are totally <strong>disconnected from the main body</strong> of the graph.<br />
<img src="http://www.prashanthellina.com/images/wiki_topic_disconnected_nodes.png" alt="disconnected nodes in the full graph"/></p>
<p>This is a slightly higher resolution picture of a <strong>section of the full graph</strong> above.<br />
<img src="http://www.prashanthellina.com/images/wiki_topic_full_graph_section.png" alt="section of the full graph containing some relevant nodes"/></p>
<p>Below is the <strong>output of second phase</strong> of the algorithm where relevant nodes are extracted and a <strong>sub-graph</strong> computed. Node and edge relevances are recomputed within this set.<br />
<a href="http://www.prashanthellina.com/images/wiki_topic_sub_graph_big.png"><br />
    <img src="http://www.prashanthellina.com/images/wiki_topic_sub_graph.png" alt="sub graph containing only relevant nodes"/><br />
</a><br />
Download the higher resolution image <a href="http://www.prashanthellina.com/images/wiki_topic_sub_graph_big.png">here</a>. <strong>1.9MB</strong></p>
<p>All the graphs above were produced using <a href="http://www.graphviz.org">Graphviz</a>.</p>
<h3>What next</h3>
<p>I tried applying the logic to some sample input texts and results look very encouraging. The next step towards news article clustering would be apply the topic extraction algorithm to multiple news articles and look for common Wikipedia articles (maybe plain intersection). I still have not given much thought to this stage. Once I do, I will post back.</p>
<p>As I said before, Wikipedia amazes me every time I use it. The wealth of information (both as text and as interconnects) is astounding. As a token of appreciation, I&#8217;ve donated a small amount to the current Wikipedia donation round. If you like Wikipedia and have used it, do consider making a donation.</p>
]]></content:encoded>
			<wfw:commentRss>http://blog.prashanthellina.com/2007/12/21/topic-extraction-using-wikipedia-data/feed/</wfw:commentRss>
		<slash:comments>12</slash:comments>
		</item>
		<item>
		<title>Wikipedia Category Graph Generator</title>
		<link>http://blog.prashanthellina.com/2007/11/12/wikipedia-category-graph-generator/</link>
		<comments>http://blog.prashanthellina.com/2007/11/12/wikipedia-category-graph-generator/#comments</comments>
		<pubDate>Mon, 12 Nov 2007 15:50:30 +0000</pubDate>
		<dc:creator>prashanthellina</dc:creator>
				<category><![CDATA[wikipedia]]></category>
		<category><![CDATA[categories]]></category>
		<category><![CDATA[graph]]></category>

		<guid isPermaLink="false">http://blog.prashanthellina.com/2007/11/12/wikipedia-category-graph-generator/</guid>
		<description><![CDATA[I was in the process of trying to understand the classification schemes available in Wikipedia (categories, lists and navigation maps) when I came across this nifty tool. It is very useful to understand the inter-relationships between Wikipedia categories. You can check it out here: http://tools.wikimedia.de/~dapete/catgraph/]]></description>
			<content:encoded><![CDATA[<p>I was in the process of trying to understand the classification schemes available in Wikipedia (categories, lists and navigation maps) when I came across this nifty tool. It is very useful to understand the inter-relationships between Wikipedia categories.</p>
<p>You can check it out here: <a href="http://tools.wikimedia.de/~dapete/catgraph/">http://tools.wikimedia.de/~dapete/catgraph/</a></p>
<p><center><img src="http://www.prashanthellina.com/images/wiki_cat_graph.png"/></center></p>
]]></content:encoded>
			<wfw:commentRss>http://blog.prashanthellina.com/2007/11/12/wikipedia-category-graph-generator/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
	</channel>
</rss>

