<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Clustering Data using Python</title>
	<atom:link href="http://blog.prashanthellina.com/2009/07/25/clustering-data-using-python/feed/" rel="self" type="application/rss+xml" />
	<link>http://blog.prashanthellina.com/2009/07/25/clustering-data-using-python/</link>
	<description>In Pursuit of Truth</description>
	<lastBuildDate>Fri, 09 Apr 2010 10:27:43 +0000</lastBuildDate>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.2.1</generator>
	<item>
		<title>By: prashanthellina</title>
		<link>http://blog.prashanthellina.com/2009/07/25/clustering-data-using-python/comment-page-1/#comment-31388</link>
		<dc:creator>prashanthellina</dc:creator>
		<pubDate>Mon, 27 Jul 2009 02:48:02 +0000</pubDate>
		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=93#comment-31388</guid>
		<description>Ashish,

I am reposting your distance function code with indentation preserved.

&lt;pre lang=&quot;python&quot;&gt;
#Distance function compares two urls and finds the distance ,also considering the structure of url
#into account and having different weights for different parts of url, as so to give precedence of
#domain over params and so on
# uses SequenceMatcher from python standard module difflib
def distance(url1, url2):
    url1_frag = urlparse(url1)
    url2_frag = urlparse(url2)
    ratio = 0.0
    #  weights for now are [6,5,4,3,2,1]  for each of the tuples of urlparse
    #ratio sums (user assigned weight of the urlparse part) * (ratio returned by Sequence matcher )
    for index in xrange(0,6):
        ratio += (6-index) * (1.0 - SequenceMatcher(None, url1_frag[index],url2_frag[index] ).ratio())
    return ratio / 21.0
&lt;/pre&gt;
&lt;br/&gt;

People, note that you will have to add this line to the top of the script to use this function.

&lt;pre lang=&quot;python&quot;&gt;
from urlparse import urlparse
&lt;/pre&gt;</description>
		<content:encoded><![CDATA[<p>Ashish,</p>
<p>I am reposting your distance function code with indentation preserved.</p>
<pre lang="python">
#Distance function compares two urls and finds the distance ,also considering the structure of url
#into account and having different weights for different parts of url, as so to give precedence of
#domain over params and so on
# uses SequenceMatcher from python standard module difflib
def distance(url1, url2):
    url1_frag = urlparse(url1)
    url2_frag = urlparse(url2)
    ratio = 0.0
    #  weights for now are [6,5,4,3,2,1]  for each of the tuples of urlparse
    #ratio sums (user assigned weight of the urlparse part) * (ratio returned by Sequence matcher )
    for index in xrange(0,6):
        ratio += (6-index) * (1.0 - SequenceMatcher(None, url1_frag[index],url2_frag[index] ).ratio())
    return ratio / 21.0
</pre>
<p></p>
<p>People, note that you will have to add this line to the top of the script to use this function.</p>
<pre lang="python">
from urlparse import urlparse
</pre>
]]></content:encoded>
	</item>
	<item>
		<title>By: prashanthellina</title>
		<link>http://blog.prashanthellina.com/2009/07/25/clustering-data-using-python/comment-page-1/#comment-31387</link>
		<dc:creator>prashanthellina</dc:creator>
		<pubDate>Mon, 27 Jul 2009 02:46:22 +0000</pubDate>
		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=93#comment-31387</guid>
		<description>Wonderful suggestion Ashish. I was planning something similar to enhance the clustering. You have saved me some work.

Output of the above script after having replaced with your distance function.
&lt;pre lang=&quot;python&quot;&gt;
[[&#039;#articles&#039;],
 [&#039;http://xkcd.com/612/&#039;],
 [&#039;http://web.sourceforge.com/terms.phphttp://slashdot.org//it.slashdot.org/search&#039;,
  &#039;http://web.sourceforge.com/privacy.php&#039;,
  &#039;http://web.sourceforge.com/advertising&#039;],
 [&#039;http://yro.slashdot.org/~drDugan/&#039;,
  &#039;http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238&#039;,
  &#039;http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&amp;op=Reply&amp;threshold=1&amp;commentsort=0&amp;mode=thread&amp;pid=28814785&#039;,
  &#039;http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&amp;op=Reply&amp;threshold=1&amp;commentsort=0&amp;mode=thread&amp;pid=28814429&#039;,
  &#039;http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&amp;op=Reply&amp;threshold=1&amp;commentsort=0&amp;mode=thread&amp;pid=28814457&#039;,
  &#039;http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&amp;cid=28810581&#039;,
  &#039;http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&amp;cid=28815123&#039;,
  &#039;http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&amp;cid=28815269&#039;,
  &#039;http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&amp;cid=28814385&#039;,
  &#039;http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&amp;cid=28814335&#039;,
  &#039;http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&amp;cid=28814657&#039;,
  &#039;http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&amp;cid=28814581&#039;,
  &#039;http://slashdot.org//slashdot.org/~Darkness404&#039;,
  &#039;http://slashdot.org//radio.slashdot.org&#039;]]
&lt;/pre&gt;</description>
		<content:encoded><![CDATA[<p>Wonderful suggestion Ashish. I was planning something similar to enhance the clustering. You have saved me some work.</p>
<p>Output of the above script after having replaced with your distance function.</p>
<pre lang="python">
[['#articles'],
 ['<a href="http://xkcd.com/612/&#039;" rel="nofollow">http://xkcd.com/612/&#039;</a>,
 ['<a href="http://web.sourceforge.com/terms.phphttp://slashdot.org//it.slashdot.org/search&#039;" rel="nofollow">http://web.sourceforge.com/terms.phphttp://slashdot.org//it.slashdot.org/search&#039;</a>,
  '<a href="http://web.sourceforge.com/privacy.php&#039;" rel="nofollow">http://web.sourceforge.com/privacy.php&#039;</a>,
  '<a href="http://web.sourceforge.com/advertising&#039;" rel="nofollow">http://web.sourceforge.com/advertising&#039;</a>,
 ['<a href="http://yro.slashdot.org/~drDugan/&#039;" rel="nofollow">http://yro.slashdot.org/~drDugan/&#039;</a>,
  '<a href="http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238&#039;" rel="nofollow">http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238&#039;</a>,
  '<a href="http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;op=Reply&#038;threshold=1&#038;commentsort=0&#038;mode=thread&#038;pid=28814785&#039;" rel="nofollow">http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;op=Reply&#038;threshold=1&#038;commentsort=0&#038;mode=thread&#038;pid=28814785&#039;</a>,
  '<a href="http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;op=Reply&#038;threshold=1&#038;commentsort=0&#038;mode=thread&#038;pid=28814429&#039;" rel="nofollow">http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;op=Reply&#038;threshold=1&#038;commentsort=0&#038;mode=thread&#038;pid=28814429&#039;</a>,
  '<a href="http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;op=Reply&#038;threshold=1&#038;commentsort=0&#038;mode=thread&#038;pid=28814457&#039;" rel="nofollow">http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;op=Reply&#038;threshold=1&#038;commentsort=0&#038;mode=thread&#038;pid=28814457&#039;</a>,
  '<a href="http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&#038;cid=28810581&#039;" rel="nofollow">http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&#038;cid=28810581&#039;</a>,
  '<a href="http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28815123&#039;" rel="nofollow">http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28815123&#039;</a>,
  '<a href="http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28815269&#039;" rel="nofollow">http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28815269&#039;</a>,
  '<a href="http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814385&#039;" rel="nofollow">http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814385&#039;</a>,
  '<a href="http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814335&#039;" rel="nofollow">http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814335&#039;</a>,
  '<a href="http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814657&#039;" rel="nofollow">http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814657&#039;</a>,
  '<a href="http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814581&#039;" rel="nofollow">http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&#038;cid=28814581&#039;</a>,
  '<a href="http://slashdot.org//slashdot.org/~Darkness404&#039;" rel="nofollow">http://slashdot.org//slashdot.org/~Darkness404&#039;</a>,
  '<a href="http://slashdot.org//radio.slashdot.org&#039;" rel="nofollow">http://slashdot.org//radio.slashdot.org&#039;</a>
</pre>
]]></content:encoded>
	</item>
	<item>
		<title>By: sriram</title>
		<link>http://blog.prashanthellina.com/2009/07/25/clustering-data-using-python/comment-page-1/#comment-31297</link>
		<dc:creator>sriram</dc:creator>
		<pubDate>Sun, 26 Jul 2009 02:39:02 +0000</pubDate>
		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=93#comment-31297</guid>
		<description>Like the previous comment says, the distance function can be improved. python rocks!</description>
		<content:encoded><![CDATA[<p>Like the previous comment says, the distance function can be improved. python rocks!</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ashish Yadav</title>
		<link>http://blog.prashanthellina.com/2009/07/25/clustering-data-using-python/comment-page-1/#comment-31250</link>
		<dc:creator>Ashish Yadav</dc:creator>
		<pubDate>Sat, 25 Jul 2009 09:47:58 +0000</pubDate>
		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=93#comment-31250</guid>
		<description>As indentation is got all messed up, you can see the code snippet at http://python.pastebin.com/fceb4031</description>
		<content:encoded><![CDATA[<p>As indentation is got all messed up, you can see the code snippet at <a href="http://python.pastebin.com/fceb4031" rel="nofollow">http://python.pastebin.com/fceb4031</a></p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ashish Yadav</title>
		<link>http://blog.prashanthellina.com/2009/07/25/clustering-data-using-python/comment-page-1/#comment-31248</link>
		<dc:creator>Ashish Yadav</dc:creator>
		<pubDate>Sat, 25 Jul 2009 09:44:39 +0000</pubDate>
		<guid isPermaLink="false">http://blog.prashanthellina.com/?p=93#comment-31248</guid>
		<description>Hey prashant, 
Blog came after a long time indeed.
I think that taking into account the structure of url, can make clustering more accurate.
Like for example two urls from same domain should be closer to each other than anything else, etc,
so try using this function for calculating distance between two urls.

# distance function compares two urls and finds the distance ,also considering the structure of url
#into account and having different weights for different parts of url, as so to give precedence of
#domain over params and so on
# uses SequenceMatcher from python standard module difflib
def customDistance(url1, url2):
    url1_frag = urlparse(url1)
    url2_frag = urlparse(url2)
    ratio = 0.0
    #  weights for now are [6,5,4,3,2,1]  for each of the elements of urlparse tuple, maximum 6 for domain and then decreasing.
    #ratio sums (user assigned weight of the urlparse part) * (ratio returned by Sequence matcher )
    #denominator is sum of the user assigned weights,
    for index in xrange(0,6):
        ratio += (6-index) * (1 - SequenceMatcher(None, url1_frag[index],url2_frag[index] ).ratio())
    return ratio / 21</description>
		<content:encoded><![CDATA[<p>Hey prashant,<br />
Blog came after a long time indeed.<br />
I think that taking into account the structure of url, can make clustering more accurate.<br />
Like for example two urls from same domain should be closer to each other than anything else, etc,<br />
so try using this function for calculating distance between two urls.</p>
<p># distance function compares two urls and finds the distance ,also considering the structure of url<br />
#into account and having different weights for different parts of url, as so to give precedence of<br />
#domain over params and so on<br />
# uses SequenceMatcher from python standard module difflib<br />
def customDistance(url1, url2):<br />
    url1_frag = urlparse(url1)<br />
    url2_frag = urlparse(url2)<br />
    ratio = 0.0<br />
    #  weights for now are [6,5,4,3,2,1]  for each of the elements of urlparse tuple, maximum 6 for domain and then decreasing.<br />
    #ratio sums (user assigned weight of the urlparse part) * (ratio returned by Sequence matcher )<br />
    #denominator is sum of the user assigned weights,<br />
    for index in xrange(0,6):<br />
        ratio += (6-index) * (1 &#8211; SequenceMatcher(None, url1_frag[index],url2_frag[index] ).ratio())<br />
    return ratio / 21</p>
]]></content:encoded>
	</item>
</channel>
</rss>

