As a part of a project I am working on, I had to cluster urls on a page. After some light googling I found, python-cluster. You can find below a simple python script to illustrate the usage of python-cluster library.
Code
import pprint
from difflib import SequenceMatcher
# http://python-cluster.sourceforge.net/
from cluster import HierarchicalClustering
# input urls to be clustered
urls = [
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814385',
'#articles',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814335',
'http://yro.slashdot.org/~drDugan/',
'http://web.sourceforge.com/privacy.php',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815123',
'http://slashdot.org//slashdot.org/~Darkness404',
'http://slashdot.org//radio.slashdot.org',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814429',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814457',
'http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238',
'http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&cid=28810581',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815269',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814657',
'http://web.sourceforge.com/terms.php'
'http://slashdot.org//it.slashdot.org/search',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814581',
'http://xkcd.com/612/',
'http://web.sourceforge.com/advertising',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814785',
]
# distance function compares two urls and finds the distance
# uses SequenceMatcher from python standard module difflib
def distance(url1, url2):
ratio = SequenceMatcher(None, url1, url2).ratio()
return 1.0 - ratio
# Perform clustering
hc = HierarchicalClustering(urls, distance)
clusters = hc.getlevel(0.2)
pprint.pprint(clusters)
Output
[['#articles'],
['http://xkcd.com/612/'],
['http://web.sourceforge.com/privacy.php'],
['http://web.sourceforge.com/advertising'],
['http://web.sourceforge.com/terms.phphttp://slashdot.org//it.slashdot.org/search'],
['http://yro.slashdot.org/~drDugan/'],
['http://slashdot.org//slashdot.org/~Darkness404'],
['http://slashdot.org//radio.slashdot.org'],
['http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814785',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814429',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814457'],
['http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238',
'http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&cid=28810581',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815123',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815269',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814385',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814335',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814657',
'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814581']]
Tagged: clustering, python, script
No Trackbacks
5 Comments
Hey prashant,
Blog came after a long time indeed.
I think that taking into account the structure of url, can make clustering more accurate.
Like for example two urls from same domain should be closer to each other than anything else, etc,
so try using this function for calculating distance between two urls.
# distance function compares two urls and finds the distance ,also considering the structure of url
#into account and having different weights for different parts of url, as so to give precedence of
#domain over params and so on
# uses SequenceMatcher from python standard module difflib
def customDistance(url1, url2):
url1_frag = urlparse(url1)
url2_frag = urlparse(url2)
ratio = 0.0
# weights for now are [6,5,4,3,2,1] for each of the elements of urlparse tuple, maximum 6 for domain and then decreasing.
#ratio sums (user assigned weight of the urlparse part) * (ratio returned by Sequence matcher )
#denominator is sum of the user assigned weights,
for index in xrange(0,6):
ratio += (6-index) * (1 – SequenceMatcher(None, url1_frag[index],url2_frag[index] ).ratio())
return ratio / 21
As indentation is got all messed up, you can see the code snippet at http://python.pastebin.com/fceb4031
Like the previous comment says, the distance function can be improved. python rocks!
Wonderful suggestion Ashish. I was planning something similar to enhance the clustering. You have saved me some work.
Output of the above script after having replaced with your distance function.
Ashish,
I am reposting your distance function code with indentation preserved.
#Distance function compares two urls and finds the distance ,also considering the structure of url #into account and having different weights for different parts of url, as so to give precedence of #domain over params and so on # uses SequenceMatcher from python standard module difflib def distance(url1, url2): url1_frag = urlparse(url1) url2_frag = urlparse(url2) ratio = 0.0 # weights for now are [6,5,4,3,2,1] for each of the tuples of urlparse #ratio sums (user assigned weight of the urlparse part) * (ratio returned by Sequence matcher ) for index in xrange(0,6): ratio += (6-index) * (1.0 - SequenceMatcher(None, url1_frag[index],url2_frag[index] ).ratio()) return ratio / 21.0People, note that you will have to add this line to the top of the script to use this function.