Clustering Data using Python

As a part of a project I am working on, I had to cluster urls on a page. After some light googling I found, python-cluster. You can find below a simple python script to illustrate the usage of python-cluster library.

Code

import pprint
from difflib import SequenceMatcher

# http://python-cluster.sourceforge.net/
from cluster import HierarchicalClustering

# input urls to be clustered
urls = [
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814385',
    '#articles',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814335',
    'http://yro.slashdot.org/~drDugan/',
    'http://web.sourceforge.com/privacy.php',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815123',
    'http://slashdot.org//slashdot.org/~Darkness404',
    'http://slashdot.org//radio.slashdot.org',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814429',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814457',
    'http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238',
    'http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&cid=28810581',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815269',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814657',
    'http://web.sourceforge.com/terms.php'
    'http://slashdot.org//it.slashdot.org/search',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814581',
    'http://xkcd.com/612/',
    'http://web.sourceforge.com/advertising',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814785',
]

# distance function compares two urls and finds the distance
# uses SequenceMatcher from python standard module difflib
def distance(url1, url2):
    ratio = SequenceMatcher(None, url1, url2).ratio()
    return 1.0 - ratio

# Perform clustering
hc = HierarchicalClustering(urls, distance)
clusters = hc.getlevel(0.2)

pprint.pprint(clusters)


Output

[['#articles'],
 ['http://xkcd.com/612/'],
 ['http://web.sourceforge.com/privacy.php'],
 ['http://web.sourceforge.com/advertising'],
 ['http://web.sourceforge.com/terms.phphttp://slashdot.org//it.slashdot.org/search'],
 ['http://yro.slashdot.org/~drDugan/'],
 ['http://slashdot.org//slashdot.org/~Darkness404'],
 ['http://slashdot.org//radio.slashdot.org'],
 ['http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814785',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814429',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814457'],
 ['http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238',
  'http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&cid=28810581',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815123',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815269',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814385',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814335',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814657',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814581']]


No Trackbacks

5 Comments

  1. Hey prashant,
    Blog came after a long time indeed.
    I think that taking into account the structure of url, can make clustering more accurate.
    Like for example two urls from same domain should be closer to each other than anything else, etc,
    so try using this function for calculating distance between two urls.

    # distance function compares two urls and finds the distance ,also considering the structure of url
    #into account and having different weights for different parts of url, as so to give precedence of
    #domain over params and so on
    # uses SequenceMatcher from python standard module difflib
    def customDistance(url1, url2):
    url1_frag = urlparse(url1)
    url2_frag = urlparse(url2)
    ratio = 0.0
    # weights for now are [6,5,4,3,2,1] for each of the elements of urlparse tuple, maximum 6 for domain and then decreasing.
    #ratio sums (user assigned weight of the urlparse part) * (ratio returned by Sequence matcher )
    #denominator is sum of the user assigned weights,
    for index in xrange(0,6):
    ratio += (6-index) * (1 – SequenceMatcher(None, url1_frag[index],url2_frag[index] ).ratio())
    return ratio / 21

    Posted July 25, 2009 at 3:02 pm | Permalink
  2. As indentation is got all messed up, you can see the code snippet at http://python.pastebin.com/fceb4031

    Posted July 25, 2009 at 3:05 pm | Permalink
  3. Like the previous comment says, the distance function can be improved. python rocks!

    Posted July 26, 2009 at 7:57 am | Permalink
  4. Wonderful suggestion Ashish. I was planning something similar to enhance the clustering. You have saved me some work.

    Output of the above script after having replaced with your distance function.

    [['#articles'],
     ['http://xkcd.com/612/'],
     ['http://web.sourceforge.com/terms.phphttp://slashdot.org//it.slashdot.org/search',
      'http://web.sourceforge.com/privacy.php',
      'http://web.sourceforge.com/advertising'],
     ['http://yro.slashdot.org/~drDugan/',
      'http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814785',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814429',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814457',
      'http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&cid=28810581',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815123',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815269',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814385',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814335',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814657',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814581',
      'http://slashdot.org//slashdot.org/~Darkness404',
      'http://slashdot.org//radio.slashdot.org']]
    
    Posted July 27, 2009 at 8:04 am | Permalink
  5. Ashish,

    I am reposting your distance function code with indentation preserved.

    #Distance function compares two urls and finds the distance ,also considering the structure of url
    #into account and having different weights for different parts of url, as so to give precedence of
    #domain over params and so on
    # uses SequenceMatcher from python standard module difflib
    def distance(url1, url2):
        url1_frag = urlparse(url1)
        url2_frag = urlparse(url2)
        ratio = 0.0
        #  weights for now are [6,5,4,3,2,1]  for each of the tuples of urlparse
        #ratio sums (user assigned weight of the urlparse part) * (ratio returned by Sequence matcher )
        for index in xrange(0,6):
            ratio += (6-index) * (1.0 - SequenceMatcher(None, url1_frag[index],url2_frag[index] ).ratio())
        return ratio / 21.0
    

    People, note that you will have to add this line to the top of the script to use this function.

    from urlparse import urlparse
    
    Posted July 27, 2009 at 8:06 am | Permalink