INCLUDE_DATA
( to ) ? be : ! be;
sidebar left sidebar right

Clustering Data using Python (1,148 views)

As a part of a project I am working on, I had to cluster urls on a page. After some light googling I found, python-cluster. You can find below a simple python script to illustrate the usage of python-cluster library.

Code

import pprint
from difflib import SequenceMatcher

# http://python-cluster.sourceforge.net/
from cluster import HierarchicalClustering

# input urls to be clustered
urls = [
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814385',
    '#articles',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814335',
    'http://yro.slashdot.org/~drDugan/',
    'http://web.sourceforge.com/privacy.php',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815123',
    'http://slashdot.org//slashdot.org/~Darkness404',
    'http://slashdot.org//radio.slashdot.org',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814429',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814457',
    'http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238',
    'http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&cid=28810581',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815269',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814657',
    'http://web.sourceforge.com/terms.php'
    'http://slashdot.org//it.slashdot.org/search',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814581',
    'http://xkcd.com/612/',
    'http://web.sourceforge.com/advertising',
    'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814785',
]

# distance function compares two urls and finds the distance
# uses SequenceMatcher from python standard module difflib
def distance(url1, url2):
    ratio = SequenceMatcher(None, url1, url2).ratio()
    return 1.0 - ratio

# Perform clustering
hc = HierarchicalClustering(urls, distance)
clusters = hc.getlevel(0.2)

pprint.pprint(clusters)


Output

[['#articles'],
 ['http://xkcd.com/612/'],
 ['http://web.sourceforge.com/privacy.php'],
 ['http://web.sourceforge.com/advertising'],
 ['http://web.sourceforge.com/terms.phphttp://slashdot.org//it.slashdot.org/search'],
 ['http://yro.slashdot.org/~drDugan/'],
 ['http://slashdot.org//slashdot.org/~Darkness404'],
 ['http://slashdot.org//radio.slashdot.org'],
 ['http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814785',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814429',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814457'],
 ['http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238',
  'http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&cid=28810581',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815123',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815269',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814385',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814335',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814657',
  'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814581']]


subscribe to feed

Tags: , ,

"Clustering Data using Python" was published on July 25th, 2009 and is listed in programming, python, text processing.

Comments on "Clustering Data using Python": 5 Comments

  1. Ashish Yadav wrote,

    Hey prashant,
    Blog came after a long time indeed.
    I think that taking into account the structure of url, can make clustering more accurate.
    Like for example two urls from same domain should be closer to each other than anything else, etc,
    so try using this function for calculating distance between two urls.

    # distance function compares two urls and finds the distance ,also considering the structure of url
    #into account and having different weights for different parts of url, as so to give precedence of
    #domain over params and so on
    # uses SequenceMatcher from python standard module difflib
    def customDistance(url1, url2):
    url1_frag = urlparse(url1)
    url2_frag = urlparse(url2)
    ratio = 0.0
    # weights for now are [6,5,4,3,2,1] for each of the elements of urlparse tuple, maximum 6 for domain and then decreasing.
    #ratio sums (user assigned weight of the urlparse part) * (ratio returned by Sequence matcher )
    #denominator is sum of the user assigned weights,
    for index in xrange(0,6):
    ratio += (6-index) * (1 – SequenceMatcher(None, url1_frag[index],url2_frag[index] ).ratio())
    return ratio / 21

  2. Ashish Yadav wrote,

    As indentation is got all messed up, you can see the code snippet at http://python.pastebin.com/fceb4031

  3. sriram wrote,

    Like the previous comment says, the distance function can be improved. python rocks!

  4. prashanthellina wrote,

    Wonderful suggestion Ashish. I was planning something similar to enhance the clustering. You have saved me some work.

    Output of the above script after having replaced with your distance function.

    [['#articles'],
     ['http://xkcd.com/612/'],
     ['http://web.sourceforge.com/terms.phphttp://slashdot.org//it.slashdot.org/search',
      'http://web.sourceforge.com/privacy.php',
      'http://web.sourceforge.com/advertising'],
     ['http://yro.slashdot.org/~drDugan/',
      'http://slashdot.org//slashdot.org/article.pl?sid=09/07/24/1545238',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814785',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814429',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&op=Reply&threshold=1&commentsort=0&mode=thread&pid=28814457',
      'http://slashdot.org//slashdot.org/comments.pl?sid=09/07/24/1545238&cid=28810581',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815123',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28815269',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814385',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814335',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814657',
      'http://slashdot.org//it.slashdot.org/comments.pl?sid=1314601&cid=28814581',
      'http://slashdot.org//slashdot.org/~Darkness404',
      'http://slashdot.org//radio.slashdot.org']]
    
  5. prashanthellina wrote,

    Ashish,

    I am reposting your distance function code with indentation preserved.

    #Distance function compares two urls and finds the distance ,also considering the structure of url
    #into account and having different weights for different parts of url, as so to give precedence of
    #domain over params and so on
    # uses SequenceMatcher from python standard module difflib
    def distance(url1, url2):
        url1_frag = urlparse(url1)
        url2_frag = urlparse(url2)
        ratio = 0.0
        #  weights for now are [6,5,4,3,2,1]  for each of the tuples of urlparse
        #ratio sums (user assigned weight of the urlparse part) * (ratio returned by Sequence matcher )
        for index in xrange(0,6):
            ratio += (6-index) * (1.0 - SequenceMatcher(None, url1_frag[index],url2_frag[index] ).ratio())
        return ratio / 21.0
    

    People, note that you will have to add this line to the top of the script to use this function.

    from urlparse import urlparse
    

558936 views

Prashanth Ellina is powered by WordPress

No Complaints Shifter Series Theme by Buzzdroid.com
Computers blogarama - the blog directory Blog Flux Directory Blog Directory & Search engine Computer Blogs - Blog Catalog Blog Directory Computers blogs Bloggeries Blog Directory blog directory Computers Blog Blog Search, Blog Directory p Listed in LS Blogs the Blog Directory and Blog Search Engine Blog Review Blog search - categorized blog directory Link With Us - Web Directory Find Blogs in the Blog
Directory Blog Directory