Topic extraction using Wikipedia data


decorative graph header


In an earlier article, I mentioned that I was trying to use Wikipedia data to do news article clustering to make it easy for me follow news feeds. I have made some progress. I’ve written an algorithm to produce a list of Wikipedia articles relevant to the input text. Input text has to be in English. The algorithm will not work well for very short pieces of text. At least a paragraph or two with sizable text are required. The list of Wikipedia articles will represent the “topic” of the input text.

Test run

To test the algorithm, I gave the text from an earlier article from this blog (Accessing your home computer from the internet). The top Wikipedia articles in the output are

  • Internet
  • Domain Name System (DNS)
  • IP Address
  • Hypertext Transfer Protocol (HTTP)
  • Modem
  • World Wide Web (WWW)
  • Domain Name
  • Dynamic Host Configuration Protocol (DHCP)
  • Internet Service Provider
  • Network Address Translation (NAT)
  • Firewall

How it works?

The basis of the algorithm is to find Wikipedia article titles occuring in the input text. The “found” set of Wikipedia articles are then used to construct a sub-graph from the Wikipedia graph (formed by linkages between Wikipedia pages). The most interconnected nodes happen to be relevant. However, as I have not applied any filtering on the input text, a lot of “junk” matches happened. For example, the word “let” is picked up and it matches a Wikipedia article by the same title which redirects to Lashkar-e-Toiba. This is totally irrelevant to the input text. To remove such spurious matches, I dropped all the least interconnected nodes and constructed a sub-graph with the remaining nodes. In the sub-graph, I did recomputation for node interconnection.

Below is the output of the first phase. This graph contains all nodes found from matching phrases in the input text. The nodes of darker blue are more relevant than lighter ones. The darker and thicker a link is, the more relevant it is.

full graph with all found wikipedia titles

Download higher resolution image here. 8.2MB

A lot of extracted articles are not relevant to the input text. Some of these spurious nodes are totally disconnected from the main body of the graph.
disconnected nodes in the full graph

This is a slightly higher resolution picture of a section of the full graph above.
section of the full graph containing some relevant nodes

Below is the output of second phase of the algorithm where relevant nodes are extracted and a sub-graph computed. Node and edge relevances are recomputed within this set.

sub graph containing only relevant nodes

Download the higher resolution image here. 1.9MB

All the graphs above were produced using Graphviz.

What next

I tried applying the logic to some sample input texts and results look very encouraging. The next step towards news article clustering would be apply the topic extraction algorithm to multiple news articles and look for common Wikipedia articles (maybe plain intersection). I still have not given much thought to this stage. Once I do, I will post back.

As I said before, Wikipedia amazes me every time I use it. The wealth of information (both as text and as interconnects) is astounding. As a token of appreciation, I’ve donated a small amount to the current Wikipedia donation round. If you like Wikipedia and have used it, do consider making a donation.

2 Trackbacks

  1. [...] Ellina made a great work extracting topics using wikipedia data. Using the Graphviz program, he shows “…the wealth of information (both as text and as [...]

  2. [...] [upmod] [downmod] Topic extraction using Wikipedia data | Prashanth Ellina (blog.prashanthellina.com) 1 points posted 3 weeks, 1 day ago by jeethu tags datamining [...]

10 Comments

  1. brilliant!
    you amaze me, as ever.

    did you try using the global interconnect count for filtering?

    Posted December 21, 2007 at 6:04 pm | Permalink
  2. I did but it was pushing down relevant nodes for some sample texts.

    Posted December 21, 2007 at 6:17 pm | Permalink
  3. shanmuganandh

    cool…

    Posted December 27, 2007 at 7:49 am | Permalink
  4. Hey Stephen North just posted a comment on my blog about your work, which could be of interest for you.

    Hi. I’d suggest at least trying neato or fdp for layouts of these networks, particularly, neato -Gmodel=subset

    Hope it helps.

    Posted January 1, 2008 at 7:05 pm | Permalink
  5. I did try neato and fdp and in both the cases the layout was not good as nodes were overlapping and edges would cut through nodes. I will however try out neato -Gmodel=subset and get back. Thanks for the tip.

    Posted January 1, 2008 at 7:23 pm | Permalink
  6. hi there, would have sent you an email, but haven’t noticed an obvious one after a very quick scan

    i’m really impressed and intrigued by your research here… your approach and algorithm seems very complementary to what i’ve been experimenting with…

    http://www.slideshare.net/guest2c797e/wikipedia-as-controlled-vocabulary/

    http://sells.welcomebackstage.com:5000/item/submit

    do let me know what you think. i wonder if there’s an easy way for me to try your node interconnectedness approach out as an additional filter in my approach?

    best–

    –chris sizemore

    Posted March 28, 2008 at 6:22 pm | Permalink
  7. Chris,

    Your work is interesting! I tried out the page you gave me for a paragraph. I believe a constrained tagging vocabulary for all kinds of content is a great way to “semanticize” it. You can take a look at this presentation (http://www.prashanthellina.com/docs/interesting_ways_of_using_wikipedia_data.pdf) to get more details about my algorithm. You can contact me at prashanthellina AT gmail.com.

    Posted March 31, 2008 at 10:10 pm | Permalink
  8. Hugo Leung

    Hi,

    I am a pharmacy student doing research on Aspirin, and oddly enough, I could really use your expertise. I am studying the interconnectedness of papers on aspirin in order to visualize the most relevant papers which have studied people over 75 years old. I am currently using graphviz to visual this web of papers. I love the graph outputs you’ve created! Can you tell me how I might be able to automate things so that the relevant nodes are darker and relevant edges thicker as you have done? As I am not a good at programming, i was going to do this manually!

    Posted December 16, 2008 at 11:28 am | Permalink
  9. Jacob

    hey id like to talk with you about your blog. please email me – thanks.

    Posted March 21, 2009 at 3:06 pm | Permalink
  10. Hello..I have some troubles with the conversion of movie to iPhone. I bought an iPhone last week and yesterday when I put a movie onto iPhone and click play, the phone give me a warning that he cannot play video in that format. I think I need to convert the movies to iPhone supported format first. So how can I deal with the conversion? Can any one recommend a powerful and free converter for me ?

    ________________
    unlock iphone 3g

    Posted November 9, 2009 at 1:24 pm | Permalink