
In an earlier article, I mentioned that I was trying to use Wikipedia data to do news article clustering to make it easy for me follow news feeds. I have made some progress. I’ve written an algorithm to produce a list of Wikipedia articles relevant to the input text. Input text has to be in English. The algorithm will not work well for very short pieces of text. At least a paragraph or two with sizable text are required. The list of Wikipedia articles will represent the “topic” of the input text.
Test run
To test the algorithm, I gave the text from an earlier article from this blog (Accessing your home computer from the internet). The top Wikipedia articles in the output are
- Internet
- Domain Name System (DNS)
- IP Address
- Hypertext Transfer Protocol (HTTP)
- Modem
- World Wide Web (WWW)
- Domain Name
- Dynamic Host Configuration Protocol (DHCP)
- Internet Service Provider
- Network Address Translation (NAT)
- Firewall
How it works?
The basis of the algorithm is to find Wikipedia article titles occuring in the input text. The “found” set of Wikipedia articles are then used to construct a sub-graph from the Wikipedia graph (formed by linkages between Wikipedia pages). The most interconnected nodes happen to be relevant. However, as I have not applied any filtering on the input text, a lot of “junk” matches happened. For example, the word “let” is picked up and it matches a Wikipedia article by the same title which redirects to Lashkar-e-Toiba. This is totally irrelevant to the input text. To remove such spurious matches, I dropped all the least interconnected nodes and constructed a sub-graph with the remaining nodes. In the sub-graph, I did recomputation for node interconnection.
Below is the output of the first phase. This graph contains all nodes found from matching phrases in the input text. The nodes of darker blue are more relevant than lighter ones. The darker and thicker a link is, the more relevant it is.

Download higher resolution image here. 8.2MB
A lot of extracted articles are not relevant to the input text. Some of these spurious nodes are totally disconnected from the main body of the graph.

This is a slightly higher resolution picture of a section of the full graph above.

Below is the output of second phase of the algorithm where relevant nodes are extracted and a sub-graph computed. Node and edge relevances are recomputed within this set.

Download the higher resolution image here. 1.9MB
All the graphs above were produced using Graphviz.
What next
I tried applying the logic to some sample input texts and results look very encouraging. The next step towards news article clustering would be apply the topic extraction algorithm to multiple news articles and look for common Wikipedia articles (maybe plain intersection). I still have not given much thought to this stage. Once I do, I will post back.
As I said before, Wikipedia amazes me every time I use it. The wealth of information (both as text and as interconnects) is astounding. As a token of appreciation, I’ve donated a small amount to the current Wikipedia donation round. If you like Wikipedia and have used it, do consider making a donation.
Tags: data mining, graph, graphviz, programming, python, semantic analysis, text processing, visualization, web, wikipedia















venkat wrote,
brilliant!
you amaze me, as ever.
did you try using the global interconnect count for filtering?
Link | December 21st, 2007 at 6:04 pm
prashanthellina wrote,
I did but it was pushing down relevant nodes for some sample texts.
Link | December 21st, 2007 at 6:17 pm
shanmuganandh wrote,
cool…
Link | December 27th, 2007 at 7:49 am
social media and green horses » Topic extraction in wikipedia wrote,
[...] Ellina made a great work extracting topics using wikipedia data. Using the Graphviz program, he shows “…the wealth of information (both as text and as [...]
Link | December 30th, 2007 at 10:28 pm
robojiannis wrote,
Hey Stephen North just posted a comment on my blog about your work, which could be of interest for you.
“Hi. I’d suggest at least trying neato or fdp for layouts of these networks, particularly, neato -Gmodel=subset”
Hope it helps.
Link | January 1st, 2008 at 7:05 pm
prashanthellina wrote,
I did try neato and fdp and in both the cases the layout was not good as nodes were overlapping and edges would cut through nodes. I will however try out neato -Gmodel=subset and get back. Thanks for the tip.
Link | January 1st, 2008 at 7:23 pm
chris sizemore wrote,
hi there, would have sent you an email, but haven’t noticed an obvious one after a very quick scan
i’m really impressed and intrigued by your research here… your approach and algorithm seems very complementary to what i’ve been experimenting with…
http://www.slideshare.net/guest2c797e/wikipedia-as-controlled-vocabulary/
http://sells.welcomebackstage.com:5000/item/submit
do let me know what you think. i wonder if there’s an easy way for me to try your node interconnectedness approach out as an additional filter in my approach?
best–
–chris sizemore
Link | March 28th, 2008 at 6:22 pm
prashanthellina wrote,
Chris,
Your work is interesting! I tried out the page you gave me for a paragraph. I believe a constrained tagging vocabulary for all kinds of content is a great way to “semanticize” it. You can take a look at this presentation (http://www.prashanthellina.com/docs/interesting_ways_of_using_wikipedia_data.pdf) to get more details about my algorithm. You can contact me at prashanthellina AT gmail.com.
Link | March 31st, 2008 at 10:10 pm
Hugo Leung wrote,
Hi,
I am a pharmacy student doing research on Aspirin, and oddly enough, I could really use your expertise. I am studying the interconnectedness of papers on aspirin in order to visualize the most relevant papers which have studied people over 75 years old. I am currently using graphviz to visual this web of papers. I love the graph outputs you’ve created! Can you tell me how I might be able to automate things so that the relevant nodes are darker and relevant edges thicker as you have done? As I am not a good at programming, i was going to do this manually!
Link | December 16th, 2008 at 11:28 am
sewFoence wrote,
Nothing seems to be easier than seeing someone whom you can help but not helping.
I suggest we start giving it a try. Give love to the ones that need it.
God will appreciate it.
Link | January 22nd, 2009 at 8:11 am
Rosenrod wrote,
Thank you - I’ll be sure to check back later for more of your posts.
Link | March 16th, 2009 at 3:55 pm
Jacob wrote,
hey id like to talk with you about your blog. please email me - thanks.
Link | March 21st, 2009 at 3:06 pm
Tagz | "Topic extraction using Wikipedia data | Prashanth Ellina" | Comments wrote,
[...] [upmod] [downmod] Topic extraction using Wikipedia data | Prashanth Ellina (blog.prashanthellina.com) 1 points posted 3 weeks, 1 day ago by jeethu tags datamining [...]
Link | May 16th, 2009 at 10:20 pm
Schwalm wrote,
Hello Guru, what tempt you to post an article. This article was exceedingly fascinating, especially since I was searching for thoughts on this subject last Thursday.
Link | June 30th, 2009 at 12:45 am