Index ¦ Archives ¦ Atom ¦ RSS

Topic extraction using Wikipedia data

decorative graph
header

In an earlier article, I mentioned that I was trying to use Wikipedia data to do news article clustering to make it easy for me follow news feeds. I have made some progress. I’ve written an algorithm to produce a list of Wikipedia articles relevant to the input text. Input text has to be in English. The algorithm will not work well for very short pieces of text. At least a paragraph or two with sizable text are required. The list of Wikipedia articles will represent the “topic” of the input text.

Test run

To test the algorithm, I gave the text from an earlier article from this blog (Accessing your home computer from the internet). The top Wikipedia articles in the output are

  • Internet
  • Domain Name System (DNS)
  • IP Address
  • Hypertext Transfer Protocol (HTTP)
  • Modem
  • World Wide Web (WWW)
  • Domain Name
  • Dynamic Host Configuration Protocol (DHCP)
  • Internet Service Provider
  • Network Address Translation (NAT)
  • Firewall

How it works?

The basis of the algorithm is to find Wikipedia article titles occuring in the input text. The “found” set of Wikipedia articles are then used to construct a sub-graph from the Wikipedia graph (formed by linkages between Wikipedia pages). The most interconnected nodes happen to be relevant. However, as I have not applied any filtering on the input text, a lot of “junk” matches happened. For example, the word “let” is picked up and it matches a Wikipedia article by the same title which redirects to Lashkar-e-Toiba. This is totally irrelevant to the input text. To remove such spurious matches, I dropped all the least interconnected nodes and constructed a sub-graph with the remaining nodes. In the sub-graph, I did recomputation for node interconnection.

Below is the output of the first phase. This graph contains all nodes found from matching phrases in the input text. The nodes of darker blue are more relevant than lighter ones. The darker and thicker a link is, the more relevant it is. full graph with all found wikipedia
titles Download higher resolution image here. 8.2MB

A lot of extracted articles are not relevant to the input text. Some of these spurious nodes are totally disconnected from the main body of the graph. disconnected nodes in the full
graph

This is a slightly higher resolution picture of a section of the full graph above. section of the full graph containing some relevant
nodes

Below is the output of second phase of the algorithm where relevant nodes are extracted and a sub-graph computed. Node and edge relevances are recomputed within this set. sub graph containing only relevant
nodes Download the higher resolution image here. 1.9MB

All the graphs above were produced using Graphviz.

What next

I tried applying the logic to some sample input texts and results look very encouraging. The next step towards news article clustering would be apply the topic extraction algorithm to multiple news articles and look for common Wikipedia articles (maybe plain intersection). I still have not given much thought to this stage. Once I do, I will post back.

As I said before, Wikipedia amazes me every time I use it. The wealth of information (both as text and as interconnects) is astounding. As a token of appreciation, I’ve donated a small amount to the current Wikipedia donation round. If you like Wikipedia and have used it, do consider making a donation.

© Prashanth Ellina. Built using Pelican. Theme by Giulio Fidente on github.