In an earlier article, I mentioned that I was trying to use Wikipedia data to do news article clustering to make it easy for me follow news feeds. I have made some progress. I’ve written an algorithm to produce a list of Wikipedia articles relevant to the input text. Input text has to be in English. The algorithm will not work well for very short pieces of text. At least a paragraph or two with sizable text are required. The list of Wikipedia articles will represent the “topic” of the input text.
Test run
To test the algorithm, I gave the text from an earlier article from this blog (Accessing your home computer from the internet). The top Wikipedia articles in the output are
- Internet
- Domain Name System (DNS)
- IP Address
- Hypertext Transfer Protocol (HTTP)
- Modem
- World Wide Web (WWW)
- Domain Name
- Dynamic Host Configuration Protocol (DHCP)
- Internet Service Provider
- Network Address Translation (NAT)
- Firewall
How it works?
The basis of the algorithm is to find Wikipedia article titles occuring in the input text. The “found” set of Wikipedia articles are then used to construct a sub-graph from the Wikipedia graph (formed by linkages between Wikipedia pages). The most interconnected nodes happen to be relevant. However, as I have not applied any filtering on the input text, a lot of “junk” matches happened. For example, the word “let” is picked up and it matches a Wikipedia article by the same title which redirects to Lashkar-e-Toiba. This is totally irrelevant to the input text. To remove such spurious matches, I dropped all the least interconnected nodes and constructed a sub-graph with the remaining nodes. In the sub-graph, I did recomputation for node interconnection.
Below is the output of the first phase. This graph contains all nodes
found from matching phrases in the input text. The nodes of darker
blue are more relevant than lighter ones. The darker and thicker a
link is, the more relevant it is.
Download higher resolution image
here.
8.2MB
A lot of extracted articles are not relevant to the input text. Some of
these spurious nodes are totally disconnected from the main body of
the graph.
This is a slightly higher resolution picture of a section of the full
graph above.
Below is the output of second phase of the algorithm where relevant
nodes are extracted and a sub-graph computed. Node and edge
relevances are recomputed within this set.
Download the higher resolution image
here.
1.9MB
All the graphs above were produced using Graphviz.
What next
I tried applying the logic to some sample input texts and results look very encouraging. The next step towards news article clustering would be apply the topic extraction algorithm to multiple news articles and look for common Wikipedia articles (maybe plain intersection). I still have not given much thought to this stage. Once I do, I will post back.
As I said before, Wikipedia amazes me every time I use it. The wealth of information (both as text and as interconnects) is astounding. As a token of appreciation, I’ve donated a small amount to the current Wikipedia donation round. If you like Wikipedia and have used it, do consider making a donation.