Index ¦ Archives ¦ Atom ¦ RSS

Ways to process and use Wikipedia dumps

http://en.wikipedia.org   Wikipedia is a superb resource for reference (taken with a pinch of salt of course). I spend hours at a time spidering through its pages and always come away amazed at how much information it hosts. In my opinion this ranks amongst the defining milestones of mankind’s advancement.

Apart from being available through http://www.wikipedia.org, the data is provided for download so that you can create a mirror locally for quicker access. This is very convenient when you are not connected to the internet, say when you are on the move.

Setting up a local copy of Wikipedia

Windows If you have Windows installed, Webaroo is an easy way to get Wikipedia locally as a “web pack”. Check out Webaroo here. Another way on Windows is to use WikiFilter. I tried WikiFilter and found it a good option (It is open source, so you can tweak it). It takes up around 3 to 3.5GB on your disk.

Linux This page has instructions to setup on Linux. “Building a (fast) Wikipedia offline reader” is a good option too.

Any operating system Wikipedia provides static wiki dumps for download which should work fine on any operating system that supports a decent web browser. Although I have not tried it, I have heard that the dumps can take up as much as 80GB of space on your disk.

Windows Mobile, iPhone and Blackberry To access Wikipedia from your mobile, check out vTap from Veveo. I must tell you that I work for this company but I am being very objective in suggesting this service to you. A Java version is being developed and will be out soon. Since the space on mobile devices is very limited, the data is hosted on vTap servers and network connectivity is required.

Other uses for Wikipedia data dumps

In being such a vast repository of knowledge, Wikipedia is useful in many other ways. I want to use Wikipedia’s data to handle the feeds I read every day. The same news article comes in from different sources and multiple times from the same source and I end up reading all of them. I am going to try and use Wikipedia to help me automatically pull together these news articles and cluster them around topics.

The first step in this experiment would be to get the data dumps from Wikipedia and process them to load them into a Mysql database. Once we get the data into a database, things become more manageable.

Getting the dumps

Wikipedia is huge and this reflects in the data dumps. It took me about 40 hours to get the articles xml alone on my home connection. Put together all the relevant files of Wikipedia dumps come to 5 GB. I got the dumps from here. You can check for more recent ones here.

Files to get

  • pages-articles.xml.bz2 (2.8 GB) - xml containing page texts
  • redirect.sql.gz (10.5 MB) - sql for redirected pages info
  • page.sql.gz (318.9 MB) - minimal page information (id, page title)
  • externallinks.sql.gz (439.6 MB) - links from pages to external sites
  • categorylinks.sql.gz (235.5 MB) - categories
  • pagelinks.sql.gz (1.2 GB) - inter wiki page links

Since DreamHost (My Web host) offers a lot of disk space and the network connection is way better than the one I have at home, I downloaded the files to my DreamHost space. (read more about DreamHost here)

I was able to download all the files using wget but for “pages-articles.xml.bz2”. For some reason I cannot understand, Wget was bailing out after downloading a few bytes (this seems to be a DreamHost specific issue). To work around this issue, I wrote this python script

get_wiki_file.py

import urllib2, sys

url = "http://download.wikimedia.org/enwiki/20070908/enwiki-20070908-pages-articles.xml.bz2"
outfname = 'enwiki-20070908-pages-articles.xml.bz2'

o = open(outfname, 'wb')
r = urllib2.urlopen(url)
total_bytes = 0
counter = 0

while 1:
        bytes = r.read(10240)
        total_bytes += len(bytes)
        o.write(bytes)
        if len(bytes) < 10240: break
        counter += 1
        if counter % 10 == 0:
                counter = 0
                print "%.2f MB" % (total_bytes/1024.0/1024.0)
o.close()

Preparing dumps

The next step was to extract all the archives. I used bunzip2 for .bz2 files and gunzip for the .gz files. I tried loading one of the large .sql files into the database and the process was killed (I guess this is because DreamHost does not like resource hungry processes running for a long time). To work around this I had to split all the big .sql files into smaller chunks.

for example

split -d -l 50 ../enwiki-20070908-page page.input.

-l option tells split how many lines per split we need and -d tells split to use numerical suffix (which will be useful soon).

However the pages-article file is an xml and not sql. To load it into the database, I had to first convert it to a .sql dump. xml2sql is a handy program for doing this. You can get it here.

xml2sql -v -m pages-articles.xml

This command will produce text.sql, page.sql and revision.sql. However, I ran into a problem here because xml2sql was leaking and slowly rose to 88 MB of resident memory when it got killed by the DreamHost process. I tried running valgrind on it but could not find any leaks (they must be getting freed on exit).

This forced me to split the huge xml into manageable parts so that xml2sql would stay within 88MB. I wrote this python script for splitting the xml.

split_xml_dump.py

import os, os.path

fname = "enwiki-20070908-pages-articles.xml"
total_page_counter = cur_page_counter = 0
o = None
outf_counter = 0
outdir = 'pages_xml_splits/'

for line in open(fname):
        if line.startswith(""):
                continue

        if o is None or cur_page_counter == 250000:
                cur_page_counter = 0
                outf_counter += 1
                outfname = 'pagexml.%d' % outf_counter
                outfname = os.path.join(outdir, outfname)
                print outfname

                if o:
                        o.write('\n')
                        o.close()

                o = open(outfname, 'wb')
                o.write('\n')

        if line == '  \n':
                cur_page_counter += 1
                total_page_counter += 1
                if total_page_counter % 10000 == 0: print total_page_counter

        o.write(line)

if o:
        o.write('\n')
        o.close()

print total_page_counter

Once the file got split not chunks of 250,000 articles each, I used xml2sql on each chunk to get the corresponding text.sql.

Loading the dumps into the database

I had now readied all the input (a bunch of .sql files) and had to load them into the database. Before I started the load, I had to create a database to hold the tables and the “text” table.

mysql -h hostname -u username -ppassword
> create database wiki;
> create table `text` (
  old_id int unsigned NOT NULL auto_increment,
  old_text mediumblob NOT NULL,
  old_flags tinyblob NOT NULL,
  PRIMARY KEY old_id (old_id)
) MAX_ROWS=10000000 AVG_ROW_LENGTH=10240;

It took about 90 minutes for the splitting to get over. This script will load the .sql files into the database one after the other.

load_splits.py

import os
import os.path
import glob
import shutil

while 1:
        fnames = [os.path.basename(f) for f in glob.glob('splits/*.input.*')]
        fnames = [(int(f.split('.')[2]), f) for f in fnames]
        fnames.sort()

        if len(fnames) == 0: break
        print "found %d files" % len(fnames)

        fname = os.path.join('splits/', fnames[0][1])
        to_fname = os.path.join('processed_splits/', fnames[0][1])
        error_fname = os.path.join('processed_splits/', fnames[0][1] + '.error')

        print "processing %s" % fname

        cmd = 'mysql -h hostname -ppassword -u username wiki < "%s"' % fname
        result = os.system(cmd)
        if result != 0:
                shutil.move(fname, error_fname)
        else:
                shutil.move(fname, to_fname)

        print "processed %s" % fname

Loading the .sql files into the database will take a long long time. I started it yesterday morning and it is still running! As the data is loading, you can check out this Wikipedia database schema diagram.

In a continuation to this article, I will write about how I will use the Wikipedia database to streamline my news feeds.

© Prashanth Ellina. Built using Pelican. Theme by Giulio Fidente on github.