Ways to process and use Wikipedia dumps

http://en.wikipedia.org  
Wikipedia is a superb resource for reference (taken with a pinch of salt of course). I spend hours at a time spidering through its pages and always come away amazed at how much information it hosts. In my opinion this ranks amongst the defining milestones of mankind’s advancement.

Apart from being available through http://www.wikipedia.org, the data is provided for download so that you can create a mirror locally for quicker access. This is very convenient when you are not connected to the internet, say when you are on the move.

Setting up a local copy of Wikipedia

Windows
If you have Windows installed, Webaroo is an easy way to get Wikipedia locally as a “web pack”. Check out Webaroo here. Another way on Windows is to use WikiFilter. I tried WikiFilter and found it a good option (It is open source, so you can tweak it). It takes up around 3 to 3.5GB on your disk.

Linux
This page has instructions to setup on Linux. “Building a (fast) Wikipedia offline reader” is a good option too.

Any operating system
Wikipedia provides static wiki dumps for download which should work fine on any operating system that supports a decent web browser. Although I have not tried it, I have heard that the dumps can take up as much as 80GB of space on your disk.

Windows Mobile, iPhone and Blackberry
To access Wikipedia from your mobile, check out vTap from Veveo. I must tell you that I work for this company but I am being very objective in suggesting this service to you. A Java version is being developed and will be out soon. Since the space on mobile devices is very limited, the data is hosted on vTap servers and network connectivity is required.

Other uses for Wikipedia data dumps

In being such a vast repository of knowledge, Wikipedia is useful in many other ways. I want to use Wikipedia’s data to handle the feeds I read every day. The same news article comes in from different sources and multiple times from the same source and I end up reading all of them. I am going to try and use Wikipedia to help me automatically pull together these news articles and cluster them around topics.

The first step in this experiment would be to get the data dumps from Wikipedia and process them to load them into a Mysql database. Once we get the data into a database, things become more manageable.

Getting the dumps

Wikipedia is huge and this reflects in the data dumps. It took me about 40 hours to get the articles xml alone on my home connection. Put together all the relevant files of Wikipedia dumps come to 5 GB. I got the dumps from here. You can check for more recent ones here.

Files to get

  • pages-articles.xml.bz2 (2.8 GB) – xml containing page texts
  • redirect.sql.gz (10.5 MB) – sql for redirected pages info
  • page.sql.gz (318.9 MB) – minimal page information (id, page title)
  • externallinks.sql.gz (439.6 MB) – links from pages to external sites
  • categorylinks.sql.gz (235.5 MB) – categories
  • pagelinks.sql.gz (1.2 GB) – inter wiki page links

Since DreamHost (My Web host) offers a lot of disk space and the network connection is way better than the one I have at home, I downloaded the files to my DreamHost space. (read more about DreamHost here)

I was able to download all the files using wget but for “pages-articles.xml.bz2″. For some reason I cannot understand, Wget was bailing out after downloading a few bytes (this seems to be a DreamHost specific issue). To work around this issue, I wrote this python script

get_wiki_file.py

import urllib2, sys

url = "http://download.wikimedia.org/enwiki/20070908/enwiki-20070908-pages-articles.xml.bz2"
outfname = 'enwiki-20070908-pages-articles.xml.bz2'

o = open(outfname, 'wb')
r = urllib2.urlopen(url)
total_bytes = 0
counter = 0

while 1:
        bytes = r.read(10240)
        total_bytes += len(bytes)
        o.write(bytes)
        if len(bytes) < 10240: break
        counter += 1
        if counter % 10 == 0:
                counter = 0
                print "%.2f MB" % (total_bytes/1024.0/1024.0)
o.close()

Preparing dumps

The next step was to extract all the archives. I used bunzip2 for .bz2 files and gunzip for the .gz files. I tried loading one of the large .sql files into the database and the process was killed (I guess this is because DreamHost does not like resource hungry processes running for a long time). To work around this I had to split all the big .sql files into smaller chunks.

for example

split -d -l 50 ../enwiki-20070908-page page.input.

-l option tells split how many lines per split we need and -d tells split to use numerical suffix (which will be useful soon).

However the pages-article file is an xml and not sql. To load it into the database, I had to first convert it to a .sql dump. xml2sql is a handy program for doing this. You can get it here.

xml2sql -v -m pages-articles.xml 

This command will produce text.sql, page.sql and revision.sql. However, I ran into a problem here because xml2sql was leaking and slowly rose to 88 MB of resident memory when it got killed by the DreamHost process. I tried running valgrind on it but could not find any leaks (they must be getting freed on exit).

This forced me to split the huge xml into manageable parts so that xml2sql would stay within 88MB. I wrote this python script for splitting the xml.

split_xml_dump.py

import os, os.path

fname = "enwiki-20070908-pages-articles.xml"
total_page_counter = cur_page_counter = 0
o = None
outf_counter = 0
outdir = 'pages_xml_splits/'

for line in open(fname):
        if line.startswith(""):
                continue

        if o is None or cur_page_counter == 250000:
                cur_page_counter = 0
                outf_counter += 1
                outfname = 'pagexml.%d' % outf_counter
                outfname = os.path.join(outdir, outfname)
                print outfname

                if o:
                        o.write('\n')
                        o.close()

                o = open(outfname, 'wb')
                o.write('\n')

        if line == '  \n':
                cur_page_counter += 1
                total_page_counter += 1
                if total_page_counter % 10000 == 0: print total_page_counter

        o.write(line)

if o:
        o.write('\n')
        o.close()

print total_page_counter

Once the file got split not chunks of 250,000 articles each, I used xml2sql on each chunk to get the corresponding text.sql.

Loading the dumps into the database

I had now readied all the input (a bunch of .sql files) and had to load them into the database. Before I started the load, I had to create a database to hold the tables and the "text" table.

mysql -h hostname -u username -ppassword
> create database wiki;
> create table `text` (
  old_id int unsigned NOT NULL auto_increment,
  old_text mediumblob NOT NULL,
  old_flags tinyblob NOT NULL,
  PRIMARY KEY old_id (old_id)
) MAX_ROWS=10000000 AVG_ROW_LENGTH=10240; 

It took about 90 minutes for the splitting to get over. This script will load the .sql files into the database one after the other.

load_splits.py

import os
import os.path
import glob
import shutil

while 1:
        fnames = [os.path.basename(f) for f in glob.glob('splits/*.input.*')]
        fnames = [(int(f.split('.')[2]), f) for f in fnames]
        fnames.sort()

        if len(fnames) == 0: break
        print "found %d files" % len(fnames)

        fname = os.path.join('splits/', fnames[0][1])
        to_fname = os.path.join('processed_splits/', fnames[0][1])
        error_fname = os.path.join('processed_splits/', fnames[0][1] + '.error')

        print "processing %s" % fname

        cmd = 'mysql -h hostname -ppassword -u username wiki < "%s"' % fname
        result = os.system(cmd)
        if result != 0:
                shutil.move(fname, error_fname)
        else:
                shutil.move(fname, to_fname)

        print "processed %s" % fname

Loading the .sql files into the database will take a long long time. I started it yesterday morning and it is still running! As the data is loading, you can check out this Wikipedia database schema diagram.

In a continuation to this article, I will write about how I will use the Wikipedia database to streamline my news feeds.

1 Trackbacks

  1. [...] here for full [...]

21 Comments

  1. Great article! I did something similar but on my local machine. wget bombed for me as well (this was Cygwin on Windows).

    Why didnt you do all this on your local machine (instead of on Dreamhost)?

    Posted October 18, 2007 at 12:03 am | Permalink
  2. Interesting, I tried wget from my local machine and It was downloading alright.

    I want to get the processed data to be accessible through some kind of API so anybody can “query” the database over HTTP. Instead of getting 5GB to my disk and then uploading back to Dreamhost, I felt it better to do it there itself. I have a 128kbps “broadband” connection, so you can imagine the upload rate.

    Posted October 18, 2007 at 7:43 am | Permalink
  3. satheesh nair

    Prashanth, Can we discuss a deal to develop a wikipedia mirror for a project of mine please. Please give me your contact details, I am in bangalore

    Posted October 24, 2007 at 7:42 pm | Permalink
  4. Hi Satheesh,

    My email is prashanthBLAHellina AT gmail DOTT com (remove BLAH).

    Posted October 24, 2007 at 8:13 pm | Permalink
  5. someone

    Hi

    I used your split script.._thanks_ a lot for it!
    but are you sure that it is accurate? does it need to be updated?

    Posted April 25, 2008 at 6:41 pm | Permalink
  6. Great! It worked for me so I assume it is correct. I have not run this against the latest xml though.

    Posted April 27, 2008 at 11:46 am | Permalink
  7. totic

    I also had the problem with wget, it seems to always fail when the files are incredible large, my solution was to just
    use curl

    example:

    curl http://download.wikimedia.org/enwiki/20070206/enwiki-20070206-pages-articles.xml.bz2 -o enwiki-20070206-pages-articles.xml.bz2

    Posted May 18, 2008 at 12:24 pm | Permalink
  8. Thanks for the info, Totic.

    Posted May 23, 2008 at 7:44 pm | Permalink
  9. Hello,

    I’ve sent you an add request on gmail for a site we need done integrated with mediawiki and wikipedia database dump. PLease accept the add request so that we can discuss it in detail.

    Posted June 14, 2008 at 8:53 pm | Permalink
  10. RK, I prefer communicating via email. Please mail me at the same address and we can have a discussion.

    Posted June 15, 2008 at 10:30 am | Permalink
  11. indrajeet

    tell me, how to dump the database after installing mediawiki
    please tell me i am waiting for ur reply.

    thanks and regards
    Indrajeet Dhanjode

    Posted July 7, 2008 at 4:50 pm | Permalink
  12. rich

    THANK YOU!! This is extremely helpful. I downloaded one of the dumps and had no clue what to do with it. I figured I would just start and tinker with it along the way (that’s how I learn everything for computers, treat it like a puzzle to solve and I end up teaching myself just about anything) but when I saw the dump would decompress into massively large files and take a while to do it, I decided to take a step back and slow down before I blow up my computer in the process. This helped me more than you can know, thanks so much!!!!

    Posted July 8, 2008 at 6:31 pm | Permalink
  13. Mailed you but couldnt get a revert. You are the only person i know who can get this done :)

    RKs last blog post..naishadh86 Intro

    Posted July 8, 2008 at 11:05 pm | Permalink
  14. Rich, way to go! That’s a wonderful way to learn. Am glad I was of help. Enjoy!
    RK, will get back to you by mail shortly.

    Posted July 17, 2008 at 9:26 pm | Permalink
  15. Indrajeet, I did not understand your question. What are you trying to do?

    Posted July 17, 2008 at 9:27 pm | Permalink
  16. Does anyone know how to make all the links internal to my domain after we do the dump?

    Posted August 2, 2008 at 2:10 am | Permalink
  17. Jared, the intra-wiki links are internal automatically.

    Posted August 4, 2008 at 9:15 pm | Permalink
  18. Prashanth,
    Thanks for a great post. I found i had to make some small alterations for mysql 5.0.21

    CREATE DATABASE wiki;
    use wiki;
    create table text {
    old_int int UNSIGNED NOT NULL, AUTO_INCREMENT,
    old_text mediumblob NOT NULL,
    old_flags tinyblob NOT NULL,
    PRIMARY KEY (old_id)
    } MAX_ROWS 10000000 AVG_ROW_LENGTH=10240;

    Posted August 25, 2008 at 5:29 am | Permalink
  19. Andy Bailey

    Prashanth,
    I managed to load the sql imports into the table ‘text’. It now holds just over 10,000,000 records but i can’t do anything with it. Everytime i try a select count(*) or a Select * limit 0,1 i get no return from MySQL. I may have to abandon all and use a mediaWiki method instead. Did you have any trouble accessing the data? One thing i’ve noticed is that the primary key index (old_id) is taking up no diskspace which seems a little weird.

    Posted September 29, 2008 at 4:18 pm | Permalink
  20. sir, i want to display wikipedia contents on my site. is it possible? if yes, how? regrds.

    Posted November 7, 2008 at 12:31 pm | Permalink
  21. bhupinder

    Hi everybody there . Could just help me for the structure for tables revision and page also. i am able to get the three files named as text, page and revision using xmltosql tool. i have been able to get the “text” table structure here but not for rest of the two.if any body can help me out for them also. one can jsut help me by responding to my query here.

    i will be thankful to that person. i know i can figure out some thing from data itself, but then also if probably some body has there something near to their structure, let me know .

    i am still waiting for some kind of help

    thanks,
    bhupinder

    Posted June 4, 2009 at 12:45 am | Permalink