( to ) ? be : ! be;
sidebar left sidebar right

Ways to process and use Wikipedia dumps (3,102 views)

http://en.wikipedia.org  
Wikipedia is a superb resource for reference (taken with a pinch of salt of course). I spend hours at a time spidering through its pages and always come away amazed at how much information it hosts. In my opinion this ranks amongst the defining milestones of mankind’s advancement.

Apart from being available through http://www.wikipedia.org, the data is provided for download so that you can create a mirror locally for quicker access. This is very convenient when you are not connected to the internet, say when you are on the move.

Setting up a local copy of Wikipedia

Windows
If you have Windows installed, Webaroo is an easy way to get Wikipedia locally as a “web pack”. Check out Webaroo here. Another way on Windows is to use WikiFilter. I tried WikiFilter and found it a good option (It is open source, so you can tweak it). It takes up around 3 to 3.5GB on your disk.

Linux
This page has instructions to setup on Linux. “Building a (fast) Wikipedia offline reader” is a good option too.

Any operating system
Wikipedia provides static wiki dumps for download which should work fine on any operating system that supports a decent web browser. Although I have not tried it, I have heard that the dumps can take up as much as 80GB of space on your disk.

Windows Mobile, iPhone and Blackberry
To access Wikipedia from your mobile, check out vTap from Veveo. I must tell you that I work for this company but I am being very objective in suggesting this service to you. A Java version is being developed and will be out soon. Since the space on mobile devices is very limited, the data is hosted on vTap servers and network connectivity is required.

Other uses for Wikipedia data dumps

In being such a vast repository of knowledge, Wikipedia is useful in many other ways. I want to use Wikipedia’s data to handle the feeds I read every day. The same news article comes in from different sources and multiple times from the same source and I end up reading all of them. I am going to try and use Wikipedia to help me automatically pull together these news articles and cluster them around topics.

The first step in this experiment would be to get the data dumps from Wikipedia and process them to load them into a Mysql database. Once we get the data into a database, things become more manageable.

Getting the dumps

Wikipedia is huge and this reflects in the data dumps. It took me about 40 hours to get the articles xml alone on my home connection. Put together all the relevant files of Wikipedia dumps come to 5 GB. I got the dumps from here. You can check for more recent ones here.

Files to get

  • pages-articles.xml.bz2 (2.8 GB) - xml containing page texts
  • redirect.sql.gz (10.5 MB) - sql for redirected pages info
  • page.sql.gz (318.9 MB) - minimal page information (id, page title)
  • externallinks.sql.gz (439.6 MB) - links from pages to external sites
  • categorylinks.sql.gz (235.5 MB) - categories
  • pagelinks.sql.gz (1.2 GB) - inter wiki page links

Since DreamHost (My Web host) offers a lot of disk space and the network connection is way better than the one I have at home, I downloaded the files to my DreamHost space. (read more about DreamHost here)

I was able to download all the files using wget but for “pages-articles.xml.bz2″. For some reason I cannot understand, Wget was bailing out after downloading a few bytes (this seems to be a DreamHost specific issue). To work around this issue, I wrote this python script

get_wiki_file.py

import urllib2, sys
 
url = "http://download.wikimedia.org/enwiki/20070908/enwiki-20070908-pages-articles.xml.bz2"
outfname = 'enwiki-20070908-pages-articles.xml.bz2'
 
o = open(outfname, 'wb')
r = urllib2.urlopen(url)
total_bytes = 0
counter = 0
 
while 1:
        bytes = r.read(10240)
        total_bytes += len(bytes)
        o.write(bytes)
        if len(bytes) < 10240: break
        counter += 1
        if counter % 10 == 0:
                counter = 0
                print "%.2f MB" % (total_bytes/1024.0/1024.0)
o.close()

Preparing dumps

The next step was to extract all the archives. I used bunzip2 for .bz2 files and gunzip for the .gz files. I tried loading one of the large .sql files into the database and the process was killed (I guess this is because DreamHost does not like resource hungry processes running for a long time). To work around this I had to split all the big .sql files into smaller chunks.

for example

split -d -l 50 ../enwiki-20070908-page page.input.

-l option tells split how many lines per split we need and -d tells split to use numerical suffix (which will be useful soon).

However the pages-article file is an xml and not sql. To load it into the database, I had to first convert it to a .sql dump. xml2sql is a handy program for doing this. You can get it here.

xml2sql -v -m pages-articles.xml

This command will produce text.sql, page.sql and revision.sql. However, I ran into a problem here because xml2sql was leaking and slowly rose to 88 MB of resident memory when it got killed by the DreamHost process. I tried running valgrind on it but could not find any leaks (they must be getting freed on exit).

This forced me to split the huge xml into manageable parts so that xml2sql would stay within 88MB. I wrote this python script for splitting the xml.

split_xml_dump.py

import os, os.path
 
fname = "enwiki-20070908-pages-articles.xml"
total_page_counter = cur_page_counter = 0
o = None
outf_counter = 0
outdir = 'pages_xml_splits/'
 
for line in open(fname):
        if line.startswith("<mediawiki") or line.startswith("</mediawiki>"):
                continue
 
        if o is None or cur_page_counter == 250000:
                cur_page_counter = 0
                outf_counter += 1
                outfname = 'pagexml.%d' % outf_counter
                outfname = os.path.join(outdir, outfname)
                print outfname
 
                if o:
                        o.write('</mediawiki>\n')
                        o.close()
 
                o = open(outfname, 'wb')
                o.write('<mediawiki>\n')
 
        if line == '  </page>\n':
                cur_page_counter += 1
                total_page_counter += 1
                if total_page_counter % 10000 == 0: print total_page_counter
 
        o.write(line)
 
if o:
        o.write('</mediawiki>\n')
        o.close()
 
print total_page_counter

Once the file got split not chunks of 250,000 articles each, I used xml2sql on each chunk to get the corresponding text.sql.

Loading the dumps into the database

I had now readied all the input (a bunch of .sql files) and had to load them into the database. Before I started the load, I had to create a database to hold the tables and the “text” table.

mysql -h hostname -u username -ppassword
> CREATE DATABASE wiki;
> CREATE TABLE `text` (
  old_id int UNSIGNED NOT NULL AUTO_INCREMENT,
  old_text mediumblob NOT NULL,
  old_flags tinyblob NOT NULL,
  PRIMARY KEY old_id (old_id)
) MAX_ROWS=10000000 AVG_ROW_LENGTH=10240;

It took about 90 minutes for the splitting to get over. This script will load the .sql files into the database one after the other.

load_splits.py

import os
import os.path
import glob
import shutil
 
while 1:
        fnames = [os.path.basename(f) for f in glob.glob('splits/*.input.*')]
        fnames = [(int(f.split('.')[2]), f) for f in fnames]
        fnames.sort()
 
        if len(fnames) == 0: break
        print "found %d files" % len(fnames)
 
        fname = os.path.join('splits/', fnames[0][1])
        to_fname = os.path.join('processed_splits/', fnames[0][1])
        error_fname = os.path.join('processed_splits/', fnames[0][1] + '.error')
 
        print "processing %s" % fname
 
        cmd = 'mysql -h hostname -ppassword -u username wiki < "%s"' % fname
        result = os.system(cmd)
        if result != 0:
                shutil.move(fname, error_fname)
        else:
                shutil.move(fname, to_fname)
 
        print "processed %s" % fname

Loading the .sql files into the database will take a long long time. I started it yesterday morning and it is still running! As the data is loading, you can check out this Wikipedia database schema diagram.

In a continuation to this article, I will write about how I will use the Wikipedia database to streamline my news feeds.

subscribe to feed

Tags: , , , , ,

"Ways to process and use Wikipedia dumps" was published on October 17th, 2007 and is listed in programming.

Follow comments via the RSS Feed | Leave a comment | Trackback URL

Comments on "Ways to process and use Wikipedia dumps": 16 Comments

  1. Sriram Krishnan wrote,

    Great article! I did something similar but on my local machine. wget bombed for me as well (this was Cygwin on Windows).

    Why didnt you do all this on your local machine (instead of on Dreamhost)?

  2. prashanthellina wrote,

    Interesting, I tried wget from my local machine and It was downloading alright.

    I want to get the processed data to be accessible through some kind of API so anybody can “query” the database over HTTP. Instead of getting 5GB to my disk and then uploading back to Dreamhost, I felt it better to do it there itself. I have a 128kbps “broadband” connection, so you can imagine the upload rate.

  3. senddesks » Ways to process and use Wikipedia dumps wrote,

    […] here for full […]

  4. satheesh nair wrote,

    Prashanth, Can we discuss a deal to develop a wikipedia mirror for a project of mine please. Please give me your contact details, I am in bangalore

  5. prashanthellina wrote,

    Hi Satheesh,

    My email is prashanthBLAHellina AT gmail DOTT com (remove BLAH).

  6. someone wrote,

    Hi

    I used your split script.._thanks_ a lot for it!
    but are you sure that it is accurate? does it need to be updated?

  7. prashanthellina wrote,

    Great! It worked for me so I assume it is correct. I have not run this against the latest xml though.

  8. totic wrote,

    I also had the problem with wget, it seems to always fail when the files are incredible large, my solution was to just
    use curl

    example:

    curl http://download.wikimedia.org/enwiki/20070206/enwiki-20070206-pages-articles.xml.bz2 -o enwiki-20070206-pages-articles.xml.bz2

  9. prashanthellina wrote,

    Thanks for the info, Totic.

  10. RK wrote,

    Hello,

    I’ve sent you an add request on gmail for a site we need done integrated with mediawiki and wikipedia database dump. PLease accept the add request so that we can discuss it in detail.

  11. prashanthellina wrote,

    RK, I prefer communicating via email. Please mail me at the same address and we can have a discussion.

  12. indrajeet wrote,

    tell me, how to dump the database after installing mediawiki
    please tell me i am waiting for ur reply.

    thanks and regards
    Indrajeet Dhanjode

  13. rich wrote,

    THANK YOU!! This is extremely helpful. I downloaded one of the dumps and had no clue what to do with it. I figured I would just start and tinker with it along the way (that’s how I learn everything for computers, treat it like a puzzle to solve and I end up teaching myself just about anything) but when I saw the dump would decompress into massively large files and take a while to do it, I decided to take a step back and slow down before I blow up my computer in the process. This helped me more than you can know, thanks so much!!!!

  14. RK wrote,

    Mailed you but couldnt get a revert. You are the only person i know who can get this done :)
    RKs last blog post..naishadh86 Intro

  15. prashanthellina wrote,

    Rich, way to go! That’s a wonderful way to learn. Am glad I was of help. Enjoy!
    RK, will get back to you by mail shortly.

  16. prashanthellina wrote,

    Indrajeet, I did not understand your question. What are you trying to do?

Leave Your Comment

Subscribe without commenting

101,224 views

Prashanth Ellina is powered by WordPress

No Complaints Shifter Series Theme by Buzzdroid.com
Computers blogarama - the blog directory Blog Flux Directory Blog Directory & Search engine Computer Blogs - Blog Catalog Blog Directory Computers blogs Bloggeries Blog Directory blog directory Computers Blog Blog Search, Blog Directory p Listed in LS Blogs the Blog Directory and Blog Search Engine Blog Review Blog search - categorized blog directory Link With Us - Web Directory Find Blogs in the Blog
Directory Blog Directory