Wikipedia is a superb resource for reference (taken with a pinch of salt of course). I spend hours at a time spidering through its pages and always come away amazed at how much information it hosts. In my opinion this ranks amongst the defining milestones of mankind’s advancement.
Apart from being available through http://www.wikipedia.org, the data is provided for download so that you can create a mirror locally for quicker access. This is very convenient when you are not connected to the internet, say when you are on the move.
Setting up a local copy of Wikipedia
Windows
If you have Windows installed, Webaroo is an easy way to get Wikipedia locally as a “web pack”. Check out Webaroo here. Another way on Windows is to use WikiFilter. I tried WikiFilter and found it a good option (It is open source, so you can tweak it). It takes up around 3 to 3.5GB on your disk.
Linux
This page has instructions to setup on Linux. “Building a (fast) Wikipedia offline reader” is a good option too.
Any operating system
Wikipedia provides static wiki dumps for download which should work fine on any operating system that supports a decent web browser. Although I have not tried it, I have heard that the dumps can take up as much as 80GB of space on your disk.
Windows Mobile, iPhone and Blackberry
To access Wikipedia from your mobile, check out vTap from Veveo. I must tell you that I work for this company but I am being very objective in suggesting this service to you. A Java version is being developed and will be out soon. Since the space on mobile devices is very limited, the data is hosted on vTap servers and network connectivity is required.
Other uses for Wikipedia data dumps
In being such a vast repository of knowledge, Wikipedia is useful in many other ways. I want to use Wikipedia’s data to handle the feeds I read every day. The same news article comes in from different sources and multiple times from the same source and I end up reading all of them. I am going to try and use Wikipedia to help me automatically pull together these news articles and cluster them around topics.
The first step in this experiment would be to get the data dumps from Wikipedia and process them to load them into a Mysql database. Once we get the data into a database, things become more manageable.
Getting the dumps
Wikipedia is huge and this reflects in the data dumps. It took me about 40 hours to get the articles xml alone on my home connection. Put together all the relevant files of Wikipedia dumps come to 5 GB. I got the dumps from here. You can check for more recent ones here.
Files to get
- pages-articles.xml.bz2 (2.8 GB) - xml containing page texts
- redirect.sql.gz (10.5 MB) - sql for redirected pages info
- page.sql.gz (318.9 MB) - minimal page information (id, page title)
- externallinks.sql.gz (439.6 MB) - links from pages to external sites
- categorylinks.sql.gz (235.5 MB) - categories
- pagelinks.sql.gz (1.2 GB) - inter wiki page links
Since DreamHost (My Web host) offers a lot of disk space and the network connection is way better than the one I have at home, I downloaded the files to my DreamHost space. (read more about DreamHost here)
I was able to download all the files using wget but for “pages-articles.xml.bz2″. For some reason I cannot understand, Wget was bailing out after downloading a few bytes (this seems to be a DreamHost specific issue). To work around this issue, I wrote this python script
get_wiki_file.py
import urllib2, sys url = "http://download.wikimedia.org/enwiki/20070908/enwiki-20070908-pages-articles.xml.bz2" outfname = 'enwiki-20070908-pages-articles.xml.bz2' o = open(outfname, 'wb') r = urllib2.urlopen(url) total_bytes = 0 counter = 0 while 1: bytes = r.read(10240) total_bytes += len(bytes) o.write(bytes) if len(bytes) < 10240: break counter += 1 if counter % 10 == 0: counter = 0 print "%.2f MB" % (total_bytes/1024.0/1024.0) o.close()
Preparing dumps
The next step was to extract all the archives. I used bunzip2 for .bz2 files and gunzip for the .gz files. I tried loading one of the large .sql files into the database and the process was killed (I guess this is because DreamHost does not like resource hungry processes running for a long time). To work around this I had to split all the big .sql files into smaller chunks.
for example
split -d -l 50 ../enwiki-20070908-page page.input.
-l option tells split how many lines per split we need and -d tells split to use numerical suffix (which will be useful soon).
However the pages-article file is an xml and not sql. To load it into the database, I had to first convert it to a .sql dump. xml2sql is a handy program for doing this. You can get it here.
xml2sql -v -m pages-articles.xml
This command will produce text.sql, page.sql and revision.sql. However, I ran into a problem here because xml2sql was leaking and slowly rose to 88 MB of resident memory when it got killed by the DreamHost process. I tried running valgrind on it but could not find any leaks (they must be getting freed on exit).
This forced me to split the huge xml into manageable parts so that xml2sql would stay within 88MB. I wrote this python script for splitting the xml.
split_xml_dump.py
import os, os.path fname = "enwiki-20070908-pages-articles.xml" total_page_counter = cur_page_counter = 0 o = None outf_counter = 0 outdir = 'pages_xml_splits/' for line in open(fname): if line.startswith("<mediawiki") or line.startswith("</mediawiki>"): continue if o is None or cur_page_counter == 250000: cur_page_counter = 0 outf_counter += 1 outfname = 'pagexml.%d' % outf_counter outfname = os.path.join(outdir, outfname) print outfname if o: o.write('</mediawiki>\n') o.close() o = open(outfname, 'wb') o.write('<mediawiki>\n') if line == ' </page>\n': cur_page_counter += 1 total_page_counter += 1 if total_page_counter % 10000 == 0: print total_page_counter o.write(line) if o: o.write('</mediawiki>\n') o.close() print total_page_counter
Once the file got split not chunks of 250,000 articles each, I used xml2sql on each chunk to get the corresponding text.sql.
Loading the dumps into the database
I had now readied all the input (a bunch of .sql files) and had to load them into the database. Before I started the load, I had to create a database to hold the tables and the “text” table.
mysql -h hostname -u username -ppassword > CREATE DATABASE wiki; > CREATE TABLE `text` ( old_id int UNSIGNED NOT NULL AUTO_INCREMENT, old_text mediumblob NOT NULL, old_flags tinyblob NOT NULL, PRIMARY KEY old_id (old_id) ) MAX_ROWS=10000000 AVG_ROW_LENGTH=10240;
It took about 90 minutes for the splitting to get over. This script will load the .sql files into the database one after the other.
load_splits.py
import os import os.path import glob import shutil while 1: fnames = [os.path.basename(f) for f in glob.glob('splits/*.input.*')] fnames = [(int(f.split('.')[2]), f) for f in fnames] fnames.sort() if len(fnames) == 0: break print "found %d files" % len(fnames) fname = os.path.join('splits/', fnames[0][1]) to_fname = os.path.join('processed_splits/', fnames[0][1]) error_fname = os.path.join('processed_splits/', fnames[0][1] + '.error') print "processing %s" % fname cmd = 'mysql -h hostname -ppassword -u username wiki < "%s"' % fname result = os.system(cmd) if result != 0: shutil.move(fname, error_fname) else: shutil.move(fname, to_fname) print "processed %s" % fname
Loading the .sql files into the database will take a long long time. I started it yesterday morning and it is still running! As the data is loading, you can check out this Wikipedia database schema diagram.
In a continuation to this article, I will write about how I will use the Wikipedia database to streamline my news feeds.
Tags: , data mining, programming, python, text processing, wikipedia















Sriram Krishnan wrote,
Great article! I did something similar but on my local machine. wget bombed for me as well (this was Cygwin on Windows).
Why didnt you do all this on your local machine (instead of on Dreamhost)?
Link | October 18th, 2007 at 12:03 am
prashanthellina wrote,
Interesting, I tried wget from my local machine and It was downloading alright.
I want to get the processed data to be accessible through some kind of API so anybody can “query” the database over HTTP. Instead of getting 5GB to my disk and then uploading back to Dreamhost, I felt it better to do it there itself. I have a 128kbps “broadband” connection, so you can imagine the upload rate.
Link | October 18th, 2007 at 7:43 am
senddesks » Ways to process and use Wikipedia dumps wrote,
[…] here for full […]
Link | October 24th, 2007 at 8:17 am
satheesh nair wrote,
Prashanth, Can we discuss a deal to develop a wikipedia mirror for a project of mine please. Please give me your contact details, I am in bangalore
Link | October 24th, 2007 at 7:42 pm
prashanthellina wrote,
Hi Satheesh,
My email is prashanthBLAHellina AT gmail DOTT com (remove BLAH).
Link | October 24th, 2007 at 8:13 pm
someone wrote,
Hi
I used your split script.._thanks_ a lot for it!
but are you sure that it is accurate? does it need to be updated?
Link | April 25th, 2008 at 6:41 pm
prashanthellina wrote,
Great! It worked for me so I assume it is correct. I have not run this against the latest xml though.
Link | April 27th, 2008 at 11:46 am
totic wrote,
I also had the problem with wget, it seems to always fail when the files are incredible large, my solution was to just
use curl
example:
curl http://download.wikimedia.org/enwiki/20070206/enwiki-20070206-pages-articles.xml.bz2 -o enwiki-20070206-pages-articles.xml.bz2
Link | May 18th, 2008 at 12:24 pm
prashanthellina wrote,
Thanks for the info, Totic.
Link | May 23rd, 2008 at 7:44 pm
RK wrote,
Hello,
I’ve sent you an add request on gmail for a site we need done integrated with mediawiki and wikipedia database dump. PLease accept the add request so that we can discuss it in detail.
Link | June 14th, 2008 at 8:53 pm
prashanthellina wrote,
RK, I prefer communicating via email. Please mail me at the same address and we can have a discussion.
Link | June 15th, 2008 at 10:30 am
indrajeet wrote,
tell me, how to dump the database after installing mediawiki
please tell me i am waiting for ur reply.
thanks and regards
Indrajeet Dhanjode
Link | July 7th, 2008 at 4:50 pm
rich wrote,
THANK YOU!! This is extremely helpful. I downloaded one of the dumps and had no clue what to do with it. I figured I would just start and tinker with it along the way (that’s how I learn everything for computers, treat it like a puzzle to solve and I end up teaching myself just about anything) but when I saw the dump would decompress into massively large files and take a while to do it, I decided to take a step back and slow down before I blow up my computer in the process. This helped me more than you can know, thanks so much!!!!
Link | July 8th, 2008 at 6:31 pm
RK wrote,
Mailed you but couldnt get a revert. You are the only person i know who can get this done
RKs last blog post..naishadh86 Intro
Link | July 8th, 2008 at 11:05 pm
prashanthellina wrote,
Rich, way to go! That’s a wonderful way to learn. Am glad I was of help. Enjoy!
RK, will get back to you by mail shortly.
Link | July 17th, 2008 at 9:26 pm
prashanthellina wrote,
Indrajeet, I did not understand your question. What are you trying to do?
Link | July 17th, 2008 at 9:27 pm