Wikipedia is a superb resource for
reference (taken with a pinch of salt of course). I spend hours at a
time spidering through its pages and always come away amazed at how much
information it hosts. In my opinion this ranks amongst the defining
milestones of mankind’s advancement.
Apart from being available through http://www.wikipedia.org, the data is provided for download so that you can create a mirror locally for quicker access. This is very convenient when you are not connected to the internet, say when you are on the move.
Setting up a local copy of Wikipedia
Windows If you have Windows installed, Webaroo is an easy way to get Wikipedia locally as a “web pack”. Check out Webaroo here. Another way on Windows is to use WikiFilter. I tried WikiFilter and found it a good option (It is open source, so you can tweak it). It takes up around 3 to 3.5GB on your disk.
Linux This page has instructions to setup on Linux. “Building a (fast) Wikipedia offline reader” is a good option too.
Any operating system Wikipedia provides static wiki dumps for download which should work fine on any operating system that supports a decent web browser. Although I have not tried it, I have heard that the dumps can take up as much as 80GB of space on your disk.
Windows Mobile, iPhone and Blackberry To access Wikipedia from your mobile, check out vTap from Veveo. I must tell you that I work for this company but I am being very objective in suggesting this service to you. A Java version is being developed and will be out soon. Since the space on mobile devices is very limited, the data is hosted on vTap servers and network connectivity is required.
Other uses for Wikipedia data dumps
In being such a vast repository of knowledge, Wikipedia is useful in many other ways. I want to use Wikipedia’s data to handle the feeds I read every day. The same news article comes in from different sources and multiple times from the same source and I end up reading all of them. I am going to try and use Wikipedia to help me automatically pull together these news articles and cluster them around topics.
The first step in this experiment would be to get the data dumps from Wikipedia and process them to load them into a Mysql database. Once we get the data into a database, things become more manageable.
Getting the dumps
Wikipedia is huge and this reflects in the data dumps. It took me about 40 hours to get the articles xml alone on my home connection. Put together all the relevant files of Wikipedia dumps come to 5 GB. I got the dumps from here. You can check for more recent ones here.
Files to get
- pages-articles.xml.bz2 (2.8 GB) - xml containing page texts
- redirect.sql.gz (10.5 MB) - sql for redirected pages info
- page.sql.gz (318.9 MB) - minimal page information (id, page title)
- externallinks.sql.gz (439.6 MB) - links from pages to external sites
- categorylinks.sql.gz (235.5 MB) - categories
- pagelinks.sql.gz (1.2 GB) - inter wiki page links
Since DreamHost (My Web host) offers a lot of disk space and the network connection is way better than the one I have at home, I downloaded the files to my DreamHost space. (read more about DreamHost here)
I was able to download all the files using wget but for “pages-articles.xml.bz2”. For some reason I cannot understand, Wget was bailing out after downloading a few bytes (this seems to be a DreamHost specific issue). To work around this issue, I wrote this python script
get_wiki_file.py
import urllib2, sys
url = "http://download.wikimedia.org/enwiki/20070908/enwiki-20070908-pages-articles.xml.bz2"
outfname = 'enwiki-20070908-pages-articles.xml.bz2'
o = open(outfname, 'wb')
r = urllib2.urlopen(url)
total_bytes = 0
counter = 0
while 1:
bytes = r.read(10240)
total_bytes += len(bytes)
o.write(bytes)
if len(bytes) < 10240: break
counter += 1
if counter % 10 == 0:
counter = 0
print "%.2f MB" % (total_bytes/1024.0/1024.0)
o.close()
Preparing dumps
The next step was to extract all the archives. I used bunzip2 for .bz2 files and gunzip for the .gz files. I tried loading one of the large .sql files into the database and the process was killed (I guess this is because DreamHost does not like resource hungry processes running for a long time). To work around this I had to split all the big .sql files into smaller chunks.
for example
split -d -l 50 ../enwiki-20070908-page page.input.
-l option tells split how many lines per split we need and -d tells split to use numerical suffix (which will be useful soon).
However the pages-article file is an xml and not sql. To load it into the database, I had to first convert it to a .sql dump. xml2sql is a handy program for doing this. You can get it here.
xml2sql -v -m pages-articles.xml
This command will produce text.sql, page.sql and revision.sql. However, I ran into a problem here because xml2sql was leaking and slowly rose to 88 MB of resident memory when it got killed by the DreamHost process. I tried running valgrind on it but could not find any leaks (they must be getting freed on exit).
This forced me to split the huge xml into manageable parts so that xml2sql would stay within 88MB. I wrote this python script for splitting the xml.
split_xml_dump.py
import os, os.path
fname = "enwiki-20070908-pages-articles.xml"
total_page_counter = cur_page_counter = 0
o = None
outf_counter = 0
outdir = 'pages_xml_splits/'
for line in open(fname):
if line.startswith(""):
continue
if o is None or cur_page_counter == 250000:
cur_page_counter = 0
outf_counter += 1
outfname = 'pagexml.%d' % outf_counter
outfname = os.path.join(outdir, outfname)
print outfname
if o:
o.write('\n')
o.close()
o = open(outfname, 'wb')
o.write('\n')
if line == ' \n':
cur_page_counter += 1
total_page_counter += 1
if total_page_counter % 10000 == 0: print total_page_counter
o.write(line)
if o:
o.write('\n')
o.close()
print total_page_counter
Once the file got split not chunks of 250,000 articles each, I used xml2sql on each chunk to get the corresponding text.sql.
Loading the dumps into the database
I had now readied all the input (a bunch of .sql files) and had to load them into the database. Before I started the load, I had to create a database to hold the tables and the “text” table.
mysql -h hostname -u username -ppassword
> create database wiki;
> create table `text` (
old_id int unsigned NOT NULL auto_increment,
old_text mediumblob NOT NULL,
old_flags tinyblob NOT NULL,
PRIMARY KEY old_id (old_id)
) MAX_ROWS=10000000 AVG_ROW_LENGTH=10240;
It took about 90 minutes for the splitting to get over. This script will load the .sql files into the database one after the other.
load_splits.py
import os
import os.path
import glob
import shutil
while 1:
fnames = [os.path.basename(f) for f in glob.glob('splits/*.input.*')]
fnames = [(int(f.split('.')[2]), f) for f in fnames]
fnames.sort()
if len(fnames) == 0: break
print "found %d files" % len(fnames)
fname = os.path.join('splits/', fnames[0][1])
to_fname = os.path.join('processed_splits/', fnames[0][1])
error_fname = os.path.join('processed_splits/', fnames[0][1] + '.error')
print "processing %s" % fname
cmd = 'mysql -h hostname -ppassword -u username wiki < "%s"' % fname
result = os.system(cmd)
if result != 0:
shutil.move(fname, error_fname)
else:
shutil.move(fname, to_fname)
print "processed %s" % fname
Loading the .sql files into the database will take a long long time. I started it yesterday morning and it is still running! As the data is loading, you can check out this Wikipedia database schema diagram.
In a continuation to this article, I will write about how I will use the Wikipedia database to streamline my news feeds.