Let's Take Wikipedia Offline!

Well, not literally. But what if you could look up Wikipedia in an instant, without even requiring a working internet connection?
Call it doomsday prepping or just a precaution, but let’s see if we can have our own copy of Wikipedia running offline, without having too much trouble along the way.

As some of you might know, up until the global situation went sideways I travelled quite a lot and hence had to deal with all sorts of things that people working stationary usually don’t experience. One of these things is bad internet. Depending on where in the world I was, I either had superb internet speeds and stable connections (<3 Seoul) or, well, not so much so.
Back at the time I took care to have documentation stored offline on my machine, so that I wasn’t dependent on connectivity to be able to work.

However, I oftentimes ended up in situations, in which I needed to look up some documentation on real life, for which I use Wikipedia most of the time. Unfortunately crawling Wikipedia’s HTML for offline use is not something that’s feasible - and it’s not even necessary as Wikipedia offers database dumps for everyone to download for free. Unfortunately, these database dumps aren’t exactly browsable the way they’re offered by Wikipedia. Luckily, there are ready-to-use apps (like Kiwix, Minipedia, XOWA, and many more) that try to offer an offline version of Wikipedia, either based on these dumps or through other means, but they’re all quite cumbersome to use and in parts have some pretty terrible prerequisites, like for example Java.

I was looking for a more lightweight approach that integrated well into my workflow – which is terminal-based – and doesn’t end up eating more storage than the actual dump itself, which at time of writing is 81GB in total size (uncompressed).

A year ago I tried this experiment once and used a tool called dumpster-dive to load the Wikipedia dump into a MongoDB and access it using uveira, my own command line tool for that. While that solution was pretty good, I ended up with a 250GB database back at the time, which had to be stored somewhere. At some point, it just became too impractical to deal with.
So today I thought it might be a great day to try a different approach.

Downloading Wikipedia

First, let’s download the latest XML dump of the English Wikipedia. We’re going to use wget -c here, so that we can continue a partial download, just in case our internet connection drops.

wget -c \
  'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2'

Now that we downloaded the archive, we can decompress it. Although converter that we’re going to use can handle bzip2, I wanted to be able to compare the size of the uncompressed file to the database that we’ll be using later on.

bzip2 -d ./enwiki-latest-pages-articles.xml.bz2

This took about half an hour on my workstation.
Next, let’s install the converter. It is a script that comes with the gensim Python package, so you’ll need that on your system. Ideally you should be working on an virtualenv.

pip install gensim

Update: You might need to downgrade scipy to the last version supporting triu: pip install "scipy<1.13"

The gensim package includes a script named segment_wiki, that allows us to easily transform the XML to JSON.

python \
  -m gensim.scripts.segment_wiki \
  -i \
  -f enwiki-latest-pages-articles.xml \
  -o enwiki-latest-pages-articles.json

On my machine I got roughly 50,000 articles per minute. I kept the default for -w (the number of workers), which is 31. At the time of writing, Wikipedia consists of 6,424,367 articles, meaning it should take around two hours to convert all articles to JSON. However, it ended up taking less time in total, as a bit of the content was skipped. I didn’t bother to check in-depth, but 5,792,326 out of 6,424,367 didn’t sound too bad at all.

2021-12-19 11:49:36,808 - segment_wiki - INFO - running /home/mrus/.virtualenvs/local3.9/lib/python3.9/site-packages/gensim/scripts/segment_wiki.py -i -f enwiki-latest-pages-articles.xml -o enwiki-latest-pages-articles.json
2021-12-19 11:51:34,063 - segment_wiki - INFO - processed #100000 articles (at 'Maquiladora' now)
2021-12-19 11:53:26,704 - segment_wiki - INFO - processed #200000 articles (at 'Kiso Mountains' now)
2021-12-19 11:55:09,739 - segment_wiki - INFO - processed #300000 articles (at 'Yuncheng University' now)
2021-12-19 11:56:38,493 - segment_wiki - INFO - processed #400000 articles (at 'Georgia College & State University' now)
2021-12-19 11:58:02,725 - segment_wiki - INFO - processed #500000 articles (at 'Stephen Crabb' now)
2021-12-19 11:59:26,795 - segment_wiki - INFO - processed #600000 articles (at 'Ann Jillian' now)
2021-12-19 12:00:43,973 - segment_wiki - INFO - processed #700000 articles (at 'Patti Deutsch' now)
2021-12-19 12:02:04,450 - segment_wiki - INFO - processed #800000 articles (at 'George J. Hatfield' now)
2021-12-19 12:03:19,111 - segment_wiki - INFO - processed #900000 articles (at 'Baya Rahouli' now)
2021-12-19 12:04:34,628 - segment_wiki - INFO - processed #1000000 articles (at 'National Institute of Technology Agartala' now)
2021-12-19 12:05:47,267 - segment_wiki - INFO - processed #1100000 articles (at 'The Cost (The Wire)' now)
2021-12-19 12:06:53,764 - segment_wiki - INFO - processed #1200000 articles (at '1026 Ingrid' now)
2021-12-19 12:08:05,158 - segment_wiki - INFO - processed #1300000 articles (at '85th Street (Manhattan)' now)
2021-12-19 12:09:19,355 - segment_wiki - INFO - processed #1400000 articles (at '1914 Penn State Nittany Lions football team' now)
2021-12-19 12:10:31,633 - segment_wiki - INFO - processed #1500000 articles (at 'Dave Hanner' now)
2021-12-19 12:11:44,928 - segment_wiki - INFO - processed #1600000 articles (at 'Swampwater' now)
2021-12-19 12:12:49,631 - segment_wiki - INFO - processed #1700000 articles (at 'Sri Lanka Army Medical Corps' now)
2021-12-19 12:14:02,378 - segment_wiki - INFO - processed #1800000 articles (at 'Yannima Tommy Watson' now)
2021-12-19 12:15:17,821 - segment_wiki - INFO - processed #1900000 articles (at 'What Rhymes with Cars and Girls' now)
2021-12-19 12:16:39,472 - segment_wiki - INFO - processed #2000000 articles (at 'Dark Is the Night for All' now)
2021-12-19 12:17:56,125 - segment_wiki - INFO - processed #2100000 articles (at 'Russ Young' now)
2021-12-19 12:19:03,726 - segment_wiki - INFO - processed #2200000 articles (at 'Tyczyn, Łódź Voivodeship' now)
2021-12-19 12:20:17,750 - segment_wiki - INFO - processed #2300000 articles (at 'Sahara (House of Lords album)' now)
2021-12-19 12:21:27,071 - segment_wiki - INFO - processed #2400000 articles (at 'Limburg-Styrum-Gemen' now)
2021-12-19 12:22:30,234 - segment_wiki - INFO - processed #2500000 articles (at 'Bogoriella' now)
2021-12-19 12:23:51,045 - segment_wiki - INFO - processed #2600000 articles (at 'Laurel and Michigan Avenues Row' now)
2021-12-19 12:25:05,416 - segment_wiki - INFO - processed #2700000 articles (at 'Kessleria' now)
2021-12-19 12:26:20,186 - segment_wiki - INFO - processed #2800000 articles (at 'EuroLeague Awards' now)
2021-12-19 12:27:34,021 - segment_wiki - INFO - processed #2900000 articles (at 'A.K.O.O. Clothing' now)
2021-12-19 12:28:58,561 - segment_wiki - INFO - processed #3000000 articles (at 'Česukai' now)
2021-12-19 12:30:24,397 - segment_wiki - INFO - processed #3100000 articles (at 'Program 973' now)
2021-12-19 12:31:43,178 - segment_wiki - INFO - processed #3200000 articles (at 'Dingden railway station' now)
2021-12-19 12:33:04,632 - segment_wiki - INFO - processed #3300000 articles (at 'Nagareboshi' now)
2021-12-19 12:34:23,664 - segment_wiki - INFO - processed #3400000 articles (at 'Anton Lang (biologist)' now)
2021-12-19 12:35:45,825 - segment_wiki - INFO - processed #3500000 articles (at 'Opera (Super Junior song)' now)
2021-12-19 12:36:58,564 - segment_wiki - INFO - processed #3600000 articles (at 'Mycena sublucens' now)
2021-12-19 12:38:22,892 - segment_wiki - INFO - processed #3700000 articles (at 'Man Controlling Trade' now)
2021-12-19 12:39:47,713 - segment_wiki - INFO - processed #3800000 articles (at 'Marwan Issa' now)
2021-12-19 12:41:07,354 - segment_wiki - INFO - processed #3900000 articles (at 'Anita Willets Burnham Log House' now)
2021-12-19 12:42:22,563 - segment_wiki - INFO - processed #4000000 articles (at 'Robert Bresson bibliography' now)
2021-12-19 12:43:44,812 - segment_wiki - INFO - processed #4100000 articles (at 'Ainsworth House' now)
2021-12-19 12:45:07,713 - segment_wiki - INFO - processed #4200000 articles (at 'Gohar Rasheed' now)
2021-12-19 12:46:31,470 - segment_wiki - INFO - processed #4300000 articles (at 'C1orf131' now)
2021-12-19 12:47:53,227 - segment_wiki - INFO - processed #4400000 articles (at 'Commatica cyanorrhoa' now)
2021-12-19 12:49:17,669 - segment_wiki - INFO - processed #4500000 articles (at 'Personal horizon' now)
2021-12-19 12:50:54,496 - segment_wiki - INFO - processed #4600000 articles (at 'Berteling Building' now)
2021-12-19 12:52:17,359 - segment_wiki - INFO - processed #4700000 articles (at 'Nyamakala' now)
2021-12-19 12:53:41,792 - segment_wiki - INFO - processed #4800000 articles (at "2017 European Judo Championships – Men's 81 kg" now)
2021-12-19 12:55:16,781 - segment_wiki - INFO - processed #4900000 articles (at 'Matt Walwyn' now)
2021-12-19 12:56:45,081 - segment_wiki - INFO - processed #5000000 articles (at 'Ralph Richardson (geologist)' now)
2021-12-19 12:58:23,606 - segment_wiki - INFO - processed #5100000 articles (at 'Here Tonight (Brett Young song)' now)
2021-12-19 13:00:06,069 - segment_wiki - INFO - processed #5200000 articles (at 'Jacob Merrill Manning' now)
2021-12-19 13:01:33,116 - segment_wiki - INFO - processed #5300000 articles (at 'Bog of Beasts' now)
2021-12-19 13:03:05,347 - segment_wiki - INFO - processed #5400000 articles (at 'Nixon Jew count' now)
2021-12-19 13:04:37,976 - segment_wiki - INFO - processed #5500000 articles (at 'Rod Smith (American football coach)' now)
2021-12-19 13:06:11,536 - segment_wiki - INFO - processed #5600000 articles (at 'Stant' now)
2021-12-19 13:07:43,201 - segment_wiki - INFO - processed #5700000 articles (at 'Mitsui Outlet Park Tainan' now)
2021-12-19 13:09:14,378 - segment_wiki - INFO - finished processing 5792326 articles with 28506397 sections (skipped 9832469 redirects, 624207 stubs, 5419438 ignored namespaces)
2021-12-19 13:09:14,399 - segment_wiki - INFO - finished running /home/mrus/.virtualenvs/local3.9/lib/python3.9/site-packages/gensim/scripts/segment_wiki.py

The XML dump that I used was 81GB in size (uncompressed) and I ended up with a JSON file that was only around 31GB. Apart from skipped content, a significant portion of these savings are probably attributed to the change in format.

-rw-r--r-- 1 mrus mrus 31G Dec 19 13:09 ./enwiki-latest-pages-articles.json
-rw-r--r-- 1 mrus mrus 81G Dec  2 01:53 ./enwiki-latest-pages-articles.xml

Making 31GB of JSON usable

Unfortunately we won’t be able to efficiently query a 31GB JSON just like that. What we need is a tool, that can ingest such large amounts of data and make them searchable. The dumpster-dive solution used MongoDB for this purpose, which I found is not an ideal way to solve this problem. And since we don’t actually need to work with the data, a database offers little benefit for us.
Instead, a search engine makes a lot more sense.

A while ago I stumbled upon quickwit and found it an interesting project. At that time I had no use case that would allow me to test it – but this experiment seems like a great playground to give it a go!

Installation is fairly easy, even though it’s not available via cargo install. Simply clone the git repo and run cargo build --release --features release-feature-vendored-set. You’ll end up with the quickwit binary inside the target/release/ directory.

By default, quickwit will phone home, but you can disable that using an environment variable.

sh export DISABLE_QUICKWIT_TELEMETRY=1

Now, let’s create the required configuration. At the time of writing, the official quickwit documentation was out of date, at least unless we’d be using the 0.1.0 release, which was over half a year old. Hence the configuration as well as the commands that I’ll be showing here won’t match the documentation. However, if you’ve compiled quickwit from git master like I did (144074d18e9b40615dacfd6c3908bcecb6b7ea3b) everything should work just fine.

{
  "version": 0,
  "index_id": "wikipedia",
  "index_uri": "file://YOUR_PATH_HERE/wikipedia",
  "search_settings": {
    "default_search_fields": ["title", "section_texts"]
  },
  "doc_mapping": {
    "store_source": true,
    "field_mappings": [
      {
        "name": "title",
        "type": "text",
        "record": "position"
      },
      {
        "name": "section_titles",
        "type": "array<text>"
      },
      {
        "name": "section_texts",
        "type": "array<text>"
      },
      {
        "name": "interlinks",
        "type": "array<text>",
        "indexed": false,
        "stored": false
      }
    ]
  }
}

Note: You have to manually replace YOUR_PATH_HERE in index_uri with the actual path to your metastore folder!

Next, let’s cd into the directory that we’ve previously set in the config.json (YOUR_PATH_HERE) and create the index using that exact same configuration (which I’m assuming is located in the same folder).

quickwit index create \
  --metastore-uri file://$(pwd)/wikipedia \
  --index-config-uri $(pwd)/config.json

After that we have to import the actual JSON data into the newly created index. This will take some time, depending on your machine’s performance.

quickwit index ingest \
  --index-id wikipedia \
  --metastore-uri file://$(pwd)/wikipedia \
  --data-dir-path $(pwd)/wikipedia-data \
  --input-path enwiki-latest-pages-articles.json

After around 10 minutes quickwit exited successfully with this output:

Indexed 5792326 documents in 10.42min.
Now, you can query the index with the following command:
quickwit index search --index-id wikipedia --metastore-uri file://$(pwd)/wikipedia --query "my query"

I noticed that the number it reported (5792326) was the same as the one previously reported by the segment_wiki.py script, so I’m optimistically assuming that all data was imported successfully. What surprised me, was that unlike with the dumpster-dive setup that I mentioned before, quickwit’s database didn’t grow the data but instead shrank it even further down to only 21GB. At this size, having all of Wikipedia’s text articles available offline suddenly isn’t a PITA anymore.

Let’s try querying some data to see if it works.

quickwit index search \
  --index-id wikipedia \
  --metastore-uri file://$(pwd)/wikipedia \
  --query 'title:apollo AND 11' \
  | jq '.hits[].title[]'
"Apollo"
"Apollo 11"
"Apollo 8"
"Apollo program"
"Apollo 13"
"Apollo 7"
"Apollo 9"
"Apollo 1"
"Apollo 10"
"Apollo 12"
"Apollo 14"
"Apollo 15"
"Apollo 16"
"Apollo 17"
"List of Apollo astronauts"
"Apollo, Pennsylvania"
"Apollo 13 (film)"
"Apollo Lunar Module"
"Apollo Guidance Computer"
"Apollo 4"

Looks like quickwit found what we were searching for. But since the article is literally named Apollo 11 we should be able to perform what (according to quickwit’s documentation) seems to be an exact search to get the Apollo 11 article we’re interested in.

quickwit index search \
  --index-id wikipedia \
  --metastore-uri file://$(pwd)/wikipedia \
  --query 'title:"Apollo 11"' \
  | jq '.hits[].title[]'
"Apollo 11"
"Apollo 11 (disambiguation)"
"Apollo 11 in popular culture"
"Apollo 11 missing tapes"
"Apollo 11 goodwill messages"
"British television Apollo 11 coverage"
"Apollo 11 (1996 film)"
"Apollo 11 lunar sample display"
"Apollo 11 Cave"
"Moonshot: The Flight Of Apollo 11"
"Apollo 11 50th Anniversary commemorative coins"
"Apollo 11 anniversaries"
"Apollo 11 (2019 film)"

While it returns more than one match, my tests have shown that it’s safe to simply pick the first result when using exact matching, as it will return the most exact match first.

Considering that we’re going through a very large set of data, the query speed is top-notch, with around 16004µs for a title query. Querying the actual content isn’t much slower either at only around 27158µs.

Now, quickwit is designed to be able to run as a standalone service and hence also offers an HTTP endpoint for querying. However, since I don’t need it to be running continuously, because I’m not looking up stuff on Wikipedia all the time, I prefer its CLI interface for the purpose of finding articles when I need them. I’m sure that running it as a service might increase the performance, though.

In order to simplify things, I wrote a helper function in my .zshrc. You can basically copy-paste it and would only need to adjust the WIKIPEDIA_* exports. However, you have to have jq, fzf, pandoc and glow installed for this to work.
I might extend this tool and eventually make it a standalone script, as soon as it gets too big. Depending on how well this solution performs over time, I might also try to build something similar for Dash docsets.

Enjoyed this? Support me via Monero, Bitcoin, Lightning, or Ethereum! More info.

Published on 2021-12-19 and updated on 2024-03-18 in make and tagged with open-source infrastructure

Journal

Travel

Make

Updates

Let's Take Wikipedia Offline!

Downloading Wikipedia

Making 31GB of JSON usable