Build a Corpus for NLP Models from Wikipedia dump file

Yulia Nudelman
2 min readDec 31, 2020
Photo by Luke Chesser on Unsplash

All NLP (Natural Language Processing) tasks need text data for training. One of the largest text data sources is Wikipedia that offers free copies of all available content in many languages as dump files.

In this article, I will download, extract, and split into sentences text data from a Wikipedia dump file. Finally, I will insert all data into a single text file as one-single-sentence-per-line.

Download a Wikipedia dump file

I will download the Hebrew language Wikipedia dump file (hewiki-latest-pages-articles-multistream.xml.bz2) and build a Hebrew corpus from Wikipedia articles.

Extract text from Wikipedia dump file

For easy extracting text from Wikipedia dump file, I use WikiExtractor.py. I develop on Windows, so I need to install Ubuntu terminal from Microsoft Store (Windows has poor support for StringIO in the Python implementation).

In Ubuntu terminal, I download WikiExtractor with git clone command:

sudo git clone https://github.com/attardi/wikiextractor.git

The next step is to change the current directory to wikiextractor and run WikiExtractor.py with JSON flag:

cd wikiextractor
python3 wikiextractor/WikiExtractor.py --json hewiki-20201220-pages-articles-multistream.xml.bz2

After completing running, wikiextractor createdtext folder with JSON files in it. Each file contains several documents formatted as JSON objects, one per line with the following structure:

{"id": "", "revid": "", "url": "", "title": "", "text": "..."}

I am going to extract only the text field from all these JSON files.

Build a Corpus for Tokenizer

To build a corpus, which can be easily used with Tokenizer, I will use BlingFire library for writing one-sentence-per-line to a single text file:

import glob
import json
from blingfire import text_to_sentences
wiki_dump_file_out='he_wiki.txt'
wiki_dump_folder_in='wikiextractor/text/**/*'
with open(wiki_dump_file_out, 'w', encoding='utf-8') as out_f:
for filename in glob.glob(wiki_dump_folder_in):
filename=filename.replace("\\","/")
articles = []
for line in open(filename, 'r'):
articles.append(json.loads(line))
for article in articles:
sentences = text_to_sentences(article['text'])
out_f.write(sentences + '\n')

This code goes through each JSON file wiki_**, extract the article's text, split it into sentences, and write every sentence in a line. At the end of the process, we got he_wiki.txt ~86.4GB.

That’s it. Now you can usehe_wiki.txt file for the training of your own Tokenizer.

--

--