Build a Corpus for NLP Models from Wikipedia dump file

Photo by Luke Chesser on Unsplash

All NLP (Natural Language Processing) tasks need text data for training. One of the largest text data sources is Wikipedia that offers free copies of all available content in many languages as dump files.

In this article, I will download, extract, and split into sentences text data from a Wikipedia dump file. Finally, I will insert all data into a single text file as one-single-sentence-per-line.

Download a Wikipedia dump file

Extract text from Wikipedia dump file

In Ubuntu terminal, I download WikiExtractor with git clone command:

sudo git clone

The next step is to change the current directory to wikiextractor and run with JSON flag:

cd wikiextractor
python3 wikiextractor/ --json hewiki-20201220-pages-articles-multistream.xml.bz2

After completing running, wikiextractor createdtext folder with JSON files in it. Each file contains several documents formatted as JSON objects, one per line with the following structure:

{"id": "", "revid": "", "url": "", "title": "", "text": "..."}

I am going to extract only the text field from all these JSON files.

Build a Corpus for Tokenizer

import glob
import json
from blingfire import text_to_sentences
with open(wiki_dump_file_out, 'w', encoding='utf-8') as out_f:
for filename in glob.glob(wiki_dump_folder_in):
articles = []
for line in open(filename, 'r'):
for article in articles:
sentences = text_to_sentences(article['text'])
out_f.write(sentences + '\n')

This code goes through each JSON file wiki_**, extract the article's text, split it into sentences, and write every sentence in a line. At the end of the process, we got he_wiki.txt ~86.4GB.

That’s it. Now you can usehe_wiki.txt file for the training of your own Tokenizer.

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store