Build a Corpus for NLP Models from Wikipedia dump file
All NLP (Natural Language Processing) tasks need text data for training. One of the largest text data sources is Wikipedia that offers free copies of all available content in many languages as dump files.
In this article, I will download, extract, and split into sentences text data from a Wikipedia dump file. Finally, I will insert all data into a single text file as one-single-sentence-per-line.
Download a Wikipedia dump file
I will download the Hebrew language Wikipedia dump file (hewiki-latest-pages-articles-multistream.xml.bz2) and build a Hebrew corpus from Wikipedia articles.
Extract text from Wikipedia dump file
For easy extracting text from Wikipedia dump file, I use WikiExtractor.py. I develop on Windows, so I need to install Ubuntu terminal from Microsoft Store (Windows has poor support for StringIO
in the Python implementation).
In Ubuntu terminal, I download WikiExtractor with git clone command:
sudo git clone https://github.com/attardi/wikiextractor.git
The next step is to change the current directory to wikiextractor
and run WikiExtractor.py
with JSON flag:
cd wikiextractor
python3 wikiextractor/WikiExtractor.py --json hewiki-20201220-pages-articles-multistream.xml.bz2
After completing running, wikiextractor
createdtext
folder with JSON files in it. Each file contains several documents formatted as JSON objects, one per line with the following structure:
{"id": "", "revid": "", "url": "", "title": "", "text": "..."}
I am going to extract only the text field from all these JSON files.
Build a Corpus for Tokenizer
To build a corpus, which can be easily used with Tokenizer, I will use BlingFire library for writing one-sentence-per-line to a single text file:
import glob
import json
from blingfire import text_to_sentenceswiki_dump_file_out='he_wiki.txt'
wiki_dump_folder_in='wikiextractor/text/**/*'with open(wiki_dump_file_out, 'w', encoding='utf-8') as out_f:
for filename in glob.glob(wiki_dump_folder_in):
filename=filename.replace("\\","/")
articles = []
for line in open(filename, 'r'):
articles.append(json.loads(line))
for article in articles:
sentences = text_to_sentences(article['text'])
out_f.write(sentences + '\n')
This code goes through each JSON file wiki_**
, extract the article's text, split it into sentences, and write every sentence in a line. At the end of the process, we got he_wiki.txt
~86.4GB.
That’s it. Now you can usehe_wiki.txt
file for the training of your own Tokenizer.