How to Build Named Entity Training Dataset for NER Task (Part 1)
Named entity recognition (NER)- is a natural language processing task that performs two actions:
- Identification named entities in a text
- Classification them into categories
CoNNL: Named entities are phrases that contain the names of persons, organizations and locations
Most Named Entity Recognition tasks require large amounts of labeled data for training. We can find quite of bit Named Entity labeled data in English, French, and Russian. However, certain languages like Hebrew have no labeled data available.
In this article, I explain how automatically classify Wikipedia article titles into entity classes using DBpedia API. This is one of the steps of building Named Entity training data described by Joel Nothman, James R. Curran, and Tara Murphy.
Wikipedia is a free content, multilingual online encyclopedia, written collaboratively by volunteers. Because of its high volume and variety, Wikipedia is a good data source for information.
For now, Wikipedia contains about Wikipedia article contains 301,203 Hebrew Articles.
DBpedia is a dataset that consists of Wikipedia article titles as entities. With DBpedia, we can get entity-type information about Wikipedia articles titles. We are going to extract information about each entity:
Named Entity Classification Pipeline
keras.utils, we download a file
hewiki-latest-all-titles-in-ns0.gz from a URL:
pandas, we read
With MediaWiki API, we can acquire meta-information about the wiki. All Wikipedia Wikis have endpoints in format
http://example.org/w/api.php. To work on the Hebrew Wikipedia, we use
https://www.wikidata.org/w/api.php?action=wbgetentities&sites=hewiki . By adding
action parameter in the query string in the URL, we tell the API which action to perform. To get an English title for the Hebrew article title:
With DBpedia API we can extract entity classes. We send as an input
enwiki (English title) and as an output get a list of class entities.
If one of the classes includes
Location, we will save it:
Now, we are ready to iterate through lines and identify article title entity type. At the end we save all titles and entity types in a CSV file: