How to Build Named Entity Training Dataset for NER Task (Part 1)

Building a Named Entity Classification Pipeline

Yulia Nudelman

3 min readAug 26, 2021

Introduction

Named entity recognition (NER)- is a natural language processing task that performs two actions:

Identification named entities in a text
Classification them into categories

CoNNL: Named entities are phrases that contain the names of persons, organizations and locations

Most Named Entity Recognition tasks require large amounts of labeled data for training. We can find quite of bit Named Entity labeled data in English, French, and Russian. However, certain languages like Hebrew have no labeled data available.

In this article, I explain how automatically classify Wikipedia article titles into entity classes using DBpedia API. This is one of the steps of building Named Entity training data described by Joel Nothman, James R. Curran, and Tara Murphy.

Wikipedia

Wikipedia is a free content, multilingual online encyclopedia, written collaboratively by volunteers. Because of its high volume and variety, Wikipedia is a good data source for information.

For now, Wikipedia contains about Wikipedia article contains 301,203 Hebrew Articles.

DBpedia

DBpedia is a dataset that consists of Wikipedia article titles as entities. With DBpedia, we can get entity-type information about Wikipedia articles titles. We are going to extract information about each entity:

Person
Organization
Location

Named Entity Classification Pipeline

Download file

With keras.utils, we download a file hewiki-latest-all-titles-in-ns0.gz from a URL:

Read file

With pandas, we read hewiki-latest-all-titles-in-ns0.gz to DataFrame:

Query Wikipedia

With MediaWiki API, we can acquire meta-information about the wiki. All Wikipedia Wikis have endpoints in format http://example.org/w/api.php. To work on the Hebrew Wikipedia, we use https://www.wikidata.org/w/api.php?action=wbgetentities&sites=hewiki . By adding action parameter in the query string in the URL, we tell the API which action to perform. To get an English title for the Hebrew article title:

Wikipedia API

Query DBpedia

With DBpedia API we can extract entity classes. We send as an input enwiki (English title) and as an output get a list of class entities.

DBpedia API

If one of the classes includes Person, Organization, orLocation, we will save it:

Now, we are ready to iterate through lines and identify article title entity type. At the end we save all titles and entity types in a CSV file:

Enjoy!