Named Entity Recognition with Flair

Yulia Nudelman
4 min readAug 22, 2022

--

Train custom Named Entity Recognition (NER) model with the Flair NLP framework.

What is Named Entity Recognition

Named entity recognition (NER) is an NLP task that identifies named entities in a text and tags them with their corresponding categories. Named entities are real-world objects:

  • Persons
  • Locations
  • Organizations
  • Time

What is Flair

Flair is a simple and powerful Natural Language Processing framework developed and open-sourced by Zalando Research that provides a simple interface that is easy to use.

Flair is state-of-the-art of the NLP tasks like sequence tagging (POS, NER), text classification (sentiment analysis), and word sense disambiguation. One of the tasks where Flair is perfect is sequence tagging. Sequence tagging is a task that assigns tags to tokens or units of text. When the tags are named entities (PERSON, ORGANIZATION, or LOCATION), we are talking about Named Entity Recognition (NER) task.

Motivation

Most of the sequence taggers are trained on a large corpus representing the general use of a language. If we want to solve a domain-specific problem, we need a custom NER tagger trained on a domain-specific corpus.

Furthermore, most NLP state-of-the-art models are built for popular languages like English. We must train our tagger if we want a NER model for languages like Hebrew.

This article shows how to train Hebrew Named Entity Recognizer from scratch with the Flair NLP framework.

Hebrew NER Model with Flair

Prepare Python Environment

Create the virtual environment with venv tool that is part of the Python Standard Library:

mkdir flair-ner
cd flair-ner
python -m venv flair-ner
.\flair-ner\Scripts\activate.bat

Install flair library with pip:

pip install flair

Verify the installation:

pip show flair

Install and run a Jupyter notebook:

pip install notebook
jupyter notebook

Create a new Python 3 notebook and load flair library:

import flair
print(flair.__version__)

Prepare Dataset

I train my model on two NER-annotated Hebrew corpus:

To load annotated corpus, I use the Corpus object that represents a dataset for training a model. The Corpus object consists of a list of train sentences, a list of dev sentences, and a list of test sentences correspond to the training, validation, and testing split during model training. The first column in the dataset is the word itself, and the second BIO-annotated NER tags. An empty line separates sentences. To read such a dataset, define the column structure as a dictionary and instantiate a ColumnCorpus:

from flair.data import Corpus
from flair.datasets import ColumnCorpus
columns = {0 : 'text', 1 : 'ner'}

data_folder = 'data/'

corpus: Corpus = ColumnCorpus(data_folder, columns,
train_file = 'train.txt',
test_file = 'test.txt',
dev_file = 'dev.txt')

Load Tag Dictionary

A tag dictionary is a set of all tags in the corpus. Get all possible tags from the corpus:

tag_dictionary = corpus.make_label_dictionary(tag_type)
print(tag_dictionary)

Initialize Transformer

Flair supports various Transformer-based architectures like BERT. I am going to use an AlephBERT — Hebrew Language Model for my tagger:

from flair.embeddings import TransformerWordEmbeddingsembeddings = TransformerWordEmbeddings(
model='onlplab/alephbert-base',
layers='-1',
subtoken_pooling='first',
fine_tune=True,
use_context=True)

Initialize Sequence Tagger

A sequence tagger is a tool that receives text as input and returns a list of words with tag names.

Initialize sequence tagger object SequenceTagger from pre-trained models:

from flair.models import SequenceTaggertagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type)

Train Model

To train the sequence tagger, I initialize ModelTrainer object with a tagged corpus and a sequence tagger object.

from flair.trainers import ModelTrainertrainer = ModelTrainer(tagger, corpus)trainer.train(
'models/flair_ner',
learning_rate=0.1,
mini_batch_size=32,
max_epochs=200)

The training process runs for 200 epochs and saves the best model result in the models/flair_ner directory as final-model.pt.

When the training process is finished, the final results of the final model are tested on the test data:

5

Load and Run Hebrew Flair Model

To load finale-model.pt model into Flair, pass the directory of the model as the first argument to the SequenceTagger.load method:

from flair.models import SequenceTaggertagger = SequenceTagger.load('models/flair_ner/final-model.pt')

And run:

from flair.data import Sentencesentence = Sentence('נסעתי מירושלים לתל אביב')
tagger.predict(sentence)
print(sentence)

Thanks for reading, and I hope this article will help you to use Flair NLP to train your custom models that solve real-world problems.

--

--

Responses (1)