Named Entity Recognition with Flair
Train custom Named Entity Recognition (NER) model with the Flair NLP framework.
What is Named Entity Recognition
Named entity recognition (NER) is an NLP task that identifies named entities in a text and tags them with their corresponding categories. Named entities are real-world objects:
- Persons
- Locations
- Organizations
- Time
- …
What is Flair
Flair is a simple and powerful Natural Language Processing framework developed and open-sourced by Zalando Research that provides a simple interface that is easy to use.
Flair is state-of-the-art of the NLP tasks like sequence tagging (POS, NER), text classification (sentiment analysis), and word sense disambiguation. One of the tasks where Flair is perfect is sequence tagging. Sequence tagging is a task that assigns tags to tokens or units of text. When the tags are named entities (PERSON, ORGANIZATION, or LOCATION), we are talking about Named Entity Recognition (NER) task.
Motivation
Most of the sequence taggers are trained on a large corpus representing the general use of a language. If we want to solve a domain-specific problem, we need a custom NER tagger trained on a domain-specific corpus.
Furthermore, most NLP state-of-the-art models are built for popular languages like English. We must train our tagger if we want a NER model for languages like Hebrew.
This article shows how to train Hebrew Named Entity Recognizer from scratch with the Flair NLP framework.
Hebrew NER Model with Flair
Prepare Python Environment
Create the virtual environment with venv
tool that is part of the Python Standard Library:
mkdir flair-ner
cd flair-ner
python -m venv flair-ner
.\flair-ner\Scripts\activate.bat
Install flair
library with pip
:
pip install flair
Verify the installation:
pip show flair
Install and run a Jupyter notebook:
pip install notebook
jupyter notebook
Create a new Python 3 notebook and load flair
library:
import flair
print(flair.__version__)
Prepare Dataset
I train my model on two NER-annotated Hebrew corpus:
To load annotated corpus, I use the Corpus
object that represents a dataset for training a model. The Corpus
object consists of a list of train
sentences, a list of dev
sentences, and a list of test
sentences correspond to the training, validation, and testing split during model training. The first column in the dataset is the word itself, and the second BIO-annotated NER tags. An empty line separates sentences. To read such a dataset, define the column structure as a dictionary and instantiate a ColumnCorpus
:
from flair.data import Corpus
from flair.datasets import ColumnCorpuscolumns = {0 : 'text', 1 : 'ner'}
data_folder = 'data/'
corpus: Corpus = ColumnCorpus(data_folder, columns,
train_file = 'train.txt',
test_file = 'test.txt',
dev_file = 'dev.txt')
Load Tag Dictionary
A tag dictionary is a set of all tags in the corpus. Get all possible tags from the corpus:
tag_dictionary = corpus.make_label_dictionary(tag_type)
print(tag_dictionary)
Initialize Transformer
Flair supports various Transformer-based architectures like BERT. I am going to use an AlephBERT — Hebrew Language Model for my tagger:
from flair.embeddings import TransformerWordEmbeddingsembeddings = TransformerWordEmbeddings(
model='onlplab/alephbert-base',
layers='-1',
subtoken_pooling='first',
fine_tune=True,
use_context=True)
Initialize Sequence Tagger
A sequence tagger is a tool that receives text as input and returns a list of words with tag names.
Initialize sequence tagger object SequenceTagger
from pre-trained models:
from flair.models import SequenceTaggertagger = SequenceTagger(
hidden_size=256,
embeddings=embeddings,
tag_dictionary=tag_dictionary,
tag_type=tag_type)
Train Model
To train the sequence tagger, I initialize ModelTrainer
object with a tagged corpus and a sequence tagger object.
from flair.trainers import ModelTrainertrainer = ModelTrainer(tagger, corpus)trainer.train(
'models/flair_ner',
learning_rate=0.1,
mini_batch_size=32,
max_epochs=200)
The training process runs for 200 epochs and saves the best model result in the models/flair_ner
directory as final-model.pt
.
When the training process is finished, the final results of the final model are tested on the test data:
Load and Run Hebrew Flair Model
To load finale-model.pt
model into Flair, pass the directory of the model as the first argument to the SequenceTagger.load
method:
from flair.models import SequenceTaggertagger = SequenceTagger.load('models/flair_ner/final-model.pt')
And run:
from flair.data import Sentencesentence = Sentence('נסעתי מירושלים לתל אביב')
tagger.predict(sentence)
print(sentence)
Thanks for reading, and I hope this article will help you to use Flair NLP to train your custom models that solve real-world problems.