Build a RoBERTa Model from Scratch
In this article, we will build a pre-trained transformer model FashionBERT
using the Hugging Face models.
Goal
The goal is to train a tokenizer and the transformer model, save the model and test it.
Data
The dataset is a collection of 87K clothing product descriptions in Hebrew. You can download it from here.
FashionBERT Model
FashionBERT
is a RoBERTa model transformer from scratch. FashionBERT
will load fashion.txt
as dataset, train the tokenizer, build merges.txt
and vocab.json
files and use these files during the pre-training process.
Install HuggingFace Transformers
The HuggingFace Transformers is a package that provides pre-trained models to perform NLP tasks.
To install transformers
with pip:
pip install transformers
Train a Tokenizer
The main goal of a tokenizer is to prepare the input for a model by splitting text into tokens and converting (encoding) to integers.
We will train Hugging Face’s ByteLevelBPETokenizer()
. Byte-level byte pair encoding (BBPE) is a byte-level tokenizer. BBPE breaks words into minimal components and merges the most frequently represented. BBPE ensures that the most common words will be represented in vocabulary as a single token, while less common words will be broken down into two or more subwords tokens. Check FloydHub Blog for a more in-depth explanation.
ByteLevelBPETolenzer()
gets the following parameters:
files
- the path to the dataset.vocab_size
- the vocabulary size.min_frequency
- the minimum frequency threshold.special_tokens
- the list of unique tokens.
from pathlib import Path
from tokenizers import ByteLevelBPETokenizertokenizer = ByteLevelBPETokenizer()
paths = ["fashion.txt"]tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
Save the Files
The tokenizer generates two files during training:
merges.txt
- the merged tokenized sub-strings.vocab.json
- the indices of the tokenized sub-strings.
To save merges.txt
and vocab.json
, we will create the FashionBERT
directory:
import os
token_dir = '/FashionBERT'
if not os.path.exists(token_dir):
os.makedirs(token_dir)
tokenizer.save_model(directory=token_dir)
Define the configuration of the Model
We will pre-train a RoBERTa-base model using 12 encoder layers and12 attention heads.
RobertaConfig()
gets the following parameters:
vocab_size
- the number of different tokens.max_position_embeddings
- the maximum sequence length.num_attention_heads
- the number of attention heads for each attention layer in the Transformer encoder.num_hidden_layers
- the number of hidden layers in the Transformer encoder.type_vocab_size
-the vocabulary size of thetoken_type_ids
.
from transformers import RobertaConfigconfig = RobertaConfig(
vocab_size=52_000,
max_position_embeddings=514,
num_attention_heads=12,
num_hidden_layers=12,
type_vocab_size=1,
)
Load the Tokenizer
To load trained tokenizer, we will use RobertaTokenizer.from_pretrained()
:
from transformers import RobertaTokenizertokenizer = RobertaTokenizer.from_pretrained('/FashionHebBERT', max_length=512)
Initialize the Model
from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config).cuda()
Build the Dataset
We will use the LineByLineTextDataset()
module for a dataset that is arranged line by line:
from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path='fashion.txt',
block_size=128,
)
Define the Data Collator
A data collator prepares the dataset for the model. A data collator takes samples from the dataset and collate them into batches.
DataCollatorForLanguageModeling()
gets the following parameters:
tokenizer
- the trained tokenizer.mlm
- setTrue
for Masked Language Modeling (MLM).mlm_probability
-the percentage of masked tokens while training.
from transformers import DataCollatorForLanguageModelingdata_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)
Train the Model
After all preparations, we are ready to start with a trainer.
TrainingArguments()
gets the following parameters:
output_dir
- the output directory where the model predictions and checkpoints will be written.overwrite_output_dir
- setTrue
to overwrite the content of the output directory.num_train_epochs
- the number of training epochs to perform.per_device_train_batch_size
- The batch size per GPU/TPU core/CPU for training.save_steps
- the number of updates steps before two checkpoint saves.save_total_limit
- the number of checkpoints.
Trainer()
gets the following parameters:
model
- the model to train, evaluate or use for predictions.args
- theTrainingArguments()
.data_collator
- theDataCollatorForLanguageModeling()
.train_dataset
- the dataset to use for training.
from transformers import Trainer, TrainingArgumentstraining_args = TrainingArguments(
output_dir='/FashionBERT',
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
)trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)trainer.train()
Save the Model
We will save the model and its configuration in FashionBERT
directory.
trainer.save_model('/FashionBERT')
Test The Model
To test pre-trained model and tokenizer on a language modeling task, we load fill-mask
:
model
- the pre-trainedFashionBERT
model.tokenizer
the pre-trained tokenizer.
from transformers import pipeline
fill_mask = pipeline(
'fill-mask',
model='/FashionBERT',
tokenizer='/FashionBERT'
)fill_mask('שמלת<mask>')