Build a RoBERTa Model from Scratch

Yulia Nudelman

3 min readFeb 18, 2021

In this article, we will build a pre-trained transformer model FashionBERT using the Hugging Face models.

Goal

The goal is to train a tokenizer and the transformer model, save the model and test it.

Data

The dataset is a collection of 87K clothing product descriptions in Hebrew. You can download it from here.

FashionBERT Model

FashionBERT is a RoBERTa model transformer from scratch. FashionBERT will load fashion.txt as dataset, train the tokenizer, build merges.txt and vocab.json files and use these files during the pre-training process.

Install HuggingFace Transformers

The HuggingFace Transformers is a package that provides pre-trained models to perform NLP tasks.

To install transformerswith pip:

pip install transformers

Train a Tokenizer

The main goal of a tokenizer is to prepare the input for a model by splitting text into tokens and converting (encoding) to integers.

We will train Hugging Face’s ByteLevelBPETokenizer(). Byte-level byte pair encoding (BBPE) is a byte-level tokenizer. BBPE breaks words into minimal components and merges the most frequently represented. BBPE ensures that the most common words will be represented in vocabulary as a single token, while less common words will be broken down into two or more subwords tokens. Check FloydHub Blog for a more in-depth explanation.

ByteLevelBPETolenzer() gets the following parameters:

files- the path to the dataset.
vocab_size- the vocabulary size.
min_frequency- the minimum frequency threshold.
special_tokens- the list of unique tokens.

from pathlib import Path
from tokenizers import ByteLevelBPETokenizertokenizer = ByteLevelBPETokenizer()
paths = ["fashion.txt"]tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

Save the Files

The tokenizer generates two files during training:

merges.txt- the merged tokenized sub-strings.
vocab.json- the indices of the tokenized sub-strings.

To save merges.txt and vocab.json, we will create the FashionBERT directory:

import os
token_dir = '/FashionBERT'
if not os.path.exists(token_dir):
 os.makedirs(token_dir)
tokenizer.save_model(directory=token_dir)

Define the configuration of the Model

We will pre-train a RoBERTa-base model using 12 encoder layers and12 attention heads.

RobertaConfig() gets the following parameters:

vocab_size- the number of different tokens.
max_position_embeddings- the maximum sequence length.
num_attention_heads- the number of attention heads for each attention layer in the Transformer encoder.
num_hidden_layers- the number of hidden layers in the Transformer encoder.
type_vocab_size-the vocabulary size of the token_type_ids.

from transformers import RobertaConfigconfig = RobertaConfig(
 vocab_size=52_000,
 max_position_embeddings=514,
 num_attention_heads=12,
 num_hidden_layers=12,
 type_vocab_size=1,
)

Load the Tokenizer

To load trained tokenizer, we will use RobertaTokenizer.from_pretrained():

from transformers import RobertaTokenizertokenizer = RobertaTokenizer.from_pretrained('/FashionHebBERT', max_length=512)

Initialize the Model

from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config).cuda()

Build the Dataset

We will use the LineByLineTextDataset() module for a dataset that is arranged line by line:

from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
 tokenizer=tokenizer,
 file_path='fashion.txt',
 block_size=128,
)

Define the Data Collator

A data collator prepares the dataset for the model. A data collator takes samples from the dataset and collate them into batches.

DataCollatorForLanguageModeling() gets the following parameters:

tokenizer- the trained tokenizer.
mlm- set Truefor Masked Language Modeling (MLM).
mlm_probability-the percentage of masked tokens while training.

from transformers import DataCollatorForLanguageModelingdata_collator = DataCollatorForLanguageModeling(
 tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

Train the Model

After all preparations, we are ready to start with a trainer.

TrainingArguments() gets the following parameters:

output_dir- the output directory where the model predictions and checkpoints will be written.
overwrite_output_dir- set True to overwrite the content of the output directory.
num_train_epochs- the number of training epochs to perform.
per_device_train_batch_size- The batch size per GPU/TPU core/CPU for training.
save_steps- the number of updates steps before two checkpoint saves.
save_total_limit- the number of checkpoints.

Trainer() gets the following parameters:

model- the model to train, evaluate or use for predictions.
args- the TrainingArguments().
data_collator- the DataCollatorForLanguageModeling().
train_dataset- the dataset to use for training.

from transformers import Trainer, TrainingArgumentstraining_args = TrainingArguments(
 output_dir='/FashionBERT',
 overwrite_output_dir=True,
 num_train_epochs=3,
 per_device_train_batch_size=64,
 save_steps=10_000,
 save_total_limit=2,
)trainer = Trainer(
 model=model,
 args=training_args,
 data_collator=data_collator,
 train_dataset=dataset,
)trainer.train()

Save the Model

We will save the model and its configuration in FashionBERT directory.

trainer.save_model('/FashionBERT')

Test The Model

To test pre-trained model and tokenizer on a language modeling task, we load fill-mask:

model- the pre-trained FashionBERT model.
tokenizer the pre-trained tokenizer.

from transformers import pipeline
fill_mask = pipeline(
 'fill-mask',
 model='/FashionBERT',
 tokenizer='/FashionBERT'
)fill_mask('שמלת<mask>')