Build a RoBERTa Model from Scratch

Yulia Nudelman
3 min readFeb 18, 2021

--

Photo by Samule Sun on Unsplash

In this article, we will build a pre-trained transformer model FashionBERT using the Hugging Face models.

Goal

The goal is to train a tokenizer and the transformer model, save the model and test it.

Data

The dataset is a collection of 87K clothing product descriptions in Hebrew. You can download it from here.

FashionBERT Model

FashionBERT is a RoBERTa model transformer from scratch. FashionBERT will load fashion.txt as dataset, train the tokenizer, build merges.txt and vocab.json files and use these files during the pre-training process.

Install HuggingFace Transformers

The HuggingFace Transformers is a package that provides pre-trained models to perform NLP tasks.

To install transformerswith pip:

pip install transformers

Train a Tokenizer

The main goal of a tokenizer is to prepare the input for a model by splitting text into tokens and converting (encoding) to integers.

We will train Hugging Face’s ByteLevelBPETokenizer(). Byte-level byte pair encoding (BBPE) is a byte-level tokenizer. BBPE breaks words into minimal components and merges the most frequently represented. BBPE ensures that the most common words will be represented in vocabulary as a single token, while less common words will be broken down into two or more subwords tokens. Check FloydHub Blog for a more in-depth explanation.

ByteLevelBPETolenzer() gets the following parameters:

  • files- the path to the dataset.
  • vocab_size- the vocabulary size.
  • min_frequency- the minimum frequency threshold.
  • special_tokens- the list of unique tokens.
from pathlib import Path
from tokenizers import ByteLevelBPETokenizer
tokenizer = ByteLevelBPETokenizer()
paths = ["fashion.txt"]
tokenizer.train(files=paths, vocab_size=52_000, min_frequency=2, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])

Save the Files

The tokenizer generates two files during training:

  • merges.txt- the merged tokenized sub-strings.
  • vocab.json- the indices of the tokenized sub-strings.

To save merges.txt and vocab.json, we will create the FashionBERT directory:

import os
token_dir = '/FashionBERT'
if not os.path.exists(token_dir):
os.makedirs(token_dir)
tokenizer.save_model(directory=token_dir)

Define the configuration of the Model

We will pre-train a RoBERTa-base model using 12 encoder layers and12 attention heads.

RobertaConfig() gets the following parameters:

  • vocab_size- the number of different tokens.
  • max_position_embeddings- the maximum sequence length.
  • num_attention_heads- the number of attention heads for each attention layer in the Transformer encoder.
  • num_hidden_layers- the number of hidden layers in the Transformer encoder.
  • type_vocab_size-the vocabulary size of the token_type_ids.
from transformers import RobertaConfigconfig = RobertaConfig(
vocab_size=52_000,
max_position_embeddings=514,
num_attention_heads=12,
num_hidden_layers=12,
type_vocab_size=1,
)

Load the Tokenizer

To load trained tokenizer, we will use RobertaTokenizer.from_pretrained():

from transformers import RobertaTokenizertokenizer = RobertaTokenizer.from_pretrained('/FashionHebBERT', max_length=512)

Initialize the Model

from transformers import RobertaForMaskedLM
model = RobertaForMaskedLM(config=config).cuda()

Build the Dataset

We will use the LineByLineTextDataset() module for a dataset that is arranged line by line:

from transformers import LineByLineTextDataset
dataset = LineByLineTextDataset(
tokenizer=tokenizer,
file_path='fashion.txt',
block_size=128,
)

Define the Data Collator

A data collator prepares the dataset for the model. A data collator takes samples from the dataset and collate them into batches.

DataCollatorForLanguageModeling() gets the following parameters:

  • tokenizer- the trained tokenizer.
  • mlm- set Truefor Masked Language Modeling (MLM).
  • mlm_probability-the percentage of masked tokens while training.
from transformers import DataCollatorForLanguageModelingdata_collator = DataCollatorForLanguageModeling(
tokenizer=tokenizer, mlm=True, mlm_probability=0.15
)

Train the Model

After all preparations, we are ready to start with a trainer.

TrainingArguments() gets the following parameters:

  • output_dir- the output directory where the model predictions and checkpoints will be written.
  • overwrite_output_dir- set True to overwrite the content of the output directory.
  • num_train_epochs- the number of training epochs to perform.
  • per_device_train_batch_size- The batch size per GPU/TPU core/CPU for training.
  • save_steps- the number of updates steps before two checkpoint saves.
  • save_total_limit- the number of checkpoints.

Trainer() gets the following parameters:

  • model- the model to train, evaluate or use for predictions.
  • args- the TrainingArguments().
  • data_collator- the DataCollatorForLanguageModeling().
  • train_dataset- the dataset to use for training.
from transformers import Trainer, TrainingArgumentstraining_args = TrainingArguments(
output_dir='/FashionBERT',
overwrite_output_dir=True,
num_train_epochs=3,
per_device_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=dataset,
)
trainer.train()

Save the Model

We will save the model and its configuration in FashionBERT directory.

trainer.save_model('/FashionBERT')

Test The Model

To test pre-trained model and tokenizer on a language modeling task, we load fill-mask:

  • model- the pre-trained FashionBERT model.
  • tokenizer the pre-trained tokenizer.
from transformers import pipeline
fill_mask = pipeline(
'fill-mask',
model='/FashionBERT',
tokenizer='/FashionBERT'
)
fill_mask('שמלת<mask>')
fill_mask output

--

--