In this article, we will build a pre-trained transformer model FashionBERT
using the Hugging Face models.
The goal is to train a tokenizer and the transformer model, save the model and test it.
The dataset is a collection of 87K clothing product descriptions in Hebrew. You can download it from here.
FashionBERT
is a RoBERTa model transformer from scratch. FashionBERT
will load fashion.txt
as dataset, train the tokenizer, build merges.txt
and vocab.json
files and use these files during the pre-training process.
The HuggingFace Transformers is a package that provides pre-trained models to perform NLP tasks.
To install transformers
with pip:
pip…
Find and extract table from PDF file with pdfplumber
library.
Python provides several libraries for PDF table extraction. Libraries like camelot
, tabula-py
and excalibur-py
can easily find and extract the well-defined tables. But sometimes, all of these powerful libraries failed when you try to extract non-formatted tables.
pdfplumber
is a Python library for text and table extraction.
pdfplumber
finds:
And groups bordering cells into tables.
All NLP (Natural Language Processing) tasks need text data for training. One of the largest text data sources is Wikipedia that offers free copies of all available content in many languages as dump files.
In this article, I will download, extract, and split into sentences text data from a Wikipedia dump file. Finally, I will insert all data into a single text file as one-single-sentence-per-line.
I will download the Hebrew language Wikipedia dump file (hewiki-latest-pages-articles-multistream.xml.bz2) and build a Hebrew corpus from Wikipedia articles.
For easy extracting text from Wikipedia dump file, I use WikiExtractor.py. I develop on Windows, so I…
How to build a simple search engine dashboard with Docker, Elasticsearch, and Plotly.
In this article, I will build a full-text search functionality that allows finding relevant articles by searching for a specific word or phrase across thousands of news articles.
Docker is a platform that packages an application and all its dependencies together in a container.
Docker is like magic ✨ in the box.
Elasticsearch is a distributed, RESTful search and analytics engine, one of the open-source products from Elastic. It is a schema-free, document-oriented data store.
Elasticsearch is an awesome search engine for performing a…
In this article, we will classify women’s clothing product descriptions into 13 predefined classes. All descriptions are in Hebrew.
We will use two Scikit-Learn classifiers: Naive Bayes and Logistic Regression for multi-class machine learning algorithms.
To predict the class of the product given its description.
The dataset is a collection of 24K women’s clothing product descriptions manually labeled. I have scrapped all data from popular Israel online fashion websites, and you can download it from here. We are going to use raw text directly rather than using a preprocessed text dataset.
Let’s take a snapshot of the our data:
import…
Data Scientist