In this article, we will build a pre-trained transformer model
FashionBERT using the Hugging Face models.
The goal is to train a tokenizer and the transformer model, save the model and test it.
The dataset is a collection of 87K clothing product descriptions in Hebrew. You can download it from here.
FashionBERT is a RoBERTa model transformer from scratch.
FashionBERT will load
fashion.txt as dataset, train the tokenizer, build
vocab.json files and use these files during the pre-training process.
The HuggingFace Transformers is a package that provides pre-trained models to perform NLP tasks.
Find and extract table from PDF file with
Python provides several libraries for PDF table extraction. Libraries like
excalibur-py can easily find and extract the well-defined tables. But sometimes, all of these powerful libraries failed when you try to extract non-formatted tables.
pdfplumber is a Python library for text and table extraction.
And groups bordering cells into tables.
All NLP (Natural Language Processing) tasks need text data for training. One of the largest text data sources is Wikipedia that offers free copies of all available content in many languages as dump files.
In this article, I will download, extract, and split into sentences text data from a Wikipedia dump file. Finally, I will insert all data into a single text file as one-single-sentence-per-line.
For easy extracting text from Wikipedia dump file, I use WikiExtractor.py. I develop on Windows, so I…
How to build a simple search engine dashboard with Docker, Elasticsearch, and Plotly.
In this article, I will build a full-text search functionality that allows finding relevant articles by searching for a specific word or phrase across thousands of news articles.
Docker is a platform that packages an application and all its dependencies together in a container.
Docker is like magic ✨ in the box.
Elasticsearch is an awesome search engine for performing a…
In this article, we will classify women’s clothing product descriptions into 13 predefined classes. All descriptions are in Hebrew.
We will use two Scikit-Learn classifiers: Naive Bayes and Logistic Regression for multi-class machine learning algorithms.
To predict the class of the product given its description.
The dataset is a collection of 24K women’s clothing product descriptions manually labeled. I have scrapped all data from popular Israel online fashion websites, and you can download it from here. We are going to use raw text directly rather than using a preprocessed text dataset.
Let’s take a snapshot of the our data: