Photo by Samule Sun on Unsplash

In this article, we will build a pre-trained transformer model FashionBERT using the Hugging Face models.

Goal

The goal is to train a tokenizer and the transformer model, save the model and test it.

Data

The dataset is a collection of 87K clothing product descriptions in Hebrew. You can download it from here.

FashionBERT Model

FashionBERT is a RoBERTa model transformer from scratch. FashionBERT will load fashion.txt as dataset, train the tokenizer, build merges.txt and vocab.json files and use these files during the pre-training process.

The HuggingFace Transformers is a package that provides pre-trained models to perform NLP tasks.

To install transformerswith pip:

pip…

Find and extract table from PDF file with pdfplumber library.

Photo by Mika Baumeister on Unsplash

Python provides several libraries for PDF table extraction. Libraries like camelot, tabula-py and excalibur-py can easily find and extract the well-defined tables. But sometimes, all of these powerful libraries failed when you try to extract non-formatted tables.

pdfplumber is a Python library for text and table extraction.

pdfplumber finds:

  • explicitly defined lines
  • intersections of those lines
  • cells that use these intersections

And groups bordering cells into tables.


Photo by Luke Chesser on Unsplash

All NLP (Natural Language Processing) tasks need text data for training. One of the largest text data sources is Wikipedia that offers free copies of all available content in many languages as dump files.

In this article, I will download, extract, and split into sentences text data from a Wikipedia dump file. Finally, I will insert all data into a single text file as one-single-sentence-per-line.

Download a Wikipedia dump file

I will download the Hebrew language Wikipedia dump file (hewiki-latest-pages-articles-multistream.xml.bz2) and build a Hebrew corpus from Wikipedia articles.

Extract text from Wikipedia dump file

For easy extracting text from Wikipedia dump file, I use WikiExtractor.py. I develop on Windows, so I…


How to build a simple search engine dashboard with Docker, Elasticsearch, and Plotly.

In this article, I will build a full-text search functionality that allows finding relevant articles by searching for a specific word or phrase across thousands of news articles.

Prerequisites

  • Docker
  • Elasticsearch
  • Plotly

Docker is a platform that packages an application and all its dependencies together in a container.

Docker is like magic ✨ in the box.

Elasticsearch is a distributed, RESTful search and analytics engine, one of the open-source products from Elastic. It is a schema-free, document-oriented data store.

Elasticsearch is an awesome search engine for performing a…


Photo by Burgess Milner on Unsplash

In this article, we will classify women’s clothing product descriptions into 13 predefined classes. All descriptions are in Hebrew.

We will use two Scikit-Learn classifiers: Naive Bayes and Logistic Regression for multi-class machine learning algorithms.

Goal

To predict the class of the product given its description.

Data

The dataset is a collection of 24K women’s clothing product descriptions manually labeled. I have scrapped all data from popular Israel online fashion websites, and you can download it from here. We are going to use raw text directly rather than using a preprocessed text dataset.

Let’s take a snapshot of the our data:

import…

Yulia Nudelman

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store