Multi-Class Hebrew Text Classification with Scikit-Learn

Photo by Burgess Milner on Unsplash

In this article, we will classify women’s clothing product descriptions into 13 predefined classes. All descriptions are in Hebrew.

We will use two Scikit-Learn classifiers: Naive Bayes and Logistic Regression for multi-class machine learning algorithms.

Goal

To predict the class of the product given its description.

Data

The dataset is a collection of 24K women’s clothing product descriptions manually labeled. I have scrapped all data from popular Israel online fashion websites, and you can download it from here. We are going to use raw text directly rather than using a preprocessed text dataset.

Load Data

Let’s take a snapshot of the our data:

import pandas as pddf=pd.read_csv(“fashion_data.csv”,index_col=[0],header=[0])
df=df.dropna()
df=df.drop_duplicates()
df=df.reset_index(drop=True)
df.info()
Figure 1
df.sample(10)
Table 1

Explore Data

Before we begin training machine learning models, we need to check the class distribution and the number of descriptions in each class:

df_group = df.groupby("category")
df_group = df_group.agg({"description": "nunique"})
df_group = df_group.reset_index()
df_group.head(13)
Table 2

Barplot

A barplot is one of the most common types of plot. It shows the relationship between a numerical variable and a categorical variable.

import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 100
df_group.plot(x='category', y='description', kind='bar', legend=False, grid=True, figsize=(8, 5))
plt.title("Number of items per class")
plt.ylabel('Number of items', fontsize=12)
plt.xlabel('Class', fontsize=12)
Figure 1

Word Cloud

A Wordcloud is a visual representation of text data. It displays a list of words, which represent the importance of each word by different sizes.

from wordcloud import WordCloud
import matplotlib as mpl
mpl.rcParams[‘figure.dpi’] = 200
def generate_word_cloud(text):
wordcloud = WordCloud(width = 600, height = 600,
background_color =’white’,
max_words=200,
min_font_size = 10,
font_path=’davidbd.ttf’).generate(text)

return wordcloud
from bidi.algorithm import get_display
cleaned_text=list()
for i,r in df.iterrows():
text=r[‘description’]
if isinstance(text, str):
bidi_text = get_display(text)
cleaned_text.append(bidi_text)
wordcloud = generate_word_cloud(‘’.join(cleaned_text))
plt.imshow(wordcloud, interpolation=’bilinear’)
plt.axis(“off”)
plt.show()
Figure 2

Classifier Models

Now we are ready for a machine learning algorithm.

Split Data

First of all, we need randomly to split data into training and test sets.

We do it with Scikit-Learn train_test_split:

from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(np.array(df[‘description’]), np.array(df[‘category’]), test_size=0.33, random_state=42)
X_train.shape, X_test.shape

Extract Features

To use our product descriptions in the model, we must encode each word as an integer value.

We do it with Scikit-Learn CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizercv=CountVectorizer(max_features=1500)
cv_train_features = cv.fit_transform(X_train)
cv_test_features = cv.transform(X_test)

Multinomial Naive Bayes

Multinomial Naive Bayes is a classification algorithm based on the Bayes’ theorem that derives the given feature vector’s probability of being associated with a class. This algorithm calculates each class’s probability for a given text and then output the class with the highest one.

We do it with Scikit-Learn MultinomialNB:

from sklearn.naive_bayes import MultinomialNBmnb = MultinomialNB(alpha=1)
mnb.fit(cv_train_features, y_train)
mnb_test_score = mnb.score(cv_test_features, y_test)
mnb_cv_scores = cross_val_score(mnb, cv_train_features, y_train, cv=5)
mnb_cv_mean_score = np.mean(mnb_cv_scores)
print('Cross Validation Accuracy (5-fold):', mnb_cv_scores)
print('Mean of Cross Validation Accuracy:', mnb_cv_mean_score)
print('Test Accuracy:', mnb_test_score)
Figure 3
cv_pred_features = cv.transform([
‘שמלה פרחונית ארוכה’
,’מכנסי גינס’
,’חולצה מכופתרת’
])
mnb.predict(cv_pred_features)
Figure 4

Logistic Regression

Logistic regression is a linear classification algorithm that learns the probability of a sample belonging to a certain class. Logistic regression tries to find the optimal decision boundary that best separates the classes.

We do it with Scikit-Learn LogisticRegression:

from sklearn.linear_model import LogisticRegressionlr = LogisticRegression(penalty='l2',solver='lbfgs', max_iter=100, C=1, random_state=42)
lr.fit(cv_train_features, y_train)
lr_test_score = lr.score(cv_test_features, y_test)
lr_cv_scores = cross_val_score(lr, cv_train_features, y_train, cv=5)
lr_cv_mean_score = np.mean(lr_cv_scores)
print('Cross Validation Accuracy (5-fold):', lr_cv_scores)
print('Mean of Cross Validation Accuracy:', lr_cv_mean_score)
print('Test Accuracy:', lr_test_score)
Figure 5
cv_pred_features = cv.transformcv.transform([
‘שמלה פרחונית ארוכה’
,’מכנסי גינס’
,’חולצה מכופתרת’
])
lr.predict(cv_pred_features)
Figure 6

You can found the source code on Github. I look forward to hearing any feedback and questions.

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store