Multi-Class Hebrew Text Classification with Scikit-Learn

Photo by Burgess Milner on Unsplash

In this article, we will classify women’s clothing product descriptions into 13 predefined classes. All descriptions are in Hebrew.

We will use two Scikit-Learn classifiers: Naive Bayes and Logistic Regression for multi-class machine learning algorithms.

Goal

Data

Load Data

import pandas as pddf=pd.read_csv(“fashion_data.csv”,index_col=[0],header=[0])
df=df.dropna()
df=df.drop_duplicates()
df=df.reset_index(drop=True)
df.info()
Figure 1
df.sample(10)
Table 1

Explore Data

df_group = df.groupby("category")
df_group = df_group.agg({"description": "nunique"})
df_group = df_group.reset_index()
df_group.head(13)
Table 2

Barplot

import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 100
df_group.plot(x='category', y='description', kind='bar', legend=False, grid=True, figsize=(8, 5))
plt.title("Number of items per class")
plt.ylabel('Number of items', fontsize=12)
plt.xlabel('Class', fontsize=12)
Figure 1

Word Cloud

from wordcloud import WordCloud
import matplotlib as mpl
mpl.rcParams[‘figure.dpi’] = 200
def generate_word_cloud(text):
wordcloud = WordCloud(width = 600, height = 600,
background_color =’white’,
max_words=200,
min_font_size = 10,
font_path=’davidbd.ttf’).generate(text)

return wordcloud
from bidi.algorithm import get_display
cleaned_text=list()
for i,r in df.iterrows():
text=r[‘description’]
if isinstance(text, str):
bidi_text = get_display(text)
cleaned_text.append(bidi_text)
wordcloud = generate_word_cloud(‘’.join(cleaned_text))
plt.imshow(wordcloud, interpolation=’bilinear’)
plt.axis(“off”)
plt.show()
Figure 2

Classifier Models

Split Data

We do it with Scikit-Learn train_test_split:

from sklearn.model_selection import train_test_split
import numpy as np
X_train, X_test, y_train, y_test = train_test_split(np.array(df[‘description’]), np.array(df[‘category’]), test_size=0.33, random_state=42)
X_train.shape, X_test.shape

Extract Features

We do it with Scikit-Learn CountVectorizer:

from sklearn.feature_extraction.text import CountVectorizercv=CountVectorizer(max_features=1500)
cv_train_features = cv.fit_transform(X_train)
cv_test_features = cv.transform(X_test)

Multinomial Naive Bayes

We do it with Scikit-Learn MultinomialNB:

from sklearn.naive_bayes import MultinomialNBmnb = MultinomialNB(alpha=1)
mnb.fit(cv_train_features, y_train)
mnb_test_score = mnb.score(cv_test_features, y_test)
mnb_cv_scores = cross_val_score(mnb, cv_train_features, y_train, cv=5)
mnb_cv_mean_score = np.mean(mnb_cv_scores)
print('Cross Validation Accuracy (5-fold):', mnb_cv_scores)
print('Mean of Cross Validation Accuracy:', mnb_cv_mean_score)
print('Test Accuracy:', mnb_test_score)
Figure 3
cv_pred_features = cv.transform([
‘שמלה פרחונית ארוכה’
,’מכנסי גינס’
,’חולצה מכופתרת’
])
mnb.predict(cv_pred_features)
Figure 4

Logistic Regression

We do it with Scikit-Learn LogisticRegression:

from sklearn.linear_model import LogisticRegressionlr = LogisticRegression(penalty='l2',solver='lbfgs', max_iter=100, C=1, random_state=42)
lr.fit(cv_train_features, y_train)
lr_test_score = lr.score(cv_test_features, y_test)
lr_cv_scores = cross_val_score(lr, cv_train_features, y_train, cv=5)
lr_cv_mean_score = np.mean(lr_cv_scores)
print('Cross Validation Accuracy (5-fold):', lr_cv_scores)
print('Mean of Cross Validation Accuracy:', lr_cv_mean_score)
print('Test Accuracy:', lr_test_score)
Figure 5
cv_pred_features = cv.transformcv.transform([
‘שמלה פרחונית ארוכה’
,’מכנסי גינס’
,’חולצה מכופתרת’
])
lr.predict(cv_pred_features)
Figure 6

You can found the source code on Github. I look forward to hearing any feedback and questions.

Data Scientist

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store