Easy Way to Convert Categorical Variables in PySpark

Converting Categorical Data using OneHotEncoding

M. Masum, PhD
3 min readApr 23, 2022

We cannot directly feed categorical data into Machine Learning (ML) algorithms. We must provide a numerical representation to ML models of the categorical features of a dataset. While working in python, we generally use label encoder, get dummies, and OneHotEncoder for converting the categorical features. However, in PySpark, we can perform the conversion in different ways. In this post, we mainly discuss this.

There are two types of categorical features that we generally deal with:

  1. The feature contains two categories
  2. The feature contains more than two categories

In pyspark, there are two methods available that we can use for the conversion process: String Indexer and OneHotEncoder.

When a feature contains only two categories/ groups, in that case, we can directly apply the StringIndexer method for conversion. The StringIndexer method is equivalent to LabelEncoder for the scikit learning package by python.

However, when a feature contains more than two groups, in that case, we can not directly apply OneHotEncoder as we do in the python version. In pyspark, the OneHotEncoder requires the input into a numerical format, thus, before fitting the categorical data into the OneHotEncoder in pyspark, we must convert the feature into a numerical format using the available StringIndexer in the…

--

--