Target Encoding in Machine Learning

I’ve been getting my hands dirty with Machine Learning lately. I made this research on target encoding in ML and how it can improve or alter the model’s accuracy and I want to share it with you. Having a machine learning feature from a dataset that has too many categorical variations can make the ML model over complicated. For example unique fields like IDs or categories with too many possibilities (city names). One good option to still use these fields in a ML model is to aggregate and encode them. We will do this by using target encoding. Let’s see how that works in more detail.

This is a self made experiment, no external materials were used as inspiration.

What is Target Encoding

Target Encoding is a method to convert a categorical feature into the mean value of the target. The result is a new column and this will replace the original categorical feature when training the ML model.

Let’s see a simple example. We will follow a ML model to predict the city temperature based on some basic data which was collected over some time.

city nametemperature (Y)
Berlin20
Frankfurt22
Munich24
Berlin20
Hamburg19
Munich18
Berlin23
Frankfurt26
Munich25

Temperatures at different recordings will be our target for this model.

Calculate the mean value for each city:

Berlin = (20 + 20 + 23) / 3 = 21
Frankfurt = (20 + 26)  / 2 = 23
Munich = (24 + 18 + 25) / 3 = 22.3
Hamburg = 19 / 1 = 19

The new calculated mean values are used as a feature for training the model.

New dataset:

city nametemperature (Y)
2120
2322
22.324
2120
1919
22.318
2123
2326
22.325

Now the model can be trained with numerical features.

How target encoding affects ML model prediction results

We will use a very simple data set to understand exactly how target encoding affects the ML model training. For training we will use XGBoost but this is not relevant for current analysis.

For validating the model we will use the same data that we used for training. We will do it like this intentionally, to see in which cases the ML will loose information when using target encoding and how it will affect the results.

In a real use case, the dataset for ML should be used by splitting it in train, validation and test parts.

Training features:

restaurantId: the unique id of the restaurant. Could also be a unique name.
menu_items: the number of items that were ordered from the restaurant's menu
bill: the const of the order

Here is the dataset we will work with:

#restaurantIdmenu_itemsbill
01001350
11001230
21001110
31002115
41003225


Example 1:

First let’s train a ML model using these restaurantId and menu_items fields. For restaurantId we will use the hot encoding strategy to have an accurate model to compare the other solutions to.

Training data:

#restaurantIdmenu_items
010013
110012
210011
310021
410032

Y:

#bill
050
130
210
315
425

After hot encoding restaurantId and training the model, we will apply the same train data to also predict the outputs. The results are very accurate compared to the real bill (Y) values.

#restaurantIdmenu_itemsbillpredicted_bill
0100135050
1100123030
2100111010
3100211515
4100322525

Using hot encoding on fields like restaurantId is not feasible in production because these fields can grow indefinitely. Each new restaurantId would create a new column in the hot encoded training data. The model would become overwhelmed.


Example 2:

Let’s use the same dataset but instead of using restaurantId directly, we will do target encoding on the bill by adding a new column called mean_bill and calculate the mean bill value for each restaurant.

#restaurantIdmenu_itemsbillmean_bill
0100135030.0
1100123030.0
2100111030.0
3100211515.0
4100322525.0

We will train only with the following fields:

#menu_itemsmean_bill
0330.0
1230.0
2130.0
3115.0
4225.0

And same Y:

#bill
050
130
210
315
425

After applying this data to the same ML training logic, here are the results. Comparing bill with the model predicted_bill we get very accurate results again.

#restaurantIdmenu_itemsbillmean_billpredicted_bill
0100135030.050
1100123030.030
2100111030.010
3100211515.015
4100322525.025

The high accuracy is the result of having the combination of menu_items and mean_bill always unique. In a big dataset this is hardly the case, let’s see next how duplicated encoded values can affect the outcome.


Example 3:

Target encoding works by reducing the number of variations of a field (restaurantId) and grouping fields that have a value in common (mean_bill). This means that the ML model will have less information in the training data. It will loose the uniqueness of restaurantId and would result in duplicated encoded values. This can be noticed when multiple restaurants have the same mean_bill.

Here is an example. We are adding a new row with restaurantId = 1004 that has menu_items = 1 and encoded mean_bill = 30.0, same as restaurant 1001 on row #2. Except that row #2 has bill = 10 as Y and row #5 has bill = 30 as Y.

#restaurantIdmenu_itemsbillmean_bill
0100135030.0
1100123030.0
2100111030.0
3100211515.0
4100322525.0
5100413030.0

So for 2 records with same menu_items = 1 and mean_bill = 30 as input, the model will be trained with bill = 10 and bill = 30 for output. Even if these bills are from separate restaurants, the model will loose this separation. So it will average it out to bill = 20.

And the results after training and predicting confirms this:

#restaurantIdmenu_itemsbillmean_billpredicted_bill
0100135030.050
1100123030.030
2100111030.020
3100211515.015
4100322525.025
5100413030.020

Row #2 and #5 have both predicted predicted_bill = 20.


The conclusion of this research:

Categorical columns with big variations can still be useful if they are target encoded, but rows from different categories can affect each other’s results. It can be beneficial to add multiple target encoded columns to reduce the risk of duplications. Besides the mean_bill, it can also use mean_items, deviation between bill and mean_bill or any other external inputs to aggregate the rows.

Source code for this example on Github.


Code to calculate target encoding

First we create a list with all unique values in the categorical field and calculate mean value of the target_field for each key:

def get_target_encoding(df, field, target_field):
    unique_fields = df[field].unique()
    targets = {}
    for f in unique_fields:
        targets[f] = df[df[field]==f][target_field].mean()
    return targets    

targets = get_target_encoding(df2, 'restaurantId', 'bill')
df['mean_bill'] = df['restaurantId'].apply(lambda x: targets[x])

Adding the deviation as another target encoding column:

df['mean_bill_deviation'] = df['bill'] - df5['mean_bill']

Target encoding in a project with real data

For the Machine Learning training course offered by DataTalks.Club we had to build a full ML system by finding a dataset, perform exploratory data analysis, cleanup the features, try different ML algorithms, do hyper parameter tuning, find most accurate model, put it on a Dockerised Python server and serve it over an API. Great stuff!

I choose to predict “Berlin base rent for apartments” using a real but older dataset. You can find the entire project here: ML capstone-project

Everything went well but the accuracy was below what I expected.

Results were:

MAE: 219 euro
Model max deviation of 50euro: 27%.

Yes apartments with similar specs have rents which are deviating that much. Of course, this is just asking price, it does not mean that someone will accept it. But in Berlin, trust me, you don’t negotiate down.

Anyway, let’s see if we can improve this accuracy by applying target-encoding on a categorical column.

The cleaned up dataset looks like this:

Cleaned dataset

Notice we have neighbourhood column. These are 23 neighbourhoods from Berlin and I used Hot-Encoding for this. But we also have subneighbourhood which are smaller areas of each neighbourhood. In the first training I excluded the subneighbourhood column because the dataset had 79 variations and it would make the model quite big if I would hot-encode all of them.

First we calculate the mean value for each subneighbourhood:

subneighbourhood_names = berlinDf_select['subneighbourhood'].unique() # returns 79 values

subneighbourhood_mean_baseRent = {}

for s in subneighbourhood_names :
    selected = berlinDf_select[berlinDf_select['subneighbourhood']==s]
    mean_baseRent = selected.baseRent.mean()
    subneighbourhood_mean_baseRent[s] = int(mean_baseRent)

Then we update the dataset:

# Replacing sub-neighbourhood with mean base rent as target encoded:
berlinDf_select['subneighbourhood_meanBaseRent'] = berlinDf_select['subneighbourhood'].apply(lambda x: subneighbourhood_mean_baseRent[x])
berlinDf_select.drop('subneighbourhood', axis='columns', inplace=True)
berlinDf_select

We end up with following dataset: Target encoded dataset

Notice Spandau has different subneighbourhood_meanBaseRent fields because it comes from different subneighbourhoods.

After applying this new dataset to same XGBoost model and perform some hyperparameter tuning we get a slight improved accuracy:

MAE: 198 euro (-21)
Model max deviation of 50euro: 26% (-1)

Source code for this project on Github.


Don’t forget, in science everything is simpler than it first seems.

Thanks for reading and happy coding!

Written on January 15, 2022
← back