Yeo-Johnson Transform in Machine Learning

Introduction

In the realm of machine learning, data preprocessing plays a pivotal role in shaping the performance of predictive models. One essential technique that often flies under the radar is the Yeo-Johnson Transform. This versatile method for data transformation offers advantages over traditional techniques like the Box-Cox Transform by accommodating both positive and negative values, making it an invaluable tool in the data scientist's arsenal. In this article, we'll delve into the intricacies of the Yeo-Johnson Transform, its benefits, and how to implement it with practical coding examples.

Yeo-Johnson Transformation

Developed by Robert Yeo and Robert Johnson in 2000, the Yeo-Johnson Transform is a modification of the Box-Cox Transform, which was designed to handle only positive data. The Yeo-Johnson Transform extends this functionality to handle data with both positive and negative values, offering greater flexibility in data preprocessing tasks.
The Yeo-Johnson Transform is defined as:

Yeo-Johnson Transformation Formula

Where

  • xi  is the original data point.
  • yi  is the transformed data point.
  • λi  is the transformation parameter, which can be optimized to maximize normality.

Key Benefits of Yeo-Johnson Transform

  1. Handles both Positive and Negative Values: Unlike the Box-Cox Transform, which is limited to positive data, the Yeo-Johnson Transform can accommodate data with a broader range of values, including negative ones.
  2. Preserves Zero Values: The Yeo-Johnson Transform preserves zero values in the data, making it suitable for datasets with a mixture of zero and non-zero values.
  3. Flexibility in Transformation: The transformation parameter 𝜆 allows for flexible adjustment of the transformation, enabling customization based on the distribution of the data.

Implementation

import numpy as np
import pandas as pd
from sklearn.datasets import load_boston
from sklearn.preprocessing import PowerTransformer
from sklearn.model_selection import train_test_split

# Load the Boston housing dataset
boston = load_boston()
data = pd.DataFrame(boston.data, columns=boston.feature_names)
target = pd.DataFrame(boston.target, columns=['MEDV'])

# Concatenate features and target variable
df = pd.concat([data, target], axis=1)

#View the dataframe 
df.head()

dataframe

Now split the dataframe.

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data, target, test_size=0.2, random_state=42)

# Apply Yeo-Johnson Transform to the target variable
transformer = PowerTransformer(method='yeo-johnson', standardize=True)
y_train_transformed = transformer.fit_transform(y_train)
y_test_transformed = transformer.transform(y_test)

# Convert transformed data back to DataFrame
y_train_transformed = pd.DataFrame(y_train_transformed, columns=['MEDV'])
y_test_transformed = pd.DataFrame(y_test_transformed, columns=['MEDV'])

# Display transformed target variable
print("Transformed Train Target:")
print(y_train_transformed.head())

Train Target data

print("\nTransformed Test Target:")
print(y_test_transformed.head())

Test Target data

Conclusion

The Yeo-Johnson transform is a valuable technique for normalizing skewed numerical features in machine learning. By transforming data to a more normal distribution, you can improve the performance and interpretability of your machine learning models. Remember that data normalization is not a one-size-fits-all solution, but the Yeo-Johnson transform offers a powerful and versatile approach for many machine learning tasks.


Similar Articles