Pipeline and Column Transformer

pjaipurk · July 25, 2023, 3:34pm

Why pipeline and column transformer?

When working on a machine learning project, the most tedious step is often a data cleaning and preprocessing step. Primarily when you work in Jupyter Notebook, running code in many cells could be confusing.

Before training a model, data should be spitted into a training set and a test set. Each data set will pass the data cleaning and preprocessing step before entering a machine learning model. It is not efficient to write repetitive code for the training set and the test set. This is when a pipeline comes into play.

Pipeline and Column Transformer are the elegant ways to create a data preprocessing workflow

First of all, imagine that you can create only one pipeline in which you can input any data and those data will be transformed into an appropriate format before model training or prediction. It will shorten your code and make code easier to read and adjust.

Pipeline and ColumnTransformer

There is a big difference between Pipeline and ColumnTransformer that you must understand.

Pipeline: Use for multiple transformations of the same columns.

ColumnTransformer: Use to transform each column set differently.

The ColumnTransformer doesn’t transform step by step but it transforms each step separately, and commingle later.