Data Preprocessing

Data preprocessing is a crucial step in the data analysis process, involving the preparation of raw data for further analysis or machine learning models. The goal of data preprocessing is to transform incomplete, incorrect, or irrelevant data into a useful and efficient format that enhances the quality and accuracy of the results.

Key Steps in Data Preprocessing include:

  1. Data Cleaning: This involves handling missing data, correcting errors, and removing outliers or duplicate records to improve the dataset’s quality.
  2. Data Transformation: This step modifies the data through normalization or scaling to bring it into a specific range, useful for comparison or integration within machine-learning algorithms.
  3. Data Reduction: Reducing the volume of data by eliminating redundant features or clustering data to simplify analysis without losing critical information.
  4. Data Integration: Combining data from different sources to create a coherent dataset, which often involves resolving data conflicts and inconsistencies.
  5. Feature Engineering: Creating new relevant features from existing data, which might provide additional insight when building predictive models.

By undergoing these preprocessing steps, data becomes more manageable and aligned with the specific requirements of the analytical task at hand, leading to more reliable and comprehensive results.

Importing Libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
  • NumPy (np): Adds support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
  • Matplotlib (plt): Used for creating static, interactive, and animated visualizations in Python.
  • pandas (pd): Provides data structures and data analysis tools.

Download the Data Here

Data Loading

dataset = pd.read_csv('Data.csv')
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
  • dataset = pd.read_csv('Data.csv'): This line reads a CSV file named ‘Data.csv’ into a pandas DataFrame called dataset.
  • X = dataset.iloc[:, :-1].values: This extracts all rows and all columns except the last one as features (independent variables) into a NumPy array X.
  • y = dataset.iloc[:, -1].values: This extracts the last column of all rows, which represents the target variable (dependent variable), into a NumPy array y.

Handling Missing Data

from sklearn.impute import SimpleImputer
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')[:, 1:3])
X[:, 1:3] = imputer.transform(X[:, 1:3])

The above code snippet addresses the issue of missing values within the ‘Age’ and ‘Salary’ columns of the dataset. Let’s break down what it does:

  • Import SimpleImputer:
    • from sklearn.impute import SimpleImputer – This line imports the SimpleImputer class from the scikit-learn library’s impute module. Scikit-learn provides various tools for data preprocessing, including handling missing data.
  • Create an Imputer Object:
    • imputer = SimpleImputer(missing_values=np.nan, strategy='mean') – This line creates an instance of the SimpleImputer class. Let’s analyze the parameters:
      • missing_values=np.nan: Tells the imputer to consider np.nan (Not a Number) values as missing values.
      • strategy='mean': Instructs the imputer to use the “mean” imputation strategy. This means it will replace missing values with the mean average of the corresponding column.
  • Fit the Imputer:
    •[:, 1:3]) – This line “trains” the imputer on columns 1 and 2 of the feature matrix X (the ‘Age’ and ‘Salary’ columns). The imputer calculates the mean values for these columns.
  • Transform and Replace Missing Values:
    • X[:, 1:3] = imputer.transform(X[:, 1:3]) – This line applies the learned imputation strategy. It:
      1. Takes the same slice of your feature matrix (X[:, 1:3])
      2. Uses the imputer object’s transform method to replace any missing values in those columns with their calculated mean values.
      3. Assigns the transformed data back into the original slice, effectively updating X.

Overall Effect

After running this code:

  • Missing values in the ‘Age’ and ‘Salary’ columns will be filled with the mean value of their respective columns.
  • Your dataset is now ready for machine learning models that don’t handle missing values well.

Important Considerations

  • Choice of Imputation Strategy: While using the mean is a common approach, it’s essential to consider if this is the most appropriate strategy for your data. Other strategies include ‘median’, ‘most_frequent’, or even more advanced approaches.
  • Data Distribution: If your data has outliers, the mean might not be the best representation of a typical value in a column.

Encoding Categorical Data

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

Code Purpose:

The code snippet is designed to transform the categorical ‘Country’ column into a numerical format suitable for machine learning models. It uses One-Hot Encoding, a technique that creates new columns for each unique category.


  1. Import Classes:
    • from sklearn.compose import ColumnTransformer – Imports ColumnTransformer, a key class for applying different transformations to different columns of a dataset.
    • from sklearn.preprocessing import OneHotEncoder – Imports OneHotEncoder, used to achieve one-hot encoding.
  2. Create ColumnTransformer Object:
    • ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [0])], remainder='passthrough') – This creates a ColumnTransformer object named ct. Let’s dissect the parameters:
      • transformers: A list of tuples. Each tuple has these elements:
        • ‘encoder’: A name for this transformation step (arbitrary).
        • OneHotEncoder(): An instance of the one-hot encoder.
        • [0]: Specifies that this transformation should be applied to column index 0 (the ‘Country’ column).
      • remainder='passthrough': Indicates that all other columns (not specified in the transformers list) should be passed through without any changes.
  3. Fit, Transform, and Update Data:
    • X = np.array(ct.fit_transform(X)) – Essential steps in one-hot encoding:
      • fit_transform(X):
        • The ColumnTransformer learns the unique categories in the ‘Country’ column.
        • It performs one-hot encoding, creating new columns for each category (e.g., France, Spain, Germany).
      • np.array(...): Converts the transformed output into a NumPy array.
      • X = …: Assigns the transformed array back to X, replacing the original data.


  • The ‘Country’ column is replaced with new binary columns representing each country:
    • A value of ‘1’ in one of these columns will indicate the corresponding country for a given row.
    • All other new country columns will have a ‘0’ for that row.

Encoding the Target Variable

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
y = le.fit_transform(y)

The code transforms the categorical target variable ‘Purchased’ (containing text values ‘Yes’ and ‘No’) into numerical labels. Machine learning models often work better with numerical targets.


  • Import LabelEncoder:
    • from sklearn.preprocessing import LabelEncoder – This imports the LabelEncoder class from scikit-learn’s preprocessing module.
  • Create LabelEncoder Object:
    • le = LabelEncoder() – This creates an instance of the LabelEncoder class and names it le.
  • Fit and Transform Target Variable:
    • y = le.fit_transform(y) – Here’s the combined effect of fit_transform:
      • **fit: ** The encoder learns the unique labels from your target variable ‘y’ (‘Yes’ and ‘No’). It internally assigns a unique numerical label to each category (likely starting from 0).
      • transform: The encoder replaces the original text labels with their corresponding numerical labels.


  • The ‘Purchased’ column will now contain numbers instead of text. For example, ‘Yes’ might be replaced with 1 and ‘No’ with 0.

Important Notes

  • Order: Label encoding can introduce a sense of order where none might exist between categories. For purely categorical targets, one-hot encoding is often a more appropriate choice.
  • Inverse Transformation: The LabelEncoder stores the mapping between text labels and numerical values. You can use its inverse_transform method to convert numerical labels back to the original text categories if needed.

Splitting the Dataset

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1)

The code uses the train_test_split function from scikit-learn to divide your dataset into two portions:

  • Training set: Used to train your machine learning model.
  • Testing set: Used to evaluate how well your trained model performs on unseen data.


  • Import train_test_split:
    • from sklearn.model_selection import train_test_split – Imports the train_test_split function from scikit-learn’s model_selection module.
  • Splitting the Data:
    • X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 1) – This line does the following:
      • Input: Takes your feature matrix X and your target variable y as input.
      • test_size = 0.2: Specifies that 20% of your data should be allocated to the testing set. The remaining 80% will be in the training set.
      • random_state = 1: Sets a seed for the random number generator. This ensures that the data is split in the same way each time you run the code, making your results reproducible.
      • Output: Returns four new variables:
        • X_train: The feature matrix for the training set.
        • X_test: The feature matrix for the testing set.
        • y_train: The target variable values for the training set.
        • y_test: The target variable values for the testing set.

Why Data Splitting is Important

Splitting data into training and testing sets is a crucial practice in machine learning to prevent overfitting. Here’s why:

  • Overfitting: Occurs when a model learns the patterns in the training data too well, including the noise and specific quirks. This results in a model that performs very well on the training data but poorly on new, unseen data.
  • Evaluation: A testing set allows you to get a realistic assessment of how well your model will generalize to new data, giving you a better idea of its real-world performance.

Feature Scaling

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train[:, 3:] = sc.fit_transform(X_train[:, 3:])
X_test[:, 3:] = sc.transform(X_test[:, 3:])

The code applies standardization to specific columns of the feature matrix. Standardization is a common preprocessing technique that helps many machine learning algorithms by ensuring features have roughly a zero mean and unit variance (standard deviation of 1).


  • Import StandardScaler:
    • from sklearn.preprocessing import StandardScaler – Imports the StandardScaler class from scikit-learn’s preprocessing module.
  • Create StandardScaler Object:
    • sc = StandardScaler() – Creates an instance of the StandardScaler class and names it sc.
  • Fit and Transform Training Set:
    • X_train[:, 3:] = sc.fit_transform(X_train[:, 3:]):
      • X_train[:, 3:]: Selects all rows and columns from index 3 onwards in the training set. This implies that you are likely applying standardization to numerical columns (i.e. ‘Salary’ and any new columns created after encoding).
      • sc.fit_transform(...):
        • fit: Calculates the mean and standard deviation of each selected feature (column) in the training set.
        • transform: Subtracts the mean and divides by the standard deviation for each value in those columns, standardizing the data.
  • Transform Testing Set:
    • X_test[:, 3:] = sc.transform(X_test[:, 3:])
      • X_test[:, 3:]: Selects the corresponding columns in the testing set.
      • sc.transform(...): Applies the same scaling (using means and standard deviations calculated on the training set) to the testing set. This is crucial to avoid information leakage from the test set into the scaling process.

Why Feature Scaling is Useful

  • Algorithm Sensitivity: Many machine learning algorithms (such as those based on distance calculation or gradient descent) can be sensitive to feature scales. Large differences in scales between features can skew results. Standardization helps level the playing field.
  • Convergence: Some algorithms might converge faster (find the optimal solution) when features are on a similar scale.

Important Notes

  • Selective Standardization: You’re selectively applying standardization to some columns and not others. Consider whether this is appropriate for your data.
  • Only Scale Numerical Features: StandardScaler is designed for numerical features. Applying it to categorical features (like one-hot encoded columns) wouldn’t be meaningful.