Simple Linear Regression

In this lesson, I will discuss regression analysis using a machine learning package. Following are the steps to follow the same.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

NumPy (Numerical Python) is a fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays efficiently. NumPy arrays are faster and more compact than Python lists, providing an array-oriented computing environment that is both efficient and convenient.

Matplotlib is a plotting library for the Python programming language and its numerical mathematics extension NumPy. matplotlib.pyplot is a collection of command style functions that make Matplotlib work like MATLAB. It provides a visual representation of data which makes it easier to identify patterns, trends, and correlations.

Pandas is a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The DataFrame is one of these structures, and it is a powerful tool widely used in data science.

dataset = pd.read_csv('Salary_Data.csv')

Download Data from Here

  • pd.read_csv(): This function is part of the Pandas library and is used to read a comma-separated values (CSV) file into a Pandas DataFrame. CSV files are a common format for storing tabular data.
  • 'Salary_Data.csv': This is the filename or the path to the file that contains the data you want to load. This string should be the name of the file if it is in the same directory as the Python script or notebook, or it can be a path to the file located elsewhere.
  • dataset: This is the variable name assigned to the DataFrame created from the CSV file. After this line of code executes, dataset contains all the data loaded from the Salary_Data.csv file, structured in a format that can be easily manipulated using further Pandas functions.
# Extracting features and target variable from the dataset
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
  • dataset.iloc[:, :-1].values: This line selects all rows (:) and all columns except the last one (:-1) from the DataFrame dataset. The .values attribute converts the selected portion of the DataFrame into a NumPy array, which is typically required for machine learning algorithms in scikit-learn. This part usually contains the independent variables or features.
  • dataset.iloc[:, -1].values: Similar to the previous line, this selects all rows and only the last column of the DataFrame, which typically is the dependent variable or target. This is also converted into a NumPy array.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = 
train_test_split(X, y, test_size = 1/3, random_state = 0)

Importing the train_test_split function: “from sklearn.model_selection import train_test_split

This line imports the train_test_split function from the sklearn.model_selection module. This function is used to split a dataset into a training set and a testing set. This is a common practice in machine learning to evaluate the performance of models.

Splitting the dataset: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

  • train_test_split(X, y, test_size = 1/3, random_state = 0): This function splits the features (X) and the labels (y) into training (X_train, y_train) and testing (X_test, y_test) sets.
    • test_size = 1/3: This argument specifies that 1/3 of the data will be used for testing, while the remaining 2/3 will be used for training the model.
    • random_state = 0: This is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices. Providing a fixed random_state ensures that the splits are reproducible and consistent during repeated runs.

Use Cases and Example:

This setup is typically used in supervised learning contexts where you have a clear distinction between input features (X) and an output target (y) that you wish to predict. After splitting, you can fit a model on X_train and y_train and later evaluate its performance on X_test and y_test.

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

1. Importing the LinearRegression class

from sklearn.linear_model import LinearRegression

This line imports the LinearRegression class from sklearn.linear_model. Linear regression is a basic and commonly used type of predictive analysis which is primarily used for finding the linear relationship between variables—predicting a quantitative response.

2. Creating an instance of LinearRegression

regressor = LinearRegression()

Here, regressor is an instance of the LinearRegression class. This object will be used to access the linear regression functionality. The LinearRegression() constructor can accept several parameters to customize its behavior, but in this case, we’re using the default settings which include fitting the intercept (intercept=True).

3. Training the model

regressor.fit(X_train, y_train)

  • fit(X_train, y_train): This method fits the linear regression model to the training data. The X_train contains the feature(s) (independent variables) of the training data, and y_train contains the corresponding target (dependent variable) values. The fitting process involves finding the coefficients (parameters) for the regression equation that minimizes the error between the predicted and actual values in the training data.

Explanation of the Process:

The purpose of training the model is to find a linear relationship, often described as: 𝑦=𝛽0+𝛽1𝑥1+⋯+𝛽𝑛𝑥𝑛+𝜖y=β0​+β1​x1​+⋯+βnxn​+ϵ where:

  • 𝛽0,𝛽1,…,𝛽𝑛 are the coefficients.
  • 𝑥1,…,𝑥𝑛​ are the features.
  • 𝜖 is the error term.

In the context of linear regression, training the model means calculating the best-fit line that minimizes the sum of the squared differences between the observed values in the dataset and those predicted by the model—a method known as Ordinary Least Squares (OLS).

Example Use:

Once the model is trained, you can use it to make predictions, evaluate its performance, and interpret the results:

# Making predictions on the test set
y_pred = regressor.predict(X_test)

# Example: Displaying the coefficients of the model
print("Coefficients:", regressor.coef_)
print("Intercept:", regressor.intercept_)

# Evaluating the model
from sklearn.metrics import r2_score
print("R^2 Score:", r2_score(y_test, y_pred))
  • regressor.predict(X_test): This uses the trained model to make predictions on new data (the testing set in this case).
  • regressor.coef_ and regressor.intercept_: These properties store the coefficients of the features and the intercept of the model respectively.
  • r2_score: This function computes the R-squared value, which is a statistical measure of how close the data are to the fitted regression line.

This workflow is fundamental in predictive analytics and is widely used across various domains, such as economics, biological sciences, and social sciences, to make informed decisions based on the underlying trends data suggests.

Purpose of the Prediction

The purpose of this step in a machine learning workflow is to evaluate how well the trained model performs on new, unseen data. The predictions y_pred can now be compared against the actual target values y_test (which were not provided to the model during the training phase) to assess the model’s accuracy, efficacy, and generalization capability.

Example Usage for Model Evaluation

After making predictions, you might typically proceed to evaluate the model’s performance using various metrics. One common metric is the Mean Squared Error (MSE) or R-squared (coefficient of determination) for regression tasks:

from sklearn.metrics import mean_squared_error, r2_score

# Calculating Mean Squared Error
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse}')

# Calculating R-squared
r_squared = r2_score(y_test, y_pred)
print(f'R^2 Score: {r_squared}')
  • mean_squared_error(y_test, y_pred): This calculates the mean of the squares of the differences between the actual (y_test) and predicted (y_pred) values, providing a measure of the model’s prediction error.
  • r2_score(y_test, y_pred): This calculates the proportion of variance in the dependent variable that is predictable from the independent variable(s), offering insight into the goodness of fit of the model.

These evaluations help to determine how predictions deviate from the actual data and how much of the variance in the data the model is able to explain, respectively.

plt.scatter(X_train, y_train, color = 'red')
plt.plot(X_train, regressor.predict(X_train), color = 'blue')
plt.title('Salary vs Experience (Training set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

Here’s a breakdown of what each line does:

Visualization Code Breakdown

plt.scatter(X_train, y_train, color = 'red')

  • plt.scatter: This function creates a scatter plot, which is ideal for visualizing the relationship between two numeric variables. It plots y_train versus X_train as a collection of points in the plot.
  • X_train and y_train: These are the features and target values from the training dataset, respectively.
  • color = 'red': This parameter sets the color of the scatter plot points to red.

plt.plot(X_train, regressor.predict(X_train), color = 'blue')

  • plt.plot: Unlike plt.scatter, which displays individual data points, plt.plot is used to draw a line graph. This line represents the predictions made by the linear regression model across the X_train data.
  • regressor.predict(X_train): This method call generates the predicted y values (salary) for the given X_train data (years of experience). These predicted values are used to plot the regression line.
  • color = 'blue': Sets the color of the regression line to blue.
plt.title('Salary vs Experience (Training set)') 
plt.xlabel('Years of Experience') 
plt.ylabel('Salary')
  • These three lines add a title to the plot (plt.title), and labels to the x-axis (plt.xlabel) and y-axis (plt.ylabel). These help in understanding what the plot represents: the relationship between years of experience and salary as observed and predicted from the training data.

plt.show(): This function displays the plot. When using Matplotlib in a script, this function is essential as it tells Python to render the plot so the user can see it.

Purpose of This Visualization

This visualization serves multiple purposes:

  • Data Exploration: It provides a visual representation of the underlying data distribution, showing how salary varies with years of experience.
  • Model Evaluation: By overlaying the regression line on the scatter plot of actual data, it visually assesses how well the model fits the training data. The closer the points are to the line, the better the model’s predictions match the actual values.

This kind of plot is particularly useful in presentations or reports to convey, at a glance, how effective the linear regression model is in capturing the trend in the data. It’s a straightforward visual assessment tool for model performance before proceeding with more detailed statistical evaluation.

plt.scatter(X_test, y_test, color = 'red')
pred_sal = regressor.predict(X_test)
plt.plot(X_test, pred_sal, color = 'blue')
plt.title('Salary vs Experience (Test set)')
plt.xlabel('Years of Experience')
plt.ylabel('Salary')
plt.show()

Visualization Code Breakdown

plt.scatter(X_test, y_test, color = 'red')

  • plt.scatter: This function creates a scatter plot, plotting the actual test data points (y_test) against X_test.
  • X_test and y_test: These represent the features (years of experience) and the corresponding salaries from the test dataset.
  • color = 'red': Sets the color of the scatter points to red, making them distinct and visually striking.

pred_sal = regressor.predict(X_test)

plt.plot(X_test, pred_sal, color = 'blue')

  • regressor.predict(X_test): Generates predictions for the X_test data. These predictions represent the model’s output when applied to new, unseen data, helping evaluate how it generalizes beyond the training data.
  • plt.plot: This line draws the regression line based on the predicted values pred_sal over the X_test data. It shows how the model predicts the salary based on years of experience for the test set.
  • color = 'blue': Sets the color of the regression line to blue, providing a clear visual contrast to the red scatter points.

plt.title('Salary vs Experience (Test set)')

plt.xlabel('Years of Experience')

plt.ylabel('Salary')

  • These commands set the title of the plot and label the axes, which is crucial for clarity and understanding when presenting the plot to an audience. It specifies that this plot is for the test set, helping distinguish it from any training set plots.
  • plt.show(): Displays the finalized plot. This is necessary to render the plot when using scripts or interactive sessions in Python environments like Jupyter Notebooks.

Purpose of This Visualization

This plot serves to:

  • Evaluate Model Generalization: By plotting the test data and the corresponding model predictions, it visually assesses how well the model generalizes to new data. This is crucial for understanding the model’s performance in practical scenarios.
  • Visual Assessment: Provides a straightforward and intuitive visual assessment of the model’s predictive accuracy on the test set. The closer the blue line (model predictions) aligns with the red points (actual values), the better the model is at making predictions.

Such visualizations are vital for presentations, reports, or simply to get a quick visual confirmation of model performance. They help stakeholders, who may not be familiar with the underlying statistics, understand model effectiveness in a clear and straightforward manner.

Code Explanation

regressor.predict([[40]])

  • regressor.predict(): This method is used to make predictions using the trained linear regression model. You provide the input features inside the method as an argument, and it returns the predicted output.
  • [[40]]: The input to the predict() method must be in the same format as the data the model was trained on. Here, 40 is encapsulated within two pairs of brackets:
    • The outer brackets represent a list (or array-like structure) which could contain multiple samples if you were making more than one prediction at a time.
    • The inner brackets represent the feature array for a single sample. Since the model expects a 2D array as input (like the X_train and X_test arrays it was trained on), you provide the input as a list within a list.

How It Works

The model uses the coefficient(s) and intercept it learned during the training phase to calculate the predicted salary for someone with 40 years of experience. The equation for this linear regression would typically look something like this: Salary=(𝛽1×Years of Experience)+𝛽0

Where β1​ is the coefficient for the “Years of Experience” feature, and β0​ is the intercept.

What It Returns

The predict() method returns the predicted salary value encapsulated in an array because it’s designed to handle multiple predictions at once. If you want just the single predicted value, you might access it by indexing the result:

predicted_salary = regressor.predict([[40]])[0]

Practical Use

This functionality is extremely useful when you need to make predictions based on the model you’ve developed. For example, if you’re developing a tool for a HR department to estimate salary ranges based on employee experience, this model could provide quick estimates.