Multiple Linear Regression

We will start by importing the library to read a dataset from a CSV file and to separate it into features (X) and target variables (Y). These libraries are pandas, numpy, and matplotlib.

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
  • numpy is imported for numerical operations.
  • matplotlib.pyplot is imported for plotting graphs (though it is not used in the provided code snippet).
  • pandas is imported for data manipulation and analysis.

Reading the dataset:

dataset = pd.read_csv('50_Startups.csv')
  • The pd.read_csv('50_Startups.csv') function reads the data from the CSV file named ’50_Startups.csv’ and stores it in a DataFrame named dataset.

📜Download the file from here

Separating features and target variables:

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
  • dataset.iloc[:, :-1].values selects all rows (:) and all columns except the last one (:-1) and converts them to a NumPy array. This is stored in the variable X and represents the features of the dataset.
  • dataset.iloc[:, -1].values selects all rows (:) and only the last column (-1), converting it to a NumPy array. This is stored in the variable y and represents the target variable (i.e., the dependent variable) of the dataset.

Let’s examine the contents of the CSV file to understand the specific features and target variables in this dataset. I will load and display the first few rows of the data from the uploaded CSV file.

The dataset contains information on 50 startups with the following columns:

  1. R&D Spend: Amount of money spent on research and development.
  2. Administration: Amount of money spent on administration.
  3. Marketing Spend: Amount of money spent on marketing.
  4. State: The state where the startup is located (New York, California, Florida).
  5. Profit: The profit earned by the startup.

Given this structure, here’s a detailed explanation of the feature and target variable extraction:

Features (X) and Target Value Y

X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

This line selects all rows and all columns except the last one (Profit). Thus, X includes the columns: R&D Spend, Administration, Marketing Spend, and State.

This line selects all rows and only the last column (Profit). Thus, y is the profit earned by the startups.

The following code snippet uses the ColumnTransformer and OneHotEncoder from the sklearn library to encode categorical data in the dataset. Here’s a detailed explanation of each part of the code:

Importing necessary classes:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
  • ColumnTransformer is used to apply different preprocessing transformations to different columns of the dataset.
  • OneHotEncoder is used to convert categorical variables into a format that can be provided to ML algorithms to do a better job in prediction.

Creating a ColumnTransformer:

ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
  • The ColumnTransformer is initialized with the following parameters:
    • transformers=[('encoder', OneHotEncoder(), [3])]: This parameter specifies the transformations to apply. It tells the ColumnTransformer to apply OneHotEncoder to the column at index 3 (the “State” column).
    • remainder='passthrough': This parameter specifies that all other columns (those not transformed) should be passed through without any changes.

Applying the ColumnTransformer to the feature matrix X:

X = np.array(ct.fit_transform(X))
  • ct.fit_transform(X) fits the ColumnTransformer to the data and then transforms it. Specifically, it:
    • Fits the OneHotEncoder to the “State” column (index 3).
    • Transforms the “State” column into multiple columns representing the one-hot encoded values.
    • Passes through all other columns without changes.
  • The transformed data is converted to a NumPy array using np.array() and assigned back to X.

Example Transformation

Before transformation, X might look like this:

R&D SpendAdministrationMarketing SpendState
165349.20136897.80471784.10New York
162597.70151377.59443898.53California
153441.51101145.55407934.54Florida

After transformation, X will look like this:

State_CaliforniaState_FloridaState_New YorkR&D SpendAdministrationMarketing Spend
001165349.20136897.80471784.10
100162597.70151377.59443898.53
010153441.51101145.55407934.54

In this transformed dataset:

  • The “State” column has been replaced by three columns (State_California, State_Florida, State_New York) representing the one-hot encoded values of the original “State” column.
  • The other columns (R&D Spend, Administration, Marketing Spend) remain unchanged.

If you have any further questions or need additional explanations, feel free to ask!

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)ChatGPT

The provided code uses the train_test_split function from the sklearn.model_selection module to split the dataset into training and testing sets. Here’s a detailed explanation of each part of the code:

Importing the train_test_split function:

from sklearn.model_selection import train_test_split
  • This function is used to split arrays or matrices into random train and test subsets.

Splitting the dataset:

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
  • The train_test_split function is called with the following parameters:
    • X: The feature matrix.
    • y: The target variable array.
    • test_size=0.2: This parameter specifies that 20% of the dataset should be allocated to the test set and the remaining 80% to the training set.
    • random_state=0: This parameter ensures that the split is reproducible. By setting the random_state to 0, you ensure that the function produces the same split every time it is run.

Output Variables

  • X_train: This is the feature matrix for the training set.
  • X_test: This is the feature matrix for the test set.
  • y_train: This is the target variable array for the training set.
  • y_test: This is the target variable array for the test set.

Example

Let’s assume the dataset has 10 samples (for simplicity):

SampleFeature 1Feature 2Feature 3Target
10.10.20.310
20.40.50.620
30.70.80.930
41.01.11.240
51.31.41.550
61.61.71.860
71.92.02.170
82.22.32.480
92.52.62.790
102.82.93.0100

After splitting with test_size=0.2, we might get (depending on the random_state):

Training set (80% of data):

SampleFeature 1Feature 2Feature 3Target
10.10.20.310
30.70.80.930
41.01.11.240
51.31.41.550
71.92.02.170
82.22.32.480
92.52.62.790
102.82.93.0100

Test set (20% of data):

SampleFeature 1Feature 2Feature 3Target
20.40.50.620
61.61.71.860

The split is random but reproducible due to the random_state parameter. The training set will be used to train the model, and the test set will be used to evaluate the model’s performance.

The following code uses the LinearRegression class from the sklearn.linear_model module to create and train a linear regression model. Here’s a detailed explanation of each part of the code:

  1. Importing the LinearRegression class:
from sklearn.linear_model import LinearRegression
  • The LinearRegression class is used to perform linear regression, which is a simple and common type of predictive analysis.

Creating an instance of the LinearRegression class:

regressor = LinearRegression()
  • This line creates an instance of the LinearRegression class named regressor. This instance will be used to fit the model to the training data and make predictions.

Training the linear regression model:

regressor.fit(X_train, y_train)
  • The fit method is called on the regressor instance with X_train and y_train as arguments.
  • X_train is the feature matrix for the training set.
  • y_train is the target variable array for the training set.
  • The fit method trains the linear regression model on the training data by finding the best-fitting line (or hyperplane in higher dimensions) that minimizes the sum of the squared differences between the actual and predicted values.

Example

Assume that the training data looks like this (simplified for illustration):

Training Set:

R&D SpendAdministrationMarketing SpendState_CaliforniaState_FloridaState_New YorkProfit
165349.20136897.80471784.10001192261.83
162597.70151377.59443898.53100191792.06
153441.51101145.55407934.54010191050.39

After calling regressor.fit(X_train, y_train), the linear regression model will learn the relationship between the features (X_train) and the target variable (y_train). Specifically, it will determine the coefficients (weights) for each feature that minimize the error in predicting the profit.

Output of the Training Process

The model will store the learned coefficients (weights) and the intercept. These parameters can be accessed using:

  • regressor.coef_: The coefficients for the features.
  • regressor.intercept_: The intercept of the regression line.

These parameters define the best-fit line (or hyperplane) used to make predictions on new data.

If you have any further questions or need additional explanations, feel free to ask!

y_pred = regressor.predict(X_test) np.set_printoptions(precision=2) print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))ChatGPT

The provided code snippet uses the trained linear regression model to make predictions on the test set and then prints the predicted values alongside the actual values for comparison. Here’s a detailed explanation of each part of the code:

  1. Making predictions on the test set:
y_pred = regressor.predict(X_test)
    • The predict method is called on the regressor instance with X_test as the argument.
    • X_test is the feature matrix for the test set.
    • The predict method uses the trained model to predict the target variable (Profit) for the test set features.
    • The predicted values are stored in the variable y_pred.

Setting print options for NumPy arrays:

np.set_printoptions(precision=2)
  • This line sets the print options for NumPy arrays to display floating-point numbers with 2 decimal places of precision. This makes the output more readable.

Printing the predicted and actual values:

print(np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1))
  • y_pred.reshape(len(y_pred), 1): This reshapes the y_pred array to have a single column and as many rows as there are elements in y_pred.
  • y_test.reshape(len(y_test), 1): This reshapes the y_test array to have a single column and as many rows as there are elements in y_test.
  • np.concatenate((y_pred.reshape(len(y_pred), 1), y_test.reshape(len(y_test), 1)), 1): This concatenates the reshaped y_pred and y_test arrays along the columns (axis 1), resulting in a 2D array where each row contains a predicted value and the corresponding actual value.
  • print(...): This prints the concatenated array.

Example Output

Assume that y_pred contains the predicted profits and y_test contains the actual profits for the test set:

y_pred = [105000.50, 120000.75, 135000.20]
y_test = [108000.00, 123000.00, 138000.00]

After reshaping and concatenating, the output would look like this:

[[105000.50 108000.00]
 [120000.75 123000.00]
 [135000.20 138000.00]]

This output shows the predicted profits alongside the actual profits for each test sample, allowing for easy comparison of the model’s performance.

To draw a graph of the regression results, you typically want to plot the actual vs. predicted values to visualize how well the model performed. Here’s how you can create such a plot using matplotlib:

  1. Plotting Actual vs. Predicted Values:
    • This plot will show how closely the predicted values match the actual values. Ideally, if the predictions are perfect, all points should lie on the line y = x.
  2. Plotting Residuals:
    • Residuals are the differences between the actual and predicted values. Plotting residuals can help you understand the distribution of errors.

Here’s the code to create these plots:

Plotting Actual vs. Predicted Values

import matplotlib.pyplot as plt

# Create a scatter plot of actual vs. predicted values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue', edgecolor='k', alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linewidth=2)  # y = x line
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values')
plt.show()

Plotting Residuals

# Calculate residuals
residuals = y_test - y_pred

# Create a scatter plot of residuals
plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, color='blue', edgecolor='k', alpha=0.7)
plt.axhline(y=0, color='red', linewidth=2)  # y = 0 line
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Predicted Values')
plt.show()

Full Example Code

Here’s the complete example combining both plots:

import numpy as np
import matplotlib.pyplot as plt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
import pandas as pd

# Load dataset
file_path = '/mnt/data/50_Startups.csv'
dataset = pd.read_csv(file_path)

# Separating features and target variable
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values

# Encoding categorical data
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [3])], remainder='passthrough')
X = np.array(ct.fit_transform(X))

# Splitting the dataset into the training set and test set
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# Training the Linear Regression model on the training set
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Making predictions on the test set
y_pred = regressor.predict(X_test)

# Plotting Actual vs. Predicted Values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue', edgecolor='k', alpha=0.7)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], color='red', linewidth=2)  # y = x line
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values')
plt.show()

# Plotting Residuals
residuals = y_test - y_pred

plt.figure(figsize=(10, 6))
plt.scatter(y_pred, residuals, color='blue', edgecolor='k', alpha=0.7)
plt.axhline(y=0, color='red', linewidth=2)  # y = 0 line
plt.xlabel('Predicted Values')
plt.ylabel('Residuals')
plt.title('Residuals vs. Predicted Values')
plt.show()