In this lesson, I will discuss regression analysis using a machine learning package. Following are the steps to follow the same.

import numpy as np import matplotlib.pyplot as plt import pandas as pd

**NumPy** (Numerical Python) is a fundamental package for scientific computing in Python. It provides support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays efficiently. NumPy arrays are faster and more compact than Python lists, providing an array-oriented computing environment that is both efficient and convenient.

**Matplotlib** is a plotting library for the Python programming language and its numerical mathematics extension NumPy. `matplotlib.pyplot`

is a collection of command style functions that make Matplotlib work like MATLAB. It provides a visual representation of data which makes it easier to identify patterns, trends, and correlations.

**Pandas** is a library written for the Python programming language for data manipulation and analysis. In particular, it offers data structures and operations for manipulating numerical tables and time series. The DataFrame is one of these structures, and it is a powerful tool widely used in data science.

dataset = pd.read_csv('Salary_Data.csv')

: This function is part of the Pandas library and is used to read a comma-separated values (CSV) file into a Pandas DataFrame. CSV files are a common format for storing tabular data.`pd.read_csv()`

: This is the filename or the path to the file that contains the data you want to load. This string should be the name of the file if it is in the same directory as the Python script or notebook, or it can be a path to the file located elsewhere.`'Salary_Data.csv'`

: This is the variable name assigned to the DataFrame created from the CSV file. After this line of code executes,`dataset`

`dataset`

contains all the data loaded from the`Salary_Data.csv`

file, structured in a format that can be easily manipulated using further Pandas functions.

# Extracting features and target variable from the dataset X = dataset.iloc[:, :-1].values y = dataset.iloc[:, -1].values

: This line selects all rows (`dataset.iloc[:, :-1].values`

`:`

) and all columns except the last one (`:-1`

) from the DataFrame`dataset`

. The`.values`

attribute converts the selected portion of the DataFrame into a NumPy array, which is typically required for machine learning algorithms in scikit-learn. This part usually contains the independent variables or features.: Similar to the previous line, this selects all rows and only the last column of the DataFrame, which typically is the dependent variable or target. This is also converted into a NumPy array.`dataset.iloc[:, -1].values`

from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

**Importing the train_test_split function: “***from sklearn.model_selection import train_test_split*“

*from sklearn.model_selection import train_test_split*

This line imports the `train_test_split`

function from the `sklearn.model_selection`

module. This function is used to split a dataset into a training set and a testing set. This is a common practice in machine learning to evaluate the performance of models.

**Splitting the dataset:** **X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)**

: This function splits the features (`train_test_split(X, y, test_size = 1/3, random_state = 0)`

`X`

) and the labels (`y`

) into training (`X_train`

,`y_train`

) and testing (`X_test`

,`y_test`

) sets.: This argument specifies that 1/3 of the data will be used for testing, while the remaining 2/3 will be used for training the model.`test_size = 1/3`

: This is used for initializing the internal random number generator, which will decide the splitting of data into train and test indices. Providing a fixed`random_state = 0`

`random_state`

ensures that the splits are reproducible and consistent during repeated runs.

### Use Cases and Example:

This setup is typically used in supervised learning contexts where you have a clear distinction between input features (X) and an output target (y) that you wish to predict. After splitting, you can fit a model on `X_train`

and `y_train`

and later evaluate its performance on `X_test`

and `y_test`

.

from sklearn.linear_model import LinearRegression regressor = LinearRegression() regressor.fit(X_train, y_train)

### 1. **Importing the LinearRegression class**

**from sklearn.linear_model import LinearRegression**

This line imports the `LinearRegression`

class from `sklearn.linear_model`

. Linear regression is a basic and commonly used type of predictive analysis which is primarily used for finding the linear relationship between variables—predicting a quantitative response.

### 2. **Creating an instance of LinearRegression**

**regressor = LinearRegression()**

Here, `regressor`

is an instance of the `LinearRegression`

class. This object will be used to access the linear regression functionality. The `LinearRegression()`

constructor can accept several parameters to customize its behavior, but in this case, we’re using the default settings which include fitting the intercept (`intercept=True`

).

### 3. **Training the model**

**regressor.fit(X_train, y_train)**

: This method fits the linear regression model to the training data. The`fit(X_train, y_train)`

`X_train`

contains the feature(s) (independent variables) of the training data, and`y_train`

contains the corresponding target (dependent variable) values. The fitting process involves finding the coefficients (parameters) for the regression equation that minimizes the error between the predicted and actual values in the training data.

### Explanation of the Process:

The purpose of training the model is to find a linear relationship, often described as: 𝑦=𝛽0+𝛽1𝑥1+⋯+𝛽𝑛𝑥𝑛+𝜖*y*=*β*0+*β*1*x*1+⋯+*β**n**x**n*+*ϵ* where:

- 𝛽0,𝛽1,…,𝛽𝑛 are the coefficients.
- 𝑥1,…,𝑥𝑛 are the features.
- 𝜖 is the error term.

In the context of linear regression, training the model means calculating the best-fit line that minimizes the sum of the squared differences between the observed values in the dataset and those predicted by the model—a method known as Ordinary Least Squares (OLS).

### Example Use:

Once the model is trained, you can use it to make predictions, evaluate its performance, and interpret the results:

# Making predictions on the test set y_pred = regressor.predict(X_test) # Example: Displaying the coefficients of the model print("Coefficients:", regressor.coef_) print("Intercept:", regressor.intercept_) # Evaluating the model from sklearn.metrics import r2_score print("R^2 Score:", r2_score(y_test, y_pred))

: This uses the trained model to make predictions on new data (the testing set in this case).`regressor.predict(X_test)`

and`regressor.coef_`

: These properties store the coefficients of the features and the intercept of the model respectively.`regressor.intercept_`

: This function computes the R-squared value, which is a statistical measure of how close the data are to the fitted regression line.`r2_score`

This workflow is fundamental in predictive analytics and is widely used across various domains, such as economics, biological sciences, and social sciences, to make informed decisions based on the underlying trends data suggests.

### Purpose of the Prediction

The purpose of this step in a machine learning workflow is to evaluate how well the trained model performs on new, unseen data. The predictions `y_pred`

can now be compared against the actual target values `y_test`

(which were not provided to the model during the training phase) to assess the model’s accuracy, efficacy, and generalization capability.

### Example Usage for Model Evaluation

After making predictions, you might typically proceed to evaluate the model’s performance using various metrics. One common metric is the Mean Squared Error (MSE) or R-squared (coefficient of determination) for regression tasks:

from sklearn.metrics import mean_squared_error, r2_score # Calculating Mean Squared Error mse = mean_squared_error(y_test, y_pred) print(f'Mean Squared Error: {mse}') # Calculating R-squared r_squared = r2_score(y_test, y_pred) print(f'R^2 Score: {r_squared}')

: This calculates the mean of the squares of the differences between the actual (`mean_squared_error(y_test, y_pred)`

`y_test`

) and predicted (`y_pred`

) values, providing a measure of the model’s prediction error.: This calculates the proportion of variance in the dependent variable that is predictable from the independent variable(s), offering insight into the goodness of fit of the model.`r2_score(y_test, y_pred)`

These evaluations help to determine how predictions deviate from the actual data and how much of the variance in the data the model is able to explain, respectively.

plt.scatter(X_train, y_train, color = 'red') plt.plot(X_train, regressor.predict(X_train), color = 'blue') plt.title('Salary vs Experience (Training set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show()

Here’s a breakdown of what each line does:

### Visualization Code Breakdown

**plt.scatter(X_train, y_train, color = 'red')**

: This function creates a scatter plot, which is ideal for visualizing the relationship between two numeric variables. It plots`plt.scatter`

`y_train`

versus`X_train`

as a collection of points in the plot.: These are the features and target values from the training dataset, respectively.`X_train`

and`y_train`

: This parameter sets the color of the scatter plot points to red.`color = 'red'`

**plt.plot(X_train, regressor.predict(X_train), color = 'blue')**

: Unlike`plt.plot`

`plt.scatter`

, which displays individual data points,`plt.plot`

is used to draw a line graph. This line represents the predictions made by the linear regression model across the`X_train`

data.: This method call generates the predicted`regressor.predict(X_train)`

`y`

values (salary) for the given`X_train`

data (years of experience). These predicted values are used to plot the regression line.: Sets the color of the regression line to blue.`color = 'blue'`

plt.title('Salary vs Experience (Training set)') plt.xlabel('Years of Experience') plt.ylabel('Salary')

- These three lines add a title to the plot (
`plt.title`

), and labels to the x-axis (`plt.xlabel`

) and y-axis (`plt.ylabel`

). These help in understanding what the plot represents: the relationship between years of experience and salary as observed and predicted from the training data.

** plt.show()**: This function displays the plot. When using Matplotlib in a script, this function is essential as it tells Python to render the plot so the user can see it.

### Purpose of This Visualization

This visualization serves multiple purposes:

**Data Exploration**: It provides a visual representation of the underlying data distribution, showing how salary varies with years of experience.**Model Evaluation**: By overlaying the regression line on the scatter plot of actual data, it visually assesses how well the model fits the training data. The closer the points are to the line, the better the model’s predictions match the actual values.

This kind of plot is particularly useful in presentations or reports to convey, at a glance, how effective the linear regression model is in capturing the trend in the data. It’s a straightforward visual assessment tool for model performance before proceeding with more detailed statistical evaluation.

plt.scatter(X_test, y_test, color = 'red') pred_sal = regressor.predict(X_test) plt.plot(X_test, pred_sal, color = 'blue') plt.title('Salary vs Experience (Test set)') plt.xlabel('Years of Experience') plt.ylabel('Salary') plt.show()

### Visualization Code Breakdown

**plt.scatter(X_test, y_test, color = 'red')**

: This function creates a scatter plot, plotting the actual test data points (`plt.scatter`

`y_test`

) against`X_test`

.: These represent the features (years of experience) and the corresponding salaries from the test dataset.`X_test`

and`y_test`

: Sets the color of the scatter points to red, making them distinct and visually striking.`color = 'red'`

**pred_sal = regressor.predict(X_test) **

**plt.plot(X_test, pred_sal, color = 'blue')**

: Generates predictions for the`regressor.predict(X_test)`

`X_test`

data. These predictions represent the model’s output when applied to new, unseen data, helping evaluate how it generalizes beyond the training data.: This line draws the regression line based on the predicted values`plt.plot`

`pred_sal`

over the`X_test`

data. It shows how the model predicts the salary based on years of experience for the test set.: Sets the color of the regression line to blue, providing a clear visual contrast to the red scatter points.`color = 'blue'`

**plt.title('Salary vs Experience (Test set)') **

**plt.xlabel('Years of Experience') **

**plt.ylabel('Salary')**

- These commands set the title of the plot and label the axes, which is crucial for clarity and understanding when presenting the plot to an audience. It specifies that this plot is for the test set, helping distinguish it from any training set plots.

: Displays the finalized plot. This is necessary to render the plot when using scripts or interactive sessions in Python environments like Jupyter Notebooks.`plt.show()`

### Purpose of This Visualization

This plot serves to:

**Evaluate Model Generalization**: By plotting the test data and the corresponding model predictions, it visually assesses how well the model generalizes to new data. This is crucial for understanding the model’s performance in practical scenarios.**Visual Assessment**: Provides a straightforward and intuitive visual assessment of the model’s predictive accuracy on the test set. The closer the blue line (model predictions) aligns with the red points (actual values), the better the model is at making predictions.

Such visualizations are vital for presentations, reports, or simply to get a quick visual confirmation of model performance. They help stakeholders, who may not be familiar with the underlying statistics, understand model effectiveness in a clear and straightforward manner.

### Code Explanation

`regressor.predict([[40]])`

: This method is used to make predictions using the trained linear regression model. You provide the input features inside the method as an argument, and it returns the predicted output.`regressor.predict()`

: The input to the`[[40]]`

`predict()`

method must be in the same format as the data the model was trained on. Here,`40`

is encapsulated within two pairs of brackets:- The outer brackets represent a list (or array-like structure) which could contain multiple samples if you were making more than one prediction at a time.
- The inner brackets represent the feature array for a single sample. Since the model expects a 2D array as input (like the
`X_train`

and`X_test`

arrays it was trained on), you provide the input as a list within a list.

### How It Works

The model uses the coefficient(s) and intercept it learned during the training phase to calculate the predicted salary for someone with 40 years of experience. The equation for this linear regression would typically look something like this: Salary=(𝛽1×Years of Experience)+𝛽0

Where *β*1 is the coefficient for the “Years of Experience” feature, and *β*0 is the intercept.

### What It Returns

The `predict()`

method returns the predicted salary value encapsulated in an array because it’s designed to handle multiple predictions at once. If you want just the single predicted value, you might access it by indexing the result:

`predicted_salary = regressor.predict([[40]])[0]`

### Practical Use

This functionality is extremely useful when you need to make predictions based on the model you’ve developed. For example, if you’re developing a tool for a HR department to estimate salary ranges based on employee experience, this model could provide quick estimates.