Basic Statistics using Python

Importing Libraries

import pandas as pd
import numpy as np

This part of the code imports the necessary Python libraries. pandas is used for data manipulation and analysis, and numpy is used for working with arrays, although it isn’t directly utilized in the given code snippets.

Creating a DataFrame

data = {'students': [65, 82, 72, 92, 83, 74, 54, 84, 65, 66]}
df = pd.DataFrame(data)
  • data: A dictionary with one key ('students') and its corresponding values given as a list of integers representing scores.
  • df: A DataFrame created from the data dictionary using pandas.DataFrame(). This structure is particularly useful for handling tabular data with potentially heterogeneously-typed columns.

Calculating the Mean

mean_value = df['students'].mean()
print(f"Mean value is {mean_value}")
print("Mean Value is...{}".format(mean_value))
  • The mean() method calculates the average of the numbers in the 'students' column of the DataFrame.
  • The mean value is printed in two formats using Python’s formatted string literals and the format() method. Both lines output the mean, which is 73.7.

Calculating the Median

median_value = df['students'].median()
print(f'Median using Pandas: {median_value}')
  • The median() method computes the median value of the data in the 'students' column. The median is the value separating the higher half from the lower half of a data sample.
  • The median value is 73.0 and is printed using formatted string literals.

Calculating the Mode

mode_value = df['students'].mode()
print(f'Mode using Pandas: {mode_value}')
  • The mode() method identifies the most frequently occurring value(s) in the 'students' column.
  • The output is a pandas Series showing that 65 appears most frequently. Since mode() can return multiple values if there’s a tie, it always returns a Series. The mode of this dataset is displayed as the first entry (0) in the Series with a value of 65.

Calculating Quartiles

data = {'Values': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11]}
df = pd.DataFrame(data)
q1 = df['Values'].quantile(0.25)
q2 = df['Values'].quantile(0.50)
q3 = df['Values'].quantile(0.75)
print(f"Print the value of Quartile Q1 ={q2}")
  • data: A dictionary containing a list of values from 1 to 11.
  • df: A DataFrame created from the data dictionary, which is used for further analysis.
  • quantile(): A method used to calculate the quartile values of the dataset.
    • 0.25 (Q1), 0.50 (Q2, which is also the median), and 0.75 (Q3) are the respective quartiles.
  • The print statement incorrectly refers to Q2 as “Quartile Q1”. The output “Print the value of Quartile Q1 =6.0” shows the median of the dataset (Q2), not Q1.

Calculating Percentiles

p20 = df['Values'].quantile(0.20)
p60 = df['Values'].quantile(0.60)
print(f'20th percentile: {p20}, 60th percentile: {p60}')
  • This part calculates the 20th and 60th percentiles of the dataset using the quantile() method with parameters 0.20 and 0.60.
  • The printed output gives the values of these percentiles. For instance, the 20th percentile is the value below which 20% of the observations may be found, and similarly for the 60th percentile.

Calculating the Range

range_value = df['students'].max() - df['students'].min()
print(f'Range: {range_value}')
  • This part of the code is intended to calculate the range of the dataset, which is the difference between the maximum and minimum values.
  • However, there is an error in the code. The DataFrame df does not have a column named 'students'; it should reference 'Values' instead.
  • The correct calculation should use df['Values'].max() - df['Values'].min() to find the range.
  • Once corrected, the range_value would correctly compute as 11 - 1 = 10.

Define Data and Create DataFrame

data1 = np.array([2, 4, 4, 4, 5, 5, 7, 9])
data2 = np.array([10, 13, 15, 14, 10, 16, 18, 21])
df = pd.DataFrame({
    'Data1': data1,
    'Data2': data2
})
  • data1 and data2 are numpy arrays containing numerical data.
  • A pandas DataFrame df is created with two columns named Data1 and Data2, holding the respective data sets.

Step 2: Calculate Basic Statistical Measures

mean_data1 = np.mean(data1)
mean_data2 = np.mean(data2)
std_data1 = np.std(data1, ddof=0)
std_data2 = np.std(data2, ddof=0)
mean_dev_data1 = np.mean(np.abs(data1 - mean_data1))
mean_dev_data2 = np.mean(np.abs(data2 - mean_data2))
  • Means of data1 and data2 are calculated using np.mean().
  • Standard deviations are calculated with np.std() using ddof=0, which denotes the divisor used in calculations is N (number of elements), indicating population standard deviation.
  • Mean deviations (average of absolute deviations from the mean) are calculated for both datasets.

Calculate Combined Metrics

combined_mean = np.mean(np.concatenate([data1, data2]))
combined_std = np.std(np.concatenate([data1, data2]), ddof=0)
  • The mean and standard deviation of the combined data from data1 and data2.

Calculate Range and Coefficients

range_data1 = np.ptp(data1)
range_data2 = np.ptp(data2)
coeff_of_range1 = range_data1 / (np.max(data1) + np.min(data1))
coeff_of_range2 = range_data2 / (np.max(data2) + np.min(data2))
  • Range (difference between maximum and minimum values) is calculated using np.ptp().
  • Coefficient of range (range divided by the sum of maximum and minimum values) for both datasets.

Calculate Quartiles and Coefficients of Quartile Deviation

quartiles_data1 = np.percentile(data1, [25, 75])
quartiles_data2 = np.percentile(data2, [25, 75])
coeff_of_quartile_dev1 = (quartiles_data1[1] - quartiles_data1[0]) / (quartiles_data1[1] + quartiles_data1[0])
coeff_of_quartile_dev2 = (quartiles_data2[1] - quartiles_data2[0]) / (quartiles_data2[1] + quartiles_data2[0])
  • Quartiles are calculated using np.percentile().
  • Coefficient of quartile deviation (difference between upper and lower quartiles divided by their sum) for both datasets.

Calculate Coefficient of Variation

coeff_of_variation1 = (std_data1 / mean_data1) * 100
coeff_of_variation2 = (std_data2 / mean_data2) * 100
  • Coefficient of variation (standard deviation divided by the mean, expressed as a percentage) for both datasets.

Print Results

print(f"Mean of Data1: {mean_data1}, Data2: {mean_data2}")
print(f"Standard Deviation of Data1: {std_data1}, Data2: {std_data2}")
print(f"Mean Deviation of Data1: {mean_dev_data1}, Data2: {mean_dev_data2}")
print(f"Combined Mean: {combined_mean}")
print(f"Combined Standard Deviation: {combined_std}")
print(f"Coefficient of Range Data1: {coeff_of_range1}, Data2: {coeff_of_range2}")
print(f"Coefficient of Quartile Deviation Data1: {coeff_of_quartile_dev1}, Data2: {coeff_of_quartile_dev2}")
print(f"Coefficient of Variation Data1: {coeff_of_variation1}, Data2: {coeff_of_variation2}")

All calculated values are printed out, providing a comprehensive statistical analysis of the two datasets.

Complete the Quiz


Comments

Leave a Reply

Your email address will not be published. Required fields are marked *