Learning habit analysis of high school students
Overview
School closures caused by the pandemic was a big problem to the student. They must use an online environment to continue learning without affecting their process. From the data collected from 420 students in Ha Noi (Viet Nam), I used statistics approaches in the Python environment to analyze and explore interesting information and knowledge. You can find the data source here. All source code was used in this tutorial is available here. You can share your approach in the comment section.
This project was implemented by using Team Data Science Process (Microsoft). I applied some artifact template to represent data and results.
Data summary
The dataset had 420 responses (cleaned from the authors) and 40 features. The features were classified by 3 groups in the below. Instead, you can categorize them by major groups of questions which the author designed include individual demographic, student learning habit and student’s perception on self-learning in the description section.
Variable Type | Number of features |
---|---|
Nominal data | 6 |
Ordinal data | 25 |
Numerical data | 9 |
The process method should be selected to be associated with the nature of data. Both nominal and ordinal data are categorical data, however, the latter is sorted by specific order and it can provide additional information for our problem. So, you should carefully when beginning exploring them.
Begining with data profilling activity in Python.
You should import necessary packages in the first lines.
# Dataframe and matrix operation
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype
# Visualization packages
import matplotlib.pyplot as plt
import seaborn as sns
# Statistical packages
import statsmodels.api as sm
from statsmodels.stats import weightstats as stests
from scipy import stats
from scipy.stats import chi2_contingency
from scipy.stats import chi2
Let create a dataframe with the datasets.
data_url = './data_student_learning_habit_covid19-200408.xlsx'
sheet_name = 'data_student_learning_habit_cov'
data_file = pd.ExcelFile(data_url)
students_df = pd.read_excel(data_file, sheet_name)
print('Loaded {0} instances with {1} features'.format(students_df.shape[0], students_df.shape[1]))
You should see this message with a successful dataframe loading.
Loaded 420 instances with 40 features
The dataset didn’t show the truth of population, so statistical information couldn’t represent all students in Viet Nam. In this part, I use a confidence interval to estimate exactly than the average value.
How did you determine that the habit of high school students changed?
Depend on 2 features: Lh_before_cov, Lh_in_cov. Both variables are ordinal data, so we can compare them to see what happened during the pandemic.
Firstly, calculate the percentage of change in the student.
def is_diff(item):
before, during = item
if before == during:
return 0
return 1
students_df['diff_before_during'] = students_df[['Lh_before_Cov', 'Lh_in_Cov']].apply(is_diff, axis=1)
p_hat = students_df['diff_before_during'].sum() / students_df.shape[0]
print(p_hat)
You will get the result look like the below
0.38333333333333336
This is the proportion of sample data, not population. You need to find that figure in the population. How do that? The answer is using a confidence interval approach.
To calculate confidence interval value, you need to know a variance of population. It’s difficult to get them by using the traditional methods. The below code shows how to calculate proportion with 95% confidence using Python.
def cal_conf_prop(dataframe, diff_func):
sample_size = dataframe.shape[0]
p_bootstrap = []
for i in range(1000):
bootstrap_sample = dataframe.sample(n=sample_size, replace=True)
bootstrap_sample['diff_before_during'] = bootstrap_sample[['Lh_before_Cov', 'Lh_in_Cov']].apply(diff_func, axis=1)
p_sample = bootstrap_sample['diff_before_during'].sum() / sample_size
p_bootstrap.append(p_sample)
p_bootstrap.sort()
# Calculate lower and higher using numpy package
alpha = 0.95
lower = np.percentile(p_bootstrap, (1 - alpha)/2 * 100)
higher = np.percentile(p_bootstrap, (alpha + ((1 - alpha) / 2)) * 100)
return lower, higher
lower, higher = cal_conf_prop(students_df, is_diff)
print('Percent of students changed learning habit: {0} {1}'.format(round(lower * 100, 1), round(higher * 100, 1)))
Percent of students changed learning habit: 33.8 42.9
You can see that the confidence interval value presents a range in the population instead of a point in the sample data. That is a great method to improve more significant result than traditional methods.
More than 1 of 3 students changed their learning habits in the pandemic. Obviously, this is easily explained by the school change from offline to online learning. However, whether it was positive or negative, the student spent more or less time studying? The assumption is explained by using the below result.
def is_diff_positive(item):
before, during = item
if before < during:
return 1
return 0
lower, higher = cal_conf_prop(students_df, is_diff_positive)
print('Percent of students changed learning habit: {0} {1}'.format(round(lower * 100, 1), round(higher * 100, 1)))
Percent of students changed learning habit: 27.6 36.4
The result showed that high school students spend more time learning than before. You can use bootstrap to calculate confidence interval value easily to explain your assumption on the dataset. This is a powerful tool to explore insight from raw data. You can read more articles about bootstrap to understand how it works. I will continue to update this project next time. You can follow me to get notification when I release a new article.
Comments