HR Analytics: Job changes of Data Scientist
Introduction
To improve candidate selection in their recruitment processes, a company collects data and builds a model to predict whether a candidate will continue to keep work in the company or not. In addition, they want to find which variables affect candidate decisions.
You can download the dataset in here.
Using ROC AUC score to evaluate model performance.
Data Overview
The company provides 19158 training data and 2129 testing data with each observation having 13 features excluding the response variable. Some of them are numeric features, others are category features. Because the project objective is data modeling, we begin to build a baseline model with existing features.
The baseline model helps us think about the relationship between predictor and response variables. It is a great approach for the first step.
Let’s import neccessary packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
train_df = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
test_df = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_test.csv')
train_df.shape, test_df.shape
((19158, 14), (2129, 13))
Next, we need to convert categorical data to numeric format because sklearn cannot handle them directly.
cat_features = list(train_df.select_dtypes(exclude=['float64', 'int64']).columns)
for col in cat_features:
train_df[col + '_cat'] = train_df[col].astype('category').cat.codes
train_df.shape
(19158, 24)
I used Random Forest to build the baseline model by using below code.
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
X = train_df.drop(cat_features + ['target'], axis=1)
y = train_df['target']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
base_params = {
'n_estimators': 200,
'max_features': 'auto',
'max_depth' : 6,
'criterion' :'gini'
}
rfc = RandomForestClassifier(**base_params)
rfc.fit(X_train, y_train)
print('Train AUC: {0}'.format(roc_auc_score(rfc.predict(X_train), y_train)))
print('Test AUC: {0}'.format(roc_auc_score(rfc.predict(X_test), y_test)))
Train AUC: 0.7373099536799569
Test AUC: 0.6928262958387448
The baseline model mark 0.74 ROC AUC score without any feature engineering steps. That is great, right? We will improve the score in the next steps.
Comments