HR Analytics: Job changes of Data Scientist

March 9, 2021 1 minute read

Introduction

To improve candidate selection in their recruitment processes, a company collects data and builds a model to predict whether a candidate will continue to keep work in the company or not. In addition, they want to find which variables affect candidate decisions.

You can download the dataset in here.

Using ROC AUC score to evaluate model performance.

Data Overview

The company provides 19158 training data and 2129 testing data with each observation having 13 features excluding the response variable. Some of them are numeric features, others are category features. Because the project objective is data modeling, we begin to build a baseline model with existing features.

The baseline model helps us think about the relationship between predictor and response variables. It is a great approach for the first step.

Let’s import neccessary packages

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

train_df = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_train.csv')
test_df = pd.read_csv('/kaggle/input/hr-analytics-job-change-of-data-scientists/aug_test.csv')

train_df.shape, test_df.shape

((19158, 14), (2129, 13))

Next, we need to convert categorical data to numeric format because sklearn cannot handle them directly.

cat_features = list(train_df.select_dtypes(exclude=['float64', 'int64']).columns)
for col in cat_features:
    train_df[col + '_cat'] = train_df[col].astype('category').cat.codes
    
train_df.shape

(19158, 24)

I used Random Forest to build the baseline model by using below code.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

X = train_df.drop(cat_features + ['target'], axis=1)
y = train_df['target']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

base_params = { 
    'n_estimators': 200,
    'max_features': 'auto',
    'max_depth' : 6,
    'criterion' :'gini'
}

rfc = RandomForestClassifier(**base_params)
rfc.fit(X_train, y_train)

print('Train AUC: {0}'.format(roc_auc_score(rfc.predict(X_train), y_train)))
print('Test AUC: {0}'.format(roc_auc_score(rfc.predict(X_test), y_test)))

Train AUC: 0.7373099536799569
Test AUC: 0.6928262958387448

The baseline model mark 0.74 ROC AUC score without any feature engineering steps. That is great, right? We will improve the score in the next steps.

Twitter Facebook LinkedIn

Nguyen Duy Cuong

HR Analytics: Job changes of Data Scientist

Introduction

Data Overview

Comments

You May Also Enjoy

Data engineer 101: How to build a data pipeline with Apache Airflow and Airbyte

Collect coronavirus data using Python

Best tools for data science job in 2021