Data Preparation

Topics

This week’s assignments will guide you through the following topics:

  • Understanding the authors’ motivations for constructing an XGBoost classifier, and why they think it’s a suitable approach to the problem they present
  • Preparing data for XGBoost classifier construction, including handling of categorical features through one-hot encoding and converting the response variable to a binary representation

Reading

Please read the following:

Replication task

  • Read in the heart dataset (HeartData_Full.csv) and explore the variable types
  • One-hot encode the categorical variables
  • Convert the response variable (num) into a binary variable in the same way the authors do
  • Create a pipeline for determining if there are “problem” features with non-numeric values, missing data, etc
  • Use pipeline to look at any correlations / obvious relationships between pairwise features

Tasks

Complete the following tasks:

  • Complete the required reading and answer the questions below.
  • Complete the replication tasks above, referring to the supplemental reading if necessary.

Weekly Questions

Answer the following questions

  • In your own words, describe what the authors are doing in this paper? What’s their motivation and what’s their goal?
  • Where does the dataset that the authors use (and that you’ll use) come from? How was the data collected and what information is included?
  • What is one-hot encoding and why is necessary in this case?
  • What other data manipulations do the authors apply and why do they do them?