Data Preparation
Topics
This week’s assignments will guide you through the following topics:
- Understanding the authors’ motivations for constructing an XGBoost classifier, and why they think it’s a suitable approach to the problem they present
- Preparing data for XGBoost classifier construction, including handling of categorical features through one-hot encoding and converting the response variable to a binary representation
Reading
Please read the following:
Replication task
- Read in the heart dataset (HeartData_Full.csv) and explore the variable types
- One-hot encode the categorical variables
- Convert the response variable (num) into a binary variable in the same way the authors do
- Create a pipeline for determining if there are “problem” features with non-numeric values, missing data, etc
- Use pipeline to look at any correlations / obvious relationships between pairwise features
Tasks
Complete the following tasks:
- Complete the required reading and answer the questions below.
- Complete the replication tasks above, referring to the supplemental reading if necessary.
Weekly Questions
Answer the following questions
- In your own words, describe what the authors are doing in this paper? What’s their motivation and what’s their goal?
- Where does the dataset that the authors use (and that you’ll use) come from? How was the data collected and what information is included?
- What is one-hot encoding and why is necessary in this case?
- What other data manipulations do the authors apply and why do they do them?