Set Up Your Data for Learning
Learn to set up data by defining features and labels, splitting it correctly, and preventing leaks for trustworthy machine learning.
In real-world machine learning projects, up to 80% of our time is spent preparing data, defining what to predict, cleaning and splitting to ensure fair evaluation, and preventing subtle leaks that can ruin our model’s trustworthiness. While the “Clean It Up!” chapter covered general data-cleaning techniques (e.g., handling missing values, outlier treatment, normalization), this lesson focuses on the ML-specific steps we’ll repeat on every project:
Defining exactly what we’re predicting and which inputs will drive that prediction.
Splitting data so our model is truly tested on unseen examples.
Safeguarding against accidental peeks into test data (data leakage).
With this foundation in mind, let’s clarify the two core components of every supervised ML dataset—labels and features.
Label vs. features
In supervised machine learning, we aim to learn a function that maps features (the inputs) to a label (the output).
A label is the single variable we want to predict, sometimes called the target or response.
Features (also called predictors or covariates) are the inputs, the measurable properties or characteristics that help explain variation in the label.
If we choose the wrong label, our model won’t answer the question we care about. The model can’t learn the underlying patterns if we pick poor features. Consider a real estate application that estimates home values. Here:
The label is the sale price of each property.
The features might include:
Total square footage
Number of bedrooms
Encoded indicator of neighborhood quality
Each feature contributes information—size suggests space, bedrooms hint at capacity, and neighborhood reflects demand, which together help predict price.
Example
Let’s now code the above example. Before training any model, we must explicitly define which columns serve as inputs and which are the outputs. This clarity prevents confusion downstream and maintains a clean workflow.
Get hands-on with 1400+ tech skills courses.