Set Up Your Data for Learning

Learn to set up data by defining features and labels, splitting it correctly, and preventing leaks for trustworthy machine learning.

In real-world machine learning projects, up to 80% of our time is spent preparing data, defining what to predict, cleaning and splitting to ensure fair evaluation, and preventing subtle leaks that can ruin our model’s trustworthiness. While the “Clean It Up!” chapter covered general data-cleaning techniques (e.g., handling missing values, outlier treatment, normalization), this lesson focuses on the ML-specific steps we’ll repeat on every project:

  1. Defining exactly what we’re predicting and which inputs will drive that prediction.

  2. Splitting data so our model is truly tested on unseen examples.

  3. Safeguarding against accidental peeks into test data (data leakage).


With this foundation in mind, let’s clarify the two core components of every supervised ML dataset—labels and features.

Label vs. features

In supervised machine learning, we aim to learn a function that maps features (the inputs) to a label (the output).

  • A label is the single variable we want to predict, sometimes called the target or response.

  • Features (also called predictors or covariates) are the inputs, the measurable properties or characteristics that help explain variation in the label.

If we choose the wrong label, our model won’t answer the question we care about. The model can’t learn the underlying patterns if we pick poor features. Consider a real estate application that estimates home values. Here:

  • The label is the sale price of each property.

  • The features might include:

    • Total square footage

    • Number of bedrooms

    • Encoded indicator of neighborhood quality

Each feature contributes information—size suggests space, bedrooms hint at capacity, and neighborhood reflects demand, which together help predict price.

Example

Let’s now code the above example. Before training any model, we must explicitly define which columns serve as inputs and which are the outputs. This clarity prevents confusion downstream and maintains a clean workflow.

Get hands-on with 1400+ tech skills courses.