Key ML Lessons
Explore five fundamental machine learning lessons to better guide AI product development and avoid common pitfalls. Learn the importance of generalization, domain knowledge, feature engineering, ensemble models, and the difference between correlation and causation to effectively lead your team in building reliable machine learning applications.
Machine learning algorithms come with the promise of being able to figure out how to perform important tasks by learning from data, i.e., generalizing from examples without being explicitly told what to do. This means that the higher the amount of data, the more ambitious problems can be tackled by these algorithms. However, developing successful machine learning applications requires quite some “black art,” which is hard to find.
Let’s go through the five key lessons compiled by machine learning researchers and practitioners (put together in a great research paper by Professor Pedro Domingos), so that you can guide your team to avoid such pitfalls.
1. It’s generalization that counts
The fundamental goal of machine learning is to generalize beyond the examples in the training set. No matter how much data we have, it is very unlikely that we will see those exact examples again at test time. Doing well on the training set is easy. The most common mistake among beginners is to test on the training data and have the illusion of success. If the chosen classifier is then tested on new data, it is often no better than random guessing. So, set some of the data aside from the beginning and only use it to test your chosen classifier at the very end, followed by learning your final classifier on the whole data.
Of course, holding out data reduces the amount available for training. This can be mitigated by doing cross-validation: randomly dividing your training data into subsets, holding out each one while training on the rest, testing each learned classifier on the unseen examples, and averaging the results to see how well the particular parameter setting does.
2. Data alone is not enough
When generalization is the goal, we bump into another major consequence: data alone is not enough, no matter how much of it you have. Very general assumptions, like similar examples having similar classes, are a large reason why machine learning has been so successful.
Domain knowledge and an understanding of our data are crucial in making the right assumptions. The need for knowledge in learning should not be surprising. Machine learning is not magic; it can’t get something from nothing. It does is get more from less though. Programming, like all engineering, is a lot of work: we have to build everything from scratch. Learning is more like farming, which lets nature do most of the work. Farmers combine seeds with nutrients to grow crops. Learners combine knowledge with data to grow programs.
3. Feature engineering is the key
At the end of the day, some machine learning projects succeed, and some fail. What makes the difference? Easily the most important factor is the features used. If we have many independent features that correlate well with the class, learning is easy. On the other hand, if the class is based on a recipe that requires handling the ingredients in a complex way before they can be used, things become harder. Feature engineering is basically about creating new input features from your existing ones.
Very often the raw data does not even come in a form ready for learning. But we can construct features from it that can be used for learning. In fact, this is typically where most of the effort in a machine learning project goes. It is often also one of the most interesting parts, where intuition, creativity and “black art” are as important as the technical stuff.
First-timers are often surprised by how little time in a machine learning project is spent actually doing machine learning. However, it makes sense if you consider how time-consuming it is to gather, integrate, clean, and pre-process data and how much trial and error can go into feature design. Also, machine learning is not a one-shot process of building a dataset and running a learner, but rather, it is an **iterative process **of running the learner, analyzing the results, modifying the data and/or the learner, and repeating. Learning is often the quickest part of this, but that’s because we’ve already mastered it pretty well! Feature engineering is more difficult because it’s domain-specific, while learners can be largely general-purpose. Of course, one of the holy grails of machine learning is automating more and more of the feature engineering process.
4. Learn many models, not just one
In the early days of machine learning, people tried many variations of different learners but still only used the best one. Then, researchers noticed that, if instead of selecting the best variation found, we combine many variations of learners, the results are often much better despite only a little extra effort for the user. Now, creating such model ensembles is very common:
- In the simplest technique, called bagging, we use the same algorithm but train it on different subsets of original data. In the end, we just average answers or combine them with some voting mechanism.
- In boosting, learners are trained one by one sequentially. Each subsequent one paying most of its attention to data points that were is predicted by the previous one and continuing until we are satisfied with the results.
- In stacking, the output of different independent classifiers become the input of a new classifier, giving final predictions.
In the Netflix prize, teams from all over the world competed to build the best video recommender system. As the competition progressed, teams found that they obtained the best results by combining their learners with other teams’, and they merged into larger and larger teams. The winner and runner-up were both stacked ensembles of over one-hundred learners and combining the two ensembles further improved the results. Together is better!
5. Correlation does not imply causation
We have all heard that correlation does not imply causation, but still, people frequently think it does.
Often, the goal of learning predictive models is to use them as guides to action. If we find that beer and diapers are often bought together at the supermarket, perhaps putting beer next to the diaper section will increase sales. However, unless we do an actual experiment, it’s difficult to tell if this is true. Correlation is a sign of a potential causal connection, and we can use it as a guide to further investigation and not as our final conclusion.