As George Box noted several decades ago, “All Models are essentially wrong, but some are useful”. The work of statistical analysis, and its modern day derivatives of Machine Learning and Deep Learning, continue to be bold attempts at capturing what are essentially complex activities into a much simpler set of equations, even if those equations themselves often seem esoteric and complex. No set of equations can manage to capture every detail – but can they capture the essence?
And how do you begin to trust those equations, that model – enough so that you are ready to bet your business on it?
There are no silver bullets, and no substitutes to testing.
The first, and most common approach, is to test the model with “out of band” data. Models, as you may know, are typically built on a set of training data, and tested on a separate set of test data. Usually though, model builders create the test data out of a portion of the available training data. This, however, presents the problem that the test data often bears characteristics similar to the training data. This implies the model is likely to perform reasonably well on the test data, since it is so similar looking to the training data. At this stage, it is critical to insist on testing the model with a completely new, and relatively recent set of data, and check the model’s performance. If the model’s performance is acceptable, it is your first real clue that the model may actually perform well in a real, live situation.
Kaggle competition results are often validated using such an approach. GE had sponsored such a competition related to flying times and validated it with data from a fresh flight before awarding the winners with the prize money.
The second step, also intuitive, is to validate the model output against specific, well-understood situations. The behavior of the model should be along well-understood lines.
The third step is to understand the robustness of the model against strange inputs – what if all the input was zero, or infinity, or completely absent? Does the model crash and burn or does it take you to a safe place?
The fourth step is to look at the property of graceful degradation – what if most of the input is there, but one of them is missing? What if one of the binary input fields received a non-binary input? Does the model pack its bags and go on vacation – or does it give you something useful to live by?
A model that is tested through these various situations, now has a reasonable chance of performing well in a production environment. Models often need time to run and perform well in a given environment, simply because model outputs are often games of probabilities, and while anyone can hit a jackpot now and then – at the end, the house always makes the most money.
With this model in hand, you are now ready to engage in the more difficult part of the story, convincing your stakeholders about the validity of the model and earning their trust in its performance. And for this, you will need the support of intuitive reasoning, explainability, proof and finally, performance in a real situation. But that will be the subject of another post on the all important, human angle.