We are living in the age of big data – I’m sure you are well aware of this as we check our hearts rate on our fitbits, let spotify pick our music mix and netflix recommend the next series to binge watch. With infinite data, we may ask ourselves, what’s the point of theory? Will machine learning take over the researcher’s role? Will this river of data finally simplify choice and optimize everything for us? The short answer in my view, is no.
If there ever is a time we need more theory, it is now. In the hype of machine learning, big data, AI and all of the buzz words of the AI technological revolution, we are slowly forgetting the importance of strong theory. At the heart any statistical method is the model – formed from economic theory – which is the driver of estimates.
This is why statistical methods, even AI, can never take the place of a researcher. We are the driver of our research agenda. But, this leads me to two concerns on the impact of (broadly speaking) machine learning and big data: 1. The kind of questions that we are asking, and 2. Thoughtful application of statistical methods.
In the first concern, I’m anxious of the future of economics that values complex methodologies and better data to answer small questions over tackling the big questions. Economists have become lazy with our bigger and better data.
At the same time, access to this type of data has allowed us to open a whole world of more accurate answers. The role of the researcher is to direct the questions – to ask the important questions and it’s implications to the whole of society. This is not to say that small questions are unimportant – research is a culmination of many small questions. However, our role as researchers is to analyze results to aggregate to the bigger picture. Let’s not forget the larger vision and goal of the questions we ask in our work and keep reminding ourselves – what is the bigger picture question that my work is trying to answer? Nor should we be afraid to ask the bigger questions, even if we don’t have the right kind of data or unfashionable (ie unlikely to be published).
The second concern is that we must remain open minded in the statistical approaches that we use, and more importantly, be transparent about their weaknesses and limitations.
One major plague within the economics community is p-hacking our way to statistical significance – a problem that arises from the way frequentists (most traditional econometric techniques) use confidence intervals for inference. It’s all too tempting in the public or perish pressure to search for stars. This hunt for significant has certainly contributed to our replication crisis. By p-hacking, we’ve biased our results to reflect the results that we desire.
That is not to say we should never use these methods. Sometimes, a traditional econometric approach is more computationally efficient with the least unbiased results, other times, machine learning approaches can provide the best results. What is the “best” result depends on the question and goal we have in mind. Often, a goal is to get least the biased estimate (with all of the other properties we like depending on the method). For example, the impact of a rise of minimum wage and it’s reduction on poverty. But, if our overall goal is predictive power/model fit – for example, what variables are the most important in predicting stock market movements – then machine learning may be far superior. Aptly summarized, b vs. r, what is the research question that we are trying to maximize, the least biased beta or the best fit and most predictive power, r-squared?
Economists are guilty of rarely maximizing predictive power (r-squared) in their models, even when the question at hand might call for it. However, machine learning techniques (often within the umbrella of Bayesian statistics who focus on model fit and/or predictive power) are slowly becoming more widely used.
Of course, this is not without consequence. Applying machine learning classification techniques blindly to social problems has far-reaching effects due to biased parameters. The consequence of data analysis sans theory can be harmful, especially in the social sciences (or social applications) where these tools can influence policy and decision making.
A few examples have come to light, such as the amazon facial recognition technology. Their technology has been shown to be biased in misclassifying black women. This technology has been bought by police departments to identify suspects, which can lead to more false arrests. In this case, it is the fact that the training data contains more white men than any other group, and thus, can more accurately predict white men, but fails to predict black women accurately. Augmenting the sample selection could easily correct this bias as research has shown.
Machine learning has also been used nationally to inform decisions on releasing individuals on bail (and the price of that bail), which was shown to be biased against black men, or even sentencing, which was shown to give harsher and longer prison sentences to black men. In this case, training data that estimates and informs prediction (and assigns a probability of recidivism) uses data that reflects the institutional biasedness that exists in our society. Prior judge sentences are prejudice decisions (given American history of institutional racism), so relying on past decision can perpetuate unjust biasedness.
Long ago traditional economists have dealt with this these issues by trying to identify the bias (sampling, selection, attrition, measurement error, simultaneous causality, are a few examples) and understand the direction or potential magnitude of biasedness. This approach should be extended to Machine Learning in understanding their major weaknesses – how can these problems bias their end result?
When you make a deal with the statistical devil, there is always a cost. No method is free from consequence. This is why it is very important to be open about potential biasedness and its consequences.
Above all, theory should be the informer of our statistical models – we need to bring economic theory (and big questions) to the forefront of research.