It seems impossible to get published in the top 5 journals without a ‘model’ and to subject that to statistical testing against empirical data.
Judgement of University publication rates affect scores which affect funding. Tenure depends on funding. If you doubt this creates a huge bias in the priorities of departments look at the department of Geography at the University of East Anglia which has dozens of Professors in the most obscure specialisms of geomorphology just to get into the top 5 most published departments globally, to no gain to students (why would they need four or five professors of glaciation?).
The problem with this is it encourages researchers to go data dredging – consciously or unconsciously. Only the most interesting results which show positive correlations get published – not negatives – publication bias.
A particularly glaring result of this comes from the field of behavioral psychology, One of its top researchers published a blog post. He had conducted an experiment at an all you can eat buffet. Customers were randomly charged either $4 or $8, the hypothesis was that those charged more would eat more was disproved.
“When [the grad student] arrived,” Wansink wrote, “I gave her a data set of a self-funded, failed study which had null results… I said, ‘This cost us a lot of time and our own money to collect. There’s got to be something here we can salvage because it’s a cool (rich & unique) data set.’ I had three ideas for potential Plan B, C, & D directions (since Plan A had failed).”
He then went on to publish three papers on the >0.95 p values they turned up. The results horrified the statistical community and now dozens of their papers have been scrutinised and glaring statistical errors found.
Why because this is data dredging also known as p-hacking. Trawling data looking for positive p- values rather than selecting the sample independently of the data used to formulate the hypothesis enormously increases the chance of a type II error – a false positive. In economic data this is particularly likely as the number of sample countries are smaller and all are to some degree co variate. Also in economic data it is the ‘long tail’ events which have the most dramatic effect, but this can be swamped by the false positives which occur in the fat tail.
When we set a p-value threshold of, for example, 0.05, we are saying that there is a 5% chance that the result is a false positive. So although we have found a statistically significant result, there is, in reality, no actual correlation. While 5% (arbitrary as it is) may acceptable for one test, if we do lots of tests on the data, then 5% p value threshold can result in an increasing number of false positives. Professors in brainstorming potential correlations and building endless models are making this probability ever higher. 538 blog has a live model where you can test this out p-hacking away until you get the right p-values.
It is like the following example.
1 out of 1000 drivers are driving drunk. The breathalyzers never fail to detect a truly drunk person. For 50 out of the 999 drivers who are not drunk the breathalyzer falsely displays drunkness. Suppose the policemen then stop a driver at random, and force them to take a breathalyzer test. It indicates that he or she is drunk. We assume you don’t know anything else about him or her. How high is the probability he or she really is drunk?
The answer is (from Bayes Theorem) only 2% because the type II errors swamp the positives – in other words the statistical power of the result is very low. How many papers in the top economic journals however even discuss power?
A professor sits in his or room with their grad students. They discuss many hypotheses and test them. The ones that bear out (p vales) are published, those that arn’t are not. Publication bias. The problem here is the ‘researcher degrees of freedom’ subjective choices driving bias.
Once you introduce the risk of type II errors in this way the chance of publishing a false positive rises to around 61%. Ioannidis’s paper on this – claiming most published papers in medical research was wrong – caused a crisis in the field. Similarly we have a replication crisis in behavioral psychology and many other fields. Why is not economics torturing itself on this? we know that around 1 in 3 economic studies dont replicate – almost exactly the figure probability theory would suggest.
Well for a field that is traditionally decades behind modelling, philosophical and statistical advances in the rest of the social sciences it should be no surprise that it is congratulating itself on introducing errors discovered in other fields so long ago.
The empirical turn in economics is simply a symptom of bad modelling. Dimensional errors, stock flow confusion, aggregation errors all proliferate. In a world of data mining the tools for empirical research based on big data are easy to use.
The techniques to avoid p-hacking exist. you could hypotheise the future based on past events only. You can – lack good data mining practitioners- divide results in two so a data set used for exploratory work form hypotheses used entirely independently to test them. But again both are hard to do with aggregated economic statistics. Economists should heed Koopmans call to avoid ‘Modelling without Theory’ and equally avoid John Eatwells conclusion that if the data does not concurr with the model to hell with the data.
I will leave the final world to statitical experts Gelmen and Loken
In (largely) observational fields such as political science, economics, and sociology, replication is more difficult. We cannot easily gather data on additional wars, or additional financial crises, or additional countries. In such settings our only recommendation can be to more fully analyze existing data…
Or in other words only time will tell. Well there may be other solutions. The gatherers of economic data can help. Lets say a statistical office gathers a large sample set and then splits it into two sets A and B. A researcher does not know which set they get. Now lets say they then formulate a hypothesis but this is tested – blindly by a researcher who has access to the other half of the data. This is replication and the statistical power of using both the A and B data is far higher than that of using a single large sample.