13 Flat priors: the strings attached

I have a feeling that one of the biggest issues that a lot of people have when they consider Bayesian statistics is priors. This is something that all the students worry and what they feel insecure about. As a result, they would rather stay in the frequentist world of flat priors.

And flat priors have a lot going for them. They are convenient for mathematicians as they make analytical derivations simpler. They are convenient for users of statistics because they do not have to worry about or even think about priors. And, if you do think about priors, flat priors look superior because they feel impartial. They do not impose any a priori knowledge and allow the results to be determined by the data alone. Thus, whatever results you get, you can claim that they are objective and untainted by the person who did the analysis.

However, flat priors have some strings attached. This may not be a deal-breaker (although it probably should) but this is definitely something you should be aware of when you use them.

13.1 Flat priors are silly

Consider experimental data from a domain you are knowledgeable in. What is the biggest effect that you feel is borderline realistic? I.e., an effect already so ridiculously large that anything larger must come from malfunctioning equipment or software, error in the analysis, etc. For example, evoked potentials in EEG are measured in microvolts, so we can safely assume that any difference between them, which is out of microvolts range, must be artificial. Let’s say our threshold for the “real effect” is a way-way overly optimistic 1 millivolt, which is 1000 μV. To a specialist that already sounds ridiculous but when we use flat priors we explicitly state that we believe equally strongly in difference between evoked potential that is on the scale of microvolts, millivolts, volts, or even billions of volts. Is an a priory belief that human brain is equally capable of generating evoked current of microvolts and billions of volts silly? It sure sound silly to me. Forcing it down on a model does not make this belief less silly.

13.2 Flat priors make you pretend that you are naïve

The other way to look at this is that by using flat priors you explicitly claim to be naïve with respect to the domain. You act as if you have no prior knowledge about the phenomenon, discarding any experience that you acquired. Do you really feel that all these years of studying the subject are of no relevance? Do you really think that you do not have good predictions about at least the realistic range of the effect? You probably do. Even if you are very uncertain about them, the range from minus to plus infinity is awfully large and you can certainly do better than that. And yet, use of flat priors implies that your are completely clueless and there is not a nugget of wisdom that you have that could aid your analysis.

13.3 Flat priors tend to overfit

Even under best circumstances (more on this below), flat priors do not restrict the model in fitting the sample as close as possible. This means that such models will almost certainly overfit the data. This might or might not be a big issue in each particular case, as it will depend on whether noise exaggerates or belittles the actual effect. I suspect that flat priors combined with a fortunate noise are partially responsible for the plethora of reported strong effects that we cannot replicate. This is definitely something to keep in mind.

13.4 Flat priors are an exception

Although for most people flat priors probably feel like a norm, they are applicable only in very specific cases of (relatively) few predictors and plenty of data. Remember all the advice about having that many observations per variable? That is because you need them to afford flat priors and the lack of regularization in general. However, that magic sweet spot is fairly small and flat priors become extremely dangerous as soon as you step out of it.

Do you have an observational study with very little data? See for example “The Ecological Detective” by Hilborn and Mangel who give plenty of situations where this is unavoidable. Flat priors will lead to extreme overfitting to the point of models being not just useless but dangerously misleading. The book mentioned above shows how usage of proper priors can rescue the analysis.

Do you have a lot of data but also a lot of predictors? You probably will end up overfitting. The field of machine learning invests a lot of time and energy into regularization. Given the sheer number of predictors, they cannot set priors by hand and, instead, use other forms of method-specific batch regularization such as lasso or ridge regression penalties on coefficient weight, pruning trees, dropping out neurons, etc.

In short, you can afford flat priors and no regularization only if you keep yourself to fairly specific kinds of data sets. But this is not a norm, this is an exception.

13.5 The irony of power analysis

Even if you are fond of using flat priors and you are not worried about any issues raised above, you still need to think about proper priors once in a while. Specifically, whenever you need to perform a power analysis. Here, you cannot postulate silly things while remaining “objective and impartial” but do need to use your domain knowledge to formulate a sign and a magnitude of an effect in order to estimate the sample size you need. Which is why the power analysis is either extremely easy, if you know priors, or extremely hard, if you do not know them. Thus, even “flat priors” people cannot avoid using the proper ones and I suspect that, as in most cases, regular thinking about priors in your domain makes it much easier to define them for the power analysis.

13.6 Conclusions

My hope is that notes above were able to show that flat priors are neither universal, nor the best prior, nor the norm. They are something you can afford under very specific circumstances. Of course you can use them but you should at least make a mental note to yourself of why do you think they are applicable in that particular case, what advantages they have over alternatives, and what are the costs for using in them in the analysis.

12 Generalized Additive Models as continuous random effects

14 Unbiased mean versus biased variance in plain English