COVID-19 Data Dives: The Banana Peel of COVID-19 Research

Tim T. Morris, MSc, PhD


May 19, 2020

Find the latest COVID-19 news and guidance in Medscape's Coronavirus Resource Center.

Medscape has asked top experts to weigh in on the most pressing scientific questions about COVID-19. Check back frequently for more COVID-19 Data Dives

Tim T. Morris, MSc, PhD

Several of my colleagues and I at the Medical Research Council (MRC) Integrative Epidemiology Unit at University of Bristol have recently written and uploaded a preprint discussing collider bias (also called selection bias, sampling bias, ascertainment bias), with a focus on COVID-19 research.

We're of the opinion that collider bias represents a major problem in many current studies that aim to determine the risk factors underlying COVID-19 infection and severity. In a Twitter post, I referred to it as the big banana peel that a lot of studies may be slipping on.

A collider is a variable that is influenced by two other variables of interest (they "collide" in a directed acyclic graph) (Figure 1). Unfortunately, unlike many of the biases that we tend to be well aware of, such as confounding and generalizability bias, collider bias is quite unintuitive. Collider bias can be automatically introduced into a study where the variables of interest influence a participant being sampled—ie, an observation being present in a dataset. For COVID-19 research, this means that risk factors and infection could appear related in samples even if these associations don't hold in the larger population.

Figure 1.

A thought experiment can help to demonstrate collider bias due to sample selection. Suppose that sporting ability and academic ability are both normally distributed and independent in the population. The normal distributions underlying these variables tell us that some people are high on sporting ability and some people are high on academic ability; but as a general rule, those high on one will not necessarily be high on the other. That is, sporting ability has no bearing whatsoever on academic ability, and vice versa (Figure 2).

Figure 2.

Let's now suppose that a prestigious, highly selective school chooses to enroll children who have high sporting or academic ability. Let's say that the school has sufficient capacity to enroll the top 10% of pupils from the population based on their combined sporting and academic scores (Figure 3).

Figure 3.

Remember, sporting and academic ability are independent in the population, so by selecting from the top of these distributions, we ensure that enrolled pupils are likely to be either sporting or academic. This selection introduces a negative correlation in our school even though this correlation does not exist in the wider population from which our school was sampled. If we analyze sporting and academic ability in the school, we will conclude that the two are inversely related. If you see a pupil in the school's uniform swing to kick a football, miss, and fall on their arse, then you can reliably conclude that they must be really smart. However, if you see a child in the uniform of another school (or no school uniform) make this sporting faux pas, you know that you cannot infer anything about their academic ability.

It is through our selection on this nonrandomly generated sample that collider bias is induced; we have automatically conditioned on the collider merely by using the school dataset. This is an extreme example, but it nicely demonstrates how observed associations in-sample may not accurately reflect associations out of sample or in the general population.

The Effect of Collider Bias on COVID-19 Research

How does this relate to coronavirus research? Well, many of the current COVID-19 datasets rely on nonrandom participation with strong selection pressures. We present a few different examples in the preprint to highlight this, but there are many more. For example, in the United Kingdom, those tested are more likely to be healthcare and essential workers with a high risk for infections, or hospitalized patients with severe symptoms (Figure 4).

Figure 4.

These selection pressures can cause some strange associations and may be partly responsible for some of those already observed, such as smoking appearing protective and ACE inhibitors appearing harmful.

To investigate whether it was possible to observe strong effects of smoking on COVID-19 infection as a result of selective sampling, we ran some simulations. These were modeled under a similar scenario to one of the smoking papers but with no true underlying causal relationship between smoking and infection. The simulations demonstrated that one could observe twofold protective/risk effects of smoking on COVID-19 infection purely because of selective sampling (Figure 5).

Figure 5.

To investigate whether there could be selective sampling to COVID-19 testing within the UK Biobank—one of the largest studies with data linkage to testing data—we examined associations between 2556 traits and being tested for COVID-19 (note that this is being tested, not a positive test result). We found that 32% of the traits associated with being tested for COVID-19 (false discovery rate of < .05), including frailty, medications used, hypertension, and socioeconomic status. The QQ plot shows the enrichment for associations (Figure 6).

Figure 6.

Are these observed associations true or the result of collider bias? It's difficult to tell from the data, but there are some sensitivity analyses that can be run. However, the extent of sensitivity analyses that one can run depends on data that are available on nonparticipants.

The only way to be completely confident that collider bias doesn't underlie observed associations is to use samples that are not under selection pressures, meaning representative population samples with random participation and attrition patterns. Clearly, such samples are exceedingly rare, so it is important that we do everything we can to ensure that the data samples we select for analyses are as representative and free from strong selection pressures as possible.

Beyond this, we must do our best to hold in mind the likely selection pressures when interpreting (COVID-19) data. Gareth Griffiths, PhD, also created a brilliant app where you can vary different selection parameters to examine the effects of collider bias on associations.

So, in summary, there is potential for collider bias to distort associations and undermine our understanding of the risk factors for COVID-19 infection and severity when using highly selected samples. It is of vital importance that studies and their associations between variables of interest are interpreted in light of this.

Results from samples that are probably not representative of the target population and could have been subject to strong selection pressure should be treated with caution by scientists and policymakers. Naive interpretation of results from such studies could lead to public health decisions that fail or even cause unintentional harm.

Gemma Sharp, PhD, Lindsey Pike, PhD, and I have also tried to summarize the paper in an accessible blog.

Tim Morris, PhD, is a senior research associate at the MRC Integrative Epidemiology Unit at the University of Bristol and the Bristol Medical School. His doctoral research focused on the role of unobserved confounding in epidemiologic research and the use of complex statistical modeling to minimize this. Follow him on Twitter.

Follow Medscape on Facebook, Twitter, Instagram, and YouTube


Comments on Medscape are moderated and should be professional in tone and on topic. You must declare any conflicts of interest related to your comments and responses. Please see our Commenting Guide for further information. We reserve the right to remove posts at our sole discretion.