The Missing COVID-19 Demographic Data

A Statewide Analysis of COVID-19-Related Demographic Data From Local Government Sources and a Comparison With Federal Public Surveillance Data

Angel Aliseda-Alonso, MPP; Sara Bertran de Lis, PhD; Adam Lee; Emily N. Pond, MPH; Beth Blauer, JD; Lainie Rutkow, JD, PhD, MPH; Jennifer B. Nuzzo, DrPH, SM


Am J Public Health. 2022;112(8):1161-1169. 

In This Article

Abstract and Introduction


Objectives: To collect and standardize COVID-19 demographic data published by local public-facing Web sites and analyze how this information differs from Centers for Disease Control and Prevention (CDC) public surveillance data.

Methods: We aggregated and standardized COVID-19 data on cases and deaths by age, gender, race, and ethnicity from US state and territorial governmental sources between May 24 and June 4, 2021. We describe the standardization process and compare it with the CDC's process for public surveillance data.

Results: As of June 2021, the CDC's public demographic data set included 80.9% of total cases and 46.7% of total deaths reported by states, with significant variation across jurisdictions. Relative to state and territorial data sources, the CDC consistently underreports cases and deaths among African American and Hispanic or Latino individuals and overreports deaths among people older than 65 years and White individuals.

Conclusions: Differences exist in amounts of data included and demographic composition between the CDC's public surveillance data and state and territory reporting, with large heterogeneity across jurisdictions. A lack of standardization and reporting mechanisms limits the production of complete real-time demographic data.


The impact of the COVID-19 pandemic in the United States has not been equal across different demographic groups. Multiple studies have shown that US racial and ethnic minority populations have a proportionally higher number of COVID-19 cases,[1,2] higher mortality rates,[3–6] and lower access to testing.[7,8] Also, studies from other countries have shown that although the prevalence of COVID-19 is similar between males and females, males have higher mortality rates.[9–11] Advanced age is a significant risk factor for severe illness and death, with adults older than 65 years accounting for 75% of all COVID-19 deaths in the United States.[12]

Most epidemiological studies of demographic characteristics of cases, hospitalizations, and deaths rely on data from death certificates[3] or specific populations from metropolitan areas,[1,5,7] hospitals and health systems with high-quality data,[4,9] or data from foreign countries.[10,11,13] These data sources are informative but incomplete. They may be limited to specific populations, may lack subnational representativeness, and may not be updated rapidly enough to adopt mitigation measures in specific populations.[14]

At the state and local levels, hospitals, health care providers, and laboratories report individualized data to health departments through a mandatory process known as "case reporting."[15] Using case reports, local health departments have created public-facing dashboards, data repositories, or Web sites with COVID-19 aggregated counts and demographic data. However, all public-facing dashboards are different, varying considerably in the availability and presentation of data. Therefore, comparing and tracking these data require that they be collected from different sites, organized, standardized, and concentrated in a single data repository.

By contrast, the US Centers for Disease Control and Prevention (CDC) collects deidentified patient-level data, including demographic characteristics, through a reporting mechanism called "case notification."[15] Using these patient-level data, the CDC produces the COVID-19 Case Surveillance Public Use Data with Geography data set. This data set contains 19 different characteristics for each COVID-19 case shared with the CDC, including demographics and geography (state and county), exposure history, and disease severity indicators. However, case notification is slower, voluntary, and less complete, as it depends on each jurisdiction's reportable conditions. Moreover, the CDC follows a privacy protection review protocol that redacts specific information—including demographic characteristics—to reduce the risk of reidentification.[16]

Several independent efforts to gather and publish comprehensive race and ethnicity data from each jurisdiction's health department in a single publicly available aggregator have also emerged outside CDC sources. Examples of these efforts include the COVID Racial Data Tracker from the COVID Tracking Project and the Boston University Center for Antiracist Research,[17] The Color of Coronavirus project from the APM Research Lab,[18] and the COVID-19 Vaccine Monitor Dashboard from the Kaiser Family Foundation.[19] However, those efforts concluded in March 2021.

Despite the many advantages of having COVID-19 demographic data to mitigate disparities,[20] it is not well understood how various public sites reporting demographic data compare. Moreover, it is unclear whether demographic data are complete and timely and whether they show consistent trends. To understand the impact of COVID-19 across different demographic groups at the state and territorial levels, the Johns Hopkins Coronavirus Resource Center (CRC) started collecting, processing, and publishing demographic data related to COVID-19 outcomes from state and territorial sources in April 2021.[21] The CRC has been working since to routinely gather and standardize data that allow compilation—in a comprehensive, accurate, and uniform manner—of the diverse, publicly available data from all US states and territories.

As part of this effort, we sought to understand how these data from public-facing state and territorial Web sites compare with the national aggregation published regularly by the CDC. Here we describe the methods used to collect and standardize COVID-19 demographic data from various local sources and compare the standardized data set with a similar publicly available data set from the CDC, focusing on the demographic composition of cases and deaths and the proportion of missing data.