Personalized Medicine for Chronic, Complex Diseases: Chronic Obstructive Pulmonary Disease as an Example

Josiah E Radder; Steven D Shapiro; Annerose Berndt


Personalized Medicine. 2014;11(7):669-679. 

In This Article

Combining Genotype & Phenotype for Personalized Medicine

Collecting both phenotypic and genotypic information from patients with chronic disease will allow us to better classify these patients using unique attributes that are predictive of outcomes or therapeutic response and return these results to the patient. In order to achieve this goal, we must not only be able to collect all of this information, but also store it in an integrated, secure, yet still easily accessible manner. Ideally, this will occur in a centralized repository of information, commonly called an enterprise data warehouse (EDW). When carefully designed, an EDW can provide a secure location for integrated information from multiple sources that also allows a secondary level on which deidentified data can be provided to researchers and providers through analytics and intelligent business tools.[67] Such tools, which are self-designed or commercially available from companies with significant experience with 'big data' analysis, such as Oracle (CA, USA) and IBM (NY, USA), are specifically designed for the analysis and visualization of large datasets.[68]

In order to achieve the goal of a fully functional EDW, data quality issues must be addressed. As the 'meaningful use' guidelines outlined in the HITECH Act have increased the amount of recorded information and standardized much of which data should be stored, health information technologies have proliferated in order to handle these data, with little standardization of how it is stored. Thus, semantic interoperability between these systems has become a significant challenge to public health.[69] Furthermore, variability in how different users in different settings utilize and access the EHR[70] can have an impact on data quality, even before interoperability can be achieved. In a study looking at quality issues in the EHR, incompleteness was demonstrated to be a major challenge in such datasets, with only 51% of patients with ICD-9 diagnoses of pancreatic cancer having supporting pathological documentation.[71] These quality concerns have been described as the true 'big data problem' for healthcare.[72]

The Electronic Health Records for Clinical Research (EHR4CR) project, which is working to combine health records from multiple European countries into a central repository for clinical research, offers an example of some of these issues. Funded by the European Innovative Medicines Initiative (IMI), the project involves a consortium of ten pharmaceutical companies, 24 public partners and two subcontractors working together with the major aim of establishing clinical research processes for existing data in EHRs across European countries.[73,74] Reports from this coalition have demonstrated the breadth of this challenge – even the choice of which structured data elements should be inventoried for clinical trials required significant input from a large group of experts.[75] However, once defined, the group was able to demonstrate whether a given site in the network was feasible for retrospective trials based on the information contained in these elements, an impressive feat considering the multiple languages and hospitals from which they were sourced.

The most effective method for the storage of genetic data and integration into an EHR or EDW is even less well defined. Whole-exome sequencing at an average read depth of 30× produces approximately 5 GB of data, while whole-genome sequencing at the same depth produces closer to 100 GB of raw data. No other currently utilized diagnostic produces this quantity of data, and the current implementations of the EHR and even many EDWs are not equipped to handle such massive data. NGS also is an unusual diagnostic in that our understanding of the results may change over time, and this may change the way we treat a patient, even long after the date of the original diagnostic.[76] Thus, repositories for NGS data must have vast storage capacity, but also be accessible in order to update the results. Some have suggested that this will require a significant shift from our current human-centric approach to genetic analysis to trusting more in computer-centric approaches.[76] In order to make this shift, EDWs will need to be carefully designed in order to either handle large quantities of data internally or to be able to interact with external storage systems in a way that allows for the accurate linkage of results and rapid access for both research and clinical purposes.

These challenges are currently being explored in depth by the Electronic Medical Records and Genomics (eMERGE) Network, funded by the National Genomic Research Institute (NHGRI) and consisting of nine institutions with the goal of integrating EHR data from all of the institutions and combining them with collected genetic data. In a report on the original five sites, EHR quality problems were elucidated, including a significant need for natural language processing at those sites that utilized text-based EHRs, the lack of certain elements as structured data, including smoking status and allergies, and the variable capture of other 'meaningful use' data.[77,78] When adding genetic data, problems with different genotyping technologies between sites, as well as data issues, including the merging of these disparate technologies, imputing for missing genotypes and correcting for strand orientation, became issues.[79] The group has already overcome many of these problems for genotypic data, however, and some of these successes are outlined in Box 1. Other creative ways of approaching data quality problems have allowed the group to produce numerous publications demonstrating the power of EHR-based phenotyping and phenotype-based association studies.[80–82]

Large-scale attempts to integrate diverse datasets, including inputs from multiple EHRs and other previously untested sources, such as genotyping, will pave the way for better phenotyping via retrospective studies and improved genetic studies, which are necessities for personalizing the treatments for chronic disease. As groups such as EHR4CR and eMERGE share what they have learned, other organizations will be able to follow suit and contribute their own lessons. Furthermore, the continued development of analytics solutions by large corporations that are accustomed to dealing with complex datasets that take these lessons into account will provide the technological basis for the development of an EDW that will truly alter clinical practice.