Whole Exome Sequencing in Inborn Errors of Immunity

Use the Power but Mind the Limits

Giorgia Bucciol; Erika Van Nieuwenhove; Leen Moens; Yuval Itan; Isabelle Meyts


Curr Opin Allergy Clin Immunol. 2017;17(6):421-430. 

In This Article

Whole Exome Sequencing and Whole Genome Sequencing Analysis Pipeline

WES and WGS raw data output files generated by high-throughput sequencing instruments, such as the Illumina Genome Analyzer (San Diego, California, US), are in FASTQ format. Burrows–Wheeler Aligner[14] is often used for subsequent alignment of the reads to the reference genome, resulting in Binary Alignment Map files. In the next step a Variant Call Format file, containing the chromosome position and nucleotide change for each variant, is created using for instance Genome Analysis Toolkit.[15] Finally, the Variant Call Format files (already containing information on the sequencing quality and read depth) can be further annotated with different tools available online, with the purpose of simplifying data analysis. Annovar,[16] SnpEff,[17] Variant Effect Predictor,[18] and others add information about a specific variant based on available data on its position in the gene, the amino acid change, the molecular effect of the variant (missense, nonsense, and so on), the allele frequency in different populations, and the protein damage prediction. Encyclopedia of DNA Elements (ENCODE) is a useful resource for annotating intergenic variants.[19] The allele frequency of the variant is a fundamental tool to filter out common polymorphisms and can be derived from publicly available databases.[20,21] BioGPS[22] and Protein Atlas[23] show the expression of the gene and the protein for which it encodes in human tissues and cell lines, whereas Evolutionary Conserved Regions Browser[24] and phylogenetic p (Phylop) score[25] display the conservation of genes and DNA sequences across species. Relevant information about genes and their encoded products are collected in integrative databases, such as GeneCards (www.genecards.org).[26] Finally, additional information can be gained by combining different annotations that predict deleteriousness/protein impact, such as in the combined annotation-dependent depletion (CADD) score.[27] Relevant annotations can be chosen and implemented into a personalized pipeline, based on the research or diagnostic laboratory's specific needs. Some of the most commonly used annotations are further discussed in the chapter 'Finding the disease-causing variant.'