Whole Exome Sequencing in Inborn Errors of Immunity

Use the Power but Mind the Limits

Giorgia Bucciol; Erika Van Nieuwenhove; Leen Moens; Yuval Itan; Isabelle Meyts


Curr Opin Allergy Clin Immunol. 2017;17(6):421-430. 

In This Article

Pitfalls of Next-generation Sequencing

NGS, and WES in particular, present some common drawbacks and technical difficulties that we are now going to outline.

Incomplete Coverage of the Genome by Whole Exome Sequencing

The major pitfall of WES is that, by design, its coverage excludes noncoding exons, for example, RNA genes, and intronic/intergenic regions, although they can harbor disease-causing mutations. Examples are Roifman syndrome, caused by compound heterozygous mutations in the RNA U4 small nuclear AT-AC form (RNU4ATAC) gene, encoding the minor spliceosomal small nuclear RNA U4atac, and X-linked reticulate pigmentary disorder (X-linked PDR), caused by a hemizygous (in men) or heterozygous (in women, who have a milder phenotype with only pigmentary anomalies) intronic mutation in the DNA polymerase alpha gene (POLA1), affecting splicing. Both diseases were genetically characterized thanks to WGS, performed on the clinically affected patients.[45–46] In addition to noncoding exons and introns, a large number of exons (up to 6% of the total, in Belkadi et al.[13]) from protein-coding genes are yet not covered in exome kits, even those included in the Consensus Coding DNA Sequence database. Moreover, it has been shown how, even in targeted regions, many high-quality true positive variants are missed by WES compared with WGS, whereas the variants missed by WGS but detected by WES are mostly false positives.[13]

When Quantity Matters

CNVs are large segments of DNA that by duplication or deletion are present in a variable number in the genome of different individuals (Figure 3). They can be disease-associated or simple polymorphisms, and are increasingly recognized to play a role in interindividual variability.[47–49] Segmental duplications, or low copy repeats, are either inter or intrachromosome duplications of highly identical stretches of DNA, from 1 to 400 kb in length.[49–51] Segmental duplications have played a major role in human evolution from primates and are much more represented in our genome compared with chimpanzee, with 138 gene duplications specific to humans identified as human-chimpanzee divergence.[52–53] In fact, approximately 5% of the human genome was shown to consist of segmental duplications of 10 kb or larger,[54] and up to 97% of genes with a CNV correspond to known segmental duplications, thus illustrating a propensity for chromosomal rearrangements in these regions.[48] A change in copy number of a gene or a disruption of its integrity can arise from nonallelic homologous recombination of repeated sequences (Figure 4). It is the cause of many genomic disorders, including X-linked ichthyosis, α-Thalassemia, and DiGeorge syndrome.[49–51,54]

Figure 3.

The figure illustrates the types of copy number variations that can arise from deletions, inversions, and duplication of segments of DNA. The squares marked by A, B, C, and D represent large regions of genomic DNA.

Figure 4.

Different types of nonallelic homologous recombination between segmental duplications (red and blue arrows represent the same duplicated region). (1) Recombination between homologous sequences on different DNA strands can generate either a duplication or a deletion of the region between the segmental duplications. (2) Recombination between homologous sequences with reverse orientation leads to inversion of the recombined fragments.

Large structural variations of the DNA are traditionally analyzed by karyotyping and by array-based comparative genomic hybridization (array CGH). Standard karyotyping is based on microscopical observation of the chromosome and can, therefore, only detect numerical or structural aberration of at least 5–10 million base pairs (Mb). In-situ fluorescence hybridization employs labeled DNA probes to detect smaller chromosomal rearrangements at a resolution of 1–5 Mb, but it can be used only when the specific aberration is already known.[55] Array CGH instead allows us the identification of smaller variations (from ~200 bp) by hybridizing the sample DNA together with a reference DNA to the normal human genome, and then measuring the differences between sample and reference.[56] An example of the application of array CGH to the diagnostic of inborn errors of immunity is the description by Kazenwadel et al.[57] of a large genomic deletion encompassing GATA binding protein 2 (GATA2) gene in a family with myelodysplastic syndrome.

When approached by WES, segmental duplications and large structural variations such as CNVs present technical difficulties at two levels.

Copy Number Variations

CNVs can significantly contribute to the cause of Mendelian disorders. For this reason, it is important to find strategies that allow them to be detected by WES, even if calling CNVs from short read sequence data can be very challenging. High-resolution mapping of copy number alterations is already very effective for WGS, whereas WES is currently not very reliable, as a result of the fragmentation of the captured exons and the extension of most CNVs beyond the covered amplicons.[13,48] Different strategies are being tested to improve CNVs detection by WES; among them, read depth analysis has proven particularly effective.[58,59] This approach assumes a random distribution in mapping depth. It compares the number of reads mapping to a certain chromosome region with the prediction by a statistical model, calling deviations from the model as CNVs.[58,60] This method yields better results when the ratio of read counts between a test sample and a control sample is used instead of a single sample analysis, to compensate for the technical variability in capture efficiency and other sequencing biases.[58,60] The different algorithms and platforms available for CNVs analysis are reviewed and compared in Kadalayil et al..[58]

Segmental Duplications

In the case of segmental duplications, the generation of relatively short sequence reads required by WES makes it impossible to distinguish between a duplicated gene and its parent gene during their alignment to the reference sequence. This prompted many researchers to systematically exclude these duplication-rich genomic regions from their downstream analysis. Caution is instead needed, as the largest segmental duplications exhibit an interindividual variability of only 3% and duplications do not preclude the occurrence of highly damaging mutations.[48] We draw on two examples from the field of inborn errors of immunity. Sequencing of the gene encoding filaggrin (FLG) proved challenging because of the large exon 3 which includes 10–12 highly identical full tandem repeats.[61] However, even nonsense and frameshift mutations in these imperfect C-terminal repeats are pathogenic for ichthyosis vulgaris and predispose to eczema and allergic diseases.[61] In our quest for a causative gene defect in a patient with severe atopic dermatitis, recurrent skin infections and other features reminiscent of hyper-IgE syndrome including a serum IgE of 30 000 kU/l, a pathogenic homozygous mutation in FLG was filtered out because it was located in a segmental duplication area. Autosomal dominant combined immunodeficiency with lymphoproliferation because of mutations in the Phosphatidylinositol 3-kinase catalytic delta (PIK3CD) gene, in particular the known pathogenic E1021K substitution, may also be missed because of a duplicated region in exon 22,[21] as was our experience in a boy for whom we did trio exome sequencing but only found the causative gene via single gene Sanger sequencing after carefully reconsidering the clinical phenotype. Hence, it is not safe to automatically exclude segmental duplications when analyzing WES data. This case also shows the crucial role of the astute clinician in guiding and navigating genetic analysis.


Pseudogenes are segments of genomic DNA that are related to normal genes but not functional. They are characterized by sequence homology with a parent gene and the loss of protein-coding ability. This is the result of two main mechanisms: the duplication of a gene followed by accumulation of mutations that cause the deactivation of the redundant copy, or the retrotransposition of mRNA back into the genome, generating a gene that is by default nonfunctional, lacking promoter regions (Figure 5).[62] The Inhibitor of kappa light polypeptide gene enhancer in B cells, kinase gamma (IKBKG)/NF-kB essential modulator (NEMO) gene, coding for NEMO, represents a significant example of pseudogene and segmental duplication. It is in fact part of a segmental duplication containing the functional gene and, in the opposite orientation, its partial pseudogene copy IKBKGP.[63–64] Nonallelic homologous recombination between the two highly similar regions can generate rearrangements both of the gene and the pseudogene, most often CNVs. When they involve the functional gene, they can be the cause of incontinentia pigmenti in affected women.[63–64] This peculiar genomic structure renders analysis of IKBKG/NEMO by WES extremely difficult. In the alignment to the reference sequence, it is impossible to attribute with certainty reads from homologous regions to the gene or the pseudogene. This has important diagnostic consequences, as the same single nucleotide variant or indel is pathogenic when affecting the gene but benign in the pseudogene. A reliable method to assess whether a variant belongs to one or the other is long-range PCR, that allows discrimination between the two regions, thanks to specific primers.[65]

Figure 5.

The figure illustrates two possible mechanisms of generation of pseudogenes: duplication of a gene followed by accumulation of deleterious mutations in the redundant copy, or retrotransposition of mRNA in the DNA, producing a nonfunctional pseudogene.

Gene and Variant-level Prediction Methods

At the gene level, the GDI and residual variation intolerance score methods are founded on the fair assumption that frequently mutated genes in the healthy population are less likely to be pathogenic for rare, inherited disorders such as inborn errors of immunity. They are efficient in eliminating these overrepresented highly mutated genes, containing missense, nonsense, or indel variants not considered detrimental biochemically or evolutionarily. In other words, they provide us with a tool to remove plausible false positive variants from NGS data. Focusing on GDI, the cutoff values proposed for the filtering process in the context of inborn errors of immunity are 12.41 for all disease-causing genes, and more specifically 13.36 and 9.49, respectively, for autosomal recessive or autosomal dominant inheritance.[36] Despite the many advantages, it is prudent to consider its limitations. In the first place, to generate these cutoff values a false negative rate of 5% was tolerated.[36] Therefore, variants in some disease-causing genes [including Dedicator of cytokinesis 8 (DOCK8), Interferon-induced helicase C domain-containing protein 1 (IFIH1), Lipopolysaccharide-responsive beige-like anchor protein (LRBA), and Ataxia-teleangectasia mutated (ATM)] will be excluded from the downstream analysis. Furthermore, the GDI may be population specific, and when performing genetic analysis it remains imperative to take these factors into account. Finally, the GDI does not compensate for CNVs present in these genes, which translates into a higher than expected GDI value. The gene FLG is an example with a GDI value of 27.35 but a demonstrated role in severe eczema. This issue of adjusting the cutoffs to include as many true positives as possible, whereas trying to eliminate false negatives applies also to the variant-level prediction methods, such as allele frequency and CADD score. We recommend to either establish a false negative rate suitable for your personal objective or to merely use these tools in the prioritization of variants (what they were ultimately designed for), but not as a final cutoff.