Whole Exome Sequencing in Inborn Errors of Immunity

Use the Power but Mind the Limits

Giorgia Bucciol; Erika Van Nieuwenhove; Leen Moens; Yuval Itan; Isabelle Meyts


Curr Opin Allergy Clin Immunol. 2017;17(6):421-430. 

In This Article

Finding the Disease-causing Variant

WES generates on average 20 000 high-quality coding variants. To minimize the presence of common variants and maximize the inclusion of probable disease-causing mutations, a gene-level and a variant-level filtering approach are applied (Figure 2).[3]

Figure 2.

Flow chart depicting the general filtering method to screen whole exome data for disease-causing mutations. AD, autosomal dominant; AR, autosomal recessive; CADD, combined annotation-dependent depletion; GDI, gene damage index; HGC, human gene connectome; MSC, mutation significance cutoff; WES, whole exome sequencing.

The first crucial step is the generation of a genetic hypothesis that takes into consideration the potential mode of inheritance, the severity of the phenotype, the penetrance, and so on. The presence of multiple affected siblings in a family with healthy ancestors is suggestive of autosomal recessive inheritance, so homozygous (especially in the case of consanguineous unions) or compound heterozygous variants should be prioritized. The inclusion of multiple affected and unaffected siblings is extremely useful to discern the potential disease-causing variant. Nevertheless, the possibility of autosomal dominant, X-linked recessive inheritance, or de novo mutations should not be neglected in consanguineous unions. Trio analysis, in which the index patient and parents are sequenced simultaneously, is particularly efficient to pinpoint disease-causing de novo mutations; it is also very informative in the detection of homozygous and compound heterozygous mutations. Severe phenotypes in single patients should be investigated in particular for de novo, homozygous, or compound heterozygous variants, although in the case of incomplete penetrance this approach could exclude the causal mutation. Finally, autosomal dominant inheritance implies that a heterozygous mutation is disease causing. In accordance with the suspected inheritance, a hypothesis must be formed on the detected variants. Heterozygous dominant variants are generally considered gain-of-function (GOF) mutations, where the encoded protein acquires a new function or expression pattern. However, dominant loss-of-function (LOF) variants can also occur through haploinsufficiency, where a reduction of gene function by 50% is sufficient to result in disease, or by a dominant negative effect, where the mutant gene product acts antagonistically to the wild-type protein function.[7] This can occur, for instance, if the mutant protein hampers multimer formation. Disease-causing compound heterozygous or homozygous variants are presumed LOF mutations.

The following criteria represent a general guideline to consider a gene to harbor potential disease-causing variants, even if they are not always valid and several counter examples can be found: it encodes a protein involved in a pathway related to the phenotype; it is expressed in the cells and tissues of interest; it is intolerant to mutations in the general population, indicating that mutations in that gene have a higher probability of being deleterious. There are databases reporting the expression of human genes across human cell lines, cell types, tissues, and organs.[22–23,26] It is reasonable to hypothesize that the gene of interest is expressed in the affected cell or tissue, although mutations in ubiquitously expressed genes can result in a phenotype that is highly tissue specific.[33] The connection of the candidate gene with a certain pathway or protein that is known to be related with the disease can be tested using the Human Gene Connectome,[34–35] which is based on protein–protein interactions and depicts the biological distance and route between two genes [the gene(s) connecting them]. The gene damage index (GDI) is a tool made publicly available at http://lab.rockefeller.edu/casanova/GDI in 2015 to filter candidates at the gene level, complementing available variant-level metrics.[36] It is defined as the cumulative mutational damage to a particular human gene in the general population based on data from the 1000 Genomes Project,[21] and it can be used to remove genes that are highly mutated in the general population and, therefore, less likely to cause severe disease.[36] Other methods, such as residual variation intolerance score, are complementary by assessing the likelihood of genes to be pathogenic based on their intolerance to mutations.[37]

At the variant level, the allele frequency, based on different public databases in accordance with the ethnicity of the studied kindred, is the main filter to substantially reduce the number of variants. The hypothesized inheritance model dictates the frequency cutoff that should be applied. In general, heterozygous variants with a frequency in the general population of more than 1% are considered polymorphisms, with at most a weak modulating effect on disease. In the context of monogenic disorders, for autosomal dominant transmission a frequency cutoff of less than 1% is thus applied. For autosomal recessive disorders, instead, a higher frequency cutoff is recommended to minimize false negatives: to obtain a frequency cutoff of 1% for the homozygous state, we need to combine two parental alleles with a frequency of 10%, respectively, to respect Hardy–Weinberg equilibrium. Exceptions can derive from population genetics phenomena, such as founder effect, resulting in a high carrier frequency [such as in Adenosine Deaminase 2 (ADA2) deficiency[38]], or evolutionary adaptation [e.g., Familiar Mediterranean Fever gene (MEFV) mutations that provide a selective advantage against Yersinia pestis infection[39]]. These cutoffs can be further reduced based on knowledge on the disease prevalence and on the predicted clinical penetrance. In the research setting, more strict cutoffs are used to screen for novel disease-causing variants, which usually have a very low allele frequency. Coding variants that are synonymous, that is, variants that do not cause a change in the amino acid at that position, and noncoding variants that do not affect splicing, are usually excluded. This step significantly reduces the data load to be analyzed, as in most cases synonymous variants are silent or do not cause a change to the protein. Rarely, synonymous variants alter disease susceptibility or cause disease, for example, by impairing splicing, or by affecting mRNA stability or protein folding (reviewed in[40]).

Various algorithms that attempt to predict the functional effect and assess the conservation/intolerance of a missense variant are publicly available, such as the Phylop score,[25] the Genomic Evolutionary Rate Profiling programs,[41] Sorting Intolerant from Tolerant (SIFT),[42] or Polymorphism Phenotyping v2 (PolyPhen-2).[43] They are combined with information from other annotations to create the CADD score.[27] CADD is more precise in predicting the deleteriousness of both protein-altering and regulatory variants. CADD scores range from 0 to 99 with increasing deleteriousness, and 15 has been proposed as a standard cutoff for all human genes. However, like other prediction tools, CADD is not flawless.[27] For example, reliable data input is lacking for intronic variants and it has been shown that CADD score has a high false negative rate when using a fixed cutoff.[44] A more precise assessment of the variant's effect at the gene level can be obtained by using the mutation significance cutoff (MSC), a quantitative method that provides gene-specific phenotypic impact cutoff values, and represents the lowest expected clinically/biologically relevant CADD (as well as PolyPhen-2 and SIFT) cutoff value for a specific gene.[44] The combination of both variant and gene-level approaches, for example, by considering the GDI, the MSC and the CADD score together, allows better prioritization, with less devaluation of false negative variants.[5,44] In other words: a benign variant (with a CADD below the MSC of that gene) with a high GDI is expected to have the lowest phenotypic impact. A variant with a CADD score above the gene-specific MSC in a gene displaying low or medium GDI and which is also biologically close to known disease-causing genes, can be expected to have the greatest phenotypic impact.