The rise of genomics and its impact on human health
Established in 1990, the Human Genome Project was one of the most expensive and collaborative ventures ever undertaken in science. Ten years since its completion, it has continued to provide a wealth of novel information, the implications of which are not yet fully understood [8]. The open-access nature of the project has stimulated scientists, as well as scientific companies, to develop better sequencing tools and accompanying analytical software. The ensuing innovations have helped to mark down the price of whole genome sequencing over the years, from nearly $3 billion at its inception to under $3,000, making it accessible to researchers from different biomedical disciplines [14].
Sequencing tools will play an important role in the development of personalized medicine. Some sequencing technologies are already used in clinics to test genetic conditions, diagnose complex diseases, or screen patient samples for rare variants. These tests allow health professionals to accurately diagnose a disease and prescribe appropriate medication specific to the patient [15, 16]. With the recent support of NIH grants in the US, neonatal sequencing is being explored to probe rare and complex disorders of newborn babies [17, 18]. There are technologies in development that allow non-invasive ways of sequencing a genome of an unborn child [19]. Personalized genome sequencing will transform the future of the healthcare landscape. However, the rise in the number of sequenced genomes is creating new problems. In particular, the way the genome analysis software works is through comparison of the obtained sequences with a reference. Because the human genome is different between different individuals, what is the reference sequence? What is the threshold to distinguish common from rare DNA variants?
Amid all these interesting implications of genome sequencing, the debate concerning the correct use of scientific terminology remains. Specifically, the nomenclature “mutation” and “polymorphism”, and also “point mutation” versus “SNP”, can be independently used to describe the same event, namely a difference in the sequence as compared with a reference. From a strictly grammatical and etymological point of view, a mutation is an event (of mutating) and a polymorphism is a condition or quality (of being polymorphic); but these terms by extension quickly came to mean the resulting event or condition itself. In principle, a point DNA variant can be labeled as a mutation or SNP. Since no clear rules are available, currently used software tools used for genome sequencing make no assignment and label the difference simply as DNA variant, blurring the distinction between the two categories.
“Mutation” and “polymorphism”: earlier definitions
The uniform and unequivocal description of sequence variants in human DNA and protein sequences (mutations, polymorphisms) were initiated by two papers published in 1993 [20, 21]. In this context, any rare change in the nucleotide sequence, usually but not always with a disease causing attribute, is termed a “mutation” [22]. This change in the nucleotide sequence may or may not cause phenotypic changes. Mutations can be inherited from parents (germline mutations) or acquired over the life of an individual (somatic mutations), the latter being the principal driver of human diseases like cancer. Germline mutations occur in the gametes. Since the offspring is initially derived from the fusion of an egg and a sperm, germline mutations of parents may also be found in each nucleated cell of their progeny. Mutations usually arise from unrepaired DNA damage, replication errors, or mobile genetic elements. There are several major classes of DNA mutations. A point mutation occurs when a single nucleotide is added, deleted or substituted. Along with point mutations, the whole structure of a chromosome can be altered, with chromosomal regions being flipped, deleted, duplicated, or translocated [23]. Another kind of DNA mutation is defined as “copy number variation”. In this case, the expression of a gene is amplified (or reduced) through increased (decreased) copy number of a locus allele [24, 25].
A variation in the DNA sequence that occurs in a population with a frequency of 1 % or higher is termed a polymorphism [26]. The higher incidence in the population suggests that a polymorphism is naturally occurring, with either a neutral or beneficial effect. Polymorphisms can also be of one or more nucleotide changes, just like mutations. The SNP exemplifies the commonest polymorphism, thought to arise every 1,000 base pairs in the human genome, and is usually found in areas flanking protein-coding genes [27] – regions now recognized as critical for microRNA binding and regulation of gene/protein expression [28]. However, SNPs can also occur in coding sequences, introns, or in intergenic regions [27]. SNPs are used as genetic signatures in populations to study the predisposition to certain traits, including diseases [29].
The anatomy of the problem
In the era of advanced DNA sequencing tools and personal genomics, these earlier definitions of mutation and polymorphism are antiquated. Before multiple parallel sequencing was developed, it was impossible to sequence multiple times the genome of the same patient. For these reasons at that time it was required to use a reference sequence coming from the assembly of multiple genomes. In the preparation of the consensus sequence, an arbitrary threshold of 1 % was established to distinguish common (polymorphism) from rare (mutation) variants [26].
The 1 % or higher frequency associated with a polymorphism is an arbitrary number [30] recommended by scientists prior to the era of Next Gen Sequencing. The threshold being arbitrary, redefining the population itself may affect the classification, with rare variants becoming polymorphisms or polymorphisms becoming rare variants according to the population analyzed. For decades, the use of this frequency to develop population models was preferred to the use of sequencing tools, which at that time were error-prone and labor-intensive. With the advent of new sequencing technologies and the subsequent sequencing of individuals, a very different picture of population dynamics has begun to emerge. Mutations that were thought to be rare in a population have been found to exceed the frequency threshold set at 1 % [31]. Even more surprising, there is a lack of association of some of these rare mutations with human diseases. When comparing populations separated by geographic and physical barriers, a disease-causing mutation in one population is found to be harmless in another, and vice versa [32].
For instance, sickle-cell anemia is caused by a nucleotide change (SNP rs334) in a gene coding for the beta chain of the hemoglobin protein [33]. In fact, rs334 is classified as a SNP, since its minor allele frequency in the population is >1 %. The disease manifests in people who have two copies of the mutated gene (rs334(T;T) genotype). Sickle cell anemia is usually rare (<1 %) in the populations of developed nations [34]. However, the heterozygous form of the gene (rs334(A;T) genotype) is persistent in populations of Africa, India, and other developing nations, where malaria is endemic [33]. In these geographic locations, heterozygote carriers of rs334 have a survival advantage against the malaria pathogen, and therefore this beneficial mutation is passed through the offspring to succeeding generations [35–37]. Here, a rare variant, which in one population (developed nations) causes a severe disease in homozygosis, can persist in another population to confer a survival advantage as a polymorphism in heterozygosis [38]. Such exceptions are increasing and show the need to redefine the terms mutation and polymorphism. The distinction between mutation and polymorphism on the basis of their disease-causing capacity is further complicated. Although thought to be naturally occurring, recent research into SNPs has shown that they can be associated with diseases like diabetes and cancers. At least 40 SNPs have been shown to associate with type-2 diabetes alone [39]. In short, it is not possible to classify the functional role of variations according to frequency in the population or their capability to cause a disease.
Context of personal genomics
This debate on “mutation” and “polymorphism” needs urgent evaluation in the era of Next Gen Sequencing and precision medicine. Multiple international collaborative projects like ENCODE (Encyclopedia of DNA elements) and HapMap (Haplotype Map) have ensued to map all the genes, genetic variation, and regulatory elements of the genome, to find associations with human biology, personal traits, and diseases [40].
In this climate, commercial companies like Illumina and Roche are developing advanced and robust platforms that tailor to the need of both small and large research facilities. The increasing competition among these companies has resulted in many different technologies, which are now available to facilitate new insights into genomics [11]. Similarly, advanced genomic tools and analytical software have been developed that can function independently of the particular platform. Researchers using tools like CLC genomics, Next Gene and Geno Matrix, can access and download sequencing datasets for their own streamlined research. The primary goal of such research is to look for subtle, complex, and dynamic sequence variations. The lack of consistent definitions and a uniform scientific language can hamper this upcoming field, where genomic platforms may formulate incorrect hypotheses and researchers may misinterpret data based on earlier definitions.
The problem is particularly important in the case of precision medicine and personalized treatments. For example, one of the main reasons to sequence the genome of a cancer consists in the identification of unique genetic features of cancer cells which may then be targeted with a personalized treatment [41]. Accordingly, it is required to classify the somatic mutations of the cancer cells and use such knowledge to exploit therapeutically all the differences between cancer and noncancerous cells. Therefore, in order to be treated with a targeted agent a cancer patient needs to express the target originated by the specific mutation occurring in cancer cells. However, should a difference be misclassified, it becomes possible for a polymorphism (present in all the cells of the patient) to be taken as a somatic mutation. The result could be a toxic effect, since the targeted treatment will impact both cancer and noncancerous cells carrying the same genetic variant. This problem is prevented if both germline and somatic cancer genomes would be sequenced in the same patient.
Another important reason underlying the need of such distinction is that a disease may originate with two subsequent mutations according to the two-hit hypothesis [42]. Within a population, a germline mutation (first hit) may predispose a subset of patients to a second, somatic, mutation whose effects will create the diseased phenotype [43]. In this context, in order to identify populations at risk it would be extremely helpful to distinguish between somatic and germline mutations. For example, multiple meningiomas occur in <10 % of meningioma patients. A first germline mutation in the SMARCB1 gene will predispose to meningioma, but this will occur only when a somatic mutation in the NF2 gene intervenes [44]. In the absence of a clear distinction between somatic and germline variants this kind of pathogenic discovery may be impossible.
This approach is now supported by a recent study. Jones et al. evaluated 815 tumor-normal paired samples coming from 15 different tumor types [45] using Next Gene Sequencing. Library preparation was performed with two methods, whole exome preparation and targeted amplification, for 111 genes. Analyses were then conducted either as if only the cancer tissue was sequenced (reference human genome assembly GRch37-lite) or taking as reference the germline DNA of the same patient. With the first analysis, the authors reported a very high rate of false-positive variants (31 % and 65 % in exome and targeted libraries, respectively). Furthermore, they identified germline mutations in 3 % of the cancers, even if they came from a cohort without family history (sporadic cancer). Now that the new sequencing technologies have dramatically reduced the cost of sequencing, precision medicine and personal genomics require that the reference of the DNA sequencing project should be obtained from the germline DNA of the same patient.
Ongoing debate and HGVS (Human Genome Variation Society) recommendations
The ongoing debate among scientists to resolve the nomenclature mutation and polymorphism is a step in the right direction. The HGVS, an alliance of 600 members from 34 countries, incorporates discussion and recommendations to establish consensus definitions and descriptions of generic terms that are accepted worldwide. Since the early 1990s, the HGVS has been instrumental in its push to standardize the mutation nomenclature. The recommendations of the HGVS have been based on extensive discussions among scientists over the years.
The papers published on this topic for the last 20 years show that HGVS was visionary to recommend new changes and extensions based on discoveries of relatively complex variants. In 2002, several researchers tried to address this nomenclature problem and the challenges to make more inclusive definitions.A special article by Condit et al. found that mutation had become increasingly negative in connotation since its use in the biological sciences, but particularly over the course of the 20th century [22]. This negativity of the term became entrenched with radiation experiments and the use of atomic weapons during the IInd world war, and later with science fiction books and movies. The paper suggested that a better term like “variation” and “alteration” might be useful, but its inconsistent usage in the scientific world makes it problematic.
More recently, additional papers have highlighted the urgency of a “consensus” guiding the selection of the sequencing methods (data collection) and reporting. These studies point out that the accurate classification of pathogenic variants requires a standardized approach and the building of data repositories including all these data [46]. In this context, Richards et al. on the behalf of the American College of Medical Genetics and Genomics (ACMG) have noted that the terms “mutation” and “polymorphism” often lead to confusion because of incorrect assumptions of pathogenic and benign effects, respectively. Thus, they recommended that both terms be replaced by the term “variant” with the following modifiers: (i) pathogenic, (ii) likely pathogenic, (iii) uncertain significance, (iv) likely benign, or (v) benign [47].