Multidisciplinary insight into clonal expansion of HTLV1–infected cells in adult Tcell leukemia via modeling by deterministic finite automata coupled with highthroughput sequencing
 Amir Farmanbar†^{1, 2},
 Sanaz Firouzi†^{1}Email author,
 SungJoon Park^{2},
 Kenta Nakai^{1, 2},
 Kaoru Uchimaru^{1, 3} and
 Toshiki Watanabe^{1, 4}Email author
DOI: 10.1186/s1292001602412
© The Author(s). 2017
Received: 30 October 2016
Accepted: 22 December 2016
Published: 31 January 2017
Abstract
Background
Clonal expansion of leukemic cells leads to onset of adult Tcell leukemia (ATL), an aggressive lymphoid malignancy with a very poor prognosis. Infection with human Tcell leukemia virus type1 (HTLV1) is the direct cause of ATL onset, and integration of HTLV1 into the human genome is essential for clonal expansion of leukemic cells. Therefore, monitoring clonal expansion of HTLV1–infected cells via isolation of integration sites assists in analyzing infected individuals from early infection to the final stage of ATL development. However, because of the complex nature of clonal expansion, the underlying mechanisms have yet to be clarified. Combining computational/mathematical modeling with experimental and clinical data of integration site–based clonality analysis derived from next generation sequencing technologies provides an appropriate strategy to achieve a better understanding of ATL development.
Methods
As a comprehensively interdisciplinary project, this study combined three main aspects: wet laboratory experiments, in silico analysis and empirical modeling.
Results
We analyzed clinical samples from HTLV1–infected individuals with a broad range of proviral loads using a highthroughput methodology that enables isolation of HTLV1 integration sites and accurate measurement of the size of infected clones. We categorized clones into four size groups, “very small”, “small”, “big”, and “very big”, based on the patterns of clonal growth and observed clone sizes. We propose an empirical formal model based on deterministic finite state automata (DFA) analysis of real clinical samples to illustrate patterns of clonal expansion.
Conclusions
Through the developed model, we have translated biological data of clonal expansion into the formal language of mathematics and represented the observed clonality data with DFA. Our data suggest that combining experimental data (absolute size of clones) with DFA can describe the clonality status of patients. This kind of modeling provides a basic understanding as well as a unique perspective for clarifying the mechanisms of clonal expansion in ATL.
Keywords
Mathematical computational modeling Deterministic finite state automata (DFA) Statetransition diagram Adult Tcell leukemia Human Tcell leukemia virus ype1 Integration site Clonal expansion Nextgeneration sequencingBackground
Cancer is a complex disease of the genome that behaves as a clonal evolutionary process in populations of cells [1–4]. Although cancer is a diverse and multifactorial disorder with differing origins and degrees of malignancy, clonal expansion and the presence of Darwinian or natural selection are generally accepted as common features [4, 5]. Since Nowell first proposed the clonal evolution theory of neoplasia in 1976 [1], a broad range of studies have provided support for this model. In recent years, the use of nextgeneration sequencing (NGS) technologies for the investigation of tumor genomes has generated increasing evidence that most neoplasms grow as a clonally expanded cell population [3, 6–8]. The vast amounts of invaluable data generated by NGS have surpassed analysis and interpretation capacity. However, the intricate nature of clonal expansion and evolution in cancer makes it difficult to convert the experimental and clinical data into medical practices [9, 10]. Experimental data alone are not generally sufficient enough to address the complex problem of cancer. Consequently, focus has shifted toward devising mathematical/computational models for simplification and extraction of fundamental meaning from the complex biological processes of cancer.
Adult Tcell leukemia (ATL) is a lifethreatening malignancy that manifests with very poor prognosis [11, 12]. ATL develops through a multistep leukemogenic process, the nature of which remains elusive [13]. Among the different types of cancer, ATL is a remarkably unique neoplasm in that it is directly caused by infection with human Tcell leukemia virus type1 (HTLV1), which is mainly transmitted via breastfeeding [14]. HTLV1 infection and integration of provirus into the host genome are intrinsic and inevitable early events for ATL development [15]. HTLV1 mainly survives in vivo by persistent clonal proliferation of infected cells [16]. Whereas the majority of HTLV1–infected individuals remain asymptomatic carriers (ACs) throughout their lifetime, ~5% of them develop ATL after a long period of clinical latency [17]. Currently, there is no clear determinant to distinguish between individuals who will remain ACs and those who will develop ATL [18, 19]. Our Joint Study on Predisposing Factors of ATL Development (JSPFAD) group examined ATL risk factors and demonstrated that a proviral load (PVL; i.e., the percentage of infected peripheral blood mononuclear cells) of >4% is one of the risk factors for progression to ATL; however, PVL alone cannot predict development of the disease [19]. Similar to other types of cancer, clonal expansion of abnormal cells is a hallmark of ATL [20, 21]. Considering that the incidence of large clones increases with disease progression from the healthy AC state to the malignant states of smoldering (SM), chronic, or acute ATL [22–24], monitoring clonal expansion via an accurate method of detection is of great clinical importance [8, 23].
Generally, mutation patterns of cells can be used to define clones and monitor clonal expansion in different types of cancer [7]. ATL development has an advantage in that not only the mutation pattern but also the integration site of the provirus can be used to define clones and monitor clonal expansion [8, 23]. Individual infected cells can be uniquely characterized based on their integration site because, typically, a single integration of HTLV1 occurs per host cell [25]. Detecting the clonality dynamics, including clonal status and alterations, requires an appropriate method for defining two main characteristics of clones, HTLV1 integration site and clone size.
Research in this area would be greatly benefitted by an easier to understanding representation and description of how cancer develops in terms of clonal expansion, which is expected to be provided by appropriate models. A realistic model would provide a better understanding of cancer and would provide a comprehensive perspective on cancer processes by integrating clinical and biological data within a mathematical and computer science framework. As with other malignancies, suitable models for ATL would help to simplify the dynamics of cooperative and complex behaviors in cancer development [26, 27]. Quantitative NGS data have the potential for creating robust and reliable mathematical modeling approaches [28]. Increasingly complex mathematical models of cancerous growth (particularly of solid tumors) that are based primarily on mutation patterns have been developed [29, 30]. The prominent role of mathematical modeling in the detailed quantitative description of diseases, and the contribution of mathematical modeling to solving biological problems have been eloquently discussed by Tanaka and Ono [31]. Currently, there is a broad range of theoretical models available; however, empirical mathematical models are still limited [30].
Several mathematical modeling studies are available in the field of HTLV1 research [32–38], although none of these studies have focused on modeling clonal expansion and its correlation with ATL development. The earliest mathematical model for HTLV1 explored the correlation between the antiviral immune response, viral load and viral diversity [32]. Later Stilianakis et al. used a nonlinear differential equation and theoretical assumptions to describe HTLV1 infection of CD4^{+} Tcells [34]. This model was further optimized to test different assumptions and/or alteration of the proposed differential equations [35–38]. Therefore, there is an obvious lack of a datadriven mathematical model that describes the role of clonal expansion of HTLV1–infected cells in ATL development. Mathematical models that are datadriven and hypothesisfree are considered to be the most applicable in many situations and have the lowest risk of confirmation bias [31]. Moreover, there is currently no computational model available for ATL development. A model that not only reflects details of biological phenomena like mathematical models but also allows abstract visualization of the observed information like computational models would be most informative to biologists [39]. Establishing suitable expressive formalisms requires filling the gap between mathematics and computer science by using advantages of both approaches.
In this study, we used deterministic finite state automata (DFA), which are a concept in automata theory [40]. Automata are the main mathematical objects in computer science that are capable of applying sequential algorithms, formalism, to system description and specification [41, 42]. DFA can abstractly display evolutionary processes and other phenomena with a sequential order of events [40, 43]. DFA represent a framework to describe the behavior of clonal expansion as discretestate systems. Our main goal was to illustrate clonality patterns and to design a conceptually clear framework based on real biological data on clonality obtained from individuals with different PVLs and progression states of ATL. We also categorized the observed clone sizes accurately based on our integration site–mediated clonality analysis approach. Moreover, we propose the first wellsuited empirical model for intuitive description of clonal expansion in ATL.
Methods
Wet laboratory experiments
HTLV1–infected individuals harbor complex populations of infected clones and uninfected cells [8, 23]. HTLV1 integration sites and the number of infected cells in each clone (i.e., clone size) are two main characteristics of infected clones that we monitored. Each HTLV1–infected cell naturally harbors only a single integration site [25]. Therefore, the number of detected unique integration sites reflects the number of infected clones. The most challenging aspect of our clonality analysis was measuring the number of infected cells in each clone. We used a molecular tagging system for this purpose. Tags acts as molecular barcodes which give DNA fragment unique signatures before PCR [8]. Information on the frequency of observed tags from the NGS data was used to remove PCR duplicates and thereby estimate the original clonal abundance in the starting sample. Because of the random design of tags, they could theoretically provide ~65,536 variations, and thus can uniquely mark a large number of cells in each clone. This method has been comprehensively validated using control samples with known clone sizes and clinical samples [8].
Sample characteristics
Sample  Clinical status  PVL (%)  DFA machine  Final state  Integration sites 

F1  AC  7.57  M1  q1  876 
F2  AC  5.24  M1  q1  802 
F3  AC  7.16  M1  q1  1473 
F4  SM  6.02  M1  q1  1827 
F5  SM  31.15  M4  q2  225 
F6  SM  23.56  M2  q2  398 
F7  SM  36.63  M4  q2  570 
F8  SM  43.24  M7  q3  417 
F9  Chronic  28.53  M4  q2  260 
F10  Chronic  15.25  M4  q2  1345 
F11  Chronic  100.70  M3  q3  73 
F12  Chronic  83.81  M3  q3  65 
F13  Acute  64.43  M6  q3  138 
F14  Acute  27.92  M5  q3  390 
F15  Acute  51.90  M3  q3  40 
F16  Acute  51.42  M3  q3  19 
F17  AC  1.24  ND  ND  233 
F18  AC  3.52  ND  ND  739 
To prepare the samples for sequencing, 5 μg genomic DNA from peripheral blood mononuclear cells was isolated using a QIAGEN DNA Blood kit. PVLs were measured by realtime PCR using the ABI PRISM 7000 Sequence Detection System as described [19].
We used a library preparation protocol specifically designed to isolate HTLV1 integration sites. All information about the design and detailed protocols has been described [8]. In brief, the starting template DNA was fragmented by sonication. The resulting fragments represented a size range of 300 to 700 bp as indicated by an Agilent 2100 Bioanalyzer and DNA 7500 kit. Fragmented DNA underwent the library construction steps of end repair, Atailing, adaptor ligation, size selection and nested PCR. The generated products contained all the specific sequences necessary for the Illumina HiSeq 2000/2500 platform (Additional file 1: Figure S1).
In silico analysis
We analyzed the large amount of NGS data with a pipeline specifically designed for HTLV1 integration sites and clone size measurement. We processed raw sequencing data according to the workflow that we previously reported [8]. Briefly, raw data of Read1 (100 bp forward), Read3 (100 bp reverse), and Read2 (8 bp index) were obtained from the Illumina HiSeq 2000/2500 platform. The quality of sequencing outputs was confirmed with the FastQC tool [47]. In the case of Read1, the first 5 bp were trimmed, and the next 5 bp were used to demultiplex indexed samples. The following 23 bp, which correspond to the long terminal repeat primer, were then removed. The next 27 bp were subjected to a BLAST search [48] against the long terminal repeat reference sequence. For the BLAST output reads, the remaining 40 bp were subjected to a BLAST search against an HTLV1 reference sequence [49]. Reads confirmed to be from HTLV1 were removed, and the sequences and IDs for the remaining reads, which were considered to be human, were collected. Subsequently, reads from Read3 with IDs corresponding to IDs from Read1 were collected. The first 40 bp of Read3 were trimmed to have the same length as Read1 sequences. The paired sequences of Read1 and Read3 were mapped against the human genome (version 19) by Bowtie [50]. For each sample, two million mapped reads were used for subsequent analysis. The 5′mapped positions were considered to be integration sites. The output format of isolated integration sites is chromosome:position (strand) (e.g., chr7:9408533 (−)). Subsequently, Read2 information, which contained 8bp randomly designed barcodes, was used to retrieve the clone size based on the tags. Finally, clone size was measured by computing the frequency of unique tags per each integration site.
Expressing results via empirical modeling
Formal definition can precisely describe automata by alphabet and formation rules in mathematics. Parameters of DFA, such as the number of accept states and the number of transitions exiting from a state, can be clearly defined by the formal definition. In mathematical language, a DFA is a 5tuple where the components are (Q, Σ, δ, q0, F). “Q” is a finite set of states. Σ is a nonempty finite set of symbols (inputs). Transition rules are denoted by a function called the transition function, δ: States × Alphabet → States (δ: Q × Σ → Q). “q0” is a start state, where q0 ∈ Q. “F” indicates final states that are a subset of states Q [40, 42, 51].
Results
Analyzing clonality of clinical samples by highthroughput sequencing
The clone size and category of clones among Top5 largest clones across all samples
F1  F2  F3  F4  F5  F6  F7  F8  F9  F10  F11  F12  F13  F14  F15  F16  F17  F18  

310  S  357  S  388  S  314  S  1427  B  1446  B  1904  B  2055  VB  2029  B  736  B  4883  VB  5377  VB  3721  VB  2848  VB  2634  VB  4909  VB  112  VS  77  VS 
252  S  147  S  287  S  118  VS  552  B  1088  B  1690  B  1293  B  361  S  725  B  25  VS  14  VS  871  B  158  S  13  VS  17  VS  54  VS  29  VS 
65  VS  58  VS  206  S  50  VS  263  S  67  VS  317  S  47  VS  69  VS  131  S  24  VS  8  VS  774  B  138  S  9  VS  3  VS  47  VS  28  VS 
58  VS  43  VS  206  S  38  VS  161  S  24  VS  205  S  46  VS  64  VS  70  VS  12  VS  7  VS  769  B  121  VS  4  VS  2  VS  42  VS  27  VS 
55  VS  35  VS  167  S  35  VS  60  VS  15  VS  20  VS  30  VS  52  VS  56  VS  11  VS  3  VS  475  S  117  VS  3  VS  2  VS  39  VS  26  VS 
Defining appropriate thresholds for the absolute clone size
Taking advantage of automata theory to describe clonality data
Discussion
Modern medicine has done much to eradicate and cure disease, but it has been less successful in some areas, such as cancer, which still remains one of the most common incurable diseases. Remarkable progress has been made recently in the genomics of cancer with the advent of NGS technologies. However, this explosion in rapidly generated, massive sets of loosely structured raw data has challenged our abilities to quantitatively analyze and draw knowledge from this information [52]. Formal modeling can address this problem by enabling appropriate simplification of real data and making sense of observed experimental data. Empirical modeling provides an accurate and complete picture of observed complex data, and it has many applications in the life science [53]. The merging of mathematics, computer science and biology in empirical models can reshape these fields by providing new ways of thinking about a problem. The virtue of mathematics in modeling is to confer clarity and precision to explanations, and to provide coherence and formalism to experimental observations [54]. The quantitative and objective power of mathematics allows understanding of otherwise hidden aspects of biological phenomena [28]. Computational models enable intuitive representation of the masses of biological data via their visualization capability, which in turn facilitates mechanistic understanding of disease [55].
The most effective and appropriate type of mathematical/computational modeling varies for each biological question. A practical model that is properly formulated to explain and interpret experimental and clinical data obtained from analyzing clonal expansion in ATL is greatly needed. We need a model that can imitate the components of our biological system (the clonality patterns and clone sizes) and reflect its properties intuitively.
In the current study, we aimed to organize and intuitively express data from NGS on clonal expansion of HTLV1–infected individuals using finite automata theory. Finite automata theory can describe and analyze dynamic behaviors of systems, and it is capable of simply representing complicated processes [43, 52]. Finite automata theory, which is a welldeveloped formal system, is used in processing various strings and sequences, especially in DNA sequence processing [51]. DFA are a subtype of finite automata theory and are simple computational structures that can formally illustrate the size order and combination of observed clones (clonality patterns). Our model translates the observed data into formal mathematical language by formulating a precise relationship between a set of clones in terms of their sizes and presenting this relationship in an easily understood state diagram.
Conventionally, clonality has been described as polyclonal, oligoclonal and monoclonal [56, 57]. However, these pattern descriptions are not quantifiable. For instance, it is known that the monoclonal expansion that results in large clones is an intrinsic feature of ATL development [20]; however, absolute clone sizes to describe this phenomenon have not been determined. In recognition of this limitation, we categorized the observed clone sizes into defined groups by which we could intuitively assess the degree of clonal expansion. We defined four groups of clonality patterns and four groups of clone sizes. Thus we defined polyclonal as a pattern showing different combinations and large numbers of VS and/or S clones, oligoclonal as a pattern showing more than one B or VB clone in combination with large numbers of VS and/or S clones, and monoclonal as a pattern showing a single VB or B clone in combination with a background of VS and/or S clones. In this way, we could attribute a meaning to the observed clone sizes and assess their contribution to ATL progression. In other words, we quantified how large a clone must be to affect the clinical status of an infected individual.
Generally, it is known that competition between clones shapes their distribution [3], but we do not know how a clone wins this competition to undergo clonal expansion. Presumably, a clone needs to become large enough to gain a fitness advantage to outcompete other clones. Coexistence of large numbers of S and VS clones, as well as presence of limited numbers of B or VB clones together with large numbers of S or VS clones in each given sample was observed. Total number and type of isolated clones are provided in Table 1, and Table 2. Therefore, we suggest that small cell populations (VS and S) do not have a selective advantage and can coexist with other clones. ACs and patients with SM ATL with low PVL harbored only VS and S clones, whereas all dominant clones in aggressive ATL (acute) were VB. The observed clone sizes were sorted in ascending order, and then thresholds of 2^{7}, 2^{9} and 2^{11} cells were applied. Hence, observed clone sizes were categorized into four distinct groups. Over the threshold of 2^{11} cells, the largest clones in the samples that had monoclonal patterns were categorized within the same group. Within the threshold range of 2^{9} to 2^{11} cells, the two largest clones in the samples that had oligoclonal patterns were categorized in the same group. Within the threshold range of 2^{7} to 2^{9} cells, clones in the samples that had polyclonal patterns with PVLs > 4% were categorized in the same group. Below the threshold of 2^{7} cells, the clones in the samples that had polyclonal patterns with PVLs < 4% were categorized in the same group. This quantified categorization of clone size not only is more intuitive for biological interpretation but also facilitates the transferal of clone size information into our model.
To convert the complex nature of data on clonal expansion into a manageable level of simplicity, borrowed the aid of mathematics and computer science. Our proposed DFA describe the clonality status of infected individuals as the output of final states q1, q2 and q3. Transitions are described by the function δ, which specifies exactly one next state for each possible combination of state and input symbol. The rows in the transition table indicate the states Q, the columns the input symbols, and the table entries the transition function δ. We indicated the accepting state with an asterisk in the figures. F = q1, q2, q3 indicates the final states, which consist of a set of states Q.
The final state of q1 represents an early stage of clonal expansion in which clone sizes does not exceed the threshold of 512 infected cells. AC patients with PVL > 4% and the SM ATL patients with low PVL terminated in this state. The DFA of samples of clinically progressed patients with SM and chronic subtypes with maximum clone sizes of 2048 infected cells terminated at the final state of q2. The final state of q3 included samples of the SM, chronic and acute subtypes with clone size > 2048. Acute samples, which represent the final stage of ATL progression, were observed only in q3. VB clones were observed only in the samples whose DFA terminated at q3. In the current study, c4 (VB clone) was observed only once in each analyzed sample. Since, presence of more than one VB clone is theoretically possible, we put a loop on the q3 final state of our final DFA machine. In the case of observing such a sample, the clonality will be defined as a largely expanded oligoclonal pattern with q3 final state.
We conducted a crosssectional analysis of HTLV1–infected individuals with a broad range of PVLs, representing different progression states of disease. Although analyzing the same individuals over time is of great importance, obtaining these kinds of samples is difficult and needs to be addressed in future studies. However, having access to the JSPFAD biomaterials bank allowed us to obtain two longitudinal (2 years apart) samples from the same individual (F12 and F16). By analyzing these samples, we could directly examine the hypotheses that samples with a higher final state have a higher chance of disease progression. At the first time point (F12) the patient was diagnosed with chronic ATL and had a PVL of 83.81% and major clone size of 5377. At the second time point (F16), the patient had progressed to the acute stage and had a PVL of 51.42% and clone size of 4909. The PVL and size of the major clone at the second time point were presumably decreased because of therapy. However, the major clone with integration site of Chr9:123682855 (+) remained stable and dominant over the 2year period. The DFA for both time points terminated in q3. Thus, it appears that reaching the final state of q3 is a factor that can be used as a risk indicator. Because the final state of this patient was already q3 at the first time point, progression to the acute stage was predicted by our DFA. As further validation of our DFA, the other patients with acute ATL also had DFA that terminated at q3, and thus we expected these patients to have a poor prognosis. Subsequently, we confirmed the poor prognosis of these patients (F13, F14, F15 and F16) by checking their clinical followup data, which showed that they had all died of the disease. However, AC patients (F1, F2 and F3) and patients with SM ATL with low PVL (F4) who showed the final state of q1 in the DFA remained clinically stable without disease progression in two years.
The data suggest that our final proposed machine (M) not only describes the clonality status of patients at single time points within a crosssectional analysis but also opens the door for future analyses of longitudinal samples for predictive purposes. The predictive ability of this model with larger numbers of samples from the same individuals over time still needs to be examined.
We believe our model is an appropriate empirical model for this system because it uses real biological data without theoretical assumptions as well as the fewest number of variables and the simplest set of relationships to explain the clonality status of samples. This model has the potential to provide insight into clonal expansion of ATL, but we are still far from understanding exactly when, where and at which step clonal expansion and transformation occur and how they can be controlled. However, our multidisciplinary strategy for translating the data of clonal expansion into the computable language of mathematics via modeling opens new avenues to approach these relevant questions in future studies.
Currently, ATL patients are categorized into different subtypes of disease progression based on clinical manifestations [46]. These standard clinical criteria for diagnosis mainly include organ involvement, leukemic manifestation, and levels of lactate dehydrogenase and calcium [58]. However, molecular features that represent the disease status remain to be characterized. Considering that in clinical practice distinct therapeutic strategies are used for the treatment of different subtypes of ATL, accurate subtype classification is of great importance. Thus, there is demand for more robust classification of ATL subtypes mediated by a genomic feature, such as HTLV1 clonal composition. This kind of analysis would be also helpful in clinical decisionmaking, such as monitoring the outcome of therapeutic interventions, based on analysis of the clonality status of patients before and after therapy [59]. In this respect, constructing an empirical model of clonal expansion would be one of the primary steps towards developing a powerful software tool for automated analysis and interpretation of individual clonality, which holds great promise for molecular diagnostics and personalized therapeutic interventions.
Conclusions
We used HTLV1 integration sites as a stable fingerprint to identify infected cells and accurately monitor their clonal expansion. We isolated large numbers of integration sites and quantified the clone sizes of eighteen clinical samples by our high throughput and validated methodology. We defined a threshold system that categorizes the size of clones into discrete groups based on the number of infected cells in each clone. We could quantify polyclonal, oligoclonal and monoclonal patterns using this categorization. We found that harboring larger clones was strongly associated with the progression of a patient to the more aggressive type of ATL, whereas smaller clones were observed across all samples and had little impact on progression. All samples with low PVLs (<10%) had smaller clones, however those with higher PVL had both smaller clones and one or two dominant larger clones. For the first time, we suggested DFA as a formalism that can represent sequential order of clones. We found that our DFA accurately reflect the true patterns of clonal expansion for each sample. Analyzing a large cohort of clinical samples from the same patients over time with the appropriate formal models will provide new insights into the clonal expansion of ATL and will allow for possible clinical applications of clonality in molecular diagnostics and predictions of prognosis.
Abbreviations
 ATL:

Adult Tcell leukemia
 HTLV1:

Human Tcell leukemia virus type1
 PVL:

Proviral load
 AC:

Asymptomatic carrier
 SM:

Smoldering
 JSPFAD:

Joint Study on Predisposing Factors of ATL Development
 NGS:

Next generation sequencing
 DFA:

Deterministic finite state automata
Declarations
Acknowledgements
A. Farmanbar expresses deep respect and gratitude to the Otsuka Toshimi scholarship foundation for supporting his graduate studies. We thank JSPFAD for providing clinical samples; M. Nakashima and T. Akashi for maintenance of JSPFAD; T. Yamada and H. Farmanbar for their advices on in silico analysis and mathematics; Y. Suzuki, K. Abe, K. Imamura, T. Horiuchi, and M. Tosaka for sequencing technical support; and U. Firouzi for helps in design of figures.
Computational analyses were provided by the supercomputer system of the Human Genome Center, the Institute of Medical Science, the University of Tokyo. We appreciate the technical assistance by Hideyuki Nishijima.
Funding
This work was supported by Japan Agency for Medical Research and Development (AMED) grant number of 16ck0106133h0103 and 16ck0106136h0103 to (TW), Ministry of Education, Culture, Sports, Science, and Technology grant number of 221S0001 and 26293226 to (TW); the Japanese Society for the Promotion of Science, DC1 grant number of 24.6916 to (SF).
Availability of data and materials
The datasets generated during the current study have been deposited in the Sequence Read Archive of NCBI with SRA study accession of SRP080123.
Authors’ contributions
TW, KU, AF, and SF conceived the project. AF and SF designed and carried out the experiments and wrote the manuscript. AF and SF performed in silico data analysis and interpreted the data. SF performed the wet laboratory experiments. AF performed DFA modeling. SP and KN contributed to the in silico data analysis and interpretation. All authors assisted in drafting and critically revising the manuscript. TW supervised the study. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
This project was approved by research ethics committee of the University of Tokyo (approval No.10–50 and No.14–155).
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Nowell PC. The clonal evolution of tumor cell populations. Science. 1976;194(4260):23–8.View ArticlePubMedGoogle Scholar
 Crespi B, Summers K. Evolutionary biology of cancer. Trends Ecol Evol. 2005;20(10):545–52.View ArticlePubMedGoogle Scholar
 Merlo LMF, Pepper JW, Reid BJ, Maley CC. Cancer as an evolutionary and ecological process. Nat Rev Cancer. 2006;6(12):924–35.View ArticlePubMedGoogle Scholar
 Greaves M. Evolutionary determinants of cancer. Cancer Discov. 2015;5(8):806–20.View ArticlePubMedPubMed CentralGoogle Scholar
 Greaves M. Cancer causation: the Darwinian downside of past success? Lancet Oncol. 2002;3(4):244–51.View ArticlePubMedGoogle Scholar
 Greaves M, Maley CC. Clonal evolution in cancer. Nature. 2012;481(7381):306–13.View ArticlePubMedPubMed CentralGoogle Scholar
 Sprouffske K, Merlo LM, Gerrish PJ, Maley CC, Sniegowski PD. Cancer in light of experimental evolution. Curr Biol. 2012;22(17):R762–771.View ArticlePubMedPubMed CentralGoogle Scholar
 Firouzi S, Lopez Y, Suzuki Y, Nakai K, Sugano S, Yamochi T, Watanabe T. Development and validation of a new highthroughput method to investigate the clonality of HTLV1infected cells based on provirus integration sites. Genome Med. 2014;6(6):46.View ArticlePubMedPubMed CentralGoogle Scholar
 Horne SD, Ye CJ, Abdallah BY, Liu G, Heng HHQ. Cancer genome evolution. Transl Cancer Res. 2015;4(3):303–13.Google Scholar
 Sidow A, Spies N. Concepts in solid tumor evolution. Trends Genet. 2015;31(4):208–14.View ArticlePubMedPubMed CentralGoogle Scholar
 Takatsuki K. Discovery of adult Tcell leukemia. Retrovirology. 2005;2:16.View ArticlePubMedPubMed CentralGoogle Scholar
 Gallo RC. The discovery of the first human retrovirus: HTLV1 and HTLV2. Retrovirology. 2005;2:17.View ArticlePubMedPubMed CentralGoogle Scholar
 Okamoto T, Ohno Y, Tsugane S, Watanabe S, Shimoyama M, Tajima K, Miwa M, Shimotohno K. Multistep carcinogenesis model for adult Tcell leukemia. Jpn J Cancer Res. 1989;80(3):191–5.View ArticlePubMedGoogle Scholar
 Fujino T, Nagata Y. HTLVI transmission from mother to child. J Reprod Immunol. 2000;47(2):197–206.View ArticlePubMedGoogle Scholar
 Tsukasaki K, Tobinai K. Biology and treatment of HTLV1 associated Tcell lymphomas. Best Pract Res Clin Haematol. 2013;26(1):3–14.View ArticlePubMedGoogle Scholar
 Etoh K, Tamiya S, Yamaguchi K, Okayama A, Tsubouchi H, Ideta T, Mueller N, Takatsuki K, Matsuoka M. Persistent clonal proliferation of human Tlymphotropic virus type Iinfected cells in vivo. Cancer Res. 1997;57(21):4862–7.PubMedGoogle Scholar
 Ishitsuka K, Tamura K. Human Tcell leukaemia virus type I and adult Tcell leukaemialymphoma. Lancet Oncol. 2014;15(11):e517–526.View ArticlePubMedGoogle Scholar
 Iwanaga M, Watanabe T, Yamaguchi K. Adult Tcell leukemia: a review of epidemiological evidence. Front Microbiol. 2012;3:322.View ArticlePubMedPubMed CentralGoogle Scholar
 Iwanaga M, Watanabe T, Utsunomiya A, Okayama A, Uchimaru K, Koh KR, Ogata M, Kikuchi H, Sagara Y, Uozumi K, et al. Human Tcell leukemia virus type I (HTLV1) proviral load and disease progression in asymptomatic HTLV1 carriers: a nationwide prospective study in Japan. Blood. 2010;116(8):1211–9.View ArticlePubMedGoogle Scholar
 Yoshida M, Seiki M, Yamaguchi K, Takatsuki K. Monoclonal integration of human Tcell leukemia provirus in all primary tumors of adult Tcell leukemia suggests causative role of human Tcell leukemia virus in the disease. Proc Natl Acad Sci U S A. 1984;81(8):2534–7.View ArticlePubMedPubMed CentralGoogle Scholar
 RodriguezBrenes IA, Wodarz D. Preventing clonal evolutionary processes in cancer: Insights from mathematical models. Proc Natl Acad Sci U S A. 2015;112(29):8843–50.View ArticlePubMedPubMed CentralGoogle Scholar
 Berry CC, Gillet NA, Melamed A, Gormley N, Bangham CRM, Bushman FD. Estimating abundances of retroviral insertion sites from DNA fragment length data. Bioinformatics. 2012;28(6):755–62.View ArticlePubMedPubMed CentralGoogle Scholar
 Gillet NA, Malani N, Melamed A, Gormley N, Carter R, Bentley D, Berry C, Bushman FD, Taylor GP, Bangham CRM. The host genomic environment of the provirus determines the abundance of HTLV1–infected Tcell clones. Blood. 2011;117(11):3113–22.View ArticlePubMedPubMed CentralGoogle Scholar
 Kamihira S, Sugahara K, Tsuruda K, Minami S, Uemura A, Akamatsu N, Nagai H, Murata K, Hasegawa H, Hirakata Y, et al. Proviral status of HTLV1 integrated into the host genomic DNA of adult Tcell leukemia cells. Clin Lab Haematol. 2005;27(4):235–41.View ArticlePubMedGoogle Scholar
 Cook LB, Rowan AG, Melamed A, Taylor GP, Bangham CR. HTLV1infected T cells contain a single integrated provirus in natural infection. Blood. 2012;120(17):3488–90.View ArticlePubMedPubMed CentralGoogle Scholar
 Gammon K. Mathematical modelling: Forecasting cancer. Nature. 2012;491(7425):S66–67.View ArticlePubMedGoogle Scholar
 Segal E, Friedman N, Kaminski N, Regev A, Koller D. From signatures to models: understanding cancer using microarrays. Nat Genet. 2005;37:S38–45.View ArticlePubMedGoogle Scholar
 Altrock PM, Liu LL, Michor F. The mathematics of cancer: integrating quantitative models. Nat Rev Cancer. 2015;15(12):730–45.View ArticlePubMedGoogle Scholar
 Araujo RP, McElwain DL. A history of the study of solid tumour growth: the contribution of mathematical modelling. Bull Math Biol. 2004;66(5):1039–91.View ArticlePubMedGoogle Scholar
 Beerenwinkel N, Schwarz RF, Gerstung M, Markowetz F. Cancer evolution: mathematical models and computational inference. Syst Biol. 2015;64(1):e1–25.View ArticlePubMedGoogle Scholar
 Tanaka RJ, Ono M. Skin Disease Modeling from a Mathematical Perspective. J Invest Dermatol. 2013;133(6):1472–8.View ArticlePubMedGoogle Scholar
 Nowak MA, Bangham CR. Population dynamics of immune responses to persistent viruses. Science. 1996;272(5258):74–9.View ArticlePubMedGoogle Scholar
 Wodarz D, Bangham CR. Evolutionary dynamics of HTLVI. J Mol Evol. 2000;50(5):448–55.View ArticlePubMedGoogle Scholar
 Stilianakis NI, Seydel J. Modeling the Tcell dynamics and pathogenesis of HTLVI infection. Bull Math Biol. 1999;61(5):935–47.View ArticlePubMedGoogle Scholar
 Wang L, Li MY, Kirschner D. Mathematical analysis of the global dynamics of a model for HTLVI infection and ATL progression. Math Biosci. 2002;179(2):207–17.View ArticlePubMedGoogle Scholar
 Katri P, Ruan S. Dynamics of human Tcell lymphotropic virus I (HTLVI) infection of CD4+ Tcells. C R Biol. 2004;327(11):1009–16.View ArticlePubMedGoogle Scholar
 Cai L, Li X, Ghosh M. Global dynamics of a mathematical model for HTLVI infection of CD4+ Tcells. Appl Math Model. 2011;35(7):3587–95.View ArticleGoogle Scholar
 Lim AG, Maini PK. HTLVI infection: a dynamic struggle between viral persistence and host immunity. J Theor Biol. 2014;352:92–108.View ArticlePubMedGoogle Scholar
 Harel D. Statecharts  a Visual Formalism for ComplexSystems. Sci Comput Program. 1987;8(3):231–74.View ArticleGoogle Scholar
 Turing AM. On computable numbers, with an application to the Entscheidungsproblem. P Lond Math Soc. 1937;42:230–65.View ArticleGoogle Scholar
 Khwaja AA, Urban JE. A property based specification formalism classification. J Syst Software. 2010;83(11):2344–62.View ArticleGoogle Scholar
 Hopcroft JE, Motwani R, Ullman JD. Introduction to Automata Theory, Languages, and Computation. 3rd ed. Prentice Hall; 2006. ISBN 10: 0321455371 / ISBN 13: 9780321455376
 Demetriu LA. Abstract Biological Systems as Sequential Machines  Behavioral Reversibility. B Math Biophys. 1966;28(2):153–60.View ArticleGoogle Scholar
 Yamaguchi K, Uozumi K, Taguchi H, Kikuchi H, Okayama A, Kamihira S, Hino S, Nosaka K, Watanabe T. Nationwide cohort study of HTLV1 carriers in Japan: Joint study on predisposing factors of ATL development (JSPFAD). Aids Res Hum Retrov. 2007;23(4):582–2.
 Biomaterial resource bank of HTLV1 carriers JSPFAD: http://htlv1.org/old/banken.html. Accessed 29 Oct 2016.
 Shimoyama M. Diagnostic criteria and classification of clinical subtypes of adult Tcell leukaemialymphoma. A report from the Lymphoma Study Group (1984–87). Br J Haematol. 1991;79(3):428–37.View ArticlePubMedGoogle Scholar
 FastQC: A quality control tool for high throughput sequence data. http://wwwbioinformaticsbabrahamacuk/projects/fastqc/.
 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic Local Alignment Search Tool. J Mol Biol. 1990;215(3):403–10.View ArticlePubMedGoogle Scholar
 Seiki M, Hattori S, Hirayama Y, Yoshida M. Human adult Tcell leukemia virus: complete nucleotide sequence of the provirus genome integrated in leukemia cell DNA. Proc Natl Acad Sci U S A. 1983;80(12):3618–22.View ArticlePubMedPubMed CentralGoogle Scholar
 Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memoryefficient alignment of short DNA sequences to the human genome. Genome Biol. 2009;10(3):R25.View ArticlePubMedPubMed CentralGoogle Scholar
 Holub J. The Finite Automata Approaches in Stringology. Kybernetika. 2012;48(3):386–401.Google Scholar
 Shendure J, Aiden EL. The expanding scope of DNA sequencing. Nat Biotechnol. 2012;30(11):1084–94.View ArticlePubMedPubMed CentralGoogle Scholar
 Shou W, Bergstrom CT, Chakraborty AK, Skinner FK. Theory, models and biology. Elife. 2015;4:e07158.PubMedPubMed CentralGoogle Scholar
 May RM. Uses and abuses of mathematics in biology. Science. 2004;303(5659):790–3.View ArticlePubMedGoogle Scholar
 Fisher J, Henzinger TA. Executable cell biology. Nat Biotechnol. 2007;25(11):1239–49.View ArticlePubMedGoogle Scholar
 Takemoto S, Matsuoka M, Yamaguchi K, Takatsuki K. A novel diagnostic method of adult Tcell leukemia: monoclonal integration of human Tcell lymphotropic virus type I provirus DNA detected by inverse polymerase chain reaction. Blood. 1994;84(9):3080–5.PubMedGoogle Scholar
 Tsukasaki K, Tsushima H, Yamamura M, Hata T, Murata K, Maeda T, Atogami S, Sohda H, Momita S, Ideda S, et al. Integration patterns of HTLVI provirus in relation to the clinical course of ATL: frequent clonal change at crisis from indolent disease. Blood. 1997;89(3):948–56.PubMedGoogle Scholar
 Tsukasaki K, Hermine O, Bazarbachi A, Ratner L, Ramos JC, Harrington Jr W, O’Mahony D, Janik JE, Bittencourt AL, Taylor GP, et al. Definition, prognostic factors, treatment, and response criteria of adult Tcell leukemialymphoma: a proposal from an international consensus meeting. J Clin Oncol. 2009;27(3):453–9.View ArticlePubMedPubMed CentralGoogle Scholar
 Aoki S, Firouzi S, Lopez Y, Yamochi T, Nakano K, Uchimaru K, Utusnomiya A, Iwanaga M, Watanabe T. Transition of adult Tcell leukemia/lymphoma clones during clinical progression. Int J Hematol. 2016;104(3):330–7.View ArticlePubMedGoogle Scholar