From June 2009 to August 2010, we recruited a total of 903 participants prospectively from the Shenzhen People’s Hospital, the Zhuhai Municipal Maternal and Child Healthcare Hospital and the Shenzhen Maternal and Child Care Center. We recruited another 19 euploid adult males for the estimation of fetal DNA fraction. Institutional Review Board approval was obtained at each site, and all participants gave informed written consent. We obtained the full karyotyping results for all samples from regular clinical tests. We randomly selected 300 euploid samples among the karyotyping results to use as the reference controls.
Maternal plasma DNA sequencing
We collected five ml peripheral venous blood from 903 pregnant women in EDTA tubes. The tubes were centrifuged at 1,600 × g for 10 min within four hours of collection. Plasma was transferred to microcentrifuge tubes and centrifuged at 16,000 × g for 10 min to remove residual cells. Cell-free plasma was stored at −80°C until DNA extraction. Each plasma sample was frozen and thawed only once.
For massively parallel genomic sequencing, DNA fragments from 600 ul of maternal plasma were used for library construction according to a modified protocol from Illumina. End-repairing of maternal plasma DNA fragments was performed using T4 DNA polymerase, Klenow polymerase, and T4 polynucleotide kinase. Afterwards, A-base tailing adapters were ligated to the DNA fragments. Standard multiplex primers were introduced by 17-cycle PCR. The libraries were analysed for size distribution by Agilent Bioanalyzer and quantified using real-time PCR. Thirty-six-cycle single-end multiplex sequencing and 50-cycle single-end multiplex sequencing were used for the Illumina GAIIx and Illumina HiSeq 2000 platform, respectively.
High effective alignment with universal unique reads set
Computationally, we incised the human reference genome (HG 18, NCBI build 36) into k-mers (k refers to the length of the sequencing reads) and then aligned the k-mers back to the reference genome. All of the k-mers that could be uniquely mapped to a single position on the reference genome, the unique mapping reads, were named as the universal unique reads set. We selected the sequencing reads that could be mapped with 0-mismatch to the universal unique reads set (i.e. the tag) for our analysis.
K-mer coverage and GC-correlation
We computed the k-mer coverage for each chromosome and every sample, as where is the ID of control samples; j is the chromosome ID; ni,j is the number of unique reads mapped onto chromosome j from sample i and Ni,j was the total number of unique reads for chromosome j. Because of the differences among the samples, we normalized the data and computed the relative k-mer coverage for each sample as , where was the average k-mer coverage of the 22 autosomes in the i-th sample.
Given the unclear mechanism of GC-bias, we performed a Losses regression to fit the relative k-mer coverage to the corresponding GC content. We denoted the fitted relative k-mer coverage as cr
). The fitted value, which we used as the theoretical value, was vital to our statistical model for cff-DNA concentration estimation and aneuploidy detection.
Because we using a male/female data set, we had different fitted values for the analysis of sex chromosomes. We calculated the fitted relative k-mer values for the sex chromosome analysis as follows:
) · (j = X, Y), for the fitted relative k-mer coverage from a regression of an adult male data set; and
) · (j = X, Y), for the fitted relative k-mer coverage from a regression of a fetal-female data set.
Cff-DNA concentration estimation
Using the gender difference to compute the relative k-mer coverage of the sex chromosome, we estimated the cff-DNA concentrations, which denote as ε. Subscripts corresponding to chromosome IDs indicate concentrations estimated from different chromosomes:
, is the estimation using the data for chromosome Y; and
, is the estimation using data for chromosome X.
Autosomal aneuploidy detection with binary hypothesis
We developed a binary hypothesis strategy to achieve a higher sensitivity and specificity. We performed two Student’s t-test based on null/alternative hypotheses, and we subsequently calculated the relative logarithmic likelihood odds ratio. The null and alterative hypothesizes are shown below.
For the first test:
H0 (null hypothesis): the fetal chromosome was euploid.
H1 (alterative hypothesis): the fetal chromosome was trisomic.
The first t-value, .
For the second test:
H0 (null hypothesis): the test fetal chromosome was trisomic.
H1 (alterative hypothesis): the test fetal chromosome was euploid.
The second t-value, .
The logarithmic likelihood odds ratio between our binary hypotheses was defined as
where DOF = the degree of freedom., We used, │t
│> 3 and │t
│< 3 as warning criteria. From the logarithmic likelihood odds ratio, we could make a confident judgment of autosomal aneuploidy if L
Fetal gender classification and sex chromosomal aneuploidy detection
We developed a double standard strategy with an experimental threshold and logistic regression to detect the fetal gender. The k-mer coverage on chromosome Y was an ideal choice for distinguishing genders. Based on the 300 reference controls, we considered cri,Y < 0.04 the threshold for identifying a female fetus, while we regarded samples with cri,Y > 0.051 as having a male fetus. We considered samples with 0.04 < cri,Y < 0.051 to be gender-uncertain.
Additionally, we developed a logistic regression strategy to improve the specificity of the gender determination. We computed the probability (Pi) of that a fetus was male by the following formula:
, where the parameters (β0, β1, β2) were determined by regression using the 300 reference controls mentioned above.
We regarded samples with pi > 0.8 as having male fetuses, samples with pi < 0.3 as having female fetuses, and the remaining samples as being gender-uncertain.
After gender classification, we performed XXX and XO detection on samples with a female fetus and XXY and XYY detection on samples with a male fetus.
For samples with a female fetus, we performed a t-test for chromosome abnormality detection.
is the standard deviation of cr
calculated from the reference controls with female fetuses; we expected sd
to equal zero. We considered samples with t
< -2.5 to be XXX or XO.
For a male fetus, we first supposed that chromosome Y is monosomic and extrapolated the fitted k-mer coverage for chromosome X, with the fetal DNA fraction estimated only by the k-mer coverage of chromosome Y. We calculated the t-score by the following formula:
, where εi,Y is the estimated cff-DNA concentration using chromosome Y data, and is the standard deviation of cr
calculated from the reference controls carrying female fetuses with an expectation of zero. Both of these quantities are defined above.
We regarded samples with t
>2.5 as being XXY or XYY. Additionally, the cff-DNA concentration estimated by chromosome X and Y independently is a combined marker for sex chromosomal aneuploidy detection especially XXY and XYY. For an XXY sample, not only was the t
>2.5 but also the cff-DNA concentration estimated by chromosome X was nearly zero, with a confidence interval from −0.03 to 0.03; For an XYY samples, not only the t
>2.5, but the R-value (Ratio of the cff-DNA concentration estimated by chromosome Y to that estimated by chromosome X) was nearly two, reflecting the fact that there were two copies of chromosome Y and only a single copy of chromosome X.