Sequence information of primers used in this study is available upon request.
Genome-wide copy number analysis for ~ 2100 individuals
We analyzed CNVs in the genome of ~ 2100 individuals with various types of developmental defects. Salient clinical features of these individuals were congenital malformations, growth failure, mental retardation, and/or disorders of sex development. Most participants were children of Japanese origin.
Genomic DNA samples were obtained from peripheral leukocytes or buccal swabs of the participants and subjected to array-based comparative genomic hybridization (aCGH) using catalog microarrays (SurePrint G3 human CGH; Agilent Technologies, Santa Clara, CA, USA). The experiments were carried out according to the manufacturer’s instructions. The results were analyzed using the Genomic Workbench 7.0 (Agilent Technologies). We searched for cases carrying multiple large (≥ 0.1 Mb) CNVs. Known recurrent CNVs in the Database of Genomic Variants  were excluded from further analyses. We also analyzed DNA samples obtained from the patients’ parents to exclude inherited CNVs.
Clinical and cytogenetic analyses of a patient with mdnCNVs
A patient with mdnCNVs was subjected to clinical analysis. Furthermore, the patient underwent conventional cytogenetic analyses. G-banding analysis was performed using lymphocyte metaphase spreads of 20 cells (SRL Inc., Tokyo, Japan). In addition, we performed multicolor fluorescent in situ hybridization to determine the presence or absence of interchromosomal translocations. The experiments were carried out according to a standard protocol (LSI Medience Corporation, Tokyo, Japan).
DNA methylation analysis
We analyzed DNA methylation indexes of seven CpG sites in the differentially methylated region at 6q24.2 for the patient with mdnCNVs. These CpG sites are located within a differentially methylated region adjacent to PLAGL1, an imprinted gene involved in glucose homeostasis [9, 10]. Usually, these CpG sites are methylated when transmitted from the mother and remain unmethylated when transmitted from the father . Hypomethylation of these CpG sites is known to result in transient neonatal diabetes mellitus (TNDM) via PLAGL1 overexpression. Pyrosequencing was performed as described previously . In brief, a genomic DNA sample obtained from the patient’s leukocytes was treated with bisulfite using the EZ DNA Methylation-Gold Kit (Zymo Research, Irvine, CA, USA). The genomic region encompassing seven CpG sites was PCR-amplified using 5′-biotinylated and unlabeled primers. The PCR products were hybridized to a sequencing primer and loaded on the PyroMark Q24 sequencer (QIAGEN, Hilden, Germany). Methylation indexes were calculated using the PyroMark CpG Software (QIAGEN). The results of the patient were compared with the reference data obtained from 49 unaffected Japanese individuals .
Further copy number analysis of the patient
To examine the presence or absence of somatic mosaicism and postnatal genomic instability of the patient, we performed further copy number analysis. Genomic DNA samples were obtained from peripheral leukocytes and oral swabs of the patient at 2 weeks and at 8 months of age, and subjected to aCGH. We calculated the average log ratios of signal intensity for probes located within the CNVs. Theoretically, the average log ratios for probes in non-mosaic heterozygous deletions and duplications are − 1.0 and + 0.58, respectively. Therefore, the average log ratios significantly higher than − 1.0 (for probes in deletions) or lower than + 0.58 (for probes in duplications) were considered as signs of mosaicism.
Microsatellite analysis was performed for six loci in the duplicated regions on 6q and 13q. The methods were described previously . In brief, genomic DNA samples of the patient and his parents were PCR-amplified using 5′-FAM-labeled and unlabeled primers. The PCR products were loaded on the Applied Biosystems 3130xl Genetic Analyzer (Thermo Fisher Scientific, Waltham, MA, USA). The results were analyzed using GeneMapper Software 3.7 (Thermo Fisher Scientific). Copy numbers of paternally and maternally derived alleles in the patient were estimated by calculating the areas under the curve of PCR products.
Short-read paired-end whole genome sequencing
Genomic DNA samples obtained from peripheral leukocytes of the patient and his mother were subjected to paired-end whole genome sequencing. Genomic libraries were prepared using the TruSeq DNA PCR-Free Library Preparation Kit (Illumina, San Diego, CA, USA) and loaded on a HiSeqX sequencer (Illumina). Sequence data were obtained as 150-bp paired-end reads (GENEWIZ, South Plainfield, NJ, USA). Base calling and quality scoring were performed using Real Time Analysis 2.7.7 (Illumina). The base call files were converted to FASTQ files using the bcl2fastq Conversion Software 2.17 (Illumina). We removed the library adaptor sequences using cutadapt 1.14 . We also trimmed low quality bases (quality scores of less than 16) at both ends and removed short reads of less than 72-bp. Sequence reads were mapped against the human genome reference (hg19/GRCh37) using Burrows-Wheeler Aligner (BWA) 0.7.15 . We used Picard 2.8.1 (Broad Institute, Cambridge, MA, USA) to remove duplicate reads and to verify mate-pairs. Local realignment and base quality score recalibration were performed with the Genome Analysis Toolkit 3.5 (Broad Institute). BAM file data were visualized on the Integrative Genomic Viewer (IGV; Broad Institute).
Synthetic long-read whole genome sequencing and haplotype-phasing
Haplotype-phasing of the patient and his father were performed through synthetic long-read whole genome sequencing (Macrogen, Seoul, South Korea). Genomic DNA samples were extracted from leukocytes. Sequencing libraries were prepared using the 10x Genomics Chromium System (10x Genomics, Pleasanton, CA, USA) and loaded on the HiSeqX sequencer (Illumina). Sequence data were obtained as 150-bp paired-end reads. Base calling and quality scoring were performed using Real Time Analysis 2 (Illumina). The base call files were converted to FASTQ files using the Long Ranger 2.1.3 (10x Genomics). Then, Long Ranger 2.1.5 (10x Genomics) was used to align sequence reads and to call single nucleotide variants (SNVs), indels, and structural variants. The output data were visualized on the Loupe 2.1.2 (10x Genomics). We genotyped SNVs of the patient and his father and constructed haplotype-phase blocks. In this analysis, we focused primarily on genomic regions flanking the breakpoints.
Genotyping of SNVs in the duplicated regions
We referred to the whole genome sequencing data of the patient and his parents, to determine the number of alleles involved in the patient’s duplications. To this end, we selected SNVs in the duplicated regions for which the father and the mother were heterozygous and homozygous, respectively, and the patient was heterozygous with the minor allele accounting for 26–40% of the total sequence reads. We assessed the genotype of the patient as diallelic and triallelic when the reads of the father-specific allele accounted for approximately 2/3 and 1/3 of the total reads, respectively.
To determine the precise genomic position of each breakpoint, we visualized the short-read whole genome sequencing data on IGV and searched for discordant paired-end reads. We referred to the results of G-banding analysis and aCGH to decide the candidate regions. We also analyzed the data of synthetic long-read whole genome sequencing on Loupe 2.1.2 (10x Genomics) to detect breakpoints that were missed by the G-banding analysis and aCGH.
The junction structures were determined by direct sequencing of the PCR-amplified DNA fragments harboring the fusion points. To this end, we performed PCR reactions using serial primers flanking the breakpoints. The genomic position of each breakpoint was decided against the human genome reference (hg19/GRCh37).
In silico analyses for the breakpoint-flanking regions
In silico analyses were carried out for genomic regions at the proximal and distal sides of each breakpoint. First, we searched for microhomologies and microhomeologies at the breakpoints. Microhomology was defined as 100% identical DNA sequences less than 70 nucleotides, while microhomeology was defined as five or more nucleotides with 70% or more homology, without > 2 consecutive nucleotide gaps. We analyzed 20-bp sequences at the proximal and distal sides of each breakpoint using LALIGN (Swiss Institute of Bioinformatics, Lausanne, Switzerland) . We set the opening gap penalty to 0 to detect all candidate sequences. The detected sequences were evaluated manually. Second, we examined the presence or absence of repetitive sequences within 2-kb regions at the proximal and distal sides of each breakpoint. Long interspersed nuclear elements (LINEs), short interspersed nuclear elements (SINEs), and human endogenous retroviruses (HERVs), were searched using RepeatMasker . Third, we checked whether the breakpoints were associated with open chromatin histone marks, i.e., H3K4Me1, H3K4Me3, and H3K27Ac. We referred to the data sets of seven cell types (GM12878, H1-hESC, HSMM, HUVEC, K562, NHEK, and NHLF) in the UCSC Genome Browser . Fourth, replication timing of breakpoint-containing genomic regions was assessed as described previously . We employed the data of tested loci closest to each breakpoint. Lastly, we analyzed the frequency of de novo SNVs in the 100-bp regions around the breakpoints. To this end, we compared sequence data of the patient and his parents.
SNVs analyses using the whole genome data
First, to examine whether the patient’s genome was vulnerable to de novo point mutations, we counted the total number of de novo SNVs in the patient. We called de novo SNVs in the entire genome as described previously . Next, to identify disease-associated SNVs in the patient or his father, we searched for protein-altering variants and splice site substitutions in 302 genes known to be involved in genomic stability [21, 22]. Sequence reads were mapped against the human genome reference (hg19/GRCh37) using BWA. We used Picard 2.8.1 to remove duplicate reads and to verify mate-pairs. Local realignment, base quality score recalibration, variant calling, and variant annotation were performed with the Genome Analysis Toolkit 3.5 (Broad Institute) and ANNOVAR . The analyzed genes are shown in Additional file 1: Table S1 [21, 22]. To exclude common polymorphisms, we referred to the data of the 1000 Genomes Project , Exome Aggregation Consortium , and Human Genetic Variation Database . SNVs that were predicted as benign/tolerated by all of the five in silico programs, namely, Sorting Intolerant From Tolerant , PolyPhen-2 HVAR , MutationTaster , Combined Annotation Dependent Depletion , and Mendelian Clinically Applicable Pathogenicity Score  were excluded as probably benign polymorphisms. SNVs of interest were confirmed by Sanger sequencing.