Cases of DVT were selected from 2139 unrelated patients referred to the Angelo Bianchi Bonomi Hemophilia and Thrombosis Center, Milan, (Italy) for diagnostic workout and thrombophilia testing after a first episode of DVT of the lower limbs in the years 1995-2010. Patients were asked to bring to the center the diagnostic documentation of their thrombotic episodes and underwent a clinical interview. DVT had been diagnosed by compression ultrasonography or venography. All patients underwent a complete thrombophilia screening, including measurement of natural anticoagulant proteins, genotyping of FVL and prothrombin G20210A and search for antiphospholipid autoantibodies. Coagulation factor VIII and fibrinogen coagulant activities were also measured.
A patient selection flow-chart is shown in Additional file 1 Figure S1. Patients selected for next-generation sequencing were required to have (a) history of idiopathic DVT of the lower limbs, (b) age of disease onset < 55 years, (c) wild-type FVL and prothrombin G20210A genotypes, (d) absence of natural anticoagulant deficiencies, (e) negative search for anti-phospholipid autoantibodies, (f) been born in Lombardy, the 10 million people region of Italy that has Milan as a regional capital. Since 42 patients matched these criteria, patients were further prioritized based on age of onset, familial history of venous or arterial thrombosis, completeness of clinical information, DNA amount and quality. Patients included in the replication were all the remaining 719 patients with a first episode of idiopathic lower-limb DVT and available DNA. Controls included in the study were selected from a total of 1938 healthy Italian individuals recruited among friends and non-consanguineous relatives who accompanied patients to the Hemophilia and Thrombosis Center and agreed to be tested for thrombophilia. Previous arterial or venous thrombosis in the controls was excluded using a validated questionnaire . Controls had similar age of the idiopathic DVT cases (± 5 years) and the same gender. Controls who underwent next-generation sequencing were matched with the patients for geographic provenience (born in Lombardy), in order to minimize population stratification. In case more than one control matched the same patient, control inclusion was random. The study was approved by the Institutional Review Board of the Fondazione IRCCS Ca' Granda - Ospedale Maggiore Policlinico and all subjects gave their informed consent. Patient recruitment, sampling and thrombophilia screening were carried out at the Angelo Bianchi Bonomi Hemophilia and Thrombosis Center, Milan (Italy). Next-generation DNA analysis and replication PCR and Sanger sequencing and associated analyses were carried out at the Human Genome Sequencing Center (HGSC), Baylor College of Medicine, Houston (USA).
Target area selection
The protein coding exons, 3' and 5' UTRs, and the intron-exon boundaries of 186 genes were chosen as target area. Target genes included all coagulation factor genes, anticoagulant genes and genes involved in fibrinolysis, platelet adhesion and aggregation, cell-cell interaction, endothelial activation and inflammation. The genomic coordinate intervals corresponding to the target area were obtained from the UCSC Genome Browser database and sent to NimbleGen for probe design. Probes were designed for all of the submitted intervals, with slightly different genomic coordinates (i.e. tiled regions). The final target area spanned 644, 472 bp. Probes were arrayed on Roche NimbleGen HD2 2.1 M-probe custom chips. The complete lists of target and tiled coordinate-intervals and of the target genes are in the Additional file 2 and Additional file 1 Table S1.
Two experiments were carried out. In the first experiment, genomic libraries from 4 DVT patients (DVT_P_01, DVT_P_02, DVT_P_03, DVT_P_04) and 4 controls (DVT_C_01, DVT_C_02, DVT_C_03, DVT_C_04) were captured on the same Roche NimbleGen HD2 chip and sequenced in the same ABI SOLiD 4 spot (one quarter of a slide). In the second, libraries from 8 cases (DVT_P_01, DVT_P_05, DVT_P_06, DVT_P_07, DVT_P_08, DVT_P_09, DVT_P_10, DVT_P_11) and 8 controls (DVT_C_05, DVT_C_06, DVT_C_07, DVT_C_08, DVT_C_09, DVT_C_10, DVT_C_11, DVT_C_12) were captured on one Roche NimbleGen HD2 chip and sequenced in two ABI SOLiD 4 spots. Sample DVT_P_11 failed capture during the second experiment. Patient DVT_P_01 was sequenced twice as a quality control procedure.
Genomic library preparation, barcoding and enrichment of target DNA sequences
Two micrograms of genomic DNA were used to prepare libraries of DNA fragments that were ligated with ABI SOLiD P1-and P2-adaptors . Different modified P2-adaptors, each containing a specific DNA-tag sequence (molecular barcode), were used for each individual library. Libraries with barcodes were used to prepare equimolar DNA pools of 8 and 16 samples. Four micrograms of DNA from each pool were hybridized on one Roche NimbleGen HD2 chip.
ABI SOLiD 4 sequencing
Capture products underwent emulsion PCR with P1-adaptor mediated attachment of clonally-amplified templates to loading beads. Beads were covalently attached on glass slides at the P2-adaptor extremity. Sequencing by oligonucleotide ligation and detection (SOLiD) was performed on ABI SOLiD 4 platforms .
Read barcode assignment and mapping to the reference genome
Reads with the barcodes were assigned to the corresponding sample using custom Perl scripts and mapped to reference human genome, NCBI36/hg18, using BFAST software . Reads that mapped on the same starting and end coordinates, considered likely to be PCR duplicates, were marked in the binary alignment/mapping (BAM) files, where mapping information was stored.
Genetic variant calls and quality control (QC)
Sorted BAM files were processed in a variant-calling pipeline consisting of a BAM filtering process and a variant calling process. In the first step, duplicate reads were eliminated from the BAM files, retaining only the read with the top mapping quality at each pair of start and end mapping coordinates. Also, reads with mapping quality score of less than 50 were expunged from the BAM files. In the variant-calling step, Samtools  was used to generate PILEUP files with read information at sites where mismatches from the reference sequence were detected. Consensus in the presence of mismatches, read- and base-quality parameters were used as criteria to distinguish genetic variants from sequencing errors, filtering high-quality calls. In a final QC process, this set of calls was further cleared from variants with an allele balance (variant allele reads/total number of reference plus variant allele reads) below 20% and/or with significant strand bias (i.e. sites at which the variant sequence did not appear on both forward and reverse reads) in spite of high variant quality.
Genetic variants were annotated on RefSeq database, dbSNP129 and 1000 Genomes pilot release (March 2010)  using Annovar software . Missense variants were also annotated on SIFT  and Polyphen 2 .
Several pieces of software have been developed to analyze genotyping result of commercially SNP-genotyping arrays (e.g. those used for GWAS), in particular the successful PLINK package . On the other hand, few tools are available for genetic association analysis of next-generation sequencing datasets. For this reason we developed Nxtgen2plink.rb, a software capable of generating PLINK-compatible input files with phenotypic information on sequenced individuals and genotype calls at each site that was found to be variable in at least one of the sequenced individuals. The software uses read alignment and coverage information (BAM files), genetic-variant calls (PILEUP files) and quality control information on variants with low quality (also in the PILEUP file format) to generate genotype calls across all individuals. Details of the workflow of the Nxtgen2plink.rb software are presented in Additional file 1 Figure S2. Association analysis was the carried out by PLINK and custom Perl and Python scripts.
In our dataset we chose to carry out (a) association analysis by Fisher's exact test of all identified variants that had at least a minor allele frequency (MAF) of 8% in the entire dataset of cases plus controls (i.e. the threshold MAF enabling a variant to reach a statistical significance with p < 0.05 given the sample size) and a genotype missingness of less than 10% and (b) 'collapsing' analysis comparing the total number of nonsynonymous single nucleotide variants (SNVs) in coagulation factor and anticoagulant protein genes in DVT cases and controls.
We chose to carry over to replication in up to 719 idiopathic DVT patients and 719 matched controls the top-5 variants from the single-variant association analysis. Replication was carried out by PCR and Sanger sequencing, which were chosen as replication techniques given the high-throughput capacity of HGSC, their accuracy and their potential to reveal neighboring genetic variation in linkage disequilibrium with or with similar effect to that of the variant undergoing replication. Genotype calls were performed automatically by SNP-detector  and 10% of the calls were verified manually on chromatograms. Replication comprised two stages (1) initial replication in 284 individuals (2) full replication in 1438 individuals. We selected for stage 2 variants that given their allele frequency and effect size estimated in stage 1 had 80% chances to be replicated at p < 0.005 in the entire cohort of 1438 individuals. Genetic association was carried out by PLINK. The interaction of rs6050 with FVL and prothrombin G20210A was tested using PLINK with the epistasis option. Association of rs6050 with DVT after adjustment for covariates was assessed by multivariable logistic regression using STATA 10 software.