Supplementary Methods and Results

Supplementary methods Gene-sets used for feature construction hi015: genes with predicted haploinsufficiency score [R1] >= 0.15 (most sensitive cutoff) hi035: genes with predicted haploinsufficiency score [R1] >= 0.35 (best tradeoff between sensitivity and specificity) hi055: genes with predicted haploinsufficiency score [R1] >= 0.55 (most specific cutoff) ExpsNov_BrainFeAd_sp: genes specifically expressed in fetal or adult brain, defined as: rma expression index for the fetal or adult brain greater than the median expression for the entire data-set and greater than twice the median expression of non-brain tissue; based on the Novartis Tissue Expression Atlas (U133A Affymetrix array) [R2] Synapse_GrantFull: full list of post-synaptic density components based on human neocortex proteomics [R3] FMR1_Targets_Darnell: human orthologs (NCBI Homologene) of mouse genes whose mRNA translation in neurons is likely to be regulated by the FMR1 protein, based on crosslinking immunoprecipitation (HITS-CLIP) of mouse brain polyribosomal mRNAs [R4] FMR1_Targets_Ascano: genes whose mRNA translation in neurons is likely to be regulated by the FMR1 protein, based on bioinformatics prediction supported by regulatory sequence motifs [R5] thr4.86_log2rpkm: genes with at least 5 BrainSpan [R6] data points for which log2 (rpkm) >= 4.86, thus deemed expressed at (very) high levels in brain. thr3.32_log2rpkm: genes with at least 5 BrainSpan [R6] data points for which 4.86 > log2 (rpkm) >= 3.32, thus deemed expressed at high/medium levels in brain. thr0.84_log2rpkm: genes with at least 5 BrainSpan [R6] data points for which 3.32 > log2 (rpkm) >= 0.84, thus deemed expressed at medium/low levels in brain. thr.MIN_log2rpkm: genes with BrainSpan [R6] data points failing all previous criteria, thus deemed expressed at very low level or not expressed in brain. thrEXPR_log2rpkm: union of genes in the sets thr4.86_log2rpkm, thr3.32_log2rpkm, thr0.84_log2rpkm, thus deemed expressed in brain PhHs_NervSys_ADX: genes implicated in human disorders with abnormality of the nervous system, autosomal dominant or X-linked mode of inheritance, downloaded from HPO (Human Phenotype Ontology) [R7] in June 2013. PhHs_NervSys_All: genes implicated in human disorders with abnormality of the nervous system, any mode of inheritance, downloaded from HPO (Human Phenotype Ontology) [R7] in June 2013. PhHs_MindFun_ADX: genes implicated in human disorders with abnormality of higher mental function, autosomal dominant or X-linked mode of inheritance, downloaded from HPO (Human Phenotype Ontology) [R7] in June 2013. PhHs_MindFun_All: genes implicated in human disorders with abnormality of higher mental function, any mode of inheritance, downloaded from HPO (Human Phenotype Ontology) [R7] in June 2013.

MmHs_Neuro_All: genes whose knock out (or other genetic construct) produces a (a) nervous system or (b) behavior/neurological phenotype in mouse, downloaded from MGI (Mouse Genome Informatics) [R8] in June 2013. MmHs_Extend_All: genes whose knock out (or other genetic construct) produces (a) embryogenesis or (b) growth/size/body or (c) craniofacial phenotype in mouse, downloaded from MGI (Mouse Genome Informatics) [R8] in June 2013. NeuroF_large: genes in at least one of the curated Gene Ontology and pathway derived sets of neurobiological relevance NeuroF_small: genes in at least two of the curated Gene Ontology and pathway derived sets of neurobiological relevance The following list of Gene Ontology and pathway-derived sets of neurobiological relevance was used for the definition of NeuroF_large and NeuroF_small, as well as for the assessment of GO and pathway feature selection: GO:0007399 nervous system development, GO

Feature selection with stepwise decorrelation for GO and pathway features (CF)
1. Given the set of all features F = {f1, f2, …, fn}, where n is the total number of features 2. Calculate the Mean Decrease Accuracy for each feature and rank features in decreasing order 3. Select the feature with top rank 4. Binarize features (by setting gene count values greater than 1 to 1), and calculate pairwise Jaccard similarity as J (fi, fj) = (sum (fi AND fj) / sum (fi OR fj)), where AND and OR are the element-wise logical operators, and logical values are expressed as TRUE = 1 and FALSE = 0 (this formulation is equivalent to the set operator based definition, but perhaps more intuitive for binary vectors) 5. Remove all features that have lower ranks than the selected feature and similarity to the selected feature above the cutoff of 0.5 6. Repeat step 3-5 until reaching the desired number of selected features, or until no feature is left (step 4 does not need to be repeated, as Jaccard similarities do not change)

MRMR Feature selection for GO and pathway features (CF)
MRMR (Minimum Redundancy Maximum Relevance Feature Selection): features were ranked based on the MRMR method, an effective approach for large feature sets with high degree of mutual redundancy and noisiness, where only a small unknown subset of features are truly discriminative. Features are typically selected one at a time by finding the next feature from the unselected set displaying minimal redundancy with the set of features already selected and maximal relevance to the true class labels; the first selected feature is maximally relevant to the class labels. In this study, we scored each feature fi using the ratio between D = I (FS, Y) and R = I (FS, fi), where I represents the mutual information function, Y represents the subject's class (ASD = 1, control = 0), FS represents the set of selected features; therefore, D represents the relevance of the feature being evaluated with respect to the true class labels, whereas R represents the redundancy of the feature being evaluated with respect to the features already selected. It can be proven mathematically that the resulting feature set is maximally dependent on the true class labels and has a reduced correlational structure compared with the original feature set.

Feature selection for GO and pathway features (Linear SVM, Neural Network)
Features were selected using only the MRMR D / I ratio described above. For each crossvalidation iteration, the number of selected features was chosen at performance saturation (i.e. selecting more features leads to a similar or lower performance).

Supplementary results
Feature relevance for 20 curated features capturing brain expression, synaptic components, neuro-phenotypes Carefully inspecting the CF feature relevance metrics, and specifically comparing the results when classifying all case subjects to de-novo or pathogenic case only, we identified several meaningful patterns: 1. feature relevance is overall similar when classifying all subjects or only de novo or pathogenic CNV carriers, although with notable excpetions 2. features based on medium size sets (750-5,000 genes), such as synaptic components (FMR1_Targets_Darnell, Synapse_GrantFull), high brain expression (thr4.86_log2rpkm), and mouse neuro-phenotypes (MmHs_Neuro_All), typically have higher relevance score for all subjects than for de-novo or pathogenic case subjects only; 3. features based on larger sets (> 5,000 genes), such as all brain expressed (thrEXPR_log2rpkm), moderate to high predicted haploinsufficiency (hi015), total count (Total), have higher for relevance score for de-novo and pathogenic case subjects than all subjects, especially for losses; this can be interpreted in relation to the larger size (and number of genes) of de novo and pathogenic CNVs; 4. haploinsufficiency features are more relevant for losses than for gains, which is expected based on the definition of haploinsufficiency as sensitivity to decreased gene product dosage; 5. features based on smaller sets (< 700-800 genes), such as human neurological phenotype genes (PhHs_...), are less relevant, probably because they account for a smaller number of subjects, and most of their genes are already present in other better ranked gene-sets; 6. the feature based on very low or absent brain expression set (thr.MIN_log2rpkm) ranks in the bottom half for both gains and losses

GO and pathway feature selection results
We first assessed performance using the size-filtered Gene Ontology and pathway collection, without any extra feature selection step. We assessed performance in comparison to a manually selected subset of Gene Ontology sets and pathways of neurobiological relevance, thus more likely to contribute to ASD risk; we always included the total gene count as a feature.
We found that using all size-filtered Gene Ontology sets leads to a suboptimal performance: the AUC was slightly lower or within one sd unit of the AUC using the total gene count only, and also slightly lower than using the manually-selected Gene Ontology subset; in contrast, the 20 curated neurally-relevant features and total gene count achieved an AUC that is larger by several sd units. The performance for pathways was markedly worse, with the AUC very close to 0.5 even when classifying de-novo or pathogenic carriers only; restricting to the manuallyselected pathway subset led only to minor improvements. This trend was consistent for the different sub-groups of cases (all subjects vs de-novo or pathogenic variant carriers, all variants vs loss-only or gain-only).
Three feature selection procedures were adopted and compared: (i) feature relevance (Mean Decrease Accuracy, MDA) based selection, (ii) feature relevance (Mean Decrease Accuracy) based selection with stepwise decorrelation, (c) MRMR (Minimum Redundancy Maximum Relevance Feature Selection). For each procedure, we selected the top 20, top 15% and top 40% ranking features excluding the total gene count, and then added the total gene count.
When classifying all subjects, the best results for Gene Ontology based features were achieved by Mean Decrease Accuracy, either by taking the top 15% without decorrelation, or taking the top 20 features with decorrelation. After decorrelation, the top 15% had a lower performance, suggesting that many relevant yet highly correlated features are removed by decorrelation. The best feature selection strategy achieved a slightly better performance than the manuallyselected Gene Ontology subset (1 sd unit or more), but still inferior to the 20 curated neurallyrelevant features. When classifying all subjects, the best results for pathway based features were achieved by Mean Decrease Accuracy, taking the top 20 features, with performance quite independent of decorrelation; this is reasonable, considering that pathway derived gene-sets have less mutual overlap than Gene Ontology derived gene-sets, and this is reflected on feature correlation. Also for pathways, the best feature selection strategy achieved a slightly better performance than the manually-selected pathway subset, yet very modest, suggesting that pathways have a limited classification power.

Results with other classifiers
Using the same cross-validation strategy as for RF and CF, and the 20 curated neurally-relevant features, both the linear SVM and NN achieved comparable or lower AUC than CF. We did not find evidence of overfitting, as the AUC obtained using the randomization of the 20 curated neurally-relevant features was similar to the AUC of the total gene count. For both the neural network and the linear SVM, GO and pathway-based features produced better performance than using CF, yet still inferior to the 20 curated features. However, we found potential evidence of some modest degree of overfitting for GO and pathway-based features, as the AUC for random features exceeded by more than one AUC absolute unit (and more than two AUC standard deviation units) the AUC for the total gene count only, and was also very close to the AUC of real Gene Ontology features.

CF robustness to parameter change
For the 20 curated neurally-relevant features, we tested different inferential statistics used by CF for tree construction and observed minor differences in performance; the default settings (Teststat = max, Testtype = Teststatistic) usually had the best performance, or performance comparable to other settings (i.e. within one sd unit). Similarly, we observed minor differences by modifying the "mincriterion" for the default inferential test statistic (default value: 0.9), which corresponds to (1 -test p-value) and needs to be satisfied by all features used for tree construction.