Discovery of stroke-related blood biomarkers from gene expression network models

Background Identifying molecular biomarkers characteristic of ischemic stroke has the potential to aid in distinguishing stroke cases from stroke mimicking symptoms, as well as advancing the understanding of the physiological changes that underlie the body’s response to stroke. This study uses machine learning-based analysis of gene co-expression to identify transcription patterns characteristic of patients with acute ischemic stroke. Methods Mutual information values for the expression levels among 13,243 quantified transcripts were computed for blood samples from 82 stroke patients and 68 controls to construct a co-expression network of genes (separately) for stroke and control samples. Page rank centrality scores were computed for every gene; a gene’s significance in the network was assessed according to the differences in their network’s pagerank centrality between stroke and control expression patterns. A hybrid genetic algorithm – support vector machine learning tool was used to classify samples based on gene centrality in order to identify an optimal set of predictor genes for stroke while minimizing the number of genes in the model. Results A predictive model with 89.6% accuracy was identified using 6 network-central and differentially expressed genes (ID3, MBTPS1, NOG, SFXN2, BMX, SLC22A1), characterized by large differences in association network connectivity between stroke and control samples. In contrast, classification models based solely on individual genes identified by significant fold-changes in expression level provided lower predictive accuracies: < 71% for any single gene, and even models with larger (10–25) numbers of gene transcript biomarkers gave lower predictive accuracies (≤ 82%) than the 6 network-based gene signature classification. miRNA:mRNA target prediction computational analysis revealed 8 differentially expressed micro-RNAs (miRNAs) that are significantly associated with at least 2 of the 6 network-central genes. Conclusions Network-based models have the potential to identify a more statistically robust pattern of gene expression typical of acute ischemic stroke and to generate hypotheses about possible interactions among functionally relevant genes, leading to the identification of more informative biomarkers. Electronic supplementary material The online version of this article (10.1186/s12920-019-0566-8) contains supplementary material, which is available to authorized users.


Background
The identification of biomarkers characteristic of acute ischemic stroke (AIS) is important both from the standpoint of basic research and for clinical practice. Distinguishing actual instances of AIS from stroke mimics is critical for the triage process in emergency medicine. In addition to providing a method for corroborating diagnosis of AIS, stroke biomarkers have the potential to serve as predictors of stroke severity and clinical outcomes when the abundance of a particular gene product has a significant association with patient outcome measures. Biomarker discovery can also provide additional insight into the basic physiology and molecular biology surrounding AIS and similar infarctions, including apoptosis of brain cells, by identifying genes in key regulatory pathways.
Several recent studies have identified molecular AIS biomarkers from blood samples. Among the biomarkers considered are the standard metabolites assayed in hospital labs, mRNA expression array, RNASeq data, miRNA data, and mass spectroscopy proteomic data [1][2][3]. One of the molecular markers most strongly associated with AIS is the tetracicopeptide repeat protein TTC7B, which is responsible for localizing the P14K kinase in the plasma membrane. Mutations in TTC7B have been found to be associated with stroke risk, and the gene appears to be downregulated in stroke patients in comparison to control samples [4]. Similarly, miRNA panel assays found higher blood let-73-5p levels associated with downregulation of CASP3 and NLK in AIS patients compared to controls [5].
The conventional approach taken in most of these studies is to compare gene expression levels (or metabolite densities) between an experimental and control group of samples, which in this context means samples taken from stroke patients vs. non-stroke patients. Gene expression comparisons are made by computing the log fold change (FC) in relative transcript densities between stroke and non-stroke in order to determine the extent to which a gene's level of expression is up or down-regulated between (for instance) stroke vs. non-stroke cases.
While simple and efficient, such approaches suffer from several drawbacks. Among these is the fact that even with p-value corrections for multiple comparisons, the number of candidate significant genes remains too high to be of practical value as biomarkers. Furthermore, the genes identified from FC often have no functional relationship with one another, nor do comparisons of gene expression levels identify associations among genes in common pathways. As a result, even if some transcriptomic analyses of stroke patients seem promising and were validated with additional qPCR experiments (e.g. [4], which identified TTC7B), the results were not reproducible in further studies and thus their predictive performance for the diagnosis of stroke is limited.
Due to the limitations in the analyses of expression data that are based solely on FC differences, a number of recent studies, e.g. [6][7][8] have applied machine learning algorithms such as support vector machines, discriminant analysis, and k-nearest neighbour (KNN) clustering to identify more statistically robust set of genetic predictors that can consistently distinguish stroke from non-stroke cases. The present study uses such methods in combination with gene expression network models as a novel approach to stroke biomarker discovery. Analyses of gene expression data make increasing use of network-based approaches that identify covarying transcription patterns among genes [9][10][11][12]. Several algorithms have been used to construct gene networks from co-expression data: usually graph edges between pairs of genes are identified from their expression covariances or mutual Shannon-Weaver information measures [12]. The significance of a gene within the network can be quantified in terms of its topological relationship to the other genespotentially indicating that a gene plays a key regulatory role in an expression pathway. Such network significance metrics are functions of the degree (number of edges) of each node and the weighted degrees of some neighborhood set. Specifically, the "centrality" of a node, computed from its own degree and a weighted count of its neighbors' (defined up to some Hamming distance) determines its topological importance in a network. A number of network measures can be used to quantify the functional importance of genes in an interaction network, including eigenvector centrality and Page centrality [13,14].
There has been limited application of network centrality measures to the identification ofg stroke biomarkers. One such study [15] used network models and PageRank centrality to identify significant miRNA-mRNA interactions in animal models of AIS and validated the significant mRNAs by fold change comparisons of human gene expression levels in stroke vs. control patients. In this study, we analyse gene expression data from the blood samples of AIS patients and control groups to construct gene expression network models. The functional importance of genes will be determined by their network centrality, the differences in gene centrality between stroke and control samples will be used to identify stroke biomarkers using machine learning approaches. The efficacy of this network-based method will be compared to the traditional approaches based on the magnitude of difference (FC) in transcript abundance between stroke and control samples.

Datasets
The Gene Expression Omnibus (GEO) was queried to obtain expression profile data in blood samples taken from stroke patients. The transcriptomics data for both AIS and control (non-stroke) samples were obtained from published stroke gene expression studies [4,16,17]. All of these data were from microarray experiments (Affymetrix whole-genome expression arrays U133 2.0) on peripheral blood samples from stroke patients and from control non-stroke patients.

Data integration
Analysis of the pooled data requires integration of the different expression array datasets from [4,16,17] into a single expression matrix with a consistent scaling of the expression levels. Dataset [4] consists of 20 stroke and control peripheral blood mononuclear cell (PBMC) samples, [16] has 39 stroke and 25 control whole blood samples, while [17] also has whole blood for 23 stroke and control samples. Because the three datasets were generated via different instrumentation and experimental setup, they were separately and differently normalized in their initial format. Therefore, we also initially re-normalized each data set separately using robust multi-array averaging (RMA [18]) and log transformation of the data. We note that the stroke samples in [16] were evaluated at three time points (within 3 h, 5, and 24 h of the stroke event); only the 3 h time point data were used for this study.
Additionally, pooling data from different studies meant combining expression arrays derived from whole blood samples vs. those from PBMC. Because PBMC excludes non-nucleate cells, one may expect somewhat different mRNA profiles with respect to whole blood due to differences in cell/tissue type. Consequently, it was necessary to determine whether pooling PBMC and whole blood data may introduce artifacts due to mRNA profile differences between the different types of blood samples. This was achieved by comparing the FC per-gene in the whole blood only datasets to the pooled blood/PBMC data using Spearman rank-based correlation analyses.
The data sets were merged by performing a second layer of joint normalization similar to the standard approaches used for qPCR data [19]. Initially, we selected 8 commonly used housekeeping genes (ACTB, B2M, HMBS, HPRT1, RPL13A, SDHA, TBA, YWHAZ) that were expressed at comparable mean levels in both treatment (stroke) and control groups and followed the procedures outlined in [19] to identify the minimal subset of genes that show the most among-experiment variability to use as normalizers. Because all of these genes showed relatively high FC across samples (Log 2 FC prior to median per-sample normalization was > 0.3 for all the examined housekeeping genes), normalization was performed based on median per-sample expression level rather than rescaling with respect to expression levels of housekeeping genes.
The pooled data were filtered so that only genes with less than 10% missing expression values would be retained for further analysis. For the remaining missing values, the KNN-Impute method [20] was applied to properly impute the missing data with K = 20. For the final stage of quality control, outliers were identified with a method based on principal components analysis (PCA)retaining the principal components that accounted for 90% of covariation, and then applying the local outlier factor (LOF) approach [21] to cluster samples and detect outliers as the unclustered samples. This analysis indicated that less than 5% of the data were marked as outliers, thereby passing a predefined threshold of fewer than 10% outliers for a dataset to be considered valid for further analysis.

Analysis of differential expression
Analyses of differential expression (fold-change in gene expression levels between stroke and non-stroke samples) were performed using InSyBio's implementation of limma (linear model for microarray) using the Ebayes algorithm [22] to evaluate linear model fit considering the sample type (PBMC/whole blood) as a covariate. Moderated t-statistics (t-scores where the standard errors are reduced to a common value across probes) were used to determine the p-values associated with the log foldchange (log FC), i.e. log 2 [X S /X C ], where X S and X C , are, respectively, the mean gene expression levels in the stroke and control samples. The Benjamini-Hochberg false discovery rate (FDR) correction [23] was performed to adjust the p-values for multiple comparison. Statistically significant log FC values were defined as those with FDR-adjusted p < 0.05 of the moderated t-statistic.
Differences in instrumentation across samples were considered a potential source of bias, and was removed by specifying this variation as a covariate in the Ebayes algorithm to account for heterogeneity between the combined expression arrays. Genes with statistically significant log FC values between stroke and control samples are retained as input for prediction models. The prediction models developed from this gene set were used for comparison with prediction models using gene sets derived from the network-based methods of identifying significant genes, as outlined below.

Network-based biomarker characterization
In order to identify biomarkers that are both statistically and potentially functionally significant, the InSyBio Bionets network-based approach was applied to the expression data. The network analyses leverage mutual covariance in expression levels to characterize statistical association of transcription patterns among genes. Following the mutual information-based approach described in [24], a correlation network was constructed for both control and stroke sample expression data. For every gene pair, the mutual information value I between the expression levels of two genes x,y is computed, where for the probabilities (frequencies) of observed states i,j ∈ x,y and joint frequencies p ij A first cut-off value of I < 0.2 is used to eliminate uncorrelated data. The statistical significance of covariance in expression between the remaining gene pairs was determined based on 95% confidence intervals in a bivariate normal distribution, with the final significance threshold that was used for most gene pairs typically corresponded to mutual information values of I = 0.7-1.0. As outlined in [25], weighted edges are assigned to pairs of nodes with significant mutual information.
The centrality of each node (gene) in the graph was determined using the PageRank algorithm [13,14] a modified eigenvector centrality algorithm best-known for its use by Google to identify the most relevant matches to search query terms. Centrality scores were assigned to genes in both the stroke and control data sets. The mean log 2 FC in centrality scores between stroke and control were compared using t-tests (following Spiro-Wilcoxon tests for normality) where statistically significant nodes were defined as those with FDR-adjusted p < 0.05. We refer to these as network significant genes, to distinguish this set of genes from those identified by expression level FC significance.
The Gene Ontology (GO) resource [26] and DAVID [27,28] annotation tools were used to characterize the functional roles and other shared features of centrally significant genes using enrichment analysis of the gene set with respect to functional roles and pathways. In addition, the Hamming distance d = 1 neighbor set of each network-central gene was compared to the set of genes with known or predicted interactions and functional associations with the network-central genes according to the STRING database tool [29]. Associations of genes in the neighborhood sets with hereditary diseases was assessed using the DisGeNet [30] tool, which assigns an association score to genes whose mutational variants are linked to known diseases in the biomedical literature.

Characterization of predictive accuracy
The set of genes with significantly different network centrality values was used to develop machine learning models that optimize the predictive accuracy of stroke vs. non-stroke classification using a minimal number of genes. Generating a manageable and robust model for stroke prediction/classification requires optimization with respect to two criteria: the first is the predictive accuracy, i.e. the frequency with which samples are correctly classified as coming from AIS vs. control based on gene expression), and second, simultaneously attempting to minimize the number of genes that are used to generate the predictive model. This algorithm was applied to both the set of network-significant genes and the FC-significant genes in order to compare the relative predictive value of genes identified from networks vs. individual expression levels.
In the optimization process, a multi-objective genetic algorithm (GA) is used to identify the optimal feature set input from a population of float vector solutions (subset of predictor genes and model parameters). The float vector is initialized with a small number of genes, and in each subsequent generation the feature set is added to or subtracted from sequentially via mutation and recombination operators. Replication of a feature set proportional to the fitness of a predictive model. In the multi-objective optimization technique used for this task the overall fitness is calculated using a combination of the following independent fitness functions: The fitness functions 1 and 4 were used to promote solutions which lead to the simplest, most general possible models. The other fitness functions were used to achieve accurate classification performance, dealing effectively with the imbalanced nature of the dataset. Specifically, a weighted sum of the independent fitness functions was used to calculate the overall fitness of a solution using the following weights: Fitness function 1: 1, Fitness function 2: 5, Fitness Function 3: 5, Fitness Function 4: 5. These weights were selected in order to provide the same (high) emphasis in classification metrics, while considering that the simplicity of the model is of less importance for these problems.
A model's efficacy was assessed by computing its Predictive Accuracy = (TP + TN)/(TP + TN + FP + FN) with TP: True Positives, TN: True Negatives, FP: False positives, FN: False Negatives. The accuracy of each model was calculated via 5-fold cross-validation of the dataset. For every iteration, 80% of the data (both stroke and control) were used as a training set while the frequency at which the remaining 20% are correctly classified defined the predictive accuracy.

Differential expression of micro-RNAs
Recent studies have indicated that several miRNAs are associated with stroke-related cellular mechanisms [31,32]. In order to identify miRNAs that may be acting as regulators of genes that are stroke-predictive, we utilized the InSyBio ncRNAseq tool [33] to identify all miRNAs that can potentially target the six transcripts of interest. For this process we used the default threshold (0.3) as suggested by [34] for the predicted confidence score for each miRNA:mRNA examined found 907 miRNAs. This predicted confidence score is estimated from an epsilon-SVR regression model using 124 sequential, thermodynamic and structural features of all potential miRNA:mRNA target sites and initially trained with a balanced positive and negative miRNA:mRNA pairs. However, not all these microRNAs are expressed across multiple tissue types, therefore, most of them are not identifiable in blood samples while others are not differentiated in stroke patients. These 907 miRNAs were further filtered using the high-throughput transcriptomics data from [35].
Significant differential expression in miRNA between stroke and controls was analysed using this data. The analyses were similar to the workflow for mRNA described in subsection C above, i.e. log FC was computed for miRNA densities in stroke vs. control samples, and statistically significant magnitudes of FC expression level were identified via moderated t-statistics and FDR corrections.

Normalization and processed dataset
Following normalization, imputing of missing data, and outlier removal, the total merged dataset consists of 137 samples (82 stroke patients and 55 control subjects), with a total of 13,243 quantified transcripts. The validity of merging PBMC with whole blood datasets was confirmed by analyses of differential expression in the combined vs. whole blood-only samples. Additional file 1: shows a volcano plot of FC vs. p-value for the blood only data (compare to Fig. 1 for the pooled data), while the second figure in the supplement shows a scatterplot of FC between blood-only and pooled expression arrays. A Spearman rank-based correlation value of 0.99 and p < < 0.001 confirms the consistency of blood-only and pooled data doesn't introduce artifacts associated with disparate gene expression profiles. Pooling PBMC and whole blood samples does slightly increase the variance and range in gene expression, which accounts for the fact Fig. 1 Volcano Plot of Differential Expression Analysis. The genes with the largest and most statistically significant absolute fold-change between treatment and control are those at the upper left and right corners that 584 significantly differentially expressed genes were found in the whole blood samples vs. 557 in the combined data set.

Differential expression analysis
A total of 557 genes are significantly differentially expressed between stroke and non-stroke samples (see Additional file 2: for a complete list). The distribution of log FC and p-values for the entire set of genes is shown as a volcano plot in Fig. 1, and Table 1 summarizes the log FC data and statistics for the 10 genes with the most statistically significant (smallest FDR-adjusted p-value based on t-statistics) differential expression between stroke and control samples. The median values in expression level between the stroke and control samples are strongly separated for these 10 genes, as shown in the Fig. 2 boxplot panels.

Network-based biomarkers
Using the criteria of differences in PageRank centrality values between stroke and normal samples, 47 potentially significant differentially expressed genes were identified. The rank list of genes in the differential expression analysis is summarized in the Additional file 3.
Both the stroke and control networks have high connectivity, with statistically significant edges linking most groups of genes in such a way that there are few disjoint sets and the path distance between randomly selected genes is short. This can be seen from the distribution of node (gene) degrees over the entire gene set is summarized in Fig. 3, which compares connectivity within the stroke and control networks to a power law distribution. Due to a larger number of degree 1-2 nodes in comparison to intermediate or high degree nodes, the fit of the observed node degree distribution in the gene networks to a power law distribution appears poor. However, a Spearman rank correlation analysis comparing predicted to observed degrees give correlation coefficients ρ = 0.596 and 0.612 for stroke and control, respectively, with p < 0.001, indicates a statistically significant concordance between the observed distribution and a power law. This pattern is consistent with "small world" phenomena [36], where every node connects to most other nodes through a comparatively short path. This high connectivity indicates at least an indirect statistical interaction between co-expression levels of most genes, especially in the stroke samples. Specifically, the expected number of edges connecting a random gene pair increases logarithmically with the number of genes, i.e. at a less than linear rate.

Predictive analytics
Single-gene expression levels give a maximum predictive accuracy of 70%, this can be seen from the plot of individual predictive accuracy among the 10 most strongly differentially expressed genes in Fig. 4 (with DNA oxidative demethylase ALKBH2 and lactadherin MFGE8 giving the highest predictive accuracies of 0.70). Furthermore, when the 10 most highly dysregulated genes are used jointly in machine learning models, the predictive accuracy remains at or below 75.1%. The same is true if a larger set of highly dysregulated genes is selected using the same optimization criteria (objective function) as for selection of the network gene set, i.e. a model with 26 genes selected by the GA-SVM optimization algorithm only increases the predictive accuracy to 81.2% (see Table 2). A model using the 10 genes identified by O'Connell et al. [8] gave predictive accuracy of 82.1% with cross-validation sampling, as shown in Table 2.
Application of feature selection to the significant network-based biomarkers led to the identification of 6 genes whose expression values gave higher joint predictive accuracy than any combination of genes from the differential expression analysis (Table 2). In what follows, we will refer to this set as "network-central predictors," in reference to the fact that they were identified by significant differences in network centrality between stroke and control expression arrays. As seen in Table 2, expression values from the genes ID3, MBTPS1, NOG, SFXN2, BMX, and SLC22A1 can jointly distinguish stroke from non-stroke samples with a high level of accuracy: 89.6% of samples are correctly classified in cross-validation sampling. As a qualitative comparison, a predictive model based on the 10 genes in [8] had 82.07% accuracy when trained with the dataset of the current study and evaluated with the same cross-validating sampling strategy. While most of the 6 network central genes also have significantly different expression levels between stroke and control, with the exception of ID3, they are not necessarily among the set of most highly dysregulated genes. Indeed, the differences in expression level for SLC22A11 between stroke and control are not even statistically significant. The mRNAs of ID3, MBTPS1, NOG, and These network-central genes have intersecting sets of Hamming distance 1,2 neighbors. Indeed some gene pairs are within mutual d = 1 neighborhoods (e.g. SLC22A1 is a d = 1 neighbor to SFXN2, BMX, NOG, and MBTPS1, and all of the gene pairs except those with ID3 are within mutual Hamming distance d ≤ 2 of one another). Consequently, none of these neighborhoods are mutually disjoint, so that no more than two significant associations (edges) separate any pair from the 6 genes. This can be seen in Fig. 5a-b, which respectively show the d ≤ 2 neighborhoods of the 6 network-central genes in the control and stroke networks (note that these graphs are somewhat "truncated" for readability by showing only the edges with I ≥ 0.75). This high connectivity is consistent with the distribution of node degrees and fit to power law distributions shown in Fig. 3, and with the fact that these genes have physiologically related functions associated with secretion and cell signalling.
However, there are significant differences in connectivity between the stroke and control networks. For example, Fig. 5b shows a single I ≥ 0.75 connection for ID3, vs. 3 in the control network. If all d = 1 neighbors are considered without the truncation (see Additional file 4), there are actually 7 neighbors in the stroke network and 9 in the Fig. 2 Boxplots showing the range in gene expression for the 10 most statistically significant differentially expressed genes, illustrating the extent of FC and the separation of median expression levels between stroke and control samples control, indicating a small decrease in connectivitynone of the genes in the two networks are shared, which accounts for the statistically significant changes in network centrality. In contrast, the connectivity of the other 5 genes is higher in the stroke vs. None of the genes in the d = 1 neighborhoods of the network-central gene set are identified as neighbors in STRING db. However, in the case of SFXNL, STRING db identifies functionally similar genes to those in the neighborhood set of our networks. Specifically, STRING db lists solute carriers SLC25A19, 25A1 and heat shock protein HSPA14 as interacting with SFXNL. The d = 1 neighborhoods for SFXNL in our analyses include the solute carriers SLC22A1, SLC41A1 and the heat-shock protein HSPH1.

Functional annotation
The six network-central predictor genes are functionally disparate, however, mosts of them are.involved in secretory and signalling pathways. The inhibitor of DNA binding ID3 interferes with the binding of helix-loophelix proteins; the protein noggin (NOG) binds and inactivates TGF-beta growth factors; membrane bound transcription factor site-1 protease MBTPS1 processes proteins through secretory pathways; cytoplasmic tyrosine protein kinase BMX is a receptor molecular involved in several transduction pathways, solute carrier SLCA11 is a voltage-gated transporter, and sideroflexin SFXN2 regulates cation transport.  Table 3, while Additional file 5: contain the complete DAVID enrichment analysis tables.
For the 47 genes with significant differences in their network centrality between stroke and non-stroke, there is significant enrichment of nucleoplasm proteins (number of genes n g = 15, OR = 2.18). There are high odds ratios associated with enrichment of other functional and structural classes of genes, e.g. U-box domain genes (n g = 2, OR = 121.96), but these are not statistically significant.
The relatively small neighborhood of ID3 (7 genes with shared edges + ID3 itself ) contain 2 genes involved in repression of apoptosis in bone marrow (LEF1, FLT3LG, with OR = 25) as well as enrichment for transcription factor binding genes (LEF1, ID3, TRIB2, OR = 37.5). For MBTPS1, the strongest enrichments are seen for zinc finger proteins (n g = 23, OR = 89.23) and for ligases (5, n g = 5.60). The NOG neighborhood is enriched for 5 genes (OR = 6.89) in the endoplasmic reticulum, 2 genes of the microvillus assembly (OR = 100.25, and 10 genes whose proteins bind poly(A) RNA (OR = 2.27). Of the 65 genes in the SFXN2 d = 1 neighborhood, 39 are phosphoprotein (OR = 1.52), 13 are mitochondrial (OR = 2.83), and 19 are involved in acetylation (OR = 1.78). In the degree 1 neighborhood of BMX, the strongest enrichments are for nucleotide and ATP-binding (14 and 11 genes with OR = 2.60, 2.63, respectively) and 6 genes for nuclear localization (OR = 5.64). SLC22A1, being within d = 1 of SFXN2, also has a neighborhood enriched Fig. 4 Predictive accuracies of the 10 genes with the strongest differential expression (see Table 1) between stroke and control, illustrating how no gene provides a predictive accuracy individually exceeding 70% Table 2 Comparison of gene set and predictive accuracy based on log FC expression level significance vs the network-based model. The first row is for predictive models using single differentially expressed genes; the second is based on the 10 most significant FC values (see Table 1); the third uses the hybrid machine learning algorithm to identify a set of predictor genes from the 557 signficant FC genes; the fourth uses 10 genes identified by O'Connell et al's predictive model; the last row uses the 6 networkcentral genes in the prediction model

82.07%
InSyBio predictive analytics using network significant gene set 6 (ID3, MBTPS1, NOG, SFXN2, BMX, SLC22A1) 89.57% in mitochondrial genes (n g = 11 of the 65 neighbors + SLC22A1, OR = 3.07), as well as nucleolar (n g = 8, OR = 2.74 and ATP-binding (n g = 10, OR = 2.24). A DisGeNet [28] search of the 6 network-central genes identified several diseases associated with their mutational variants or dysregulation in the literature. Figure 6 summarizes the diseases with a threshold association score > 0.001 for each gene (see Additional file 6: for a complete list); only SFXN2 had not been previously significantly associated with any disease in the database While the other genes have not been previously linked to AIS, they have been linked with diseases and comorbidities such as diabetes (ID3, SLC22A1), obesity (SLC22A1), and myocardial ischemia (ID3).

Stroke related miRNAs
A set of 115 miRNAs were found to be significantly differentially expressed between stroke and controls. Following the preliminary filtering based on confidence scores, this initial set of miRNAs was reduced to 27 (Table 4, see Additional file 7: for a complete list of associations between miRNA and the network-central 6gene set). Among this set of miRNAs, eight (hsa-miR-1181, hsa -miR-1207-3p, hsa -miR-1246, hsa -miR-3180, hsa -miR-3960, hsa -miR-4436a, hsa -miR-517a-3p, hsa -miR-517a-3p) were determined to target two or more transcripts from the set of 6 network-central predictor genes, with the exception of SFXN2. The increased correlation between the stroke related miRNAs identifiable in blood and the set of predictor genes provides an additional validation of the statistical and potentially functional significance of these genes as stroke biomarkers.

Discussion
Previous studies have used gene sets identified from FC significance to achieve similar predictive accuracies stroke vs. controls samples via machine learning models. For example, [6] identified a set of 29 genes that provided 93.5% sensitivity and 89.5% specificity in distinguishing control vs. stroke, while [7,8] used expression levels from a panel of 10 genes to achieve 95.6% predictive accuracy. None of the 29 genes in [6] are in our set of 6 network central genes, while of the 10 in [7,8], ID3 is the only shared gene.
The predictive accuracies in these studies vs. our results are not directly comparable, insofar as we use a different (merged) dataset for model building. Nevertheless, the regression analyses summarized in S1 indicate that differences in gene expression level between whole blood and PBMC are sufficiently similar to make at least qualitative comparisons of predictive accuracies across these sample types and combinations thereof. To further address the contraints of comparison across sample types, the prediction model suggested in [8] was retrained using the dataset of the current study and evaluated using the same cross validation setup used for the evaluation of the network-central gene models. When gene sets identified in [7,8] are used as model predictors with the merged training data, our network-based models perform favorably in comparison. Another potentially significant caveat to the current study is the fact that due to limited data, the examined predictive models and biosignatures used only cross-validation analysis, as opposed to an independent external dataset for validation.
As additional gene expression data in stroke patients become available, the prediction model based on network-central gene biosignatures should be subjected to further external validation with additional, independent data. Apart from potentially achieving similar levels of predictive accuracy with a smaller array of genes, the principal advantage of the network-based approach lies in providing greater information content through the identification of genes with co-expression patterns correlated with those in the prediction model, i.e. the d = 1, 2 network neighbors. Network models leverage joint information about pairwise co-expression patterns, and so can provide additional insights into the functions and pathways of genes underlying a disease or condition. Such statistical associations generate hypotheses for future experimental confirmation of the significance of the identified network-central predictor genes and their neighbors to stroke physiology, and consequently, their potential to serve as biomarkers in clinical assays.
There are a number of possible biological explanations for the observed changes in network connectivity and gene centrality between stroke and control samples. The identification of 6 network-central genes whose expression levels are strongly characteristic of acute ischemic stroke is thought to be a consequence of specific, correlated patterns of gene expression linked to stroke and associated reactions to anoxia, inflammation, and cell death in both the brain and in blood vessels. Such correlated patterns of gene activity are particularly evident in the increased number of interactions (or at least statistically significant associations) between control and stroke for NOG, BMX, SFXN2 and SLC22A12 vs. the decrease in association with ID3 and MBTPS1, as evidenced by the relative sizes of d = 1 neighborhoods.
While differential gene expression is necessary for the identification of gene centrality in a network, it is not necessarily the case that the most strongly dysregulated genes are associated with the overall biological changes driven by a particular disease or condition. The fact that none of the genes with the strongest differences in network centrality (with the exception of ID3) are characterized by with largest FC further illustrates the efficacy and power of network-based approaches to the analysis of gene expression. It is also noteworthy that significantly differentially expressed genes from independent studies such as TTC7B [4] were not found to be significant in the integrated dataset, either in the analyses of log FC differences or in the network models. This could indicate that the significant FC observed for this gene in individual studies are due to particular characteristics of that experimental design, or due to the gene expression outcomes specific to the therapy/drug treatments received by a specific cohort of patients.
A large difference in centrality score between stroke vs. non-stroke samples may be indicative of gene regulatory pathways that are disrupted (either differentially activated or repressed) during a stroke even in the absence of large changes in the expression levels of individual genes. For example, the existence of less highly connected neighborhoods associated with specific genes Fig. 6 Association of 5network-central predictor genes (ID3, MBTPS1, NOG, BMX, SLC22A1) with known diseases according to the DisGeNet database using the default threshold of 0.001 (no such associations were identified for SFXN2) may provide insight into gene expression "modules" associated with specific pathways. This modularity is disrupted during stroke events (and potentially other pathological states) through the up or down-regulation of genes that link the once functionally and statistically separated pathways.
As a potential limitation to these interpretations, it is noted that some of the changes in mRNA density associated with stroke may be a "passive" result of cell death rather than due to stroke-specific changes in gene expression patterns. The genes associated with these mRNAs are less likely to be part of regulatory or signalling cascades unique to stroke, and as such of limited interest as possible therapeutic targets. This aspect is particularly relevant in the case of circulating miRNAs. miRNA/mRNA interactions occur in the cytoplasm, so changes in the concentrations of circulating miRNAs may be the result of differential expression of miRNA in cells that die and lyse.
Similar considerations apply to the differences in mRNA densities that are a consequence of changes in relative leukocyte densities rather than differential expression. Several recent papers [37][38][39] have shown that the density of lymphoid cells in whole blood decreases following stroke, while the relative abundance of neutrophils and other myeloid cells increases. Consequently, at least part of the changes to mRNA densities in whole blood reflect changes in relative leukocyte abundance rather than changes in transcription patterns as such.