The architecture of Cox-PASNet
Cox-PASNet consists of: (1) a gene layer, (2) a pathway layer, (3) multiple hidden layers, (4) a clinical layer, and (5) a Cox layer (see Fig. 6). Cox-PASNet requires two types of ordered data, gene expression data and clinical data from the same patients, where gene expression data are introduced to the gene layer and clinical data are introduced to the clinical layer. The pipeline layers of the two data types are merged in the last hidden layer and produces a Prognostic Index (PI), which is an input to Cox proportional hazards regression. In this study, we included only age as clinical data. Thus, the clinical layer is embedded in the last hidden layer directly, without any additional hidden layers. Higher-dimensional clinical data are desired to be integrated with hidden layers in the clinical pipeline.
Gene layer
The gene layer is an input layer of Cox-PASNet, introducing zero-mean gene expression data (X) with n patient samples of p gene expressions, i.e., X={x1,...,xp} and \(\mathbf {x}_{i} \sim \mathcal {N}(0, 1)\). For pathway-based analysis, only the genes that belong to at least one pathway are considered in the gene layer.
Pathway layer
The pathway layer represents biological pathways, where each node explicitly indicates a specific biological pathway. The pathway layer incorporates prior biological knowledge, so that the neural network of Cox-PASNet can be biologically interpretable. Pathway databases (e.g., KEGG and Reactome) contain a set of genes that are involved in a pathway, and each pathway characterizes a biological process. The knowledge of the given association between genes and pathways, forms sparse connections between the gene layer and the pathway layer in Cox-PASNet, rather than fully-connecting the layers. The node values in the pathway layer measure the corresponding pathways as high-level representations for the survival model.
To implement the sparse connections between the gene and pathway layers, we consider a binary bi-adjacency matrix. Given pathway databases containing pairs of p genes and q pathways, the binary bi-adjacency matrix (\(\mathbf {A} \in \mathbb {B}^{q \times p}\)) is constructed, where an element aij is one if gene j belongs to pathway i; otherwise it is zero, i.e., A={aij|1≤i≤q,1≤j≤p} and aij={0,1}.
Hidden layers
The hidden layers depict the nonlinear and hierarchical effects of pathways. Node values in the pathway layer indicate the active/inactive status of a single pathway in a biological system, whereas the hidden layers show the interactive effects of multiple pathways. The deeper hidden layer expresses the higher-level representations of biological pathways. The connections in the hidden layers are sparsely established by sparse coding, so that model interpretation can be possible.
Clinical layer
The clinical layer introduces clinical data to the model separately from genomic data to capture clinical effects. The independent pipeline for clinical data also prevents the genomic data, of relatively higher-dimension, from dominating the effect of the model. In Cox-PASNet, the complex genomic effects of gene expression data are captured from the gene layer to the hidden layers, whereas the clinical data are directly introduced into the output layer, along with the highest-level representation of genomic data (i.e., node values on the last hidden layer). Therefore, Cox-PASNet takes the effects of genomic data and clinical data into account separately in the neural network model. If richer clinical information is available, multiple hidden layers in the clinical layers can be considered.
Cox layer
The Cox layer is the output layer that has only one node. The node value produces a linear predictor, a.k.a. Prognostic Index (PI), from both the genomic and clinical data, which is introduced to a Cox-PH model. Note that the Cox layer has no bias node according to the design of the Cox model.
Furthermore, we introduce sparse coding, so that the model can be biologically interpretable and mitigate the overfitting problem. In a biological system, a few biological components are involved in biological processes. The sparse coding enables the model to include only significant components, for better biological model interpretation. Sparse coding is applied to the connections from the gene layer to the last hidden layer by mask matrices. The sparse coding also makes the model much simpler, having many fewer parameters, which relieves overfitting problem.
Objective function
Cox-PASNet optimizes the parameters of the model, Θ={β,W}, by minimizing the average negative log partial likelihood with L2 regularization, where β is the Cox proportional hazards coefficients (weights between the last hidden layer and the Cox layer) and W is a union of the weight matrices on the layers before the Cox layer. The objective function of the average negative log partial likelihood is defined as follows:
$$\begin{array}{*{20}l} \ell(\boldsymbol{\Theta}) = &-\frac{1}{n_{E}}\sum_{i \in E}\left(\mathbf{h}_{i}^{I}\boldsymbol\beta - \text{log}\!\!\sum_{j \in R(T_{i})}\exp(\mathbf{h}_{j}^{I}\boldsymbol\beta)\right) \,+\, \lambda(\| \boldsymbol{\Theta}\|_{2}), \end{array} $$
(1)
where hI is the layer that combines the second hidden layer’s outputs and the clinical inputs from the clinical layer; E is a set of uncensored samples; and nE is the total number of uncensored samples. R(Ti)={i|Ti≥t} is a set of samples at risk of failure at time t; ∥Θ∥2 is the L2-norms of {W,β} together; and λ is a regularization hyper-parameter to control sensitivity (λ>0).
We optimize the model by partially training small sub-networks with sparse coding. Training a small sub-network guarantees feasible optimization, with a small set of parameters in each epoch. The overall training flow of Cox-PASNet is illustrated in Fig. 7.
Initially, we assume that layers are fully connected, except between the gene layer and the pathway layer. The initial parameters of weights and biases are randomly initialized. For the connections between the gene layer and pathway layer, sparse connections are forced by the bi-adjacency matrix, which is a mask matrix that indicates the gene memberships of pathways. A small sub-network is randomly chosen by a dropout technique in the hidden layers, excluding the Cox layer (Fig. 7a). Then the weights and the biases of the sub-network are optimized by backpropagation. Once the training of the sub-network is complete, sparse coding is applied to the sub-network by trimming the connections within the small network that do not contribute to minimizing the loss. Figure 7b illustrates the sparse connections, and the nodes dropped by sparse coding are marked with bold and dashed lines. The algorithm of Cox-PASNet is briefly described in Algorithm 1.
Sparse coding
Sparse coding is proposed to make the connections between layers sparse for the model interpretation. Sparse coding is implemented by a mask matrix on each layer in the model. A binary mask matrix M determines the sparse connections of the network, where an element indicates whether the corresponding weight is zero or not. Then, the outputs, h(ℓ), in the ℓ-th layer are computed by:
$$ \mathbf{h}^{(\ell +1)} = a\left((\mathbf{W}^{(\ell)}\star\mathbf{M}^{(\ell)})\mathbf{h}^{(\ell)}+\mathbf{b}^{(\ell)}\right), $$
(2)
where ⋆ denotes an element-wise multiplication operator; a(·) is a nonlinear activation function (e.g., sigmoid or Tanh); and W(ℓ) and b(ℓ) are a weight matrix and bias vector, respectively (1≤ℓ≤L−2, and L is the number of layers).
In particular, an element of the binary mask matrix M is set to one if the absolute value of the corresponding weight is greater than threshold s(ℓ); otherwise it is zero. The mask matrix between the gene layer and pathway layer (M(0)) is given from pathway databases, whereas other mask matrices (M(ℓ),ℓ≠0) are determined by:
$$ \mathbf{M}^{(\ell)}=\mathbbm{1}(|\mathbf{W}^{(\ell)} | \geq s^{(\ell)}), \indent \ell \neq 0, $$
(3)
where s(ℓ) is the optimal sparsity level; and the function 𝟙(x) returns one if x is true; otherwise it is zero. The optimal s(ℓ) is heuristically estimated on each layer in the sub-network to minimize the cost function. In this study, we considered a finite set of sparsity levels in a range of s=[0,100), and computed scores. Note that a sparsity level of zero produces a fully-connected layer, whereas that of 100 makes disconnected layers. Then we approximated the cost function with respect to sparsity levels by applying a cubic-spline interpolation to the cost scores computed by the finite set of s. Finally, the sparsity level that minimizes the cost score was considered for the optimal sparsity level. The optimal s(ℓ) is approximated on each layer, individually, in the sub-network. The individual optimization of the sparsity on each layer represents various levels of biological associations on genes and pathways.