Our methodology was first introduced in [18]; below we presented it in some more details.
In regard to the notation, we shall use capital letters to denote model parameters and random variables, and small ones for the realizations of these variables. Let X_{
i
}, i = 1,.., N be random categorical variables. Categorical Bayesian network (G, P) with nodes X_{
i
}'s is a probability model based on directed acyclic graph (DAG) G and conditional probability table P defined as follows. G can be described by a collection of sets Pa_{
i
}'s such that for each i, Pa_{
i
}, called parent set of X_{
i
}, comprises all X_{
j
}'s for which there is a directed edge connecting X_{
j
} and X_{
i
} in G. We shall use index k to denote the categorical levels of X_{
i
} and multiindex j for the combination of parent states of X_{
i
}, and with slight abuse of notation shall write k ∈ X_{
i
} and j ∈ Pa_{
i
}. The second component of a CBN is the table P of conditional probabilities Pr(X_{
i
}Pa_{
i
}). Let P_{
i,kj
} ≡ Pr(X_{
i
} = kPa_{
i
} = j). For each i and j, the multinomial distribution of (X_{
i
}Pa_{
i
} = j) is defined by the probability vector {P}_{i,j}\equiv {\left({P}_{i,kj}\right)}_{k\in {X}_{i}},{\sum}_{k\in {X}_{i}}{P}_{i,kj}=1. Then we have P={\left\{{P}_{i,j}\right\}}_{i,j}.
With [X_{
i
}] and [Pa_{
i
}] we shall denote the number of states of X_{
i
} and Pa_{
i
}, respectively, and with Pa_{
i
} the number of parents of X_{
i
}. Clearly, \left[P{a}_{i}\right]={\prod}_{{X}_{a}\in \phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}P{a}_{i}}\left[{X}_{a}\right]. The complexity of the CBN (G, P) is given by df\left(G\right)={\sum}_{i1}^{N}\left[P{a}_{i}\right]\left(\left[{X}_{i}\right]1\right) and equals the degree of Freedom for defining the probability table P.
For any DAG G, there is a node order Ω, also called causal order, such that the parents of each node always precede that node in Ω, a fact that we write symbolically as X_{Ω(1)} ≺ X_{Ω(2)} ≺... ≺ X_{Ω(N)}. Formally, Ω is a permutation of the indices 1,..., N, such that for any i > 0, Pa_{Ω(i) }⊂ {X_{Ω(1)}, X_{Ω(2)}, ..., X_{Ω(i−1)}}. For any order Ω, with \mathcal{G}\left(\text{\Omega}\right) we shall denote the class of DAGs that are compatible with Ω.
Learning categorical Bayesian networks from twoclass data
We approach CBN learning from the perspective of classification problems where the observations are assumed to come from two different classes. We describe an algorithm that utilizes the class label information to find a DAG attaining maximum class discrimination with respect to a suitable measure. The essential component of our method is the graph structure estimation since the optimal conditional probability table can be easily inferred for any given DAG.
Let {\left\{{x}^{s}\right\}}_{s=1}^{n} be a nsample of independent observations on {\left\{{X}_{i}\right\}}_{i=1}^{N} and let each observation x^{s} have a label c^{s} ∈ {0, 1} that assigns it to one of the two classes; c^{s}'s are assumed to be observations on a binary random variable C. We denote the labeled sample with {D}_{n}={\left\{\left({x}^{s},{c}^{s}\right)\right\}}_{s=1}^{n}.
The loglikelihood function of a CBN (G, P) with nodes {\left\{{X}_{i}\right\}}_{i=1}^{N} with respect to the unlabeled sample {\left\{{x}^{s}\right\}}_{s=1}^{n} is
l\left(G,P{D}_{n}\right)={\displaystyle \sum _{i=1}^{N}}{\displaystyle \sum _{j\phantom{\rule{2.77695pt}{0ex}}\in \phantom{\rule{2.77695pt}{0ex}}P{a}_{i}}}{\displaystyle \sum _{k\phantom{\rule{2.77695pt}{0ex}}\in \phantom{\rule{2.77695pt}{0ex}}{X}_{i}}}{n}_{i},{k}_{j}\text{log}{P}_{i,{k}_{j}},
(1)
where {n}_{i},{k}_{j}\equiv {\sum}_{s=1}^{n}{1}_{\left\{{x}_{i}^{s}=k,p{a}_{i}^{s}=j\right\}} and {n}_{i,j}\equiv {\sum}_{k}{n}_{i},{k}_{j}. Note that for each i we have {\sum}_{j\in \phantom{\rule{2.77695pt}{0ex}}P{a}_{i}}{n}_{i,j}=n. For a fixed DAG G, the MLE \widehat{P} of P is obtained by maximizing l(G, PD_{
n
}) as a function of P
l\left(G{D}_{n}\right)\equiv {\displaystyle \underset{P}{\text{max}}}l\left(G,P{D}_{n}\right)={\displaystyle \sum _{i=1}^{N}}{\displaystyle \sum _{j\in P{a}_{i}}^{N}}{n}_{i,j}{\displaystyle \sum _{k\in {X}_{i}}^{N}}{\widehat{P}}_{i,{k}_{j}}\text{log}{\widehat{P}}_{i,{k}_{j}},
(2)
where {\widehat{P}}_{i,{k}_{j}}\equiv {n}_{i,kj}/{n}_{i,j} is the point estimate of P_{
i,kj
}. It is a well known fact that by increasing the complexity of G, that is, by adding new edges in G, the likelihood (2) can only increase. Therefore the MLE solution for G based on (1) will tend to overfit the training data. The latter can be overcome if, instead of the loglikelihood, one optimizes a scoring function of the form l(GD_{
n
}) − λ_{
n
}df (G) [13], where λ_{
n
} is a penalization parameter indexed by the sample size n. Some standard choices include BIC, λ_{
n
} = 0.5log(n)/n, and AIC, λ_{
n
} = 1/n, however, no 'universally' optimal penalization criterion is available. As we have already commented in the introduction, datadriven approaches for selecting λ_{
n
} such as CV are not necessarily optimal and their efficiency in twoclass discrimination problems is unclear. In contrast, our approach is based on a scoring function, very similar to the likelihood ratio test, that can be used to learn an optimal graph structure without involving additional penalization parameters.
Similarly to P_{
i
},k_{
j
}, let us define the conditional probabilities pertaining to the first experimental class, {P}_{i,kj}^{0}=Pr\left({X}_{i}=kP{a}_{i}=j,C=0\right) and let {\widehat{P}}_{i,kj}^{0} be the corresponding point estimators as in (2), that is, {\widehat{P}}_{i,kj}^{0}={n}_{i,kj}^{0}\text{/}{n}_{i,j}^{0}, where {n}_{i,kj}^{0}\equiv {\sum}_{s=1}^{n}{1}_{\left\{{x}_{i}^{s}=k,p{a}_{i}^{s}=j,{c}^{s}=0\right\}}. We then consider the statistics
\mathcal{R}\left(G{D}_{n}\right)=\frac{1}{n}{\displaystyle \sum _{i=1}^{N}}{\displaystyle \sum _{j\in \phantom{\rule{2.77695pt}{0ex}}P{a}_{i}}}{\displaystyle \sum _{k\in \phantom{\rule{2.77695pt}{0ex}}{X}_{i}}}{n}_{i,kj}\text{log}\left({\widehat{P}}_{i,kj}/{\widehat{P}}_{i,kj}^{0}\right)
(3)
and introduce the scoring function
\mathcal{S}\left(G{D}_{n}\right)=\frac{\mathcal{R}\left(G{D}_{n}\right)}{df\left(G\right)},
(4)
the intuition behind which is given below. Given a collection \phantom{\rule{0.25em}{0ex}}\mathcal{G} of DAGs with nodes {\left\{{X}_{i}\right\}}_{i=1}^{N}, we propose the following estimator of G
\hat{G}=arg{\displaystyle \underset{G\in \mathcal{G}}{\text{max}}}\mathcal{S}\left(G{D}_{n}\right)
(5)
which we shall tentatively refer to as BNKL estimator.
Equivalently, \phantom{\rule{0.25em}{0ex}}\mathcal{R} can be expressed as
\mathcal{R}\left(G{D}_{n}\right)=\frac{1}{n}{\displaystyle \sum _{i=1}^{N}}{\displaystyle \sum _{j\phantom{\rule{2.77695pt}{0ex}}\in \phantom{\rule{2.77695pt}{0ex}}P{a}_{i}}}{n}_{i,j}{d}_{K\phantom{\rule{0.3em}{0ex}}L}\left({\widehat{P}}_{i,j}{\widehat{P}}_{i,j}^{0}\right),
where d_{
K L
} denotes the KullbackLeibler (KL) divergence between the multinomial distributions {\widehat{P}}_{i,j}={\left\{{\widehat{P}}_{i,kj}\right\}}_{k} and {\widehat{P}}_{i,j}^{0}={\left\{{\widehat{P}}_{i,kj}^{0}\right\}}_{k}. The optimization problem (5) aims at finding a DAG that achieves maximum class separation with respect to the accumulated KLdivergence between the node conditional probability tables. Note that we always have \mathcal{R}\left(G{D}_{n}\right)\ge 0. Moreover, if {\widehat{P}}_{i,j}^{0} are uniform distributions, that is, {\widehat{P}}_{i,j}^{0}={\left(1/\left[{X}_{i}\right]\right)}_{i=1}^{\left[{X}_{i}\right]}, then (3) reduces to (1) up to an additional constant due to the equality
\frac{1}{n}{\displaystyle \sum _{i=1}^{N}}{\displaystyle \sum _{j\in \phantom{\rule{2.77695pt}{0ex}}P{a}_{i}}}{n}_{i,j}{d}_{K\phantom{\rule{0.3em}{0ex}}L}\left({\widehat{P}}_{i,j}{\widehat{P}}_{i,j}^{0}\right)=\frac{1}{n}l\left(G{D}_{n}\right)+{\displaystyle \sum _{i=1}^{N}}\text{log}\left(\left[{X}_{i}\right]\right).
We can therefore look at \mathcal{R}\left(G{D}_{n}\right) as an extension of the maximum loglikelihood l(GD_{
n
}) to twoclass problems.
For a fixed DAG G, the statistics 2n\mathcal{R} is in fact equivalent to the likelihood ratio chisquared statistics (also known as G^{2} statistics) applied to n\widehat{P} and n{\widehat{P}}^{0} viewed as observed and expected counts, respectively. Not surprisingly then, under the null hypothesis {H}_{0}:{P}_{i,j}={P}_{i,j}^{0}, for all i, j, 2n\kappa \mathcal{R}\left(G{D}_{n}\right) is asymptotically χ^{2} distributed with df (G) degree of freedom, where κ = Pr(C = 0)/Pr(C = 1) is the odds ratio for the first class (the formal proof of this fact is out of the scope of this article).
The role of the factor df(G) in the denominator of (5) is to assist model selection. From informationtheoretical perspective, df(G) represents the amount of memory required for saving all of the states of a CBN with DAG G. Since \mathcal{R}\left(G{D}_{n}\right) measures the class differences with respect to G, we can think of the scoring function (4) as an estimate of the degree of class separation per unit complexity. Let \mathcal{R}\left(G\right) be the population version of (3) obtained by replacing \widehat{P} with the population probabilities P, that is
\mathcal{R}\left(G\right)={\displaystyle \sum _{i=1}^{N}}{\displaystyle \sum _{j\in \phantom{\rule{2.77695pt}{0ex}}P{a}_{i}}}Pr\left(P{a}_{i}=j\right){d}_{K\phantom{\rule{0.3em}{0ex}}L}\left({P}_{i,j}{P}_{i,j}^{0}\right).
We say that G_{0} achieves most efficient class separation in \mathcal{G} if
{G}_{0}=arg{\displaystyle \underset{G\in \mathcal{G}}{\text{max}}}\frac{\mathcal{R}\left(G\right)}{df\left(G\right)}.
(6)
Then, provided that G_{0} is unique maximizer, it can be easily shown that \hat{G} is a consistent estimator of G_{0}, a claim that makes (5) a sound statistical procedure.
We proceed into some computational aspects of problem (5). Because \mathcal{S}\left(G{D}_{n}\right) is usually highly nonregular function (nonsmooth and nonconvex), finding the optimal DAG essentially requires an exhaustive search in \phantom{\rule{0.25em}{0ex}}\mathcal{G}. In order to make the problem computationally manageable we thus need to apply some strong restrictive conditions on \phantom{\rule{0.25em}{0ex}}\mathcal{G}. First, we assume that the parent sizes are bounded above by a constant M. The value of M should depend on the samples size n used for estimation and we do not recommend M > 2 unless n is in the hundreds. Second, we limit the search in (5) to DAGs compatible with a fixed but optimally chosen node order. An intuitive causality argument suggests that if we believe that node conditional distributions are set rather independently of each other than otherwise, for a regulation X_{1} → X_{2}, it seems more plausible for X_{2} to have higher betweenclasses marginal difference than X_{1}. If we accept this argument, we would be inclined to assume that:
nodes\phantom{\rule{0.3em}{0ex}}with\phantom{\rule{0.3em}{0ex}}lower\phantom{\rule{0.3em}{0ex}}marginal\phantom{\rule{2.77695pt}{0ex}}class\phantom{\rule{0.3em}{0ex}}difference\phantom{\rule{0.3em}{0ex}}are\phantom{\rule{0.3em}{0ex}}upstream\phantom{\rule{0.3em}{0ex}}the\phantom{\rule{0.3em}{0ex}}DAG\phantom{\rule{0.3em}{0ex}}of\phantom{\rule{0.3em}{0ex}}the\phantom{\rule{0.3em}{0ex}}network;\phantom{\rule{0.3em}{0ex}}
(7)
where 'upstream' is understand as earlier in the node order of the DAG. In the context of gene regulations we periphrase this principle in 'target' and 'biomarker' terminology as follows: 'target' genes present lower differential expression in comparison to the 'biomarker' genes and are thus situated upstream the regulation network with respect to the latter. Hereafter we refer to an order satisfying (7) as order of increasing differential expression or IDE.; see Algorithm 1 below for its estimation.
Formally, for the purpose of solving (5), we consider collections of DAGs of the form
\mathcal{G}\left(\text{\Omega},M\right)=\left\{G{X}_{\text{\Omega}\left(1\right)}\prec {X}_{\text{\Omega}\left(2\right)}\prec \dots \prec {X}_{\text{\Omega}\left(N\right)},P{a}_{i}\phantom{\rule{0.3em}{0ex}}\le \phantom{\rule{0.3em}{0ex}}M,\forall i\right\},
(8)
where Ω satisfies (7). In the actual data analyses below we use M = 2 as a compromise between degree of network connectivity and computational complexity. For classes \mathcal{G}\left(\text{\Omega},M\right), the optimal DAG \phantom{\rule{0.25em}{0ex}}\hat{G} can be found by an efficient exhaustive search with polynomial complexity. In fact, BN estimation restricted to type (8) classes of DAGs is not new and can be traced back to [19]. The BNKL algorithm is implemented in the sdnet package for R, [20]. Below, we present average times of BNKL estimation of random CBNs with different sizes N, 3 categories per node, maximum parent size M = 2 and sample size n = 250 (Table 1).
The computational times are concordant with the theoretical complexity of the algorithm O(nN^{M + 1}).
A network model for classification of gene expression profiles
We return to the main goal of this investigation  developing a CBNbased classifier for twoclass problems. We have shown, Eq. (5), how we can choose a graph structure that achieves optimal separation of a labeled sample. We use the estimated structure as a common DAG of two CBNs that model the two classes with distinct probability tables. This approach, 'one DAG, two probability tables', has been previously adopted by other BNbased classifiers [21].
Gene expression data is acquired by a multistage process the result of which are continuous variables representing the expression levels of prespecified gene probes. Since CBN is a discrete model, the initial step in our inference framework involves discretization  any sample {\left\{{y}^{s}\right\}}_{s=1}^{n} of observations on the genenodes {\left\{{X}_{i}\right\}}_{i=1}^{N} is transformed into categorical sample {\left\{{x}^{s}\right\}}_{s=1}^{n}. Gene expression levels are often discretized into 3 categories  'underexpressed', 'baseline' and 'overexpressed' [16]. Although more sophisticated procedures are certainly possible, in our experiments we employ a 3level uniform discretization as follows. After excluding 5% of the most extreme values, a standard precaution against outliers, the range of y's is divided into equal intervals and an observation y is assigned a categorical value x according to the interval into which y falls. The uniform discretization is simple to implement and have good performance in practice. We emphasize that, as should be the case in all well designed training → test prediction studies, the discretization parameters (cutoff points) are determined strictly from the training sample and are used to discretize the test sample.
More formally, we assume that: (i) the class samples D_{0} = D ∩ {c = 0} and D_{1} = D ∩ {c = 1} come from two CBNs, (G_{0}, P^{0}) and (G_{0}, P^{1}), with DAG G_{0} and probability tables P^{0} and P^{1}; (ii) G_{0} is efficient in sense of (6); (iii) G_{0} is compatible with an IDE order Ω and has a maximum parent size of M. Since G_{0} is unknown in advance, the assumptions (ii) and (iii) cannot be checked. Instead, (ii) and (iii) should be considered technical assumptions specifying the properties of the estimated networks. All prerequisites being set, we propose Algorithm 1: the first part of it estimates G_{0}, P^{0} and P^{1}, while the second one performs classification of test samples.
Algorithm 1 BNKL Classification

1.
Training. Input: continuous labeled training sample {\left\{\left({y}^{s},{c}^{s}\right)\right\}}_{s=1}^{n}.

(a)
Node order estimation, IDE (7): For each i, perform ttest on y_{
i
}'s comparing the 2 classes. Set \widehat{\text{\Omega}}to be the order of decreasing (ttest) pvalues.

(b)
Uniform discretization: For each i, set µ_{
ik
} = q_{1} + k(q_{2} − q_{1})/3, k = 1, 2, where q_{1} and q_{2} are the 2.5 and 97.5 percentiles of the training observations y_{
i
}'s. Discretize y_{
i
}'s into 3 categories using the cutoff points µ_{i1 }and µ_{i2}.

(c)
Find the optimal DAG \phantom{\rule{0.25em}{0ex}}\hat{G} in \mathcal{G}\left(\widehat{\text{\Omega}}\right) according to Eq. (5).

(d)
Define CBNs \left(\hat{G},{\widehat{P}}^{0}\right) and \left(\hat{G},{\widehat{P}}^{1}\right) by estimating the classspecific conditional probability tables {\widehat{P}}^{0}\left(\hat{G}{D}_{0}\right) and {\widehat{P}}^{1}\left(\hat{G}{D}_{1}\right) as in Eq. (2).

2.
Prediction. Input: continuous test observation z.

(a)
For each i, discretize {z}_{i}\mapsto {x}_{i} using the training cutoff points µ_{i1 }and µ_{i2}.

(b)
Calculate the loglikelihoods {l}_{0}=l\left(\hat{G},{\widehat{P}}_{0}x\right) and {l}_{1}=l\left(\hat{G},{\widehat{P}}_{1}x\right) according to Eq. (1).

(c)
Assign z to the class with greater loglikelihood l.
To avoid numerical instabilities in Algorithm 1, all zero slots in the estimated conditional probabilities {\widehat{P}}^{0} and {\widehat{P}}^{1} are reset to a minimum positive value of 1/(3n) (the resolution of a training sample of size n to populate a 3nomial distribution) and then renormalized so that {\sum}_{k}{\widehat{P}}_{i,kj}=1, for all i and j ∈ Pa_{
i
}.
Figure 1 shows some examples of gene expression discretization and representation of gene interactions with 3nomial distributions. For instance, the probabilities P_{
ij
} = Pr(X_{AKT1 }= iX_{PIK3R3 }= j) of the regulation AKT1→PIK3R3 are P_{i1 }= (0.4, 0.4, 0.2), P_{i2 }= (0.23, 0.54, 0.23) and P_{i3 }= (0.17, 0.42, 0.41). The probabilities corresponding to the first class only are {P}_{i1}^{0}=\left(0.14,0.72,0.14\right), {P}_{i2}^{0}=\left(0.28,0.14,0.57\right) and {P}_{i3}^{0}=\left(0.25,0.5,0.25\right). A formal test for the linear association between the two genes fails to detect significant class difference (pval = 0.18). The relatively large KLdivergence score between P and P^{0} however, indicates significant class difference (pval≈0). The examples also illustrate different types of regulations such as activation (PIK3R3→AKT1, LAMB1→TRAF3) and inhibition (FHIT→LAMB1). The BNKL model, recall, is designed to detect changes in the interactions. For example, FHIT→LAMB1 is apparently inhibitory for the second class (in blue) but neutral for the first (in red) and BNKL perceives that difference.
Benchmark classifiers
To evaluate the performance of the BNKL algorithm we compare it to 3 established in practice classification methods. We consider SVM with Gaussian kernel as implemented in the e1071 package for R. The kernel parameter γ is tuned via CV on the training data for optimality. The benchmark performance of SVM is well established [22]. Our second choice is LASSO, an algorithm based on l_{1}penalized linear regression that is applied as follows. The expectation of the binary class variable is assumed to be a linear combination of a given set of genecovariates. Then LASSO selects a subset of significant predictor genes using a l1normbased penalization criteria and discard the rest. The sum of squared errors is used as classification criteria. We use an implementation of the algorithm provided by the lars package for R.
The third reference classifier, PC, employs a linear BN model as follows: (1) a DAG \phantom{\rule{0.25em}{0ex}}\hat{G} is fitted to the combined sample D_{0} ∪ D_{1} using the PC algorithm [23] with Gaussian test for conditional independence at αlevel 0.05 (see pcalg package for R); thus a parent set Pa_{
i
} is selected for each i; (2) for each i, two distinct sets of (Y_{
i
}Pa_{
i
})regression parameters are estimated for each class separately; (3) test samples are classified according to the conditional likelihoods. Note that, SVM, LASSO and PC, in contrast to BNKL, are applied directly to continuous observations on {\left({X}_{i}\right)}_{i=1}^{N}.