We will outline the derivation of the bounds and algorithms for the case *i* = *N* (all *N* training vectors are used). The results are clearly also valid for *i* = 1,2,..., *N* . We suppress the subscript *N* and just use the notation *V*, *K*,**σ***, £ etc.

Consider the pre-Hilbert space of models *f* (*x*;*a*) = Σ*a*
_{
x
}·K(*x*', *x*) where the sums are initially over finitely many *x* ;where *K* (*u*,*v*), is a piecewise continuous, bounded symmetric, non-negative kernel function on *V* × *V* , positive at diagonal points (*u u*); and where the matrix *K* (*z*
_{
i
}, *z*
_{
j
}) is positive semi-definite for any finite {*z*
_{
i
}} ⊆ *V* (positive definite for distinct *z*
_{
i
} ). Define an inner product [*f* (*x*;*a*), *f* (*x*;*b*)] = ΣΣ*a*
_{
x'
}
*b*
_{
x''
}
*K*(*x*',*x*''). Now extend this to form a real Hilbert space by completion. For any g in the constructed Hilbert space, g can be identified with the point wise limit of a sequence of models in the pre-Hilbert space which converges to g in the constructed Hilbert space. It can easily be shown that [*g*,*K*(*u*,-)] = *g* (*u*), where *g* (*u*) is the value of the associated point wise limit at *u* . Hence the space is called a reproducing kernel Hilbert space (RKHS). Assume, throughout that || || equals RKHS norm and that there are *N* predictors {*x*
_{
i
}}⊆ *V* .

Let *M* = {g: the RKHS norm of g is less than or equal to M}. By translation we may assume that the query vector *x*
_{0} = 0. The following two theorems are proven in [10], for the more general case of *f* (*x*) being within *ε*(*x*) in V (where *ε* (0) = 0) of some member of the family *M*.

### Theorem I (Minimax Query-based Vector Machine)

Let *f* (*x*) be any function (not just a "probability of class 1 given *x*" function) in *M, Y*
_{
j
}
*= f* (*x*
_{
j
}) *+ N*
_{
j
} , *j* = 1,2,..., *N* and noise covariance matrix **N** the bounded (in the semi-definite order - i.e. **σ** - **N** is positive semi-definite) by a positive definite **σ** (in this paper **σ** = 0.25 **I**). Consider the matrix **K*** = ((*K* (*x*
_{
i
},*x*
_{
j
}))):*i* , *j* = 0,1,2,..., *N* .(V is centered at the query point *x*
_{0} which we are taking as 0 but the results obtained are the same for any query point *x*
_{0} by changing *x*
_{0} to the origin and subtracting *x*
_{0} from each predictor *x*
_{
j
}). Set *w*
_{0} = -1 ( *w* has now *N* + 1 components), **σ*** equal the *N* + 1 by *N* + 1 matrix formed by adding a 0-th row and 0-th column of zeros to the noise covariance matrix upper bound **σ**, and the *N* + 1 dimensional vector *u* = (1,0,0,...,0)^{
t
}. Let
.

Then the mean squared error of *F*(*w*), where *w** = 0 (Note *F* (*w*) does not involve *w*
_{0}), is bounded by £ if
and this is the best possible bound on mean squared error if we allow any noise covariance **N** bounded in semi definite order by **σ**.

Proof: see theorem VI in [10].

### Theorem II (Vector Machine with Context)

Assume hypotheses and notation of Theorem I except *f* (*x*) takes values in 0[1] and is in *PK*
_{
α
} (*V* ). Then the estimator *F* (*w*), which equals *F* (*w*) of Theorem I except that *w** = -*α* (*w*
_{0} + *w*
_{1} + ... + *w*
_{
N
}), has mean squared error bounded by £ of Theorem I

For such *M* , we call *F* (*w*) the contextual Tikhonov estimator. In fact, for any *M* greater than or equal the right hand side of the above inequality, the same result holds. A good choice for *α* is 0.5 (which is used in our experiments) since it minimizes *M*
_{
V
} as a function of *α*.

Proof: see theorem VII of [10].