In this section, we first show the intuition behind the proposed algorithm framework, then, following Gurcan et al.'s proposal[1], present the proposed algorithm framework as three steps:

1.
Multiinstance sample representation

2.
Feature extraction

3.
Training of learning algorithms
Figure 3 illustrates the framework of the above three steps. We should note that the proposed framework is adaptable and flexible because it only provides a general framework and different implementations can be replaced according to the application domain.
Formulation
The proposed annotation framework is motivated by the nature of skin biopsy image recognition, which can be naturally expressed as a multiinstance learning problem. To make this intuition clearer, it is necessary to review the procedure of manually annotating skin biopsy images. From dermatopathological clinical experience, we can see that a set of standard terms are used by doctors to annotate an image. However, doctors are not required to explicitly record the correspondence between standard terms and regions within a given image, leading to the terms ambiguity described in the previous section. Because terms are actually associated with certain local regions, it is not reasonable to connect each region of an image to all associated terms, which results in poor models from a machine learning perspective [16]. As illustrated in Figures 2.(a)(d), regions within a given image may have different relationships to the attached terms. It is timeconsuming to manually label each region with a set of terms to meet the requirement of traditional singleinstance learning. For this reason, by regarding each image as a bag and regions within the image as instances, multiinstance learning is naturally suitable for the annotation task. According to the basic assumption of multiinstance learning [15], a bag can be annotated with a term if it contains at least one region labelled with that term. Otherwise, the bag cannot be annotated with that term. Thus, we can build a set of binary multiinstance classifiers, each of which corresponds to a term. Given an image, each classifier outputs a Boolean value indicating whether its term should be annotated to the image. Thereby, we can address the term ambiguity within a multiinstance learning framework.
Another challenge is how to effectively represent an image as a multiinstance sample, or a bag. The key problem is how to partition an image into several regions to construct instances. Skin tissue is microscopically composed of several different structures, and a doctor needs to inspect them individually to determine abnormal areas. Regions of a skin biopsy image should be divided according to the structures of skin tissue to come up with a feature description for each part, but clusteringbased algorithms [17] may not generate contiguous regions. Hence, we apply an imagecutting algorithm, namely Normalized Cut (NCut) [18], to generate visually disjoint local regions. Prior knowledge in dermatopathology suggests that on the one hand, examining an individual visually disjoint region is sufficient to annotate it in most cases, and on the other hand, there is not considerable relationship between terms to be annotated in a given image. The former supports the application of our imagecutting method, and the latter allows us to decompose the annotation task in to a set of multiinstance binary classification tasks.
Formally, let D = {(I_{
i
}, T_{
i
})i = 1, ..., n, I_{
i
} ∈ I, T_{
i
} ⊆T} be a set of skin biopsy images associated with a set of annotated terms, where T = {t_{1}, t_{2}, ..., t_{
m
}} is a set of standard terms for annotation and I is a set of images. Each image is stored as a pixel matrix in 24k RGB colour space. The task is to learn a function f : I → 2^{T} given D. When given an unseen image I_{
x
}, f can output a subset of T corresponding to the annotation terms of the given image I_{
x
}.
We first apply a cutting algorithm to generate visually disjoint regions for each image, given by I_{
i
} = {I_{
ij
}j = 1, ..., n_{
i
}}, where n_{
i
} is the number of regions in image I_{
i
}, followed by a feature extraction procedure to express each generated region as a feature vector. Then, we train the target model through two algorithms.
Skin biopsy image representation
Now we present a method for representing a skin biopsy image. First, express each image as a bag of regions as instances, and then apply two transformationinvariant feature extraction methods to further express them as vectors.
Multiinstance sample representation
To generate visually disjoint regions, we adopt a famous graphcutting algorithm, Normalized Cut (NCut), proposed by Shi et al. [18] in 2000, aimed at extracting perceptual groupings from a given image. In constract with clusteringbased image segmentation algorithms, e.g., [17], NCut extracts the global impression of a given image, i.e., disjoint visual grouping. To make this article selfcontained, we briefly present the main idea of NCut.
NCut approaches the segmentation of an image as a graph cutting problem. It constructs a local connection between neighbour pixels within an image. Vertices of the constructed graph are pixels, and the weights of edges are similarity between pixels. The problem of NCut is to find a cut that minimises insegment similarity and maximises crosssegment similarity. Formally, supposing there is a graph G = (V, E), we aim to find an optimal cut that partitions it into two disjoint sets A and B, where A ∩ B = ∅ and A ∪ B = V. A measure is defined in Eq. 1 as optimal graph cutting:
N\phantom{\rule{0.25em}{0ex}}cut\left(A,B\right)=\frac{cut\left(A,B\right)}{assoc\left(A,V\right)}+\frac{cut\left(A,B\right)}{assoc\left(B,V\right)}
(1)
where cut\left(A,B\right)={\sum}_{u\in A,v\in B}w\left(u,v\right), w(u, v) is the weight of the edge between vertices u and v, and assoc\left(A,V\right)={\sum}_{u\in A,t\in \phantom{\rule{2.77695pt}{0ex}}V}w\left(u,t\right) is the summed weights of the edges between the vertices in segment A and any other vertices in graph G. Because graph G is locally connected, a binary column vector x_{V×1 }can be defined to indicate whether a vertex belongs to subset A. The goal of NCut is to find a cut that minimises Ncut(A, B), as Eq. 2 shows.
mi{n}_{x}N\phantom{\rule{0.3em}{0ex}}cut\left(x\right)
(2)
According to [18], the solution to Eq. 2 captures a visual segmentation of an image whose underlying idea is naturally consistent with the clinical experience of skin biopsy image recognition. Eq. 2 can be solved as a standard Rayleigh quotient [19]. We ignore the detailed procedure for brevity. The computational time complexity of NCut for a given image is O(n^{2}), where n is the number of pixels in an image.
The number of regions p is a parameter to be set beforehand. Figure 4 shows the NCut outputs of the same image with different parameter settings. Parameter p will affect the model performance to some extent. We will present this in the discussion section.
Feature extraction based on 2DDWT
Previous work on skin image analysis has indicated that a good feature extraction method significantly affect model performance. Many problemoriented feature expression methods have been proposed and proven to be successful in histopathology and dermatopathology [1]. However, feature extraction methods for skin biopsy images are seldom reported. Considering the variation of colour, rotation, magnification and even resolution in skin biopsy images, we propose a transformationinvariant feature extraction method based on 2dimension discrete wavelet transformation (2DDWT). The basic idea of the proposed feature extraction originated from [20, 17], which suggested applying 2DDWT in colour space for each block within a given image. We briefly describe the proposed feature extraction methods as follow.

1.
Input a local region IR generated by NCut. Note that regions generated by NCut are irregular. For convenience, we store them as minimum covering rectangles by padding the regions with black pixels, as indicated in Figure 5. This padding does not significantly affect model performance, as most of these padding pixels will be discarded in later steps.

2.
Colour space transformation. IR is an RGB expression and now transferred to LUV space, denoted as IR_LUV. Calculate features f_{1} = mean(IR_LUV.L), f_{2} = mean(IR_LUV.U) and f_{3} = mean(IR_LUV.V).

3.
Divide IR_LUV into squares of size m × m pixels, resulting in (width/m) × (height/m) blocks, denoted as Bpq, where p = {1, ..., width/m} and q = {1, ..., height/m}. Eliminate blocks that are totally black, so as to remove padding pixels as much as possible.

4.
Apply 2DDWT to each B_{
pq
}, and keep coefficients LH, HL and HH. Let {t}_{x}=\sqrt{\frac{1}{4}{x}^{T}x)}, where x ∈ {LH, HL, HH}. Average t_{
x
} for all blocks within a region to obtain features f_{4}, f_{5}, f_{6}.

5.
Following [20], calculate the normalized inertia of order 1, 2 and 3 as features f_{7}, f_{8}, f_{9}.
After the above 5 steps, a 9ary real vector is obtained for each region. An image is transformed into a set of disjoint regions, represented as real feature vectors. Thus we turn the original dataset into a multiinstance representation. Note that this representation is invariant to transformation, as 2DDWT extracts texture features of regions that are irrelevant to rotation angle and magnification. The other features, LUV mean and normalized inertia of orders 1, 2 and 3, are also transformationinvariant. In the following section, we will provide an indepth discussion of the effectiveness of this feature extraction method.
Feature extraction based on SIFT
Scaleinvariant feature transform (SIFT) [21] is a wellstudied feature extraction method widely used in the study of medical image classification. Juan C. Caicedo et al. [10] used SIFT to extract histopathological image features. We apply SIFT as our second feature extraction strategy. Unlike 2DDWT, SIFT has been proven to be a robust key point selector in different image annotation and analysis applications. We use the common setting of SIFT, in which 8 orientations and 4 × 4 blocks are used, resulting in a 128ary vectorial expression. Intuitively speaking, SIFT selects several outstanding points to represent a given image. We apply SIFT to the NCutgenerated regions to obtain a features vector.
Model training
We propose two multiinstance learning algorithms to train our model. The first algorithm is based on CitationKNN [22], and the second is a Bayesian multiinstance learning algorithm, namely Gaussian Process MultiInstance Learning (GPMIL) [23]. CitationKNN was first proposed by Jun Wang et al. [22] and can be regarded as a multiinstance version of traditional KNN classifiers. To determine a given test bag's label, CitationKNN considers not only the K nearest labelled bags, i.e., references, but also labelled bags that regard the given bag as a K nearest neighbour, i.e., citers. CitationKNN is well studied and has many successful applications in machine learning. GPMIL introduced a Gaussian process prior and solved the multiinstance learning problem in a Bayesian learning framework. The essential idea of GPMIL is that by defining a set of latent variables and the likelihood function, it establishes the relationship between class labels and instances in a probabilistic framework. By imposing a Gaussian process prior on these latent variables, we can use a Bayesian learning strategy to derive a posterior distribution of annotation terms given a training dataset and a test image.
We extend these two algorithms to meet the requirements of our annotation task, taking into consideration some insights into skin biopsy image annotation. On the one hand, because there is no prior knowledge on which to base multiinstance learning assumptions [24] for our task, we build model from the original assumption [15]. CitationKNN with a properly defined similarity metric is a simple but effective algorithm in this case. On the other hand, the confidence level of a term to be annotated to a given image is preferred, which requires us to model the predictive distribution of annotation terms. To achieve this goal, we extend Bayesian learning to the multiinstance setting and model the posterior distribution of the annotation terms. An additional benefit of the Bayesian learning framework is that it is possible to model correlation between annotation terms, leading to a more general model.
CitationKNN for annotation
CitationKNN is a multiinstance learning algorithm inspired by the citation and reference system in scientific literature. To determine the label of a test bag X, it considers not only the neighbours (references) of X but also the bags (citers) that regard X as a neighbour. CitationKNN uses both references and citers to determine an unseen bag's concept label. The key problem is how to evaluate distances between bags to identify references and citers.
CitationKNN implements a simple idea: that if two images A and B share with the same term, they should regard each other as neighbors under a properly defined similarity measure, i.e., B is one of the K nearest neighbors of A and vice versa. In our work, a modified version of Hausdorff distance [25] was used as a similarity measure, which is given by
AHD\left(A,B\right)=\frac{{\displaystyle \sum _{a\in A}}{\displaystyle \underset{b\in B}{\text{min}}}d\left(a,b\right)+{\displaystyle \sum _{b\in B}}{\displaystyle \underset{a\in A}{\text{min}}}d\left(b,a\right)}{\leftA\right+\leftB\right}
(3)
where AHD measures the average Hausdorff distance between two bags A and B, and a, b are instances in each bag. d(x, y) is the Euclidean distance function in instance space. As indicated in [25], AHD achieves a better performance than other set distance functions in multiinstance learning. The intuitive definition of AHD is the average minimal distance between instances from two bags, which better evaluates the spatial relationship between a pair of bags.
Note that CitationKNN is a memorybased algorithm, meaning that all training samples must be stored when testing a given image and that no training procedure is required. When testing, AHD must be computed between the test image and all training samples. To reduce the computation cost, we define a locality matrix LM to speed up the algorithm as follow.

1.
Cluster the training set D to obtain s clusters and denote the centroid of each cluster as c_{
i
}, s = {i = 1, ..., s}.

2.
Compute the AHD distance between each training sample and each centroid s_{
i
}, and keep the K nearest training samples for each s_{
i
} in the ith row of LM.
Thus we obtain a sbyK locality matrix LM. When testing an image, we first calculate the distance between centroids and the given image, then discard the centroids that are far from the given image. For the remaining centroids, we perform a table lookup on LM to find the corresponding rows of the remaining centroids; only the training samples associated with such rows are needed in distance computation. We can prune out a large portion of the training samples that are far away from the test image, which greatly reduces the computational cost. The matrix can be computed only once before testing with cost O(n^{2}), where n = D stands for the size of the training set.
GPMIL
We propose a Bayesian learning algorithm with a Gaussian process prior for our annotation task. Following [23], we first introduce an unobserved latent function g(x) defined in instance space for each annotation term t such that for each instance x, g(x) gives a probability indicating the confidence of x to be annotated with term t. We further impose a Gaussian process prior on all g(x) of the whole instance space. Let G = {g(x_{
i
})i = 1, ..., n_{
inst
}}, where n_{
inst
} denotes the size of the instance space. We have G ~ N (0, K) as a Gaussian process prior [26], where K is a Gram matrix of some wellknown kernel of all instance pairs. To establish the connection between g(x) and the annotated terms of images, a likelihood function is defined according to the basic multiinstance assumption [15] as Eq. (4):
p\left(t{G}_{B}\right)=p\left(t\rightg\left({x}_{1}\right),\cdots \phantom{\rule{0.3em}{0ex}},g\left({x}_{\leftB\right}\right)=\underset{j}{\text{max}}\left(g\right({x}_{j}\left)\right)
(4)
where G_{
B
} represents the output of g(x) for all instances in bag B, and B is the size of bag B. For mathematical convenience, softmax is used instead of max, thus we have
p\left(t{G}_{B}\right)=\underset{j}{\text{max}}\left(g\left({x}_{j}\right)\right)\approx 1\mathsf{\text{n}}{\displaystyle \sum _{x\in B}}{e}^{\alpha g\left(x\right)}
(5)
where α is an amplifying factor of the softmax function. If the largest g(x_{
j
}) for any j is less than 0.5, bag B would not be annotated with term t because p(tG_{
B
}) <0.5. The joint likelihood function on the whole training set D can be written as
p\left(T{G}_{D}\right)={\displaystyle \prod _{B\in D}}p\left(t{G}_{B}\right)
(6)
where T is a boolean vector indicating whether each bag B in D is annotated with term t. However, we are concerned with the label of a test bag B, not GB or GD themselves. Following Bayes rule, the posterior distribution over G for training dataset D and term t can be written as:
p\left({G}_{D}D,T\right)=\frac{p\left(T{G}_{D}\right)p\left({G}_{D}\right)}{p\left(TD\right)}
(7)
where p(TG_{
D
}) is the joint likelihood defined in Eq. (6), p(G_{
D
}) is the Gaussian process prior and p(TD) is the marginal likelihood given by
p\left(TD\right)=\int p\left(T{G}_{D}\right)p\left({G}_{D}\right)d{G}_{D}
(8)
With Eq. (7) and (8), we can further obtain the prediction distribution of a test bag X for annotating term t as
p\left(tD,T,X\right)=\int p\left(t{G}_{X},X\right)p\left({G}_{X}D,T,X\right)d{G}_{X}
(9)
where in the right hand side of Eq. (9), p(tG_{
X
}, X) represents the likelihood function of the test mage X, given by p\left(t{G}_{X},X\right)=\int p\left({G}_{X}{G}_{D},D,X\right)p\left({G}_{D}D,Y\right)d{G}_{D}, and p(G_{
X
}D, T, X) represents the posterior distribution of latent variable G_{
X
}. For each test image X, using the whole training dataset and the corresponding annotation vector T, we can obtain a predictive distribution that is a function of X and t. The effective method for solving Eq. (9) can be found in [27, 23].
To make the idea of GPMIL clearer, we provide an example as follows:

1.
Suppose we have a training image set D associated with a binary annotation vector for term t and a test image X.

2.
Following Eq. (4) and (6), calculate the likelihood function for the training set D.

3.
Following Eqs. (7), (8) and (9), we write down the analytical form of the predictive distribution for X.

4.
We use some approximate method to transform the predictive distribution to a Gaussian distribution that can be solved analytically. After this step, a closeform solution can be obtained for testing any unseen images. In other words, the training set can be discarded in the testing step.
For each annotation term t, a model is trained by using GPMIL. For a test image, each model calculates a probability indicating the confidence of annotating the image with the corresponding term.