In this section, we first show the intuition behind the proposed algorithm framework, then, following Gurcan et al.'s proposal[1], present the proposed algorithm framework as three steps:
-
1.
Multi-instance sample representation
-
2.
Feature extraction
-
3.
Training of learning algorithms
Figure 3 illustrates the framework of the above three steps. We should note that the proposed framework is adaptable and flexible because it only provides a general framework and different implementations can be replaced according to the application domain.
Formulation
The proposed annotation framework is motivated by the nature of skin biopsy image recognition, which can be naturally expressed as a multi-instance learning problem. To make this intuition clearer, it is necessary to review the procedure of manually annotating skin biopsy images. From dermatopathological clinical experience, we can see that a set of standard terms are used by doctors to annotate an image. However, doctors are not required to explicitly record the correspondence between standard terms and regions within a given image, leading to the terms ambiguity described in the previous section. Because terms are actually associated with certain local regions, it is not reasonable to connect each region of an image to all associated terms, which results in poor models from a machine learning perspective [16]. As illustrated in Figures 2.(a)-(d), regions within a given image may have different relationships to the attached terms. It is time-consuming to manually label each region with a set of terms to meet the requirement of traditional single-instance learning. For this reason, by regarding each image as a bag and regions within the image as instances, multi-instance learning is naturally suitable for the annotation task. According to the basic assumption of multi-instance learning [15], a bag can be annotated with a term if it contains at least one region labelled with that term. Otherwise, the bag cannot be annotated with that term. Thus, we can build a set of binary multi-instance classifiers, each of which corresponds to a term. Given an image, each classifier outputs a Boolean value indicating whether its term should be annotated to the image. Thereby, we can address the term ambiguity within a multi-instance learning framework.
Another challenge is how to effectively represent an image as a multi-instance sample, or a bag. The key problem is how to partition an image into several regions to construct instances. Skin tissue is microscopically composed of several different structures, and a doctor needs to inspect them individually to determine abnormal areas. Regions of a skin biopsy image should be divided according to the structures of skin tissue to come up with a feature description for each part, but clustering-based algorithms [17] may not generate contiguous regions. Hence, we apply an image-cutting algorithm, namely Normalized Cut (NCut) [18], to generate visually disjoint local regions. Prior knowledge in dermatopathology suggests that on the one hand, examining an individual visually disjoint region is sufficient to annotate it in most cases, and on the other hand, there is not considerable relationship between terms to be annotated in a given image. The former supports the application of our image-cutting method, and the latter allows us to decompose the annotation task in to a set of multi-instance binary classification tasks.
Formally, let D = {(I
i
, T
i
)|i = 1, ..., n, I
i
∈ I, T
i
⊆T} be a set of skin biopsy images associated with a set of annotated terms, where T = {t1, t2, ..., t
m
} is a set of standard terms for annotation and I is a set of images. Each image is stored as a pixel matrix in 24k RGB colour space. The task is to learn a function f : I → 2T given D. When given an unseen image I
x
, f can output a subset of T corresponding to the annotation terms of the given image I
x
.
We first apply a cutting algorithm to generate visually disjoint regions for each image, given by I
i
= {I
ij
|j = 1, ..., n
i
}, where n
i
is the number of regions in image I
i
, followed by a feature extraction procedure to express each generated region as a feature vector. Then, we train the target model through two algorithms.
Skin biopsy image representation
Now we present a method for representing a skin biopsy image. First, express each image as a bag of regions as instances, and then apply two transformation-invariant feature extraction methods to further express them as vectors.
Multi-instance sample representation
To generate visually disjoint regions, we adopt a famous graph-cutting algorithm, Normalized Cut (NCut), proposed by Shi et al. [18] in 2000, aimed at extracting perceptual groupings from a given image. In constract with clustering-based image segmentation algorithms, e.g., [17], NCut extracts the global impression of a given image, i.e., disjoint visual grouping. To make this article self-contained, we briefly present the main idea of NCut.
NCut approaches the segmentation of an image as a graph cutting problem. It constructs a local connection between neighbour pixels within an image. Vertices of the constructed graph are pixels, and the weights of edges are similarity between pixels. The problem of NCut is to find a cut that minimises in-segment similarity and maximises cross-segment similarity. Formally, supposing there is a graph G = (V, E), we aim to find an optimal cut that partitions it into two disjoint sets A and B, where A ∩ B = ∅ and A ∪ B = V. A measure is defined in Eq. 1 as optimal graph cutting:
(1)
where , w(u, v) is the weight of the edge between vertices u and v, and is the summed weights of the edges between the vertices in segment A and any other vertices in graph G. Because graph G is locally connected, a binary column vector x|V|×1 can be defined to indicate whether a vertex belongs to subset A. The goal of NCut is to find a cut that minimises Ncut(A, B), as Eq. 2 shows.
(2)
According to [18], the solution to Eq. 2 captures a visual segmentation of an image whose underlying idea is naturally consistent with the clinical experience of skin biopsy image recognition. Eq. 2 can be solved as a standard Rayleigh quotient [19]. We ignore the detailed procedure for brevity. The computational time complexity of NCut for a given image is O(n2), where n is the number of pixels in an image.
The number of regions p is a parameter to be set beforehand. Figure 4 shows the NCut outputs of the same image with different parameter settings. Parameter p will affect the model performance to some extent. We will present this in the discussion section.
Feature extraction based on 2D-DWT
Previous work on skin image analysis has indicated that a good feature extraction method significantly affect model performance. Many problem-oriented feature expression methods have been proposed and proven to be successful in histopathology and dermatopathology [1]. However, feature extraction methods for skin biopsy images are seldom reported. Considering the variation of colour, rotation, magnification and even resolution in skin biopsy images, we propose a transformation-invariant feature extraction method based on 2-dimension discrete wavelet transformation (2D-DWT). The basic idea of the proposed feature extraction originated from [20, 17], which suggested applying 2D-DWT in colour space for each block within a given image. We briefly describe the proposed feature extraction methods as follow.
-
1.
Input a local region IR generated by NCut. Note that regions generated by NCut are irregular. For convenience, we store them as minimum covering rectangles by padding the regions with black pixels, as indicated in Figure 5. This padding does not significantly affect model performance, as most of these padding pixels will be discarded in later steps.
-
2.
Colour space transformation. IR is an RGB expression and now transferred to LUV space, denoted as IR_LUV. Calculate features f1 = mean(IR_LUV.L), f2 = mean(IR_LUV.U) and f3 = mean(IR_LUV.V).
-
3.
Divide IR_LUV into squares of size m × m pixels, resulting in (width/m) × (height/m) blocks, denoted as Bpq, where p = {1, ..., width/m} and q = {1, ..., height/m}. Eliminate blocks that are totally black, so as to remove padding pixels as much as possible.
-
4.
Apply 2D-DWT to each B
pq
, and keep coefficients LH, HL and HH. Let , where x ∈ {LH, HL, HH}. Average t
x
for all blocks within a region to obtain features f4, f5, f6.
-
5.
Following [20], calculate the normalized inertia of order 1, 2 and 3 as features f7, f8, f9.
After the above 5 steps, a 9-ary real vector is obtained for each region. An image is transformed into a set of disjoint regions, represented as real feature vectors. Thus we turn the original dataset into a multi-instance representation. Note that this representation is invariant to transformation, as 2D-DWT extracts texture features of regions that are irrelevant to rotation angle and magnification. The other features, LUV mean and normalized inertia of orders 1, 2 and 3, are also transformation-invariant. In the following section, we will provide an in-depth discussion of the effectiveness of this feature extraction method.
Feature extraction based on SIFT
Scale-invariant feature transform (SIFT) [21] is a well-studied feature extraction method widely used in the study of medical image classification. Juan C. Caicedo et al. [10] used SIFT to extract histopathological image features. We apply SIFT as our second feature extraction strategy. Unlike 2D-DWT, SIFT has been proven to be a robust key point selector in different image annotation and analysis applications. We use the common setting of SIFT, in which 8 orientations and 4 × 4 blocks are used, resulting in a 128-ary vectorial expression. Intuitively speaking, SIFT selects several outstanding points to represent a given image. We apply SIFT to the NCut-generated regions to obtain a features vector.
Model training
We propose two multi-instance learning algorithms to train our model. The first algorithm is based on Citation-KNN [22], and the second is a Bayesian multi-instance learning algorithm, namely Gaussian Process Multi-Instance Learning (GPMIL) [23]. Citation-KNN was first proposed by Jun Wang et al. [22] and can be regarded as a multi-instance version of traditional KNN classifiers. To determine a given test bag's label, Citation-KNN considers not only the K nearest labelled bags, i.e., references, but also labelled bags that regard the given bag as a K nearest neighbour, i.e., citers. Citation-KNN is well studied and has many successful applications in machine learning. GPMIL introduced a Gaussian process prior and solved the multi-instance learning problem in a Bayesian learning framework. The essential idea of GPMIL is that by defining a set of latent variables and the likelihood function, it establishes the relationship between class labels and instances in a probabilistic framework. By imposing a Gaussian process prior on these latent variables, we can use a Bayesian learning strategy to derive a posterior distribution of annotation terms given a training dataset and a test image.
We extend these two algorithms to meet the requirements of our annotation task, taking into consideration some insights into skin biopsy image annotation. On the one hand, because there is no prior knowledge on which to base multi-instance learning assumptions [24] for our task, we build model from the original assumption [15]. Citation-KNN with a properly defined similarity metric is a simple but effective algorithm in this case. On the other hand, the confidence level of a term to be annotated to a given image is preferred, which requires us to model the predictive distribution of annotation terms. To achieve this goal, we extend Bayesian learning to the multi-instance setting and model the posterior distribution of the annotation terms. An additional benefit of the Bayesian learning framework is that it is possible to model correlation between annotation terms, leading to a more general model.
Citation-KNN for annotation
Citation-KNN is a multi-instance learning algorithm inspired by the citation and reference system in scientific literature. To determine the label of a test bag X, it considers not only the neighbours (references) of X but also the bags (citers) that regard X as a neighbour. Citation-KNN uses both references and citers to determine an unseen bag's concept label. The key problem is how to evaluate distances between bags to identify references and citers.
Citation-KNN implements a simple idea: that if two images A and B share with the same term, they should regard each other as neighbors under a properly defined similarity measure, i.e., B is one of the K nearest neighbors of A and vice versa. In our work, a modified version of Hausdorff distance [25] was used as a similarity measure, which is given by
(3)
where AHD measures the average Hausdorff distance between two bags A and B, and a, b are instances in each bag. d(x, y) is the Euclidean distance function in instance space. As indicated in [25], AHD achieves a better performance than other set distance functions in multi-instance learning. The intuitive definition of AHD is the average minimal distance between instances from two bags, which better evaluates the spatial relationship between a pair of bags.
Note that Citation-KNN is a memory-based algorithm, meaning that all training samples must be stored when testing a given image and that no training procedure is required. When testing, AHD must be computed between the test image and all training samples. To reduce the computation cost, we define a locality matrix LM to speed up the algorithm as follow.
-
1.
Cluster the training set D to obtain s clusters and denote the centroid of each cluster as c
i
, s = {i = 1, ..., s}.
-
2.
Compute the AHD distance between each training sample and each centroid s
i
, and keep the K nearest training samples for each s
i
in the ith row of LM.
Thus we obtain a s-by-K locality matrix LM. When testing an image, we first calculate the distance between centroids and the given image, then discard the centroids that are far from the given image. For the remaining centroids, we perform a table lookup on LM to find the corresponding rows of the remaining centroids; only the training samples associated with such rows are needed in distance computation. We can prune out a large portion of the training samples that are far away from the test image, which greatly reduces the computational cost. The matrix can be computed only once before testing with cost O(n2), where n = |D| stands for the size of the training set.
GPMIL
We propose a Bayesian learning algorithm with a Gaussian process prior for our annotation task. Following [23], we first introduce an unobserved latent function g(x) defined in instance space for each annotation term t such that for each instance x, g(x) gives a probability indicating the confidence of x to be annotated with term t. We further impose a Gaussian process prior on all g(x) of the whole instance space. Let G = {g(x
i
)|i = 1, ..., n
inst
}, where n
inst
denotes the size of the instance space. We have G ~ N (0, K) as a Gaussian process prior [26], where K is a Gram matrix of some well-known kernel of all instance pairs. To establish the connection between g(x) and the annotated terms of images, a likelihood function is defined according to the basic multi-instance assumption [15] as Eq. (4):
(4)
where G
B
represents the output of g(x) for all instances in bag B, and |B| is the size of bag B. For mathematical convenience, softmax is used instead of max, thus we have
(5)
where α is an amplifying factor of the softmax function. If the largest g(x
j
) for any j is less than 0.5, bag B would not be annotated with term t because p(t|G
B
) <0.5. The joint likelihood function on the whole training set D can be written as
(6)
where T is a boolean vector indicating whether each bag B in D is annotated with term t. However, we are concerned with the label of a test bag B, not GB or GD themselves. Following Bayes rule, the posterior distribution over G for training dataset D and term t can be written as:
(7)
where p(T|G
D
) is the joint likelihood defined in Eq. (6), p(G
D
) is the Gaussian process prior and p(T|D) is the marginal likelihood given by
(8)
With Eq. (7) and (8), we can further obtain the prediction distribution of a test bag X for annotating term t as
(9)
where in the right hand side of Eq. (9), p(t|G
X
, X) represents the likelihood function of the test mage X, given by , and p(G
X
|D, T, X) represents the posterior distribution of latent variable G
X
. For each test image X, using the whole training dataset and the corresponding annotation vector T, we can obtain a predictive distribution that is a function of X and t. The effective method for solving Eq. (9) can be found in [27, 23].
To make the idea of GPMIL clearer, we provide an example as follows:
-
1.
Suppose we have a training image set D associated with a binary annotation vector for term t and a test image X.
-
2.
Following Eq. (4) and (6), calculate the likelihood function for the training set D.
-
3.
Following Eqs. (7), (8) and (9), we write down the analytical form of the predictive distribution for X.
-
4.
We use some approximate method to transform the predictive distribution to a Gaussian distribution that can be solved analytically. After this step, a close-form solution can be obtained for testing any unseen images. In other words, the training set can be discarded in the testing step.
For each annotation term t, a model is trained by using GPMIL. For a test image, each model calculates a probability indicating the confidence of annotating the image with the corresponding term.