 Research
 Open Access
 Published:
Automated skin biopsy histopathological image annotation using multiinstance representation and learning
BMC Medical Genomics volume 6, Article number: S10 (2013)
Abstract
With digitisation and the development of computeraided diagnosis, histopathological image analysis has attracted considerable interest in recent years. In this article, we address the problem of the automated annotation of skin biopsy images, a special type of histopathological image analysis. In contrast to previous wellstudied methods in histopathology, we propose a novel annotation method based on a multiinstance learning framework. The proposed framework first represents each skin biopsy image as a multiinstance sample using a graph cutting method, decomposing the image to a set of visually disjoint regions. Then, we construct two classification models using multiinstance learning algorithms, among which one provides determinate results and the other calculates a posterior probability. We evaluate the proposed annotation framework using a real dataset containing 6691 skin biopsy images, with 15 properties as target annotation terms. The results indicate that the proposed method is effective and medically acceptable.
Background
With the rapid development of computeraided diagnosis, increasingly more digital data have been stored electronically. It has been a great challenge for doctors and experts to effectively analyse these data. Introducing the power of computational intelligence into this analysis problem would be meaningful and practical, with the potential not only to ease the burden of doctors but also to save time so that doctors and experts can pay more attention to confusing and difficult cases [1].
In skin disease diagnosis, histopathological data provide a microscopic view of skin tissue architecture, which contributes to the correct diagnosis of skin diseases. Microscopic analysis of skin tissue provides further information about what happens under the skin's surface. To confirm a skin disease, on the one hand, doctors should have a clear understanding of the patient's medical history and careful observations of the skin eruption. On the other hand, histopathological data are of great necessity. For example, different patients may appear to have the same rash; however, differences in their histopathological data can distinguish them and aid in diagnosis. Histopathological data provide a comprehensive view of the presence of disease and its effects on patients. Some skin diseases, especially benign skin tumours and skin cancer, should be diagnosed using histopathological information. The information we extract from the data can help a doctor judge a patient's condition, estimate the prognosis, direct treatment, and evaluate the curative effects of treatments. For undiagnosed disease, complete histopathological data can provide an initial assessment of a condition's nature and severity.
Generally, there are two levels of skin disease diagnosis: skin surface inspection [2] and skin biopsy image analysis [3]. The former is a diagnostic procedure that can roughly be reached after routine exams, including observation and the physical examination of skin lesions, whereas the latter is a complement of the former [4, 5], utilised in cases where the doctor has less confidence or even cannot make a decision based only on an inspection of the skin surface. As indicated in histopathological studies, skin biopsy images reveal further information about what happens beneath the skin's surface at a microscopic level [4, 6]. Therefore, the results of skin biopsy image analysis could be explained more accurately than observations of the surface. For a medically acceptable diagnosis, many skin biopsy image cases are usually required to identify the significant changes associated with that specific diagnosis and differentiate them from those of similar skin diseases [7]. Because understanding skin biopsy images requires more professional knowledge and richer experience [8] than inspecting the skin's surface, it becomes a great challenge for doctors to correctly interpret huge number of skin biopsy images.
An important step in skin biopsy image analysis is to annotate an image with a set of standard terms as a professional description of what is happening in the tissues. Due to the large number of biopsy images, computeraided automated annotation methods have been investigated [1]. However, the task of automating skin biopsy image annotation poses at least two significant challenges. The first is the implicit ambiguity between annotation words and images. From clinical experience, a doctor can recognises skin biopsy images based on his expertise without explicitly attaching annotation terms to the exactly corresponding regions. What we can obtain is a whole image associated with a set of annotation terms, as indicated in Figure 1. The ambiguity also appears in the relationship between numbers of terms and corresponding regions. Figure 2 illustrates onetoone, onetomany, manytoone and manytomany relationships between between terms and regions. The second challenge is the complexity and variety of local regions annotated with the same term. The complexity and variety comprise differences in the texture, shape, size, color, magnification and even resolution of local regions, as shown in Figure 1. Two images in a row may share the same annotation terms but have totally different appearances. Hence, it is a great challenge to construct an automated annotation model that captures essential features for the terms.
Currently, several attempts to undertake the automated histopathological image analysis problem have been reported. Metin N. et al. [1] reviewed some important work on histopathological data analysis. They reviewed studies on different information source processing, segmentation and feature extraction methods for different application backgrounds and model training algorithms. Syed et al. [9] presented an analysis of feature extraction methods for bagoffeatures representations of histopathological images. Juan C. Caicedo et al. [10] proposed a histopathological image classification method based on bagoffeatures and a kernelfunctionbased model training algorithm. They approached the skin cancer histopathology image classification problem by representing images through bagoffeature methods. However, they solved the problem as a traditional single instance learning problem [11] with a kernel machine. Though widely used in histopathological image feature extraction, bagoffeatures don't, in fact, reveal the inner structures of histopathological images, and most important, it loses original information to some extent [12].
Much of the work in skin image recognition has been reported publicly. We review two important works closely related to our work here. Bunte et al. [13] proposed a novel machine learning method for skin surface image classification. They noticed that existing skin surface image feature extraction methods are only differently weighted strategies of color space. Hence, if an optimal weighted strategy is learned from the training dataset, it can achieve very good performance. In their work, an optimal weights vector is learned through a maximal margin classification algorithm, realising the idea that instead of finding a proper weighting, they derived one. However, their method is not suitable for our task. On the one hand, in their work, manual labelling of normal and lesion regions is required for each skin surface image. Because understanding a skin biopsy image requires more skill and expertise than understanding a skin surface image, this requirement would be a heavy burden for doctors. On the other hand, in the work of Bunte et al., only RGB colour spacebased features are used, which cannot fully describe the essential features of biopsy images, e.g., texture, local structures and even visual edges. Moreover, biopsy images are often stained for clearer illustration of tissue structures and different types of cells, which would lead to the failure of purely colourbased feature extraction methods.
Another work that should be emphasised is on Drosophila gene image annotation, proposed by Li et al. [12]. They addressed the problem of the automated annotation of Drosophila embryogene expression patterns in a multiinstance multilabel learning (MIML) framework [14]. Annotation terms are associated with groups of images corresponding to different embryogene developmental stages, but more specifically, the terms are in fact associated with some patches within the group of images. They solve the problem by regarding each image group as a multiinstance sample and annotated terms as labels attached to the sample. They proposed two MIML algorithms for model training. To express a group of images as a bag, they adopt a block division method to generate equalsize patches as instances. Though the general framework of [12] is consistent with our task, it is not naturally suited to skin biopsy image annotation, as Drosophila embryogene images do not contain complex inner structures, textures or colours. Therefore, equalsize block division does not make sense for our task.
In this article, we propose a novel automated annotation framework based on the theory of multiinstance learning. Multiinstance learning is a special learning framework introduced by Dietterich et al. [15] to solve the drug activity prediction problem. Different from singleinstance learning, samples in multiinstance learning (also called bags) are composed of several instances with potential concept labels, only the concept labels of bags are known. For binary classification tasks, a bag is positive if and only if it contains at least one positive instance and negative otherwise. The task of multiinstance learning is to predict the labels of unseen bags by training a model with labelled bags.
We first show that the skin biopsy image annotation task can naturally be decomposed into several binary multiinstance classification tasks. Then, by applying a graphcutting algorithm and regionbased feature extraction methods, we propose an effective method of expressing each skin biopsy image as a bag whose instances are regions. Finally, we propose two algorithms for model building. One is discriminative and produces a binary output indicating whether a given image should be annotated with a certain term. The other one models the conditional distribution p(t_{ i }I, D) to calculate the posterior probability of annotating an image I with a term ti, given a training dataset D.
Methods
In this section, we first show the intuition behind the proposed algorithm framework, then, following Gurcan et al.'s proposal[1], present the proposed algorithm framework as three steps:

1.
Multiinstance sample representation

2.
Feature extraction

3.
Training of learning algorithms
Figure 3 illustrates the framework of the above three steps. We should note that the proposed framework is adaptable and flexible because it only provides a general framework and different implementations can be replaced according to the application domain.
Formulation
The proposed annotation framework is motivated by the nature of skin biopsy image recognition, which can be naturally expressed as a multiinstance learning problem. To make this intuition clearer, it is necessary to review the procedure of manually annotating skin biopsy images. From dermatopathological clinical experience, we can see that a set of standard terms are used by doctors to annotate an image. However, doctors are not required to explicitly record the correspondence between standard terms and regions within a given image, leading to the terms ambiguity described in the previous section. Because terms are actually associated with certain local regions, it is not reasonable to connect each region of an image to all associated terms, which results in poor models from a machine learning perspective [16]. As illustrated in Figures 2.(a)(d), regions within a given image may have different relationships to the attached terms. It is timeconsuming to manually label each region with a set of terms to meet the requirement of traditional singleinstance learning. For this reason, by regarding each image as a bag and regions within the image as instances, multiinstance learning is naturally suitable for the annotation task. According to the basic assumption of multiinstance learning [15], a bag can be annotated with a term if it contains at least one region labelled with that term. Otherwise, the bag cannot be annotated with that term. Thus, we can build a set of binary multiinstance classifiers, each of which corresponds to a term. Given an image, each classifier outputs a Boolean value indicating whether its term should be annotated to the image. Thereby, we can address the term ambiguity within a multiinstance learning framework.
Another challenge is how to effectively represent an image as a multiinstance sample, or a bag. The key problem is how to partition an image into several regions to construct instances. Skin tissue is microscopically composed of several different structures, and a doctor needs to inspect them individually to determine abnormal areas. Regions of a skin biopsy image should be divided according to the structures of skin tissue to come up with a feature description for each part, but clusteringbased algorithms [17] may not generate contiguous regions. Hence, we apply an imagecutting algorithm, namely Normalized Cut (NCut) [18], to generate visually disjoint local regions. Prior knowledge in dermatopathology suggests that on the one hand, examining an individual visually disjoint region is sufficient to annotate it in most cases, and on the other hand, there is not considerable relationship between terms to be annotated in a given image. The former supports the application of our imagecutting method, and the latter allows us to decompose the annotation task in to a set of multiinstance binary classification tasks.
Formally, let D = {(I_{ i }, T_{ i })i = 1, ..., n, I_{ i } ∈ I, T_{ i } ⊆T} be a set of skin biopsy images associated with a set of annotated terms, where T = {t_{1}, t_{2}, ..., t_{ m }} is a set of standard terms for annotation and I is a set of images. Each image is stored as a pixel matrix in 24k RGB colour space. The task is to learn a function f : I → 2^{T} given D. When given an unseen image I_{ x }, f can output a subset of T corresponding to the annotation terms of the given image I_{ x }.
We first apply a cutting algorithm to generate visually disjoint regions for each image, given by I_{ i } = {I_{ ij }j = 1, ..., n_{ i }}, where n_{ i } is the number of regions in image I_{ i }, followed by a feature extraction procedure to express each generated region as a feature vector. Then, we train the target model through two algorithms.
Skin biopsy image representation
Now we present a method for representing a skin biopsy image. First, express each image as a bag of regions as instances, and then apply two transformationinvariant feature extraction methods to further express them as vectors.
Multiinstance sample representation
To generate visually disjoint regions, we adopt a famous graphcutting algorithm, Normalized Cut (NCut), proposed by Shi et al. [18] in 2000, aimed at extracting perceptual groupings from a given image. In constract with clusteringbased image segmentation algorithms, e.g., [17], NCut extracts the global impression of a given image, i.e., disjoint visual grouping. To make this article selfcontained, we briefly present the main idea of NCut.
NCut approaches the segmentation of an image as a graph cutting problem. It constructs a local connection between neighbour pixels within an image. Vertices of the constructed graph are pixels, and the weights of edges are similarity between pixels. The problem of NCut is to find a cut that minimises insegment similarity and maximises crosssegment similarity. Formally, supposing there is a graph G = (V, E), we aim to find an optimal cut that partitions it into two disjoint sets A and B, where A ∩ B = ∅ and A ∪ B = V. A measure is defined in Eq. 1 as optimal graph cutting:
where $cut\left(A,B\right)={\sum}_{u\in A,v\in B}w\left(u,v\right)$, w(u, v) is the weight of the edge between vertices u and v, and $assoc\left(A,V\right)={\sum}_{u\in A,t\in \phantom{\rule{2.77695pt}{0ex}}V}w\left(u,t\right)$ is the summed weights of the edges between the vertices in segment A and any other vertices in graph G. Because graph G is locally connected, a binary column vector x_{V×1 }can be defined to indicate whether a vertex belongs to subset A. The goal of NCut is to find a cut that minimises Ncut(A, B), as Eq. 2 shows.
According to [18], the solution to Eq. 2 captures a visual segmentation of an image whose underlying idea is naturally consistent with the clinical experience of skin biopsy image recognition. Eq. 2 can be solved as a standard Rayleigh quotient [19]. We ignore the detailed procedure for brevity. The computational time complexity of NCut for a given image is O(n^{2}), where n is the number of pixels in an image.
The number of regions p is a parameter to be set beforehand. Figure 4 shows the NCut outputs of the same image with different parameter settings. Parameter p will affect the model performance to some extent. We will present this in the discussion section.
Feature extraction based on 2DDWT
Previous work on skin image analysis has indicated that a good feature extraction method significantly affect model performance. Many problemoriented feature expression methods have been proposed and proven to be successful in histopathology and dermatopathology [1]. However, feature extraction methods for skin biopsy images are seldom reported. Considering the variation of colour, rotation, magnification and even resolution in skin biopsy images, we propose a transformationinvariant feature extraction method based on 2dimension discrete wavelet transformation (2DDWT). The basic idea of the proposed feature extraction originated from [20, 17], which suggested applying 2DDWT in colour space for each block within a given image. We briefly describe the proposed feature extraction methods as follow.

1.
Input a local region IR generated by NCut. Note that regions generated by NCut are irregular. For convenience, we store them as minimum covering rectangles by padding the regions with black pixels, as indicated in Figure 5. This padding does not significantly affect model performance, as most of these padding pixels will be discarded in later steps.

2.
Colour space transformation. IR is an RGB expression and now transferred to LUV space, denoted as IR_LUV. Calculate features f_{1} = mean(IR_LUV.L), f_{2} = mean(IR_LUV.U) and f_{3} = mean(IR_LUV.V).

3.
Divide IR_LUV into squares of size m × m pixels, resulting in (width/m) × (height/m) blocks, denoted as Bpq, where p = {1, ..., width/m} and q = {1, ..., height/m}. Eliminate blocks that are totally black, so as to remove padding pixels as much as possible.

4.
Apply 2DDWT to each B_{ pq }, and keep coefficients LH, HL and HH. Let ${t}_{x}=\sqrt{\frac{1}{4}{x}^{T}x)}$, where x ∈ {LH, HL, HH}. Average t_{ x } for all blocks within a region to obtain features f_{4}, f_{5}, f_{6}.

5.
Following [20], calculate the normalized inertia of order 1, 2 and 3 as features f_{7}, f_{8}, f_{9}.
After the above 5 steps, a 9ary real vector is obtained for each region. An image is transformed into a set of disjoint regions, represented as real feature vectors. Thus we turn the original dataset into a multiinstance representation. Note that this representation is invariant to transformation, as 2DDWT extracts texture features of regions that are irrelevant to rotation angle and magnification. The other features, LUV mean and normalized inertia of orders 1, 2 and 3, are also transformationinvariant. In the following section, we will provide an indepth discussion of the effectiveness of this feature extraction method.
Feature extraction based on SIFT
Scaleinvariant feature transform (SIFT) [21] is a wellstudied feature extraction method widely used in the study of medical image classification. Juan C. Caicedo et al. [10] used SIFT to extract histopathological image features. We apply SIFT as our second feature extraction strategy. Unlike 2DDWT, SIFT has been proven to be a robust key point selector in different image annotation and analysis applications. We use the common setting of SIFT, in which 8 orientations and 4 × 4 blocks are used, resulting in a 128ary vectorial expression. Intuitively speaking, SIFT selects several outstanding points to represent a given image. We apply SIFT to the NCutgenerated regions to obtain a features vector.
Model training
We propose two multiinstance learning algorithms to train our model. The first algorithm is based on CitationKNN [22], and the second is a Bayesian multiinstance learning algorithm, namely Gaussian Process MultiInstance Learning (GPMIL) [23]. CitationKNN was first proposed by Jun Wang et al. [22] and can be regarded as a multiinstance version of traditional KNN classifiers. To determine a given test bag's label, CitationKNN considers not only the K nearest labelled bags, i.e., references, but also labelled bags that regard the given bag as a K nearest neighbour, i.e., citers. CitationKNN is well studied and has many successful applications in machine learning. GPMIL introduced a Gaussian process prior and solved the multiinstance learning problem in a Bayesian learning framework. The essential idea of GPMIL is that by defining a set of latent variables and the likelihood function, it establishes the relationship between class labels and instances in a probabilistic framework. By imposing a Gaussian process prior on these latent variables, we can use a Bayesian learning strategy to derive a posterior distribution of annotation terms given a training dataset and a test image.
We extend these two algorithms to meet the requirements of our annotation task, taking into consideration some insights into skin biopsy image annotation. On the one hand, because there is no prior knowledge on which to base multiinstance learning assumptions [24] for our task, we build model from the original assumption [15]. CitationKNN with a properly defined similarity metric is a simple but effective algorithm in this case. On the other hand, the confidence level of a term to be annotated to a given image is preferred, which requires us to model the predictive distribution of annotation terms. To achieve this goal, we extend Bayesian learning to the multiinstance setting and model the posterior distribution of the annotation terms. An additional benefit of the Bayesian learning framework is that it is possible to model correlation between annotation terms, leading to a more general model.
CitationKNN for annotation
CitationKNN is a multiinstance learning algorithm inspired by the citation and reference system in scientific literature. To determine the label of a test bag X, it considers not only the neighbours (references) of X but also the bags (citers) that regard X as a neighbour. CitationKNN uses both references and citers to determine an unseen bag's concept label. The key problem is how to evaluate distances between bags to identify references and citers.
CitationKNN implements a simple idea: that if two images A and B share with the same term, they should regard each other as neighbors under a properly defined similarity measure, i.e., B is one of the K nearest neighbors of A and vice versa. In our work, a modified version of Hausdorff distance [25] was used as a similarity measure, which is given by
where AHD measures the average Hausdorff distance between two bags A and B, and a, b are instances in each bag. d(x, y) is the Euclidean distance function in instance space. As indicated in [25], AHD achieves a better performance than other set distance functions in multiinstance learning. The intuitive definition of AHD is the average minimal distance between instances from two bags, which better evaluates the spatial relationship between a pair of bags.
Note that CitationKNN is a memorybased algorithm, meaning that all training samples must be stored when testing a given image and that no training procedure is required. When testing, AHD must be computed between the test image and all training samples. To reduce the computation cost, we define a locality matrix LM to speed up the algorithm as follow.

1.
Cluster the training set D to obtain s clusters and denote the centroid of each cluster as c_{ i }, s = {i = 1, ..., s}.

2.
Compute the AHD distance between each training sample and each centroid s_{ i }, and keep the K nearest training samples for each s_{ i } in the ith row of LM.
Thus we obtain a sbyK locality matrix LM. When testing an image, we first calculate the distance between centroids and the given image, then discard the centroids that are far from the given image. For the remaining centroids, we perform a table lookup on LM to find the corresponding rows of the remaining centroids; only the training samples associated with such rows are needed in distance computation. We can prune out a large portion of the training samples that are far away from the test image, which greatly reduces the computational cost. The matrix can be computed only once before testing with cost O(n^{2}), where n = D stands for the size of the training set.
GPMIL
We propose a Bayesian learning algorithm with a Gaussian process prior for our annotation task. Following [23], we first introduce an unobserved latent function g(x) defined in instance space for each annotation term t such that for each instance x, g(x) gives a probability indicating the confidence of x to be annotated with term t. We further impose a Gaussian process prior on all g(x) of the whole instance space. Let G = {g(x_{ i })i = 1, ..., n_{ inst }}, where n_{ inst } denotes the size of the instance space. We have G ~ N (0, K) as a Gaussian process prior [26], where K is a Gram matrix of some wellknown kernel of all instance pairs. To establish the connection between g(x) and the annotated terms of images, a likelihood function is defined according to the basic multiinstance assumption [15] as Eq. (4):
where G_{ B } represents the output of g(x) for all instances in bag B, and B is the size of bag B. For mathematical convenience, softmax is used instead of max, thus we have
where α is an amplifying factor of the softmax function. If the largest g(x_{ j }) for any j is less than 0.5, bag B would not be annotated with term t because p(tG_{ B }) <0.5. The joint likelihood function on the whole training set D can be written as
where T is a boolean vector indicating whether each bag B in D is annotated with term t. However, we are concerned with the label of a test bag B, not GB or GD themselves. Following Bayes rule, the posterior distribution over G for training dataset D and term t can be written as:
where p(TG_{ D }) is the joint likelihood defined in Eq. (6), p(G_{ D }) is the Gaussian process prior and p(TD) is the marginal likelihood given by
With Eq. (7) and (8), we can further obtain the prediction distribution of a test bag X for annotating term t as
where in the right hand side of Eq. (9), p(tG_{ X }, X) represents the likelihood function of the test mage X, given by $p\left(t{G}_{X},X\right)=\int p\left({G}_{X}{G}_{D},D,X\right)p\left({G}_{D}D,Y\right)d{G}_{D}$, and p(G_{ X }D, T, X) represents the posterior distribution of latent variable G_{ X }. For each test image X, using the whole training dataset and the corresponding annotation vector T, we can obtain a predictive distribution that is a function of X and t. The effective method for solving Eq. (9) can be found in [27, 23].
To make the idea of GPMIL clearer, we provide an example as follows:

1.
Suppose we have a training image set D associated with a binary annotation vector for term t and a test image X.

2.
Following Eq. (4) and (6), calculate the likelihood function for the training set D.

3.
Following Eqs. (7), (8) and (9), we write down the analytical form of the predictive distribution for X.

4.
We use some approximate method to transform the predictive distribution to a Gaussian distribution that can be solved analytically. After this step, a closeform solution can be obtained for testing any unseen images. In other words, the training set can be discarded in the testing step.
For each annotation term t, a model is trained by using GPMIL. For a test image, each model calculates a probability indicating the confidence of annotating the image with the corresponding term.
Evaluation
Dataset description
We evaluated the proposed method using a real skin biopsy image dataset from The Second Affiliated Hospital of Guangzhou University of Chinese Medicine and The Third Affiliated Hospital of SUN YATSEN University.. The dataset contains diagnosis data from 2010 to 2012, including 2734 patient records and 6691 skin biopsy images associated with a set of standard dermatopathology annotations in Chinese. The dataset was generated by manually selecting 23 biopsy images at the same magnification ratio for each patient. Each term indicates a certain feature of concern in the biopsy images of a certain patient. Each image has pixels with 24k colours in RGB space with a size of 2048 × 1536 pixels. We considered 15 annotation terms in the evaluation, among which some often appear in lesion regions and others are only observed for some special types of skin diseases. Table 1 lists these terms with their rates of occurance in the evaluation dataset.
A binary matrix is obtained by text matching, in which each row is a 15ary binary vector indicating whether an image has been annotated with these terms. Based on domain knowledge, a skin biopsy image is possibly composed of up to 15 regions. We set the number of regions p as 8, 10 or 12 for separate runs of our proposed algorithm, then combine them through majority voting. Images fed to NCut are all rescaled to 200 × 150 pixels for effective calculation. The feature extraction methods were applied to the rescaled images instead of the original ones because the rescaled images contain sufficient information.
Evaluation criteria
As mentioned in the previous section, we decomposed the annotation task into several binary classification tasks. Zeroone loss (also called precision) is a straightforward criterion for our task. Because multiple terms are associated with an image, multilabel machine learning evaluation criteria are also suitable for our task. We also introduce Hamming loss for evaluation, whose definition can be found in [28]. Intuitively speaking, Hamming loss is a measure of how many objectterm pairs are annotated by mistake. Note that larger values of Hamming loss indicate better model performance. Zeroone loss evaluates the annotation performance of a single term, whereas Hamming loss evaluates the whole model output for all terms.
Evaluation results
Evaluation of feature extraction and model training methods
We evaluated the performance of two feature extraction methods 2DDWT, and SIFT, combined with two model training algorithms. The purpose was to show the effectiveness using different feature expressions to different models. We used the following configuration. The whole dataset was randomly divided into a training set and a testing set with a ratio 3:7. The number of regions generated by NCut was set to 10. The block size for 2DDWT was set to 4 × 4. Images were all rescaled to 200 × 150 for effective computation. SIFT was used with its default settings, as mentioned above. For GPMIL, because the model provides a probability r, it can be converted into a binary value through b = sign(r −0.5). We also implemented the bagoffeatures method with an RBF kernel function [10] as a baseline for comparison. For every model, we ran 10 trials and averaged all of the results to obtain a final result. Table 2 shows the results, measured by zeroone loss, for the annotation of 15 terms.
In Table 2 the column BOF stands for the result of the bagoffeatures method proposed in [10]. The best result in each row has been highlighted in bold. It can be observed that the multiinstance learningbased methods are superior to the bagoffeaturesbased method for annotating most terms. Both feature extraction methods achieved the best performance in some cases. We cannot simply determine which method is superior to the other. Some prior knowledge or experience can be introduced to determine the most suitable feature representation method. Another factor that should be noted is the stability of the proposed method, which achieves higher precision but lower variance compared to the baseline method, meaning that the proposed method is more reliable and stable for the annotation of different terms.
Table 3 illustrates the performance as evaluated by Hamming loss. GPMIL with 2DDWT feature representation achieves the best Hamming loss. Note that Hamming loss is often higher than the average error rate for the annotation of all terms, as the correct annotations may not be in the same image, leading to some increase in Hamming loss.
The impact of number of regions
We varied the number of regions generated by NCut to demonstrate its impact on the model performance and reveal the relationship between the proposed method and clinical experience. We used 2DDWT as the only feature extraction method and varied p from 6 to 12 in step 2. As indicated in Figure 4, a small p value may lead to complex regions featured as more terms, whereas a large p value may lead to fragments of regions. Figure 6 shows the results for the first 8 terms.
We can see that the parameter p affects model performance to some extent. In most cases, it is true that a larger p means better performance. Even in cases where p = 6, the proposed algorithm achieves an acceptable result while annotating some terms, which is in opposition to our experience, as we do not know which number of regions would be best. We propose to use an ensemble method to create a model with better generalisation, reducing the impact of an improper setting of p. To do this, we adopted a majority voting strategy; a model is trained with each value of p when testing an image, and the models vote to determine the final result. Because the models are of binary outputs, they vote for each annotation term. Table 4 shows the ensemble result for each term.
The impact of an imbalanced training set
As indicated in Table 1 the frequency of different terms varies significantly. When training a model with an imbalanced dataset, the model would be biased toward the major class. We varied the ratio r between positive and negative samples to determine a good strategy for building a training dataset. To do this, a series of datasets Dr of size N are constructed by first randomly selecting N × r images annotated with a term from the training set, then randomly selecting N × (1  r) images not annotated with the same term. We used Citation KNN and 2DDWT feature extraction for this evaluation. Note that in this case accuracy may not be a proper measure because the model tends to predict all test samples as one class when training with a highly imbalanced dataset. For example, when a dataset is composed of 90% positive and 10% negative samples, a model that always makes positive predictions would achieve an accuracy of 90%. However, this accuracy would be meaningless. We used false positive (FP) and false negative (FN) ratios to measure accuracy. Figure 7 shows the model performance of different values of r for the first 4 terms.
An illustration of the model output
Finally, we illustrated a comparison between the model output and the real annotation terms attached to the test images. We selected three images from the evaluation dataset. The three images were taken in 2011 from three different patients. Figure 8 illustrates the annotation results of CitationKNN and GPMIL. The column True stands for annotation terms that belong to the images according to the diagnosis records. Citation KNN provides a set of terms and GPMIL further outputs a confidence level for the terms. In Figure 8, we omitted terms with a probability of less than 50%.
Discussion
Multiinstance representation vs. bagoffeatures
In histopathological and dermatopathological image analysis, a large amount of work was based on bagoffeatures construction [10, 29–31], in which a dictionary is built whose elements are small patches from a set of training images and can be regarded as keywords. To classify or annotate a given image, these methods need only examine the presence or quantity of keywords in the image. Thus the image can be expressed as a histogram of elements in the dictionary.
Our multiinstance framework is quite different from bagoffeaturesbased methods. The proposed framework retains original features through direct feature extraction methods, whereas bagoffeaturesbased methods only generate some statistical measures, e.g., histogram of the elements in a dictionary, which may cause some loss of discriminative information. Meanwhile, the elements of a dictionary in a bagoffeaturesbased method are often derived from gridbased image patches. We argue that such patches are not able to fully capture the essential discriminative information contained in histopathological images. The proposed framework generates meaningful local regions with visually disjoint edges using NCut, which is more consistent with diagnostic experience in dermatopathology.
Number of regions of Normalized Cut
We addressed some issues related to setting a reasonable number of regions. Though the evaluation results showed that an ensemble with different regions yields an acceptable result, this method lacks a good explanation. When inspecting skin biopsy images, a small number of regions indicates that the doctor is focusing on relatively global features, whereas a large number indicates more detailed features. Doctors' behaviour may range from global to detailed according to their knowledge and experience. Skin tissue is composed of three anatomically distinct layers, namely the epidermis, dermis, and subcutaneous tissue (fat). Epidermis can be further divided into four layers. Each layer has a distinctive stained colour and special structures. Distinct pathological changes involving any of these whole layers such as Hyperkeratosis, Acanthosis and Hyperpigmentation of the basal cell layer, can be easily recognised in a small number of segmentations. Specific changes within a layer, such as a Munro microabscess, nevocytic nests or infiltration of lymphocytes, can be more accurately detected when the image is divided into more pieces. Either a global or a detailed view is reasonable in diagnosis, which is consistent with the above evaluation results.
Relationship between regions
Considering the relationships between regions, it should be noted that skin tissues have clearly featured inner structures. Some correlation can be observed between the presence of different terms within an image. For example, terms such as hyperkeratosis and parakeratosis can only be found in certain regions and above features such as acanthosis or hyperpigmentation of the basal cell layer (if the term is attached to the same image). Theoretically speaking, GPMIL can capture such correlations to some extent by defining a different likelihood function [27]. Our Gaussian process prior for GPMIL also implies such relationships. However, previous work [32] reported that the inclusion of such relationships did not make a positive contribution to model performance. We owe this phenomenon to the doctors' experience implied in the training dataset, i.e., that doctors or experts pay more attention to important local regions, which statistically reduces the emphasis on relationships between regions.
Conclusion
In this work, we introduce the application of multiinstance representation and learning to the recognisation and annotation of dermatopathological skin biopsy images. To reprensent a skin biopsy image as a multiinstance sample, we apply Normalized Cut to divide an image into visually disjoint regions and then extract features for each region through 2DDWT and SIFTbased algorithms. Two training algorithms have been proposed for model building: Citation KNN provides a binary output, and GPMIL calculates a probability indicating the confidence level of the model output. The evaluation results show that the proposed method is effective for biopsy image recognition and annotation.
Medically, the results contribute to the development of dermatopathology. Timeconsumption and expenditure would be lower if a computer program could take over the annotation work of a pathologist. The accuracy of diagnosis would be increased if subjective factors, such as a doctor's skill, and objective factors, such as light, were eliminated. The application accords with developing trends in dermatopathology. Further work will include introducing relationships between terms in multiinstance multilabel framework and designing more powerful region recognition and feature extraction methods.
References
 1.
Gurcan MN, Boucheron LE, Can A, Madabhushi A, Rajpoot NM, Yener B: Histopathological Image Analysis: A Review. Biomedical Engineering, IEEE Reviews in. 2009, 2: 147171.
 2.
Cerroni L, Argenyi Z, Cerio R, Facchetti F, Kittler H, Kutzner H, Requena L, Sangueza OP, Smoller B, Wechsler J, Kerl H: Influence of evaluation of clinical pictures on the histopathologic diagnosis of inflammatory skin disorders. J Am Acad Dermatol. 2010, 63 (4): 64752. 10.1016/j.jaad.2009.09.009.
 3.
Fogelberg A, Ioffreda M, Helm KF: The utility of digital clinical photographs in dermatopathology. J Cutan Med Surg. 2004, 8 (2): 11621.
 4.
LlamasVelasco M, Paredes BE: Basic concepts in skin biopsy. Part I. Actas Dermosifiliogr. 2012, 103: 1220. 10.1016/j.ad.2011.05.007.
 5.
Ferrara G, Argenyi Z, Argenziano G, Cerio R, Cerroni L, Di Blasi A, Feudale EA, Giorgio CM, Massone C, Nappi O, Tomasini C, Urso C, Zalaudek I, Kittler H, Soyer HP: The influence of clinical information in the histopathologic diagnosis of melanocytic skin neoplasms. PLoS One. 2009, 4 (4): e537510.1371/journal.pone.0005375.
 6.
Neitzel CD: Biopsy techniques for skin disease and skin cancer. Oral Maxillofac Surg Clin North Am. 2005, 17 (2): 1436. 10.1016/j.coms.2005.02.002. v
 7.
Grayson W: Recognition of Dual or Multiple Pathology in Skin Biopsies from Patients with HIV/AIDS. Patholog Res Int. 2011, 2011: 398546
 8.
Sellheyer K, Bergfeld WF: A retrospective biopsy study of the clinical diagnostic accuracy of common skin diseases by different specialties compared with dermatology. J Am Acad Dermatol. 2005, 52 (5): 82330. 10.1016/j.jaad.2004.11.072.
 9.
Raza SH, Parry RM, Moffitt RA, Young AN, Wang MD: An Analysis of Scale and Rotation Invariance in the BagofFeatures Method for Histopathological Image Classification. MICCAI (3), Volume 6893 of Lecture Notes in Computer Science. Edited by: Fichtinger G, Martel AL, Peters TM. 2011, Springer, 6674.
 10.
Caicedo JC, CruzRoa A, González FA: Histopathology Image Classification Using Bag of Features and Kernel Functions. AIME, Volume 5651 of Lecture Notes in Computer Science. Edited by: Combi C, Shahar Y, AbuHanna A. 2009, 126135.
 11.
Bishop CM: Pattern Recognition and Machine Learning (Information Science and Statistics). Secaucus, NJ. 2006, USA: SpringerVerlag New York, Inc
 12.
Li YX, Ji S, Kumar S, Ye J, Zhou ZH: Drosophila Gene Expression Pattern Annotation through MultiInstance MultiLabel Learning. IEEE/ACM Trans Comput Biology Bioinform. 2012, 9: 98112.
 13.
Bunte K, Biehl M, Jonkman MF, Petkov N: Learning effective color features for content based image retrieval in dermatology. Pattern Recogn. 2011, 44 (9): 18921902. 10.1016/j.patcog.2010.10.024.
 14.
Zhou ZH, Zhang ML, Huang SJ, Li YF: Multiinstance multilabel learning. Artif Intell. 2012, 176: 22912320. 10.1016/j.artint.2011.10.002.
 15.
Dietterich TG, Lathrop RH, LozanoPérez T: Solving the multiple instance problem with axisparallel rectangles. Artif Intell. 1997, 89 (12): 3171.
 16.
Zhang ML: Generalized MultiInstance Learning: Problems, Algorithms and Data Sets. 2009
 17.
Chen Y, Wang JZ: Image Categorization by Learning and Reasoning with Regions. J Mach Learn Res. 2004, 5: 913939.
 18.
Shi J, Malik J: Normalized Cuts and Image Segmentation. IEEE Trans Pattern Anal Mach Intell. 2000, 22 (8): 888905. 10.1109/34.868688.
 19.
Golub GH, Van Loan CF: Johns Hopkins series in the mathematical sciences. Matrix computations. 1989, Baltimore: Johns Hopkins University Press, 2
 20.
Gersho A: Asymptotically optimal block quantization. Information Theory, IEEE Transactions on. 1979, 25 (4): 373380. 10.1109/TIT.1979.1056067.
 21.
Lowe DG: Distinctive Image Features from ScaleInvariant Keypoints. Int J Comput Vision. 2004, 60 (2): 91110.
 22.
Wang J, Zucker JD: Solving MultipleInstance Problem: A Lazy Learning Approach. 2000
 23.
Kim M, la Torre FD: Gaussian Processes Multiple Instance Learning. Proceedings of the 27th International Conference on Machine Learning (ICML10), June 2124, 2010, Haifa, Israel. Edited by: F´'urnkranz J, Joachims T. 2010, Omnipress, 535542.
 24.
Foulds J, Frank E: A Review of MultiInstance Learning Assumptions. Knowl Eng Rev. 2010, 25: 125. 10.1017/S026988890999035X.
 25.
Zhang ML, Zhou ZH: Multiinstance clustering with applications to multiinstance prediction. Applied Intelligence. 2009, 31: 4768. 10.1007/s104890070111x.
 26.
Rasmussen CE, Williams C: Gaussian Processes for Machine Learning. 2006, MIT Press
 27.
He J, Gu H, Wang Z: Bayesian multiinstance multilabel learning using Gaussian process prior. Mach Learn. 2012, 88 (12): 273295.
 28.
Zhang ML, Wang ZJ: MIMLRBF: RBF neural networks for multiinstance multilabel learning. Neurocomputing. 2009, 39513956.
 29.
Ji S, Li YX, Zhou ZH, Kumar S, Ye J: A bagofwords approach for Drosophila gene expression pattern annotation. BMC Bioinformatics. 2009, 10: 11910.1186/1471210510119.
 30.
CruzRoa A, Caicedo JC, González FA: Visual Pattern Analysis in Histopathology Images Using Bag of Features. CIARP, Volume 5856 of Lecture Notes in Computer Science. Edited by: BayroCorrochano E, Eklundh JO. 2009, Springer, 521528.
 31.
Rueda A, Arevalo JE, CruzRoa A, Romero E, González FA: Bag of Features for Automatic Classification of Alzheimer's Disease in Magnetic Resonance Images. CIARP, Volume 7441 of Lecture Notes in Computer Science. Edited by: Álvarez L, Mejail M, Gómez L, Jacobo JC. 2012, Springer, 559566.
 32.
Zhang G, Shu X, Liang Z, Liang Y, Chen S, Yin J: Multiinstance learning for skin biopsy image features recognition. Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on . 2012, 16. 10.1109/BIBM.2012.6392648.
Acknowledgements
Based on "MultiInstance Learning for Skin Biopsy Image Features Recognition", by Gang Zhang, Xiangyang Shu, Zhaohui Liang, Yunting Liang, Shuyi Chen and Jian Yin which appeared in Bioinformatics and Biomedicine (BIBM), 2012 IEEE International Conference on. ^{©} 2012 IEEE 9781467325608/12/.
The authors would like to thank Prof. Dacan Chen and Prof. Zhaohui Liang from The Second Affiliated Hospital of Guangzhou University of Chinese Medicine for their inspiring suggestions, assistance and financial aid during the study. This work is supported by the National Natural Science Foundation of China (No. 81274003, 61033010, 61272065), Guangdong Provincial Foundation of Medical Science Research (No. A2012215), Natural Science Foundation of Guangdong Province (S2011020001182), Research Foundation of Science and Technology Plan Project in Guangdong Province and Guangzhou City (2009B030801090, 2010A040303004, 11A12050914, 11A31090341, 2011Y500004), Research Foundation of Guangdong Provincial Hospital of Chinese Medicine (No. 2013KT1067), Research Foundation of SysungEtri project (2011A091000026) and the 2012 College Student Career and Innovation Training Plan Project (1184512043).
Declarations
The publication costs for this article were funded by the corresponding author.
This article has been published as part of BMC Medical Genomics Volume 6 Supplement 3, 2013: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2012: Medical Genomics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcmedgenomics/supplements/6/S3.
Author information
Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
Gang Zhang and Ziping Li are equally contributing authors of this article. Gang Zhang was responsible for the main framework of this article, as well as algorithm development and programming, and he wrote the first draft of the paper. Ziping Li was responsible for study design and crude data processing, and he revised the paper from a clinical perspective. Honglai Zhang is the coordinator who organised the study, Xiangyang Shu provided useful advice on dermatology, and Jian Yin and Guozheng Li provided useful advice on data processing.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhang, G., Yin, J., Li, Z. et al. Automated skin biopsy histopathological image annotation using multiinstance representation and learning. BMC Med Genomics 6, S10 (2013). https://doi.org/10.1186/175587946S3S10
Published:
Keywords
 Feature Extraction Method
 Annotation Term
 Annotation Task
 Histopathological Image
 Binary Classification Task