Validation of the Nanostring of the Banff Human Organ Transplant Gene Panel in Archival Data from Human Kidney Transplants and its Modelling Complexity

RNA gene expression of renal transplantation biopsies is commonly used to identify rejection. Mostly done with microarrays, seminal ndings describe and dene the patterns of genes associated with types of rejection and non-rejection kidney allograft diagnoses. To make gene expression more accessible for pathology laboratories, the Molecular Diagnostics Working Group of the Banff Foundation for Allograft Pathology and NanoString Technologies partnered to create the Banff Human Organ Transplant Panel (BHOT), a gene panel set of 770 genes as a substitute for microarrays (~ 50,000 genes). The advantage of this platform is that gene expressions are quantiable on formalin xed and paran embedded archival tissue samples. This new technology, thus, makes gene expressions cheaper and accessible to more laboratories and investigators. The purpose of this report is to validate the BHOT panel as a surrogate for microarrays and test the accuracy of the modelled BHOT data.


Results
This limited NanoString gene set readily identi es renal rejection and non-rejection diagnostic patterns using in silico statistical analyses of seminal archival databases derived from renal transplant RNA expression arrays. Multiple modelling algorithms show a highly variable pattern within the error matrices per sample. The discrepancies within the error matrices are most likely related to the gene expression heterogeneity of samples within a given pathological diagnosis. This was con rmed by clustering the data into 8 groups, which modelled with fewer misclassi cations.

Conclusion
This report validates gene expression of human renal allografts using the Banff Human Organ Transplant Panel as a surrogate for microarrays and con rms the its modelling complexity. Background RNA gene expression is now commonly used to nd patterns of gene expression in renal transplants.
Recent technology, NanoString nCounter, employs formalin xed para n embedded tissue as the RNA source for gene expression studies, i.e., archival specimens [20]. Advantages of NanoString include gene expression analyses on the same tissue used for diagnosis, and more importantly, readily permits the expansion of gene expression studies to more laboratories that routinely make renal transplant diagnoses. NanoString uses synthetic oligonucleotides as targets for the RNA fragments derived by extraction of the RNA. NanoString gene panels employ only 770 gene targets and, therefore, are not gene discovery tools.
To promote gene expression in renal transplants, the Molecular Diagnostic Working Group of the Banff Foundation for Allograft Pathology and NanoString Technologies partnered to create a subset of microarray genes, the Banff Human Organ Transplant (BHOT) panel, which could be widely used by many laboratories, thereby, allowing widespread collaborations and clinical trials [21].
Optimally validation of the BHOT panel would be done by comparing the BHOT panel and microarrays on the same RNA, but such an experiment has not yet been done. The purpose of this report is to test in silico on archival data if the BHOT panel genes shows similar expression patterns corresponding to those previously reported using microarrays [1][2][3][4][5][6][7][8][9]22]. Validation of the BHOT panel will encourage additional investigations of human renal allograft rejection. In addition, modelling studies were performed to test how well the BHOT expression data identi es the diagnostic classes.

Materials And Methods
Data. Downloaded text les of GSE data sets 30718 [6], 36059 [10,23,24], and 48581 [10,25] from NCBI all derived from HU-133 plus 2 microarrays with their diagnostic annotations were rst imported into excel. These seminal reports from these three databases established the gene expression patterns for T Cell Mediated Rejection (TCMR), Antibody Mediated Rejection (ABMR), and delayed graft function (Acute Kidney Injury, AKI) [6,10,[23][24][25]. Gene names were checked due to gene label updates used in the Banff Human Organ Transplant gene set panel (BHOT). Microarrays contain many duplicate probes for individual genes, and the highest value per gene probe was chosen. These data were joined with the BHOT panel and excluded non-renal parenchymal and viral genes. Data were renormalized using the housekeeping probes with negligible effect. Coe cients of variation were calculated and the lowest % CV including housekeeping genes were deleted, leaving 667 genes and 764 samples [26]. Data were then log 2 transformed.
Software. Analyses were performed using SAS/JMP 14.2 using linear models with validation, principal components, multiple logistic regression, K-means clustering with determination of cluster number or Python 3.7 with the scik-learn module (Pycaret 2.0). UMAP (Uniform Manifold Approximation and Projection for Dimension Reduction) clustering was performed in JMP14.2/R4.0. Graphing was performed with Graph Builder (JMP 14.2). Batch normalization was performed with JMP Genomics 9.0.
From power calculations (power = 0.8), signi cance was set at a False Discovery Rate Adjusted P-value of 0.005 (-log10 = 2.3). This was also applied to the signi cance of any mean difference.
Parsing of Genes. Three methods of parsing genes were used to identify multiple principal components that were used to optimally partition the diagnostic groups. The rst method, supervised, nds the highest gene expressions by linear models within two comparison groups, TCMR, ABMR, AKI, MIXED, or NORMAL as compared to NO REJECTION. These derived gene sets are heterogenous (collinear and nonparametric) as multiple principal components with eigen values from 3 to > 100 were derived from each derived binary gene set. These principal components are called Pathological Based Principal Components (PBPC) [1][2][3][4][5][6][7][8][9]27]. The second method, semi-supervised, used genes from CIBERSORT LM22, Blood Atlas, and KEGG, and NanoString Annotations for a speci c cell type or immunological pathway. This method was inspired by Nanostring Advanced Analysis software, in which "scores" are created using singular value decomposition, a sparse principal component, of genes that identify a cell type or immunological pathway. Care was taken to eliminate genes, which cross react in other cell types or pathways. The genes within a cell type or pathway created one principal component with an eigen value > 5 and are called Cell Pathway Principal Components (CPPC). The third method, unsupervised (agnostic), derives multiple principal components with eigen values from 3 to > 100, from all genes without regard to a class or diagnosis and called Unsupervised Principal Components (UPC). Principal component analysis was chosen for data reduction due to the massive collinearity of individual gene expressions.
Batch effects. UMAP was used to identify by clustering unknown anomalous effects (batch effects) in the archival data using genes with a coe cient of variation (CV) of ≤ 5%, which included the house keeping genes from BHOT. Genes with CV ≤ 5% have little partitioning value. The three anomalous clusters were manually coded as categorical variables. Batch normalization was performed in JMP Genomics 9.2 using batch normalization.
Pathological diagnoses. Pathological diagnoses, derived from annotations of the downloaded databases, are categorical classes: T Cell Mediated Rejection (TCMR), Antibody Mediated Rejection (ABMR), Mixed (both TCMR and ABMR), NO REJECTION (NR), Acute Kidney Injury (AKI) which is de ned as renal dysfunction unrelated to rejection and often occurring post transplantation, and Normal Native (NORMAL). These diagnostic classes are summary classes derived from the more complex classi cations of renal allograft rejections per Banff classi cation schemes, which employ microscopic criteria, many of which cannot be evaluated, identi ed, or correlated with RNA expression [28,29].
De nitions, abbreviations, and classi cation parameters are shown in supplemental data, Table 1A &B.

Results
Although the combined data were from the same array, batch effects can often skew data. It is unknown how many experiments were done to create the archived datasets, so that batch corrections cannot be done on individual experiments. To work around this problem UMAP clustering was performed on the genes with the lowest 5% coe cient of variation. Figure 1 shows a graph in which three clusters (A, B, &C, black arrows) were identi ed before batch correction. After batch normalization, one cluster remained (red arrow). Such batch effects have a slight in uence (F = 0.02) on classi cation accuracy (Pycaret classi cation, compare_models module), Supplemental Table 2, using all data and the target as DX (diagnosis). Batch normalized data was throughout.
Feature selection of the supervised (PBPC) and unsupervised (UPC) principal components using linear model, and active effects in multinomial logistic regression identi ed optimal sets of principal components that best partitioned the diagnostic groups with estimates and signi cance (Supplemental Table 3). Figure 2 shows a graph of the principal components (PCs) for the three different methods of gene selection. Pathologically based PCs (PBPCs, Fig. 2A) and the unsupervised PCs (UPC, Fig. 2C) readily partition the diagnostic groups. CPPC (Fig. 2B) identify immunologically interpretable patterns with a high PC for tubules in NORMALS, and a high endothelial PC in ABMR and MIXED but low in the other groups. In ammatory cell types and mediators are highest in TCMR and MIXED rejections, known to contain in ammatory in ltrates and low in NORMALS and AKI.
The highest gene expressions per principal component (CPPC and PBPC) de ned by PC loading tables were compared to the transcript patterns previously identi ed in microarrays called pathologically based transcripts (PBTs) [1][2][3][4][5][6][7][8][9]22]. PBPC1 is dominated by genes identifying adaptive immunity, chemokine and cytokine signaling, cytotoxicity, T cell receptor signaling, toll-like receptor signaling, type 2 interferon, CD4 and CD8 T cells, and macrophages and found in PBTs (Type 2 interferon induced, cytotoxic lymphocyte induced, T cell, injury and injury repair transcripts). PBPC1, therefore, is an in ammatory signature that is associated with the in ammation commonly seen in TCMR and MIXED rejections. PBPC2, low in the AKI diagnoses, is low for genes in the cytokine signaling (JAK2) pathway, innate immunity, TH17 pathways, and toll-like receptor signaling and found in PBTs (late injury repair and type 2 interferon induced transcripts). PBPC3, high in AKI, identi es higher and different cytokines (CXCL13, 16, and CXCR6) and is found in PBTs (injury repair, endothelial, type 2 interferon induced, and decreased solute carrier transcripts). PBPC4, highest in ABMR and MIXED rejections, is dominated by the expression of endothelial genes and some CD4 cells and found in PBTs (endothelial and alloantibody induced transcripts), and is an endothelial pattern closely associated with antibody mediated rejections (ABMR and MIXED). PBPC5, high in AKI and TCMR, contains genes for B cells, complement, and innate immunity and is found in PBTs (B cell, macrophage, injury-related transcripts). PBPC6, lowest in the AKI and normal diagnoses is low for genes in innate immunity, type 2 interferon, and CD4s, and CD8 T cells and low in PBTs (injury related type 2 interferon inducible, T cell transcripts). PBPC7, is low for chemokine, T, B endothelial, and macrophage genes and low in PBTs (B cell, alloantibody induced, endothelial injury repair and type 2 interferon induced transcripts). PCPC8, highest in the normal diagnosis is high for glomerular, tubular, TH17 pathway, and tissue homeostasis genes and found in PBTs (solute carrier (high), alloantibody induced (low), endothelial (low), type 2 interferon induced (low)).
Within the unsupervised principal components UPC1, like PBPC1, identi es an in ammatory pattern highest in genes for adaptive immunity, chemokines, cytokines, cytotoxicity, innate immunity, toll-like receptor signaling, CD4 and CD8 T cells, and macrophages. UPC2, highest in NO REJECTION contains type 1 and 2 interferon related gene expressions, chemokine and cytokine, innate and toll-like receptor related gene expressions without in ammatory cells and found in PBT interferon related transcripts. UPC3, highest in AKI, shows the highest gene patterns in cytokines, complement, innate immunity, oxidative stress, without markers for B, T, or macrophage cells and is found in PBTs (interferon and injury repair transcripts and solute carrier (low)). UPC4, high in ABMR and MIXED, identi es an endothelial pattern with many endothelial genes, adaptive immunity, chemokines, complement, cytokines, B cell, CD4, CD8, macrophage genes without any cytotoxicity signals and is found in PBTs (endothelial and alloantibody induced, B cell, type 2 interferon). UPC5, highest in the normal diagnosis includes gene signals for glomeruli, tubules, some innate, oxidative, TH17, TNF without type 2 interferon, plasma, CD4, CD8, or macrophage cells and is found in PBTs (high tubular, high endothelial, injury repair). UPC6, high in AKI, shows the greatest number of genes in adaptive, cytokine, complement, innate, and CD8, and macrophages and is found in PBTs (type 2 interferon, injury related, macrophage related transcripts). UPC7, highest in NO REJECTION, is highest for genes in chemokine, cytokine, innate, oxidative, TH17 pathways, with some markers for B, CD4, CD8, and macrophage cells and is found in PBTs (injury related, type 2 interferon related transcripts).
Modelling. Modelling is used to estimate how good variables describe the classi cation parameters, in this case, how accurately the principal components identify the diagnoses. Modelling programs assign a class or in this case a diagnosis to each sample based on the highest probability within the target diagnosis for a speci c sample. Initially, multinomial logistic regression (JMP14.2) modelled the data (CPPC, PBPC, UPC and all three (ALL)) with the target as the diagnosis (DX). All four models created acceptable ROC curves, Fig. 3A, B, C, & D. However, the errors in the confusion matrices were substantial 43-60%, indicating that modelling poorly matched the DX, Table 1A. Also, problematic, the per sample error of 46.9% indicates per sample discrepancies showing that subtle variations in the data type engender different patterns of errors matching the DX. Reeve et.al. also, identi ed variations in misclassi cations when clustering using archetypal analysis as compared to annotated diagnoses [30].
Individual samples are misclassi ed differently depending on the data set.
The classi cation assignment is derived from the highest probability per group, whether the highest probability is below or above 50%. The average probabilities were examined in misclassi ed and concordant samples, Fig. 4. Misclassi ed samples have lower average probabilities as compared to concordant samples suggesting that samples without a consistently high probability cause per sample variations in the error matrices [30].
One possibility for the complexity of modelling as shown by the high per sample misclassi cation patterns is that the annotated diagnosis is not a pure category and that gene expression heterogeneity exists within samples of a diagnostic class. To explore this, distributions were analyzed for all principal components by diagnosis. Figure 5 shows the distributions of all the principal components from the three data sets (CPPC, UPC, and PBPC) for the six diagnostic classes for each sample (grey lines) and a group mean (black line). The grey lines show wide distribution patterns within a diagnosis, and the group mean shows biphasic distributions for the diagnoses AKI, TCMR, and MIXED, which are, therefore, mixtures of distributions. The red vertical line is the modal average of NORMAL. Vast heterogeneity is evident for all diagnoses.
To try to reduce the heterogeneity across samples, the samples were clustered using K Means, resulting in 8 clusters. Table 2 is a contingency table of DX vs clusters. NORMAL and AKI cluster similarly to DX, but the other diagnoses are widely distributed among the clusters, especially NO REJECTION. Again, using multinomial logistic regression (JMP14.2) modelled the data (CPPC, PBPC, UPC and all three (ALL)) with the target as the Clusters. Table 1B shows that the percent misclassi cation from the confusion matrices dropped to 6-17%, a dramatic improvement as compared to using the DX as the target, Table 1A. However, the per sample errors in misclassi cations were 27.9%, which were much lower than 46.9% using the DX as the target.
To further re ne and optimize the prior models, Pycaret, which uses Python and sciklearn modules, was used to test additional models. The Pycaret compare_models module permits comparison of multiple models to nd the most optimal model by accuracy. The models tested are found in the supplemental Table 1A. The best model with the highest accuracy was also tested with the tuning, bagging, boosting, and blending with negligible improvements in accuracy (data not shown). Making an ensemble model of the top three models also did not show any improvement in accuracy. The optimal model for the data (All, CPPC, PBPC, UPC) with targets as DX and Clusters is shown in Table 3. Modeling all principal components (All) or the individual sets (CPPC, PBPC, and UPC) vs the target DX showed similar accuracies and precision with confusion matrix errors of 33-37% with per sample error rates of about 15%. Dramatic improvements in accuracies (> .9) was shown using Clusters as the target with dramatic reductions in the error rates.

Discussion
Findings show that the BHOT panel of genes recapitulates the diagnostic patterns from the archival data using these archival data sets (Fig. 2, supplementary Table 2). As the selected BHOT panel genes are derived from many microarray studies, it is not surprising that BHOT panel genes identify the expected patterns of rejection. Nevertheless, such in silico con rmation will encourage investigators to use this panel as it is easier than microarrays. All three methods of parsing genes created workable models with high average ROC scores. It is unclear which method of parsing the genes into principal components is easiest or is most suitable to create an e cient and standardized data analysis pipeline. Using PCs (PBPC) from sets of the highest genes between binary diagnostic comparisons is conceptually simple but engenders many principal components, which share collinearity and make feature selection for modelling both tedious and di cult. Using principal components of cell types and pathways is conceptually easier to understand immunologically. Creating unsupervised principal components is the easiest for feature selection and has an advantage that a latent variable or pathway may be present, which is not readily identi ed by the rst two gene selection methods [18,19]. These three methods, including just nding the highest genes by t-tests, will likely vary between independently derived data sets because results are very dependent on the sample size of the data set, the balance of the classes employed, and the purity of the annotated class diagnosis.
Some investigators argue that gene expression models assign a more accurate diagnosis than the original pathological diagnosis and use such testing for clinical diagnosis. However, heterogeneity is present in the misclassi cation assignments per sample by different models [30] or data sets (CPPC, UPC, PBPC), (Table 1, Table 3) engendering discordant patterns of misclassi cations. Therefore, changing the sample diagnosis by modelling may be premature. Model averaging or an ensemble of models does not solve this problem as a new error matrix is created, which still maintains the per sample variations in classi cations. Arbitrarily, using only a high probability to assign a diagnostic classi cation solves part of the misclassi cation problem by reducing some misclassi cations but limits model utility as many samples could remain unclassi ed [30]. To improve assignment of diagnoses of ambiguously modelled samples, additional clinical information such as some histological parameters, alloantibody, C4d, or time after transplant, could be incorporated (expert knowledge of prior probabilities) with gene expression to create a clinical pathological diagnosis [30,31]. Although using such expert knowledge may allow assignment of some samples to a diagnosis, and make overall interpretation easier, interpreting such heterogeneous variables, absent in the model, is subjective and may work for some samples but not all.
Although clustering data independent of diagnosis makes a better model with fewer misclassi cations, interpreting individual clusters remains problematic. It is better to nd the best model for the data rather than nd the best data for a model. Are these synthetic clusters just "toy" data, that models well but has no biological relevance? Some clusters resemble canonical diagnoses, but others do not. How do you assign a meaningful and clinically interpretable name to synthetic cluster? Nevertheless, creating more homogeneous groups of samples may identify clinically important groups of samples. Clustering expression rejection data can identify novel subgroups, not appreciated in the annotated classes [19,27]. This is most important in the NO REJECTION diagnosis, which is the most heterogeneous by gene expression (Table 2) and the most frequent diagnosis. This diagnosis lacks evidence of rejection, and subjects usually have a preserved creatinine, yet the gene expression pattern within the NO REJECTION diagnosis is markedly heterogeneous as indicated by clustering. If some gene expression patterns in the NO REJECTION diagnostic category correlate with a subsequent clinical rejection or correlate with renal outcome, then analysis of gene expressions adds value to clinical decisions.
The gene expression data are heterogeneous within the original diagnostic classes because clustering all the principal components creates a better model with fewer misclassi cations. This is mostly likely because pathological diagnoses are complex and critically dependent on microscopic ndings that cannot be identi ed within a mixture of extracted RNAs. For example, tubulitis, which is mononuclear in ammation identi ed within tubules, is required for a diagnosis of TCMR but cannot be captured in a slurry of RNA. In addition, many of the Banff histological lesions used or required for allograft diagnoses are also somewhat non-speci c [10,23,32]. In addition, the rejection classes (TCMR, ABMR, and MIXED) used to categorize the data are summaries of rejections patterns, which have grades from clinically mild to clinically very serious, so that the annotated class diagnoses are heterogeneous and mere summary approximations.
Modelling creates additional challenges for investigators, who wish to use gene expression to inform diagnostic decisions. Diagnostic changes cannot be made solely by gene expression as different modelling algorithms or modelling slightly different sets of principal components do not uniformly assign a consistent pattern of per sample diagnosis. Additional information as covariates might include creatinine trajectory, proteinuria, time after transplant, prior diagnoses, or histological parameters might improve or alter modelling performance [33,34]. This is problematic as investigators using different models or slightly different data sets could reassign diagnoses discordantly. Uncertainties could also arise in clinical trials using allograft transplant gene expression for classi cations if contributors to the clinical trial assign variant gene expression classi cations, depending on how the genes are analyzed or modelled. This problem also applies to aligning disparate studies investigating a similar hypothesis.
Creating additional categories by clustering more diagnostic categories, which makes modelling more consistent and lowers misclassi cation rates. But it is unclear if these additional categories represent biologically relevant diagnostic classes, or if they represent inconsequential minor variants, biopsy sampling error or evolving forms of established diagnoses. Only correlation of pathological diagnoses and gene expression patterns with the endpoint of renal allograft survival or subsequent rejections can resolve such discrepancies and identify the optimal and biologically relevant classes for clinical decisionmaking. This is likely best done by longitudinal analysis of a patient's samples.

Conclusion
In conclusion, these ndings con rm the BHOT gene panel is a useful surrogate for microarrays to identify the expected gene expression patterns in human allografts and would encourage this easier gene expression method. These ndings also con rm the complexity of modelling gene expressions and suggest that reassigning a diagnosis based solely on gene expression is not straightforward and possibly premature. This creates challenges for the application of gene expression for the classi cation of an individual patient's specimen for clinical decision making. Future analytical challenges facing investigators, who desire to use transplant gene expressions for clinical care include: 1) How and which genes are best and most e ciently parsed to create an e cient data analysis pipeline; 2) how is modelling is best performed to assign a diagnostic probability to a patient's sample; 3) what clinical and pathological parameters improve model performance; 4) how to resolve the heterogeneity of gene expression and pathological diagnoses into more homogeneous groups that permit the most accurate modelling and immunological interpretation, and, nally, determine if new and more homogeneous classes are biologically relevant.