Review of data download of TCGA transcriptomic and clinical annotation data
The TCGA data portal has the clinical data commons that are publicly available for data mining in the clinical databank [29]. These data are accessible in multiple ways including Bulk/Batch API access, TCGA Biolinks software via Bioconductor, and Cart-Building on the portal website in a patient-by-patient search [29]. Currently, no unified patient disease progression information is directly available for bulk data mining on the portal website. Our progression annotation was built by text mining clinical files of progression annotations project by project using the batch query function in the TCGA Biolinks package. Each patient has multiple unique identifiers. In a project-by-project manner, each Case ID was cataloged. Each case ID query produced a case UUID that was used across the data types including the gene expression counts, VCF files, FASTQ files, images from slides, and clinical annotation for each experiment for each patient. Each UUID produces a patient summary. Each summary was broken down into: Data category, Experimental strategy, clinical annotations, and clinical supplemental files. The transcriptome counts files for each project were downloaded, normalized and analyzed. Each project has between 53 and 261 clinical annotation columns. The stringr and dplyr software packages were used for clinical annotation, data cleaning, and anatomical annotation [30]. Metastatic tumors identified in the clinical annotation file were drawn from the “metastatic tissue”, “sites of metastases” or “metastatic tissue site” column(s). Tumor progression labeled as “synchronous” were not included in the metastatic data as the clinical timeline of diagnosis was ambiguous. The diagnosis allows for tumors to be classified as synchronous ranging between the time of diagnosis up to 6 months following the diagnosis in varying tumor types.
Review of synthetic sample generation
Synthetic samples were generated to balance positive and negative classes using the SMOTE algorithm; where positive classes were tumors that developed a metastasis in the tested location and negative classes were tumors that did not develop a metastasis in the tested locaiton [31]. Briefly, the Synthetic Minority Oversampling Technique (SMOTE) is an algorithm to increase the representation of a minority class in machine learning classification problems. The objective function for this approach sits on top of a distance based KNN algorithm. The synthetic oversampling technique begins by selecting a minority class instance. Then finds the instance’s k nearest neighbors. One of the minority class neighbors is chosen at random. A line is drawn between these two instances and a synthetic sample is generated along the line as a convex combination of the two real instances. This process repeats until it has created the desired number of synthetic samples. The number of synthetic samples generated was specific for each binary comparison. The authors suggest that the SMOTE algorithm can be used to generate a large sum of representative synthetic samples, however how large that sum is without over fitting the model is unknown. We employed an overfit prevention method during sample balancing. We measured 80% of the majority class and increased the representation of the minority to the match approximately 80% of the majority class rounded to the closest integer.
Review of feature selection
Feature selection is a method in model building to reduce the dimensionality of a dataset. Overfitting can occur when the number of columns (features) outnumber the rows (instances) we can use for the model. To reduce the dimensionality of the problem we have employed three kinds of feature selection methods: Filter based, Wrapper-based and Embedded feature selection. Chi-square filtering calculates the chi-square metric between the target and the numerical variable and only reduces the features for the variables with the maximum chi-squared values. The SelectKBest, Chi2 and MinMaxScaler Libraries from Sklearn and feature_selection module were used [32]. A Recursive feature selection estimator iteratively reduced the dimensionality of the data set by recursively considering smaller and smaller subsets of each feature block. The RFE was trained on each initial block of features and the importance of each feature was obtained through the feature_importances attribute. The RFE and LogisticRegression libraries from Sklearn and feature_selection module were used [32]. For embedded methods, Random forest classifier, random forest regression and lasso regression with a logistic regression estimator and L1 penalty were employed. These algorithms have an embedded feature selection method to stratify and rank features. The SelectFromModel, RfC and RfR libraries were imported from Sklearn [32, 33]. We cross validated these approaches by extracting support values in each using the get_support methods, summing the true feature support Booleans for each feature in each block across all five methods and sorting features by selection support.
Iterative Feature selection was conducted by splitting the 60,483 transcript features into 100 blocks of approximately 600 features to be assessed by the above algorithms. We extracted support values for each feature from each selection method. Each block was assessed independent of all other blocks in each classification. Transcripts were filtered for features that showed the highest cross-validated support in multiple or all algorithms. Dimensionality was finally reduced by filtering out co-linear features. The top 10% of highest scoring features were kept from each block for a total number of approximately 5000 candidate transcripts (50 transcripts × 100 blocks). The remaining transcripts were used as the input features in each binary classification. Tree-based models were selected as the best fit for the classification to account for the variability in number selected features in each classification and to allow model attributions to be extracted post-hoc.
Review of model building
Random Forest classification and Gradient boosted tree classifiers were built to classify site specific progression from primary tumors. The selected features in each binary classification were used as input attributes into model classification. The model is set to report rounded value for classification but is capable of posterior probability for class likelihood. The code and the pretrained models are available through the documented Github. Model building and usage is documented on the Github wiki page.
Review of feature recapture
Feature recapture was the final phase of model building and analysis. Testing the statistical significance of feature recapture in independently generated lists following bioinformatic analysis is an indirect however well documented technique to determine non-random enrichment [34]. Two sets of feature recapture were analyzed and displayed in Additional file 1: Table S7. The tests were conducted; within cancer class seeding loci and the between cancer classes metastasizing in matching locations. The Fisher’s exact was used to evaluate the significance of recapture between lists, as the significance of deviation from the null hypothesis can be directly calculated. Our null hypothesis was that the feature recapture when analyzing matched seeding locations across cancer types was by chance; therefore, no biological meaning can be drawn from the phenomena. Our alternative hypothesis was that recapture of features within class and between matching seeding locations indicates similar distant metastatic potential and offers candidate biomarkers for organotropic metastasis, respectively. The contingency table was set as; the background of the search space for the information gain algorithm. The starting feature selection space for each classification was the entire human transcriptome. As all of the binary compassions initially began considering all 60,483 transcripts, and each set of selected features were independently generated, the total transcriptome remained the background for all tests. In list A of each contingency table, we place the top 1000 features for each classification of primary tumor seeding location. In list B, we assess a second primary tumor type and/or metastatic location feature list. We test the significance of the intersection of the two lists considering the list sizes, background and overlap in contingency table. The GeneOverlap package on Bioconductor was used to conduct the Fisher’s exact tests [35].
Gene set overrepresentation and semantic analysis
The clusterprofiler package was used to conduct an overrepresentation test in the GO database [36]. The selected features for each metastatic location in each cancer type were translated into their associated GO biological process IDs using the bitr function in the clusterProfiler package [36]. The overrepresented GO biological pathways were passed to into the GoSemSim package and simplify enrichment package [37]. A similarity matrix of biological functions was made using the simplfyEnrichment package in R [38]. A heatmap was produced by clustering the similarity scores of the biological functions using the package default binary cut function. A Fisher’s exact test was conducted using the base GeneOverlap in R [35]. The background was changed from the human transcriptome to the GO database to account for the change in the search space [39]. The UpsetR package in R was used to display the bar graph of overlapping biological processes in the tumors seeding in matched locations [40]. All overlaps were tested between cancers metastasizing in similar organs.
Data availability and code
We used public data sets drawn from the TCGA database using the GDC data commons for this project and its analyses [41, 42]. We have provided all the custom computer code to produce these models.
Our code is currently available for view and use in a public Github repository: https://github.com/michaelSkaro/Classification_of_organotropic_metastases. The docker image containing all relevant environment variables, dependencies and a demo test data set is also made publicly available on docker hub and integrated into the Github actions. We have a documented wiki page that is available, demonstrating the installations, displays visualization and describes script usage within the pipeline.
We have provided a general usage script that runs the entire metastatic classification pipeline. At the command line it can be ran using the metastasis_pipeline.py script within the built docker container. We have provided a general usage feature selection pipeline Feature_selection.py. We have provided the organotropic features sets for all cancer types selected in this study in the Additional file 1: data tables. We have provided all enrichment and recapture code in the source code.