XenoCell is a Python-based wrapper around Xenome [5] supplied with functions for data processing. An overview of the XenoCell workflow is shown in Fig. 1 (see Implementation section). The input to XenoCell consists of paired reads from droplet-based single-cell sequencing experiments of xenograft samples from any proportion of host/graft species.
Once for, every cellular barcode, the percentage of graft- and host-specific reads is calculated (Step A), the extraction of cellular barcodes representing the cells from the organism of interest is performed (Step B) on the basis of a user-defined threshold of tolerance (Fig. 2).
To assess the performance of XenoCell, we applied it to a publicly available single-cell gene expression dataset released by 10× Genomics (supplementary data), which is composed of a 1:1 mixture of fresh frozen human (HEK293T) and mouse (NIH3T3) cells (for a total of ~ 5000 cells).
We retrieved graft- and host-specific cellular barcodes, containing a maximum of 10% or a minimum of 90% host-specific reads, respectively. Then, we used the XenoCell-filtered FASTQ output files as input for Cell Ranger to align the reads against the hg19 and mm10 genome assemblies, respectively, resulting in 2532 graft and 2626 host cells, reflecting the initial 1:1 mixture of cells.
We compared our results against the well-established Cell Ranger pipeline, which currently is the only one supporting a comparable approach (albeit exclusively applicable to scRNA-seq data generated with the 10× Genomics platform).
We took advantage of a function of Cell Ranger that is able to create a combined reference genome from two species, align reads from a single data set against it, and quantify the fraction of species-specific reads for each cell of the data set, thereby allowing to split the cells by the species they originated from. To compare the assignment of cells to one of two species through XenoCell and Cell Ranger in the aforementioned example data set, we have prepared a combined reference genome made from human (hg19) and mouse (mm10) using this feature of Cell Ranger.
We performed a total of three analyses: (1) alignment of the unfiltered data set to the combined human-mouse reference genome, (2) alignment of human cells extracted with XenoCell to the human reference genome, (3) alignment of mouse cells extracted with XenoCell to the mouse reference genome. The number of cells identified after XenoCell filtering in analyses 2 and 3 confirm the initial 1:1 proportion of human and mouse cells in the data set. Moreover, ~ 97% and 100% of the cells identified as graft- and host-specific by XenoCell, respectively, were classified concordantly by Cell Ranger (Additional file 2: Fig. S1). To check whether filtering of cellular barcodes with XenoCell affects the transcriptional profiles of the single cells, potentially due to the removal of reads classified as ambiguous by Xenome, we aligned the unfiltered sample and the graft-specific cells retrieved by XenoCell to the human reference genome (hg19) using Cell Ranger, and represented the transcriptional profiles in a UMAP projection generated with Seurat (Fig. 3). Results clearly show that the graft cells retrieved by XenoCell occupy the same transcriptional space as the unfiltered sample, with the other population of cells likely representing mouse cells. Similarly, we aligned the unfiltered sample and host-specific cells retrieved by XenoCell to the mouse reference genome (mm10) using the same procedure as for the human cells (Fig. 4) thus obtaining the same results. As a confirmation, we measured the correlation of transcripts per gene between XenoCell-filtered and unfiltered cells, for both graft (hg19, r = 1) and host (mm10, r = 1) cells (Additional file 3: Fig. S2). Results lead to the same conclusion that XenoCell does not introduce a systematic bias to the transcriptional profiles of the investigated cells due to the removal of reads which cannot be unambiguously assigned to either human (graft) or mouse (host). Moreover, in both cases, we investigated whether the removed cells (visually identifiable in the UMAP projection of the unfiltered sample as the grey cluster that is not present in the graft and host subsets of Figs. 3 and 4, respectively) were correctly assigned to their respective counterpart. As expected, this was the case for 99.7% and 97.9% of cells for graft-specific and for host-specific scenario. The remaining 7 (0.3%) and 47 (2.1%) cells were found to have significant UMI counts for both organisms (range 16–87%), therefore they were correctly discarded by the imposed thresholds (0–10% of host-specific reads for cellular barcodes from the graft; 90–100% of host-specific reads for cellular barcodes from the host; Additional file 4: Fig. S3).
Overall, XenoCell and the multi-species analysis with Cell Ranger produced concordant results. However, the functionality of Cell Ranger to align reads from scRNA-seq experiments to a multi-species reference genome is only available for samples generated by the 10× Genomics scRNA-seq kits, thereby limiting its applicability. Instead, XenoCell offers the flexibility to set a threshold on the permitted fraction of host/graft-specific reads, depending on the biological question the user poses, and is not restricted to any particular technology or library preparation kit.
In this light, we assessed the effectiveness of XenoCell on a publicly available PDX dataset [14] generated using the Drop-seq protocol [13]. When we applied XenoCell to this this dataset, we detected 11 cellular barcodes containing more than 250,000 host-specific reads and less than 5% graft-specific reads (Additional file 5: Fig. S4), which would likely pass common cell whitelisting methods, ultimately ending up contaminating the dataset. To confirm our suspicion, we performed three separate analyses: (1) alignment of human cells extracted with XenoCell to the human reference genome, (2) alignment of mouse cells extracted with XenoCell to the mouse reference genome, (3) alignment of the unfiltered data set to the human reference genome. Using STARsolo with default parameters to generate transcript count matrices, and intersecting the sets of identified cells from the different analyses, we found that 19 of the 581 cells (3.3%) identified in the unfiltered dataset belonged to the host and were successfully removed by XenoCell in the purified graft sample.
These results suggest that upstream biochemical or physical strategies to purify xenograft samples by removing contaminating cells from the host organism might not always be fully successful, and, therefore, bioinformatic pre-processing of the sequenced data should become a routine practice in single-cell experiments from xenografts. Overall, we demonstrated that XenoCell can be applied to data generated by multiple single-cell technologies and we tested its effectiveness on a mixed-species dataset and a real PDX dataset.