All about scATAC-seq's quality control

One main challenge in scATAC-seq analysis is to distinguish the good cells (high quality cells), the bad cells(low quality cells), and the ugly cells (doublet cells).

The good, the bad, and the ugly in scATAC-seq

The bad—- low quality cell

Why should I perform quality control

The low quality cells and the doublet cells can influence the downstream analysis, such as clustering (form a unique cluster), trajectory analysis (create artificial trajectories) and so on. OSCA book has great explaination about this issue.

What metrics are used to distinguish between the good and the bad

Different frameworks perfer different metrics to perform quality control. For example, ArchR and SnapATAC2 utilized TSS enrichment and number of unique fragments. And Muon use TSS enrichment and nucleosome signal to perform quality control. To compare what’s best strategy to preform quality control, I firstly classified the metrics to following categories.

Sequencing depth

Similar with scRNAseq datasets, scATAC has sequencing depth, which related with total number of fragments in peaks, unique fragment number, unique peak number, and so on. Low sequencing depth means low quailty. High sequencing depth may related with doublets, nuclei clumps, or other artefacts.

Transcriptional start site (TSS) enrichment score

TSS enrichment score is the ratio of fragments centered at the TSS to fragments in TSS-flanking regions. This metrics is used in nearly every scATAC framework. Extremely high TSS scores and extremely low TSS scores are both not good signs.

The good, the bad, and the ugly in scATAC-seq

Fraction of fragments in peaks (FRiP)

Represents the fraction of all fragments that fall within ATAC-seq peaks. This is a good metric to show signal-noise ratio.

Nucleosome banding pattern

Because chromosomes are organized into units called nucleosomes, fragment size can exhibit periodic distribution correlated with nucleosome size. Fragments in high quality cells are nucleosome-free (<147 bp).

The good, the bad, and the ugly in scATAC-seq

Others

For example, duplication fraction in pycistopic, ratio reads in genomic blacklist regions in signac.

The good, the bad, and the ugly in scATAC-seq

How to choose quality control metrics?

A good metric typically effectively distinguishes between good and bad cells. So, we begin by exploring whether the above metrics are indicative.

The good, the bad, and the ugly in scATAC-seq

We observed that the four metrics seperate bad cells (red box) and good cells well. If we combine these metrics, the seperation are more clear.

The good, the bad, and the ugly in scATAC-seq

How to filter cells base on QC metrics

While low quailty cells can be discovered with different reasons (i.e. low sequencing depth, high nucleosome signal), we noticed that high quality cells are usually clustered together. This give us a intuation, that we can select cells base on unspervise clustering methods, like Kmeans.

Base on this thought, I apply kmeans on QC metrics to assess wether it can select high quality cells.

I select TSS enrichment, nucleosome signal, percent of reads in peaks, number of fragment to be the feature. After normalization, I apply Kmeans.

The good, the bad, and the ugly in scATAC-seq

As we can see, Kmeans method is excellent at select high quality cells. Low quality cells are grouped together to form low-sequencing-depth cluster, high-nuleosome-signal cluster, and low-TSS score clusters.

The good, the bad, and the ugly in scATAC-seq

I tried different number of clusters in Kmeans. As we can see, this methods is very robust, high quality cells are always cluster togehter until cluster number increase to 6.

What about other QC methods

I compared different QC methods provided by various framework. The results are show below.

The good, the bad, and the ugly in scATAC-seq

Not suprising, the results in different methods are similar. But what’s the best quality control methods? I used total cell number - unique cell number to roughly estimate the performance of different methods.

The good, the bad, and the ugly in scATAC-seq

As we can see, Kmeans methods has best performance in our datasets.

What’s filter methods of cellrangers?

You may notice that cellranger’s quality control is very strict. While other methods leaves 18000-20000 cells, cellranger’s qc only leaves 11577 cells. You can find the filter results in filtered_peak_bc_matrix.h5 and singlecell.csv. From their pipeline, we can see it remove doublet and low quality cells with it’s own methods.

The good, the bad, and the ugly in scATAC-seq

It use cell-calling heuristic to filter low sequencing depth cells. Due to its overly stringent filtering outcomes, I don’t believe that using cellranger’s filtering results from the outset is a good idea.

Should I use strict QC threshold or loose QC threshold?

I suggest to use loose method at first. The worst outcome with a loose QC threshold is an arificial cluster/trajectory. Usually you can find them at downstream. However, if you begin with strict QC at first, you may never know what opportunities you’ve missed in terms of discoveries.

The ugly —– doublets

Common strategy to find doublets

There are two strategies to detect doublets. One approch is based on simulation, the other is based on coverage. Here describe the difference between the two methods.

What’s the best way to detect doublets

As suggested by author of scDblFinder, it’s better to combine the two methods (coverage-based and simulation-based) to discover doublets. As most frameworks’ doublets detection methods are simulation-based, I compared the results run by scDblFinder, ArchR (demuxlet), and SnapATAC2 (scrublet). The more related with amulet (coverage-based method), the results are better.

The good, the bad, and the ugly in scATAC-seq

As we can see, except ArchR showed bad performance, SnapATAC2 and scDblFinder both exhibit good performance.

The good, the bad, and the ugly in scATAC-seq

Actually, results of SnapATAC2(scrublet) and scDblFinder are similar. If you perfer strict quality control, you can use scDblFinder; else, you can use SnapATAC2.

Take home message

Kmeans are good strategy to filter low quality cells.
Better not use cellranger filtered methods.
Do not use ArchR to detect doublets.
Combine Amulet and scDblFinder/SnapATAC2 to detect doublets.

The bad—- low quality cell

Why should I perform quality control

What metrics are used to distinguish between the good and the bad

Sequencing depth

Transcriptional start site (TSS) enrichment score

Fraction of fragments in peaks (FRiP)

Nucleosome banding pattern

Others

How to choose quality control metrics?

How to filter cells base on QC metrics

What about other QC methods

What’s filter methods of cellrangers?

Should I use strict QC threshold or loose QC threshold?

The ugly —– doublets

Common strategy to find doublets

What’s the best way to detect doublets

Take home message

Enjoy Reading This Article?