Sequencing depth - Hubrecht Genome Facility

Sequencing depth / Coverage

Before starting a NGS experiment, the sequencing depth has to be determined, which is roughly defined as the number of reads of a given region of the genome in an experiment. In general, the required sequencing depth will be based on the type of study, the expression levels, the size of the reference genome, the published literature, and the best practices defined by the scientific community.

The sequencing depth directly affects the reproducibility of variant detection: the higher the number of aligned sequence reads, the higher the confidence of which a base is called at a particular position. This is regardless of whether the base called is the same as the reference base or is mutated. In other words, individual sequencing error reads are statistically irrelevant when they are outnumbered by correct reads. The deeper the sequencing, the more robust your results will be. However, this comes with a price. Therefore, estimating the number of reads you require helps you to reduce the costs of your experiment.

Increasing the coverage has different consequences for different techniques:

ChIC-seq

Increasing the coverage leads to more well-defined peaks, increased manifold estimation/clustering, and less sparsity of the single cell data. Basically, you pick up the genes of interest in more of the individual cells.

NlaIII/Karyo-seq

Increasing the coverage reduces allelic dropout, resulting in a higher chance of capturing both alleles. Additionally, it’s possible to detect smaller sized copy number aberrations.

Choosing the right sequencing depth is a trade-off between the sequencing costs, the complexity of the library (i.e. the number of unique sequences present in the library) and the required sequencing depth to enable statistically significant biological conclusions. Sequencing more reads will generally increase both the power of your assay and the chances of covering rare events.

In the case of sequencing at shallow depth, almost all reads will be unique, i.e. the exact same read sequence will be seen only once. In the case of sequencing at greater depth, it is possible that some reads are being encountered more frequently due to amplification biases or other reasons. We identify these duplicate reads with UMIs (Unique Molecular Identifiers) incorporated in the library. As an example, when sequencing at a shallow depth would recover 0.9 unique reads for every sequenced read, at medium sequencing depth more reads would be seen multiple times and only 0.5 unique molecules would be recovered. If one would sequence such a library even deeper, perhaps only 0.1 unique molecules per read would be recovered, meaning that 90% of sequencing reads (and thus costs) would be wasted. It now depends on the whole experimental design to make the best decision. Sometimes sequencing at low depth will be sufficient (e.g. CNV calling scNlaIII/Karyo-seq libraries) while in other cases more depth may be needed and the sequencing costs may be a limit to the number of cells sequenced (e.g. SNV calling complex scNlaIII/Karyo-seq libraries). For relatively low complexity libraries it may often be better to sequence them at lower depth (to avoid wasting sequencing on duplicate reads) but sequence more cells, especially if these cells are clustered in later analysis because each read in a single cell than also adds information on all cells in that cluster. Finally, there may be cases where library complexity is unusually low or not sufficient for the experimental design and new, more complex libraries should be made.

We recommend to initially sequence shallow as we can always re-sequence your samples.

One could use the following simple formula to get a rough idea of the required sequencing depth:

Number of required reads = ( desired coverage * expected size of the genome * # expected cells ) / read length

desired coverage
Ideally, the coverage for DNA-seq experiments is 30x. This means every nucleotide in your reconstructed sequence is covered by 30 sequencing reads on average. For scDNA-seq this would be very costly and for many types of experiments (like karyotyping) not even necessary. Usually the coverage will be between 0.005-0.1x.
expected size of the genome
This is dependent on your cell type and the technique you are using. When performing scNlaIII/scKaryo-seq, this relates to the total size of the genome of the processed cells. With scChIC-seq, researchers look at genomic regions related to a certain histone mark. Therefore, it depends on the abundance of this particular mark and again the size of the genome.
# expected cells
As most of our services are plate-based, one would expect to have 376 cells (384 – 8 empty wells as control) per plate. In practice, not all cells will be of good quality and some cells will drop out because of technical artefacts. We aim to get at least 3/4 of a plate of good cells.
read length
This variable is dependent on the sequencing platform. By default, we use the Illumina Nextseq2000 P2 2×100 bp platform, so the read length is 100.

When using the Tapestri platform of MissionBio, it is recommended to use the following formula (typical coverage in this case would be 60-80x):

Number of required reads = desired coverage * # amplicons * # expected cells