DESeq2 - overview

DESeq2


gribskovgribskov

DESeq2 is one of the best programs around for RNA-Seq based differential gene expression analysis.

Requirements

DESeq2 requires un-normalized read counts, i.e. the number of reads mapped to transcripts NOT transcripts per million (TPM) or reads per thousand bases per million reads (RPKM). Generally you will have multiple biological replicates of each sample. DO NOT average the replicates. Such counts can be obtained using many programs, my recommendations is that you use Salmon in either mapping or alignment mode. Another option would be Stringtie. Once you have your counts, it will usually be most convenient to do the analysis on your personal computer since neither the count files nor the DEG output are very large.

Step-by-step

  1. copy the count file to your local computer. It's easiest if you have the counts for each sample in a single file with the genes/transcripts as rows and the samples as columns.
  2. Load the counts into a DESeqDataSet object. This requires defining your metadata matrix and experimental design (see the vignette).
  3. Run DESeq()
  4. Preliminary analyses
    • Whisker plots of un-normalized data
    • Read count density histogram
  5. Run DESeq2 once to obtain normalized counts. This step is a normalization for differences in library size (i.e. number of reads sequenced for each sample).
  6. Prefilter the counts, removing very low count transcripts that are poorly measures, and so variable that they will never be significantly differentially expressed. For a simple experiment you may be able to simply keep those transcripts over a minimum threshold. For a more complicated experiment with many sets of samples, I evaluate each replicate group separately and retain any transcript that is well-determined in ANY sample. This is so you do not eliminate transcripts that are expressed at level 0 in some or most samples, but increase to high levels in one or a few samples. What is well measured? The answer is subjective, my seat-of-the-pants estimate is that a sample requires 30-100 counts (mapped reads) to be well determined. Transcripts with very low counts will have greater differences between replicates than their mean value - that is, you cannot reliable distingui9sh them from zero. Removing low level transcripts increases your ability to confidently identify DEGs because it decreases the multiple testing correction (FDR). At the same time is removing the highly variable low expression genes does not increase your chance of missing interesting effects because none of the removed genes will be significantly differentially expressed.
  7. Load the normalized counts into a new DESeqDataSet object and run DESeq() again. Make sure you use the Normalized=T option with the counts() or you will not be using the normalized counts
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License