3.4. Analysis of count data (EDA)
Abel Vertesy
Disclaimer
- Using data from: Zheng, G. X. Y et al. (2017) ‘Massively parallel digital transcriptional profiling of single cells’. doi: 10.1038/ncomms14049.
- Following steps of the Seurat Tutorial
- Download the data from: https://bit.ly/2mVpJvM
Example case: Study the immune system
Peripheral blood mononuclear cells (PBMC)
A PBMC is any peripheral blood cell having a round nucleus. These cells consist of:
- lymphocytes
- T cells,
- B cells,
- NK cells (natural killer cells),
- monocytes (macrophage precursor).
Other cell types in the blood:
- Erythrocytes and platelets have no nuclei, and
- granulocytes (neutrophils, basophils, and eosinophils) have multi-lobed nuclei.
Multipoptent progenitors give rise to lymphoid and myeloid lineages
A quiescent stem cell is activated to proliferates through multiple stages
PBMCs are both part of the lymphoid and myeloid lineages
Dataset: 3k PBMCs from a Healthy Donor
Single Cell Gene Expression Dataset by Cell Ranger 1.1.0
Peripheral blood mononuclear cells (PBMCs) from a healthy donor (same donor as pbmc6k).
PBMCs are primary cells with relatively small amounts of RNA (~1pg RNA/cell).
- 2,700 cells detected
- Sequenced on Illumina NextSeq 500 with ~69,000 reads per cell
- 98bp read1 (transcript), 8bp I5 sample barcode, 14bp I7 GemCode barcode and 10bp read2 (UMI)
- Analysis run with –cells=3000
Published on May 26, 2016
This dataset is licensed under theCreative Commons Attribution license.
Link to summary html file
Wake-up (brainstorming) exercise #1: you are the data analyst of this project
You learnt many things already about the concepts of (single-cell) data analysis and machine learning during the EDA course.
Start a new text document and write down in bullet points:
- What questions can you ask?
- There are no wrong questions at this point, just try to get on paper as many questions as you can.
- Use your biological intuition,
- experiences in your own project (or studies), and
- ask for more info if you need (I did not work with )
Open one of the shared excel files.
Wake-up exercise #2:
- How would you answer them?
- Which methods that we learned about / heard about before?
- How would you put them together into a pipeline? Think about what
- steps of
- analysis
- would you
- do?
Hint:
- First just try to find a method that can answer each of your questions
- Order them into a logical order
1. What questions would I ask?
I would ask:
- Is our dataset high quality ?
- How many UMI per cell do I see?
- How many genes per cell do I see?
- What cell types do we see?
- Do we see all expected, and no unexpected cell types?
- What is the proportion of these cell types?
- What genes are expressed in each type?
- What genes distinguish the cell types (marker genes)?
- Are these marker genes uniformly expressed, or is there variation among them?
- Can we find sub-types?
- e.g.: are there multiple kind of T-cells?
2. / 5. How would I approach my questions?
- Quality control and filtering:
- Visualise the distribution of read counts and gene counts
- Filter out the bad cells
- (Identify variable genes)
- Clustering
- Check for batch effects
- Visualise clusters and heterogeneity
- t-SNE (a non-linear dimensionality reduction method)
- Differential gene expression
- Find genes marker genes
- Gene set analysis: find functional relevance in genes
- [Loop between any of the steps above]
2. / 5. How SEURAT approaches these questions?
- Quality control and filtering:
- Visualise the distribution of read counts and gene counts
- Filter out the bad cells
- Heterogeneity of the data
- Identify variable genes (it is also a )
- Visualise
- PCA (Principle Component Analysis, a linear dimensionality reduction method)
- t-SNE (t-Stochastic Neighbourhood Embedding , a non-linear dimensionality reduction method)
- Clustering
- Seurat
Graph based clustering
based on Principle Components
instead of genes.
- Differential expression
- Sub-clustering
Why not just using Excel style tables (data frames)?
Single-cell data is special, because
- large size and
- sparsity of the data (90% of the tables can be 0s)
Sparse matrices can be stored in .mtx
files
Name: MatrixMarket
format.
- Plain text format.
- Used for exchanging and storing test matrices.
- Check out the file using the bash
head
command.
- Try to interpret what you see: This is a file format you never heard of but you can recognise some patterns + you learned sth about the typical structure of files
- Separate the output
- What are the parts of this file?
- What are in the columns?
Structure of the .mtx
(solution)
- You will see…
- 2 header lines (% sign is a common way to annotate “comments” or headers - similarly to the # sign.
- followed by a line detailing the total number of rows, columns and counts for the full matrix.
- Entries per line in the following format:
rows, columns, entries
- Since only the coordinates are stored in the .mtx file, the
- names of each row & column must be stored separately in the “genes.tsv” and “barcodes.tsv” files.
R can open .mtx files using the Matrix package
- Installed as a dependency to
Seurat
- require (“Matrix”)
Load the data and create a Seurat object
… Open an R script and load the data!
“Seurat” is and object, a new data structure
Objects are:
- the most complex data structures we encountered,
- central idea for
object-oriented programming
.
- You can think of it as a list
What is pbmc in r?
What data structure is created?
# Figure out what
is(pbmc)
str(pbmc)
- pbmc is a
class
or in other words: an object
5.4.1. Filtering genes and cells
- Problem
- The dataset is huge how an I make it workable?
- Filtering
- Rough-cut:
- Keep all genes expressed in >= 3 cells
- Keep all cells with at least 200 detected genes
- Remove QC-outlier cells:
- % of mitochondrial genes
- Read counts
Filtering genes and cells in Seurat
… Go back to the R script!
5.4.3. Identify the most variable genes
- Problem:
FindVariableGenes()
- calculates the average expression & dispersion for each gene
- bin per expression level,
- calculates a z-score for dispersion within each bin.
Z-score normalisation
a.k.a.: Standard score
Definition:
The distance from the mean, as a number of standard deviations (STD).
Z-score calculation:
(value - mean) / STD
Z-scores of the normal distribution
When would you use the Z-score ?
- Think abstractly!
- What kind of comparison?
5.4.4. Removing unwanted sources of variation by linear regression (“scaling”)
- Removing unwanted sources of variation by linear regression (“scaling”)
To mitigate the effect of confounding factors,
Seurat
constructs linear models to predict gene expression based on user-defined variables.
- The residuals of these models are used later for dimensionality reduction and clustering.
Removing confounding effects in Seurat
Seurat
can regress out cell-cell variation in gene expression
- driven by batch,
- mapping quality
- the number of detected molecules
- mitochondrial gene expression and
- cell cycle.
- Here we regress on the number of detected molecules per cell.