tSNE: t distribution-Stochastic neighborhood embedding
An intuitive tSNE explanation
→ See tSNE slides from StatQuest
How does t-SNE really work?
You have 1000 dimensional data set:
- impossible to visualize & get a sense of it
- We have to convert that 1000D data set to something that we can visualize & play around.
Conversion to similarity matrix
- Measures similarity for every pair of data points (cell).
- Similar data points will have more value of similarity and the different data points will have less value.
- You convert from
cell * gene (feature)
dimensions to
cell * cell
dimensioned similarity matrix.
- Remember the sample correlation heatmap? That was a similarity matrix between worms.
- You have now a similarity matrix
S1
- (You already left the original highest dimensional space.)
Stochastic Neighbourhood Embedding
- Convert this similarity distance to probability (joint probability) according to the normal distribution.
- Now, t-SNE arranges all of the data points randomly on an even lower dimensional space (say 2D) .
- Stochastic Neighbourhood Embedding
- Put cells close to each other based on their similarity (~ probability) in high-D
- Because its sampling from a probability distribution, its stochastic.
- Now you have distances in 2D space.
Finding the optimal 2D representation
- Move points around (optimise their location) on their similarity (~ probability) in high-D
- Repeat until convergence
- convergence = when you cannot get better
Compare Similarity in high-D and similarity in (2)D
- Now we also have the similarity matrix for the data points in lower dimension:
S2
- t-SNE then compares matrix
S1
and S2
&
- tries to make the difference smaller by (some complex mathematics.)
- Remember optimization? There is actually a gradient descent optimisation inside.
- At the end we will have lower dimensional data points which tries to capture even complex relationships at which PCA fails.
Why t-distribution?
The distribution of the pairwise distances is very different between a high-dimensional and a low-dimensional space.
- Think of the triangle example in the PCA presentation