5.4.7.Non-linear.dimensionality.reduction.by.tSNE

When do we use non-linear dimensional reduction?

We learnt about PCA, why don’t we just use that?
In some cases data has some non-linear structure which linear methods cannot unfold.
- i.e.: There are problems too hard to solve linearly

How would you reduce the dimensionality of this object?

Swiss.roll

This pine apple skin has a folded 2D surface
- Surfaces in multiple dimensions are called manifolds (a mathematician would kill me for this).
- Manifold learning is way (and name) of reducing dimensions.

How could you simplify these 3 dimensional data?

NonLInear.dr

Simplify the data by unfolding the higher dimensional surfacex

NonLInear.dr.v2

Crowding problem in PCA leads to overlapping points / inseparable clusters

PCA.iris.facorextra.package

Why is there a crowding problem of dimensionality reduction?

The distribution of the pairwise distances is very different between a high-dimensional and a low-dimensional space.
In a 2D space, how many point you can put equidistantly?
You can only put n+1 points equidistant. (n =dimensions )

Linearly simplifying data with non-linear structures can also leads to the crowding problem

Swiss.roll

How to map a 1000 dimensions?

Now you have 100 cells that are equidistant in 1000 dimensions - how do you map this out?
- Shortly, you cannot: known as the crowding problem
- But you can try to approximate it.
  - Spring example: each data point is pulled by a spring proportional to its distance in 100D

tSNE: t distribution-Stochastic neighborhood embedding

An intuitive tSNE explanation

→ See tSNE slides from StatQuest

How does t-SNE really work?

You have 1000 dimensional data set:

impossible to visualize & get a sense of it
- We have to convert that 1000D data set to something that we can visualize & play around.

Steps of tSNE

See the Statquest presentation!

Conversion to similarity matrix

Measures similarity for every pair of data points (cell).

Similar data points will have more value of similarity and the different data points will have less value.
You convert from
- cell * gene (feature) dimensions to
- cell * cell dimensioned similarity matrix.
  - Remember the sample correlation heatmap? That was a similarity matrix between worms.
- You have now a similarity matrix S1
- (You already left the original highest dimensional space.)

Stochastic Neighbourhood Embedding

Convert this similarity distance to probability (joint probability) according to the normal distribution.
Now, t-SNE arranges all of the data points randomly on an even lower dimensional space (say 2D) .
- Stochastic Neighbourhood Embedding
- Put cells close to each other based on their similarity (~ probability) in high-D
  - Because its sampling from a probability distribution, its stochastic.
- Now you have distances in 2D space.

Finding the optimal 2D representation

Move points around (optimise their location) on their similarity (~ probability) in high-D
Repeat until convergence
- convergence = when you cannot get better

Compare Similarity in high-D and similarity in (2)D

Now we also have the similarity matrix for the data points in lower dimension: S2
t-SNE then compares matrix S1 and S2 &
tries to make the difference smaller by (some complex mathematics.)
Remember optimization? There is actually a gradient descent optimisation inside.
At the end we will have lower dimensional data points which tries to capture even complex relationships at which PCA fails.

Why t-distribution?

The distribution of the pairwise distances is very different between a high-dimensional and a low-dimensional space.

Think of the triangle example in the PCA presentation

Sources:

analyticsvidhya.com

blog.paperspace.com

statquest.org (recommended) or video

5.4.7.Non-linear.dimensionality.reduction.by.tSNE