app
for checking the characteristics of sequencing data.fastqc MyExperiment.fastq
Trim Galore!
is a wrapper for the reads trimming software cutadapt
.
trim_galore -h
for helptrim_galore --nextera MyExperiment.fastq Share/ERR522959_2.fastq
.fastq
fileSo why not removing some more reads that would be useless anyways? >- Does your reads have the correct CBC and UMI? >- Pre-filtering reads for correct barcodes and UMI
Is a .fasta
file with 20-24 sequences: the chromosomes.
Why that? - It’s simple: >- Using R you could write it yourself > - compare base to base “A”=="A"
> - your.base == base.in.reference
> - in 2 gigantic for loops >- Comparing every base in your read vs every base in the reference is This is way to slow >- When you try to speed it up, it soon gets very complicated > - Creating a search index (lookup) for the reference is usually a key step >- There are a lot of other issues
STAR
is such an alignerTwo steps are required to perform STAR
(or most other) alignments.
A search index for the reference is usually a key step
mkdir results
mkdir results/STAR
STAR --runThreadN 4 --genomeDir indices/STAR --readFilesIn Share/ERR522959_1.fastq Share/ERR522959_2.fastq --outFileNamePrefix results/STAR/
.sam
file.bam
to .sam
Part of the HTSlib (High Throughput Sequencing Library )
SAMtools provide various utilities for manipulating alignments in the SAM format, including
We wrote our own scripts to count reads per gene and cell, and finally create an expression matrix
→ We need a .sam file as input and the aligners typically generate a .bam file
ATTTTTTTTTTTTTTTTTTTTN
.samtools
to viewYou learned yesterday some commands:
# After maooing
# You can use "samtools" to convert a .bam into a .sam
samtools view -h file.bam # creates a .sam
# Use pipe to count the number of lines in file.txt
# counts the number of lines in the samtools output
samtools view -h file.bam | wc -l
# Check the file
wc -l file.txt
samtools view -h file.bam | head | more
# Select a certain chromosome (if mapped against the genome)
samtools view -b in.bam chr1 > in_chr1.bam
# ...or per gene if mapped against the transcripome
samtools view -b in.bam Pou5f1__chr19 > in_chr1.bam
3 ways to count
Why?
You are sequencing molecules from a large pool of molecules. This is a sampling process.
Imagine:
This is what we do in when we convert reads to transcript-counts.
Read more about the mapping process.
N
-s