--- editor_options: markdown: wrap: 72 --- # Step 3: QAQC RNA sequences Sarah Tanja October 10, 2024 - [Goals](#goals) - [Setup conda environment](#setup-conda-environment) - [Run `FastQC`](#run-fastqc) - [Compile the `MultiQC` report](#compile-the-multiqc-report) - [Interpretation of `MultiQC` report](#interpretation-of-multiqc-report) - [Clean reads with `fastp`](#clean-reads-with-fastp) - [Summary](#summary) # Goals {#goals} In this script, we will generate FastQC/MultiQC for raw sequences, conduct trimming and cleaning, then generate reports for cleaned sequences. # Setup conda environment {#setup-conda-environment} Create and activate conda environment (must already have installed miniconda) > [!CAUTION] > > Execute the following conda commands in the terminal so that you can > respond to `Proceed([y]/n)?` with a `y` ``` bash conda create -n mcap2024 conda activate mcap2024 ``` Install programs within conda environment - [fastqc](https://anaconda.org/bioconda/fastqc) ``` bash conda install bioconda::fastqc ``` *FastQC generates sequence quality information of your reads* - [multiqc](https://anaconda.org/bioconda/multiqc), [git developer version](Installation) ``` bash pip install --upgrade --force-reinstall git+https://github.com/MultiQC/MultiQC.git ``` *Multiqc summarizes FastQC analysis logs and summarizes results in an html report* - [fastp](https://anaconda.org/bioconda/fastp) ``` bash conda install bioconda::fastp ``` *FastP provides fast all-in-one preprocessing for FastQ files* conda install hisat2 conda install samtools # Run `FastQC` {#run-fastqc} ``` bash cd rawfastq/00_fastq fastqc ./*.fastq.gz ``` Make a subdirectory for your FastQC results and move FastQC results there ``` bash cd ../output mkdir 03_qaqc mv ../../rawfastq/00_fastq/*fastqc* ./ ``` # Compile the `MultiQC` report {#compile-the-multiqc-report} ``` bash cd ../output/03_qaqc multiqc ./ ``` # Interpretation of `MultiQC` report {#interpretation-of-multiqc-report} Watch a quick [6-min tutorial](https://www.youtube.com/watch?v=qPbIlO_KWN0) on how to navigate in the MultiQC Report # Clean reads with `fastp` {#clean-reads-with-fastp} - remove adapters - remove low-quality reads - remove reads with high number of unknown bases In this script, we are trimming and cleaning with the following settings: - remove adapter sequences `--adapter_sequence=AGATCGGAAGAGCACACGTCTGAACTCCAGTCA` - enable polyX trimming on 3' end at length of 6 `--trim_poly_x 6` - filter by minimum phred quality score of \>30 `-q 30` - enable low complexity filter `-y` - set complexity filter threshold of 50% required `-Y 50` Make a subdirectory for cleaned reads within the data directory. ``` bash cd ../data mkdir cleaned_reads ``` `fastp` all [options](https://github.com/OpenGene/fastp?tab=readme-ov-file#:~:text=of%201%20~%206.-,all%20options,-usage%3A%20fastp%20%2Di) can be found in the git README. - --in1 - Path to forward read input - --in2 - Path to reverse read input - --out1 - Path to forward read output - --out2 - Path to reservse read output - --failed_out - Specify file to store reads that fail filters - --qualified_quality_phred - Phred quality \>= -q is qualified (20) - --unqualified_percent_limit - % of bases allowed to be unqualified (10) - --length_required - Set required sequence length (100) - --detect_adapter_for_pe - Adapters can be trimmed by overlap analysis, however, --detect_adapter_for_pe will usually result in slightly cleaner output than overlap detection alone. This results in a slightly slower run time - --cut_right - Move a sliding window from front to tail. Use cut_right_window_size to set the window size (5), and cut_right_mean_quality (20) to set the mean quality threshold. - --html - The html format report file name ``` bash fastp --in1 ${file}_R1_001.fastq.gz --in2 ${file}_R2_001.fastq.gz --out1 ../cleaned_reads/${file}_R1_001.clean.fastq.gz --out2 ../cleaned_reads/${file}_R2_001.clean.fastq.gz --failed_out ../cleaned_reads/${file}_failed.txt --detect_adapter_for_pe --trim_poly_x 6 -q 30 -y -Y --qualified_quality_phred 20 --unqualified_percent_limit 10 --length_required 100 detect_adapter_for_pe --cut_right cut_right_window_size 5 --cut_right_mean_quality 20 ``` # Summary {#summary} Clean sequences are now ready for alignment.