--- title: "Rhizospheric fungal/bacteria metagenome project" format: html editor: markdown: wrap: sentence --- We are analyzing the metagenomes from 2 soil samples from a greenhouse study. These will give us a sense of resident microbial (fungal and bacteiral) communities prior to biofertilizer treatment. 1. F1B-KM40 - bulk field soil 2. F2\$-KM41 - rhizosphere soil # Data upload The goal for week 2 is to upload fastq files to Raven server. Note that this needs to be done from the local terminal, not the one on the server. ```{bash} # move files from lab drive to Raven scp "/Volumes/CRUZER16GB/F1B-KM40_R1_001.fastq.gz" \ rendavis@raven.fish.washington.edu:/home/shared/16TB_HDD_01/fish546/renee/ ``` That was successful. Now we upload the rest using rsync instead (better for handling multiple files, fewer chances of duplicates, conflicts etc). ```{bash} # move remaining files from lab drive to Raven rsync -av --progress \ /Volumes/CRUZER16GB/*.fastq.gz \ rendavis@raven.fish.washington.edu:/home/shared/16TB_HDD_01/fish546/renee/ ``` # Data integrity and quality control ## Checksums Before we begin working, let's run a MD5 check on the 4 metagenome fastq files. ```{bash} find . -type f -exec md5sum-lite {} \; >> output.txt #found all files in the working directory, calculated its MD5 checksum, save to an output txt file ``` This prints the results to a text file. Here is the output: ``` b24ae50750964ef8d01a8808051c73c7 ./F2R-KM41_R2_001.fastq.gz a48a5ad0bc4ce07f641c0c647fe0dbaf ./F1B-KM40_R2_001.fastq.gz 88d5e83290598639d54218bc505bfe66 ./output.txt a8be224ab035ea641ccd5a8c9d46c024 ./F2R-KM41_R1_001.fastq.gz e34585c824e429f54ff34d919719f1c4 ./F1B-KM40_R1_001.fastq.gz ``` ```{bash} mv output.txt md5checksums_metagenomes_2025.txt #renamed the file to something more descriptive ``` To validate file integrity later, you could run the following code to ensure a match. ```{bash} md5sum -c md5checksums_metagenomes_2025.txt ``` ## Quality control ### FastQC I have been told that QC has been run on these, but we won't take that at face value. For this we will use FastQC. ```{bash} # run FastQC on the fastq files /home/shared/FastQC-0.12.1/fastqc -t 36 -o output /home/shared/16TB_HDD_01/fish546/renee/*.fastq.gz ``` ```{bash} Analysis complete for F2R-KM41_R1_001.fastq.gz Analysis complete for F2R-KM41_R2_001.fastq.gz Analysis complete for F1B-KM40_R1_001.fastq.gz Analysis complete for F1B-KM40_R2_001.fastq.gz ``` .png files downloaded to my desktop. ### MultiQC ```{bash echo=TRUE} eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)" conda activate which multiqc cd output multiqc . ``` See html report in output folder. # Preprocessing We have to preprocess raw data to enable clean, high-quality reads ready for assembly. This involves 2 steps: trimming and merging forward and reverse reads. ## Trimming using Trimmomatic We need to trim raw data to remove adapters and low-quality bases. It's important to do this prior to merging, as merging reads relies on the overlapping region between forward and reverse reads. If you try to merge before trimming, you could end up with incorrect overlaps, low quality merges, and related issues. For this step we will use [Trimmomatic](#0), which we will have to install manually. ```{bash} scp -r ~/Downloads/Trimmomatic-0.39 rendavis@raven.fish.washington.edu:/home/shared/16TB_HDD_01/fish546/renee #from the local terminal ``` Now we will run it to trim adapters and low quality bases. We'll stick with defaults of greater than or equal to 50 base pairs (bp), and 3 leading and trailing bp. We will do this in 2 steps with F1B R1/R2 and then F2R R1/R2. ```{bash} cd /home/shared/16TB_HDD_01/fish546/renee java -jar /home/shared/16TB_HDD_01/fish546/renee/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 4 \ F1B-KM40_R1_001.fastq.gz F1B-KM40_R2_001.fastq.gz \ F1B-KM40_trimmed_R1_paired.fastq.gz F1B-KM40_trimmed_R1_unpaired.fastq.gz \ F1B-KM40_trimmed_R2_paired.fastq.gz F1B-KM40_trimmed_R2_unpaired.fastq.gz \ ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \ LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50 #keeps bps equal or above 50 ``` ``` Quality encoding detected as phred33 Input Read Pairs: 31944124 Both Surviving: 29010263 (90.82%) Forward Only Surviving: 1518244 (4.75%) Reverse Only Surviving: 921543 (2.88%) Dropped: 494074 (1.55%) TrimmomaticPE: Completed successfully ``` These are good numbers. Generally, anything over 80% both surviving is considered good. And 1.5% is a very low drop rate. ```{bash} cd /home/shared/16TB_HDD_01/fish546/renee/raw-data java -jar /home/shared/16TB_HDD_01/fish546/renee/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 4 \ F2R-KM41_R1_001.fastq.gz F2R-KM41_R2_001.fastq.gz \ F2R-KM41_trimmed_R1_paired.fastq.gz F2R-KM41_trimmed_R1_unpaired.fastq.gz \ F2R-KM41_trimmed_R2_paired.fastq.gz F2R-KM41_trimmed_R2_unpaired.fastq.gz \ ILLUMINACLIP:/home/shared/16TB_HDD_01/fish546/renee/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10 #had to use the whole file path here LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50 #keeps bps equal or above 50 ``` ``` Quality encoding detected as phred33 Input Read Pairs: 27015389 Both Surviving: 26807327 (99.23%) Forward Only Surviving: 208057 (0.77%) Reverse Only Surviving: 0 (0.00%) Dropped: 5 (0.00%) TrimmomaticPE: Completed successfully ``` Trimmomatic produces **unpaired** and **paired** file outputs. Paired reads are those which both forward and reverse survived trimming. Unpaired reads indicate where only one of the pair survived (the other was discarded due to low quality or short length). We will use the paired files for most downstream steps like merging and assembly, but will keep the unpaired files around just in case we need to embark on deeper analysis. ## Merging forward and reverse reads using PEAR These are R1/R2 (forward and reverse reads) and will have to be merged. This is the last component of pre-processing as we work towards metagenome assembly.For this we will use fastp which is already installed on Raven. ```{bash} cd /home/shared/16TB_HDD_01/fish546/renee # Merge F1B-KM40 /home/shared/fastp-v0.24.0/fastp \ -i F1B-KM40_trimmed_R1_paired.fastq.gz \ -I F1B-KM40_trimmed_R2_paired.fastq.gz \ --merge \ --merged_out F1B-KM40_merged.fastq.gz \ # Merge F2R-KM41 /home/shared/fastp-v0.24.0/fastp \ -i F2R-KM41_trimmed_R1_paired.fastq.gz \ -I F2R-KM41_trimmed_R2_paired.fastq.gz \ --merge \ --merged_out F2R-KM41_merged.fastq.gz \ ``` ``` Read1 before filtering: total reads: 29010263 total bases: 4140614008 Q20 bases: 4050447746(97.8224%) Q30 bases: 3845738339(92.8785%) Read2 before filtering: total reads: 29010263 total bases: 4154017583 Q20 bases: 4065704832(97.874%) Q30 bases: 3866115895(93.0693%) Merged and filtered: total reads: 7474936 total bases: 1650421008 Q20 bases: 1630687369(98.8043%) Q30 bases: 1578354375(95.6334%) Filtering result: reads passed filter: 58018466 reads failed due to low quality: 0 reads failed due to too many N: 2060 reads failed due to too short: 0 reads with adapter trimmed: 635320 bases trimmed due to adapters: 11744183 reads corrected by overlap analysis: 2160886 bases corrected by overlap analysis: 2804599 Duplication rate: 7.13807% Insert size peak (evaluated by paired-end reads): 268 Read pairs merged: 7474936 % of original read pairs: 25.7665% % in reads after filtering: 100% JSON report: fastp.json HTML report: fastp.html /home/shared/fastp-v0.24.0/fastp -i F1B-KM40_trimmed_R1_paired.fastq.gz -I F1B-KM40_trimmed_R2_paired.fastq.gz --merge --merged_out F1B-KM40_merged.fastq.gz fastp v0.24.0, time used: 178 seconds Read1 before filtering: total reads: 26807327 total bases: 4021099050 Q20 bases: 3808188117(94.7052%) Q30 bases: 3500492825(87.0531%) Read2 before filtering: total reads: 26807327 total bases: 4021099050 Q20 bases: 3793251312(94.3337%) Q30 bases: 3477357474(86.4778%) Merged and filtered: total reads: 6724249 total bases: 1525795473 Q20 bases: 1489384743(97.6137%) Q30 bases: 1418180834(92.947%) Filtering result: reads passed filter: 51640358 reads failed due to low quality: 1969720 reads failed due to too many N: 4576 reads failed due to too short: 0 reads with adapter trimmed: 170138 bases trimmed due to adapters: 2613154 reads corrected by overlap analysis: 2552667 bases corrected by overlap analysis: 5056209 Duplication rate: 6.67899% Insert size peak (evaluated by paired-end reads): 268 Read pairs merged: 6724249 % of original read pairs: 25.0836% % in reads after filtering: 100% ``` Around 25% of original read pairs got merged, which appears to be typical for paired-end read merging. As a final step, we'll preview the file and do a quality check on the merged files using FastQC. ```{bash} # run FastQC on the fastq files /home/shared/FastQC-0.12.1/fastqc -t 36 -o output /home/shared/16TB_HDD_01/fish546/renee/*merged.fastq.gz ``` HTML report results: [F1B-KM40_merged.fastq.gz](http://raven.fish.washington.edu:8787/files/renee-myco/project/output/F1B-KM40_merged_fastqc.html); [F2R-KM41_merged.fastq.gz](http://raven.fish.washington.edu:8787/files/renee-myco/project/output/F2R-KM41_merged_fastqc.html). F1B has higher per base sequence quality, but both look good. # Assembly using MEGAHIT ## Installation First we clone the repository into our working folder. ```{bash} cd /home/shared/16TB_HDD_01/fish546/renee/ git clone https://github.com/voutcn/megahit.git #this brings the installation files into our working directory cd megahit #navigating to the folder ``` Then we have to compile MEGAHIT. ```{bash} cd /home/shared/16TB_HDD_01/fish546/renee/ mkdir build cd build cmake ../megahit make ./megahit --version # tests the installation; MEGAHIT v1.2.9 ``` ## Assembly process Once this is done, we run each file independently to produce two assemblies. ```{bash} cd /home/shared/16TB_HDD_01/fish546/renee/build #pointing wd to the binary files ./megahit -r ../F1B-KM40_merged.fastq.gz #specifying input file -o megahit_F1B_KM40_out #output directory --min-contig-len 500 #over 500 bps -t 8 #8 threads ``` ```{bash} cd /home/shared/16TB_HDD_01/fish546/renee/build ./megahit \ -r ../F2R-KM41_merged.fastq.gz \ -o megahit_F2R_KM41_out \ --min-contig-len 500 \ -t 8 ``` Each run took 120 minutes. ## Check files We want to see a text file when we check files using the head command. ```{bash} #check F1B_KM40 cd /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F1B_KM40_out head final.contigs.fa ``` ```{bash} #checkF2R_KM41 cd /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out head final.contigs.fa ``` We should get something that looks like this: ``` >k141_8240 flag=1 multi=2.0000 len=511 GGGGAACGCGCGCGGACGCGGGAAGGCCGCCTTCAGGTCGAAC ``` `>` = start of a header.For the meta data: k141 refers to the k-mer size (k=141) used when assembling that contig. 8240 is the ID or serial number assigned to that contig. It's the name of the contig. Then there is the internal MEGAHIT flag for the contig, usually about whether the contig is trusted/high-confidence (flag=1 usually means it's "normal" and usable). Then there is the multiplicity (or coverage estimate), which refers to how many times this contig appeared in the assembly graph. Finally, we have the contig length. # Annotation using MG-RAST MG-RAST (metagenomics Rapid Annotation using Subsystem Technology) is an open source, browser based annotation tool that has been previously been used in Winkler lab bioinformatics projects. There is a [user manual]((metagenomics%20Rapid%20Annotation%20using%20Subsystem%20Technology)). # Next steps (week 5-9) 1. marker gene extraction - identify 16S (bacteria) and ITS (fungi) reads (OR annotation?) 2. taxonomic classification 3. visualizations # References Bolger, A. M., Lohse, M., & Usadel, B.(2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data. Bioinformatics, btu170. F. Meyer, D. Paarmann, M. D'Souza, R. Olson , E. M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, and R. A. Edwards. The Metagenomics RAST server --- A public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinformatics 2008, 9:386. [http://www.biomedcentral.com/1471-2105/ 9/386](http://www.biomedcentral.com/1471-2105/9/386) Thomas T, Gilbert J, Meyer F. Metagenomics - a guide from sampling to data analysis. Microb Inform Exp. 2012 Feb 9;2(1):3. doi: 10.1186/2042-5783-2-3. PMID: 22587947; PMCID: PMC3351745. Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: A fast and accurate Illimuna Paired-End reAd mergeR. Vollmers J, Wiegand S, Kaster AK. Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist's Perspective - Not Only Size Matters! PLoS One. 2017 Jan 18;12(1):e0169662. doi: 10.1371/journal.pone.0169662. PMID: 28099457; PMCID: PMC5242441.