---
title: "Rhizospheric fungal/bacteria metagenome project"
format: html
editor: 
  markdown: 
    wrap: sentence
---

We are analyzing the metagenomes from 2 soil samples from a greenhouse study.
These will give us a sense of resident microbial (fungal and bacteiral) communities prior to biofertilizer treatment.

1.  F1B-KM40 - bulk field soil
2.  F2\$-KM41 - rhizosphere soil

# Data upload

The goal for week 2 is to upload fastq files to Raven server.
Note that this needs to be done from the local terminal, not the one on the server.

```{bash}
# move files from lab drive to Raven
scp "/Volumes/CRUZER16GB/F1B-KM40_R1_001.fastq.gz" \
rendavis@raven.fish.washington.edu:/home/shared/16TB_HDD_01/fish546/renee/
```

That was successful.
Now we upload the rest using rsync instead (better for handling multiple files, fewer chances of duplicates, conflicts etc).

```{bash}
# move remaining files from lab drive to Raven
rsync -av --progress \
    /Volumes/CRUZER16GB/*.fastq.gz \
    rendavis@raven.fish.washington.edu:/home/shared/16TB_HDD_01/fish546/renee/
```

# Data integrity and quality control

## Checksums

Before we begin working, let's run a MD5 check on the 4 metagenome fastq files.

```{bash}
find . -type f -exec md5sum-lite {} \; >> output.txt  #found all files in the working directory, calculated its MD5 checksum, save to an output txt file
```

This prints the results to a text file.
Here is the output:

```         
b24ae50750964ef8d01a8808051c73c7  ./F2R-KM41_R2_001.fastq.gz
a48a5ad0bc4ce07f641c0c647fe0dbaf  ./F1B-KM40_R2_001.fastq.gz
88d5e83290598639d54218bc505bfe66  ./output.txt
a8be224ab035ea641ccd5a8c9d46c024  ./F2R-KM41_R1_001.fastq.gz
e34585c824e429f54ff34d919719f1c4  ./F1B-KM40_R1_001.fastq.gz
```

```{bash}
mv output.txt md5checksums_metagenomes_2025.txt #renamed the file to something more descriptive
```

To validate file integrity later, you could run the following code to ensure a match.

```{bash}
md5sum -c md5checksums_metagenomes_2025.txt
```

## Quality control

### FastQC

I have been told that QC has been run on these, but we won't take that at face value.
For this we will use FastQC.

```{bash}
# run FastQC on the fastq files
/home/shared/FastQC-0.12.1/fastqc -t 36 -o output /home/shared/16TB_HDD_01/fish546/renee/*.fastq.gz
```

```{bash}
Analysis complete for F2R-KM41_R1_001.fastq.gz
Analysis complete for F2R-KM41_R2_001.fastq.gz
Analysis complete for F1B-KM40_R1_001.fastq.gz
Analysis complete for F1B-KM40_R2_001.fastq.gz
```

.png files downloaded to my desktop.

### MultiQC

```{bash echo=TRUE}
eval "$(/opt/anaconda/anaconda3/bin/conda shell.bash hook)"
conda activate
which multiqc

cd output

multiqc .
```

See html report in output folder.

# Preprocessing

We have to preprocess raw data to enable clean, high-quality reads ready for assembly.
This involves 2 steps: trimming and merging forward and reverse reads.

## Trimming using Trimmomatic

We need to trim raw data to remove adapters and low-quality bases.
It's important to do this prior to merging, as merging reads relies on the overlapping region between forward and reverse reads.
If you try to merge before trimming, you could end up with incorrect overlaps, low quality merges, and related issues.

For this step we will use [Trimmomatic](#0), which we will have to install manually.

```{bash}
scp -r ~/Downloads/Trimmomatic-0.39 rendavis@raven.fish.washington.edu:/home/shared/16TB_HDD_01/fish546/renee  #from the local terminal
```

Now we will run it to trim adapters and low quality bases.
We'll stick with defaults of greater than or equal to 50 base pairs (bp), and 3 leading and trailing bp.
We will do this in 2 steps with F1B R1/R2 and then F2R R1/R2.

```{bash}
cd /home/shared/16TB_HDD_01/fish546/renee
java -jar /home/shared/16TB_HDD_01/fish546/renee/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 4 \
  F1B-KM40_R1_001.fastq.gz F1B-KM40_R2_001.fastq.gz \
  F1B-KM40_trimmed_R1_paired.fastq.gz F1B-KM40_trimmed_R1_unpaired.fastq.gz \
  F1B-KM40_trimmed_R2_paired.fastq.gz F1B-KM40_trimmed_R2_unpaired.fastq.gz \
  ILLUMINACLIP:TruSeq3-PE.fa:2:30:10 \
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50  #keeps bps equal or above 50 
```

```         
Quality encoding detected as phred33
Input Read Pairs: 31944124 
Both Surviving: 29010263 (90.82%) 
Forward Only Surviving: 1518244 (4.75%) 
Reverse Only Surviving: 921543 (2.88%) 
Dropped: 494074 (1.55%)
TrimmomaticPE: Completed successfully
```

These are good numbers.
Generally, anything over 80% both surviving is considered good.
And 1.5% is a very low drop rate.

```{bash}
cd /home/shared/16TB_HDD_01/fish546/renee/raw-data  
java -jar /home/shared/16TB_HDD_01/fish546/renee/Trimmomatic-0.39/trimmomatic-0.39.jar PE -threads 4 \
  F2R-KM41_R1_001.fastq.gz F2R-KM41_R2_001.fastq.gz \
  F2R-KM41_trimmed_R1_paired.fastq.gz F2R-KM41_trimmed_R1_unpaired.fastq.gz \
  F2R-KM41_trimmed_R2_paired.fastq.gz F2R-KM41_trimmed_R2_unpaired.fastq.gz \
  ILLUMINACLIP:/home/shared/16TB_HDD_01/fish546/renee/Trimmomatic-0.39/adapters/TruSeq3-PE.fa:2:30:10  #had to use the whole file path here
  LEADING:3 TRAILING:3 SLIDINGWINDOW:4:20 MINLEN:50  #keeps bps equal or above 50 
```

```         
Quality encoding detected as phred33
Input Read Pairs: 27015389 
Both Surviving: 26807327 (99.23%) 
Forward Only Surviving: 208057 (0.77%) 
Reverse Only Surviving: 0 (0.00%) 
Dropped: 5 (0.00%)
TrimmomaticPE: Completed successfully
```

Trimmomatic produces **unpaired** and **paired** file outputs.
Paired reads are those which both forward and reverse survived trimming.
Unpaired reads indicate where only one of the pair survived (the other was discarded due to low quality or short length).

We will use the paired files for most downstream steps like merging and assembly, but will keep the unpaired files around just in case we need to embark on deeper analysis.

## Merging forward and reverse reads using PEAR

These are R1/R2 (forward and reverse reads) and will have to be merged.
This is the last component of pre-processing as we work towards metagenome assembly.For this we will use fastp which is already installed on Raven.

```{bash}
cd /home/shared/16TB_HDD_01/fish546/renee

# Merge F1B-KM40
/home/shared/fastp-v0.24.0/fastp \
  -i F1B-KM40_trimmed_R1_paired.fastq.gz \
  -I F1B-KM40_trimmed_R2_paired.fastq.gz \
  --merge \
  --merged_out F1B-KM40_merged.fastq.gz \
  
# Merge F2R-KM41
/home/shared/fastp-v0.24.0/fastp \
  -i F2R-KM41_trimmed_R1_paired.fastq.gz \
  -I F2R-KM41_trimmed_R2_paired.fastq.gz \
  --merge \
  --merged_out F2R-KM41_merged.fastq.gz \
```

```         
Read1 before filtering:
total reads: 29010263
total bases: 4140614008
Q20 bases: 4050447746(97.8224%)
Q30 bases: 3845738339(92.8785%)

Read2 before filtering:
total reads: 29010263
total bases: 4154017583
Q20 bases: 4065704832(97.874%)
Q30 bases: 3866115895(93.0693%)

Merged and filtered:
total reads: 7474936
total bases: 1650421008
Q20 bases: 1630687369(98.8043%)
Q30 bases: 1578354375(95.6334%)

Filtering result:
reads passed filter: 58018466
reads failed due to low quality: 0
reads failed due to too many N: 2060
reads failed due to too short: 0
reads with adapter trimmed: 635320
bases trimmed due to adapters: 11744183
reads corrected by overlap analysis: 2160886
bases corrected by overlap analysis: 2804599

Duplication rate: 7.13807%

Insert size peak (evaluated by paired-end reads): 268

Read pairs merged: 7474936
% of original read pairs: 25.7665%
% in reads after filtering: 100%


JSON report: fastp.json
HTML report: fastp.html

/home/shared/fastp-v0.24.0/fastp -i F1B-KM40_trimmed_R1_paired.fastq.gz -I F1B-KM40_trimmed_R2_paired.fastq.gz --merge --merged_out F1B-KM40_merged.fastq.gz 
fastp v0.24.0, time used: 178 seconds
Read1 before filtering:
total reads: 26807327
total bases: 4021099050
Q20 bases: 3808188117(94.7052%)
Q30 bases: 3500492825(87.0531%)

Read2 before filtering:
total reads: 26807327
total bases: 4021099050
Q20 bases: 3793251312(94.3337%)
Q30 bases: 3477357474(86.4778%)

Merged and filtered:
total reads: 6724249
total bases: 1525795473
Q20 bases: 1489384743(97.6137%)
Q30 bases: 1418180834(92.947%)

Filtering result:
reads passed filter: 51640358
reads failed due to low quality: 1969720
reads failed due to too many N: 4576
reads failed due to too short: 0
reads with adapter trimmed: 170138
bases trimmed due to adapters: 2613154
reads corrected by overlap analysis: 2552667
bases corrected by overlap analysis: 5056209

Duplication rate: 6.67899%

Insert size peak (evaluated by paired-end reads): 268

Read pairs merged: 6724249
% of original read pairs: 25.0836%
% in reads after filtering: 100%
```

Around 25% of original read pairs got merged, which appears to be typical for paired-end read merging.

As a final step, we'll preview the file and do a quality check on the merged files using FastQC.

```{bash}
# run FastQC on the fastq files
/home/shared/FastQC-0.12.1/fastqc -t 36 -o output /home/shared/16TB_HDD_01/fish546/renee/*merged.fastq.gz
```

HTML report results: [F1B-KM40_merged.fastq.gz](http://raven.fish.washington.edu:8787/files/renee-myco/project/output/F1B-KM40_merged_fastqc.html); [F2R-KM41_merged.fastq.gz](http://raven.fish.washington.edu:8787/files/renee-myco/project/output/F2R-KM41_merged_fastqc.html).
F1B has higher per base sequence quality, but both look good.

# Assembly using MEGAHIT

## Installation

First we clone the repository into our working folder.

```{bash}
cd /home/shared/16TB_HDD_01/fish546/renee/
git clone https://github.com/voutcn/megahit.git #this brings the installation files into our working directory
cd megahit #navigating to the folder
```

Then we have to compile MEGAHIT.

```{bash}
cd /home/shared/16TB_HDD_01/fish546/renee/
mkdir build
cd build
cmake ../megahit
make

./megahit --version # tests the installation; MEGAHIT v1.2.9
```

## Assembly process

Once this is done, we run each file independently to produce two assemblies.

```{bash}
cd /home/shared/16TB_HDD_01/fish546/renee/build  #pointing wd to the binary files
./megahit -r ../F1B-KM40_merged.fastq.gz  #specifying input file
-o megahit_F1B_KM40_out   #output directory
--min-contig-len 500  #over 500 bps
-t 8  #8 threads
```

```{bash}
cd /home/shared/16TB_HDD_01/fish546/renee/build

./megahit \
  -r ../F2R-KM41_merged.fastq.gz \
  -o megahit_F2R_KM41_out \
  --min-contig-len 500 \
  -t 8
```

Each run took 120 minutes.

## Check files

We want to see a text file when we check files using the head command.

```{bash}
#check F1B_KM40
cd /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F1B_KM40_out
head final.contigs.fa
```

```{bash}
#checkF2R_KM41
cd /home/shared/16TB_HDD_01/fish546/renee/build/megahit_F2R_KM41_out
head final.contigs.fa
```

We should get something that looks like this:

```         
>k141_8240 flag=1 multi=2.0000 len=511
GGGGAACGCGCGCGGACGCGGGAAGGCCGCCTTCAGGTCGAAC
```

`>` = start of a header.For the meta data: k141 refers to the k-mer size (k=141) used when assembling that contig. 8240 is the ID or serial number assigned to that contig. It's the name of the contig. Then there is the internal MEGAHIT flag for the contig, usually about whether the contig is trusted/high-confidence (flag=1 usually means it's "normal" and usable). Then there is the multiplicity (or coverage estimate), which refers to how many times this contig appeared in the assembly graph. Finally, we have the contig length. 

# Annotation using MG-RAST

MG-RAST (metagenomics Rapid Annotation using Subsystem Technology) is an open source, browser based annotation tool that has been previously been used in Winkler lab bioinformatics projects.
There is a [user manual]((metagenomics%20Rapid%20Annotation%20using%20Subsystem%20Technology)).

# Next steps (week 5-9)

1.  marker gene extraction - identify 16S (bacteria) and ITS (fungi) reads (OR annotation?)

2.  taxonomic classification

3.  visualizations

# References

Bolger, A. M., Lohse, M., & Usadel, B.(2014). Trimmomatic: A flexible trimmer for Illumina Sequence Data.
Bioinformatics, btu170.

F. Meyer, D. Paarmann, M. D'Souza, R. Olson , E.
M. Glass, M. Kubal, T. Paczian, A. Rodriguez, R. Stevens, A. Wilke, J. Wilkening, and R.
A. Edwards.
The Metagenomics RAST server --- A public resource for the automatic phylogenetic and functional analysis of metagenomes.
BMC Bioinformatics 2008, 9:386.
[http://www.biomedcentral.com/1471-2105/ 9/386](http://www.biomedcentral.com/1471-2105/9/386)

Thomas T, Gilbert J, Meyer F. Metagenomics - a guide from sampling to data analysis.
Microb Inform Exp. 2012 Feb 9;2(1):3.
doi: 10.1186/2042-5783-2-3.
PMID: 22587947; PMCID: PMC3351745.

Zhang J, Kobert K, Flouri T, Stamatakis A. PEAR: A fast and accurate Illimuna Paired-End reAd mergeR.

Vollmers J, Wiegand S, Kaster AK. Comparing and Evaluating Metagenome Assembly Tools from a Microbiologist's Perspective - Not Only Size Matters!
PLoS One.
2017 Jan 18;12(1):e0169662.
doi: 10.1371/journal.pone.0169662.
PMID: 28099457; PMCID: PMC5242441.