--- title: "06-alignment-data" output: md_document date: "2023-05-04" --- ```{r setup, include=FALSE} knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Setting this to false will allow us to knit the codument without evaluating code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` # Task 1 Curling in sorted.bam files: ```{r, engine='bash'} cd ../data curl -O https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/120321-cvBS/19F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam curl -O https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/120321-cvBS/19F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam.bai ``` Curling in reference genome: ```{r, engine='bash'} cd ../data curl -O https://gannet.fish.washington.edu/seashell/bu-mox/data/Cvirg-genome/GCF_002022765.2_C_virginica-3.0_genomic.fa curl -O https://gannet.fish.washington.edu/seashell/bu-mox/data/Cvirg-genome/GCF_002022765.2_C_virginica-3.0_genomic.fa.fai ``` Using tview to quickly take a peak at the sorted.bam file and genome (looking at alignment data against genome) ```{r, engine='bash'} /home/shared/samtools-1.12/samtools tview \ ../data/19F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam \ ../data/GCF_002022765.2_C_virginica-3.0_genomic.fa ``` ![tview screenshot](https://github.com/course-fish546-2023/celeste-coursework/blob/main/week-06-tview.png?raw=true) # Task 2 ## Creating bam files For this task we will align RNAseq data to a reference genome using HISat2 and then proceed to visualize it with IGV. Fist we download the fastq files: ```{r, engine='bash'} cd ../data curl -O https://owl.fish.washington.edu/nightingales/C_gigas/F143n08_R2_001.fastq.gz curl -O https://owl.fish.washington.edu/nightingales/C_gigas/F143n08_R1_001.fastq.gz ``` Next we download some genome references: ```{r, engine='bash'} cd ../data curl -O https://gannet.fish.washington.edu/panopea/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa curl -O https://gannet.fish.washington.edu/panopea/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa.fai curl -O https://gannet.fish.washington.edu/panopea/Cg-roslin/GCF_902806645.1_cgigas_uk_roslin_v1_genomic-mito.gtf ``` Creating reference index using the gtf file with HISat2 ```{r, engine='bash'} /home/shared/hisat2-2.2.1/hisat2-build \ -f ../data/cgigas_uk_roslin_v1_genomic-mito.fa \ ../output/cgigas_uk_roslin_v1_genomic-mito.index ``` Alignment of the RNAseq data with hisat2: ```{r, engine='bash'} /home/shared/hisat2-2.2.1/hisat2 \ -x ../output/cgigas_uk_roslin_v1_genomic-mito.index \ -p 4 \ -1 ../data/F143n08_R1_001.fastq.gz \ -2 ../data/F143n08_R2_001.fastq.gz \ -S ../output/F143_cgigas.sam ``` Peaking at the alignment data (SAM file format): ```{r, engine='bash'} tail -1 ../output/F143_cgigas.sam ``` Next we are create a bam file from the sam file ```{r, engine ='bash'} /home/shared/samtools-1.12/samtools view \ -@ 7 \ -Su ../output/F143_cgigas.sam \ | /home/shared/samtools-1.12/samtools sort - \ -@ 7 \ -o ../output/F143_cgigas.sorted.bam /home/shared/samtools-1.12/samtools index ../output/F143_cgigas.sorted.bam ``` When done making the BAM files, delete the SAM files to free up space on device so you can proceed to the next step without hiccups. ## Using mpileup to make a vcf file for visualization in IGV Did this but couldn't get the txt file to be used towards making a vcf.gz file. Just made a vcf.gz file directly, ```{r, engine ='bash'} #!/bin/bash ## I leave in the code chunk because it does work well, just not used in the following workflow: # loop through all sorted.bam files #for file in ../output/*.sorted.bam; do # get the sample name from the filename # sample=$(basename $file .sorted.bam) # run mpileup # /home/shared/bcftools-1.14/bcftools mpileup --threads 4 --no-BAQ \ # --fasta-ref ../data/cgigas_uk_roslin_v1_genomic-mito.fa \ # $file > ../output/${sample}_mpileup_output.txt #done ``` ```{r, engine='bash'} #tail ../output/F143_cgigas_mpileup_output.txt ``` Skipping the txt file to vcf.gz and making a vcf.gz file directly from the sorted.bam files: ```{r, engine='bash'} /home/shared/bcftools-1.14/bcftools mpileup --threads 4 --no-BAQ \ --fasta-ref ../data/cgigas_uk_roslin_v1_genomic-mito.fa \ ../output/*.sorted.bam | /home/shared/bcftools-1.14/bcftools call -Oz -v -m -o ../output/F143_mpile.vcf.gz ``` You can take a look at what's in the file here: ```{r, engine='bash'} zgrep "^##" -v ../output/F143_mpile.vcf.gz | \ awk 'BEGIN{OFS="\t"} {split($8, a, ";"); print $1,$2,$4,$5,$6,a[1],$9,$10}' | head ``` Now we convert the vcf.gz file to a vcf file using the call command, but the vcf.gz file itself is truncated since there is not much room left on my current device. ```{r, engine='bash'} /home/shared/bcftools-1.14/bcftools call \ -v -c - f ../output/F143_mpile.vcf.gz \ > ../output/F143_mpile_calls.vcf ``` Inputting the reference genome, bam file, and vcf file into IGV I get the following where it reports that there are no variants found. I suspect it has to do with the fact that download of the file was truncated due to a lack of space on my current device. I will be re-running the code later on to try to get the files to fully download. ![IGV](https://github.com/course-fish546-2023/celeste-coursework/blob/main/igv%20screenshot.png?raw=true)