Create and inspect and alignment files. Including visualizing and capturing “outside” graphics. Publish notebook in rpubs and provide link at top of code.
Minimally show bam file, and at least 2 genome feature files.
Goals: 1) Work with large files but keep them off the machine 2) work with SAM/BAM 3) Properly mark/ comment work
This downloads the BAM file
cd ../data
curl -O https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/120321-cvBS/19F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam
curl -O https://gannet.fish.washington.edu/seashell/bu-mox/scrubbed/120321-cvBS/19F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam.bai
This downloads the genome for C. virginica (Eastern Oyster)
cd ../data
curl -O https://gannet.fish.washington.edu/seashell/bu-mox/data/Cvirg-genome/GCF_002022765.2_C_virginica-3.0_genomic.fa
curl -O https://gannet.fish.washington.edu/seashell/bu-mox/data/Cvirg-genome/GCF_002022765.2_C_virginica-3.0_genomic.fa.fai
Run this in terminal: make sure to navigate to appropriate directory first
/home/shared/samtools-1.12/samtools tview
../data/19F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam
../data/GCF_002022765.2_C_virginica-3.0_genomic.fa
Take a screen shot of this to include for the assignment!
This chunk downloads fastq files for C. gigas (Pacific Oyster) R1 is 4.6 GB and R2 is 4.9 GB
cd ../data
curl -O https://owl.fish.washington.edu/nightingales/C_gigas/F143n08_R2_001.fastq.gz
curl -O https://owl.fish.washington.edu/nightingales/C_gigas/F143n08_R1_001.fastq.gz
This downloads the genome C. gigas (Pacific Oyster)
cd ../data
curl -O https://gannet.fish.washington.edu/panopea/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa
curl -O https://gannet.fish.washington.edu/panopea/Cg-roslin/cgigas_uk_roslin_v1_genomic-mito.fa.fai
curl -O https://gannet.fish.washington.edu/panopea/Cg-roslin/GCF_902806645.1_cgigas_uk_roslin_v1_genomic-mito.gtf
Alignment
This builds index using Hisat. It will have 8 files
/home/shared/hisat2-2.2.1/hisat2-build \
-f ../data/cgigas_uk_roslin_v1_genomic-mito.fa \
../output/cgigas_uk_roslin_v1_genomic-mito.index
Build the SAM file.
Sam file is very large: 63.9 GB
/home/shared/hisat2-2.2.1/hisat2 \
-x ../output/cgigas_uk_roslin_v1_genomic-mito.index \
-p 4 \
-1 ../data/F143n08_R1_001.fastq.gz \
-2 ../data/F143n08_R2_001.fastq.gz \
-S ../output/F143_cgigas.sam
tail -1 ../output/F143_cgigas.sam
Will likely have to point to Crabby-Duck Repo for the rest of the assignment…
This should convert SAM to BAM
# Convert SAM to BAM, using 4 additional threads
/home/shared/samtools-1.12/samtools view -@ 4 -bS \
/home/shared/8TB_HDD_01/sr320/github/steven-crabby-duck/output/F143_cgigas.sam > ../output/F143_cgigas.bam
This sorts and indexs the BAM file
# Sort the BAM file, using 4 additional threads
/home/shared/samtools-1.12/samtools sort -@ 4 \
../output/F143_cgigas.bam -o ../output/F143_cgigas_sorted.bam
# Index the sorted BAM file (multi-threading is not applicable to this operation)
/home/shared/samtools-1.12/samtools index \
../output/F143_cgigas_sorted.bam
Use bcftools mpileup
to generate a pileup in a (.txt)
Binary Call Format file
However, according to announcement #10, do not have to run bcftools
/home/shared/bcftools-1.14/bcftools mpileup --threads 4 --no-BAQ \
--fasta-ref ../data/cgigas_uk_roslin_v1_genomic-mito.fa \
../output/F143_cgigas_sorted.bam > ../output/F143_mpileup_output.txt
Take a look at the mileup file
tail ../output/F143_mpileup_output.txt
call variants, output variant sites, and store results in VCF file
cat ../output/F143_mpileup_output.txt \
| /home/shared/bcftools-1.14/bcftools call -mv -Oz \
> ../output/F143_mpile.vcf.gz
Note, these chunks may not work but cause use VCF from chunk above for visualization
zgrep "^##" -v ../output/F143_mpile.vcf.gz | \
awk 'BEGIN{OFS="\t"} {split($8, a, ";"); print $1,$2,$4,$5,$6,a[1],$9,$10}' | head
/home/shared/bcftools-1.14/bcftools call \
-v -c ../output/F143_mpile.vcf.gz \
> ../output/F143_mpile_calls.vcf
I don’t think it’s supposed to look like that but I couldn’t zoom in on the program….