--- title: "Where are the SNPs?" author: Steven Roberts date: "`r format(Sys.time(), '%d %B, %Y')`" output: github_document: toc: true toc_depth: 3 number_sections: true html_preview: true html_document: theme: readable highlight: zenburn toc: true toc_float: true number_sections: true code_folding: show code_download: true --- ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(kableExtra) library(DESeq2) library(pheatmap) library(RColorBrewer) library(data.table) library(DT) library(Biostrings) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` # Background At minimum, identifying C-\>T SNPs on both strands so we have a list of SNPs from the EpiDiverse output. Other tasks that build upon that: - Compare Bs-snper and EpiDiverse SNP lists. Determine which source our final C-\>T SNP list should come from - Remove SNPs from methylKit analysis - Remove SNPs and recalculate average gene methylation and CV of gene methylation for modeling analyses ------------------------------------------------------------------------ There are two ways we have used to get genetic information from DNA methylation data. - EPIDiverse - BS-SNPer # Epidiverse Sam ran Epidiverse - [Notebook](https://robertslab.github.io/sams-notebook/2022/09/21/BSseq-SNP-Analysis-Nextflow-EpiDiverse-SNP-Pipeline-for-C.virginica-CEABIGR-BSseq-data.html) - VCF Directory - - results dir - ![](http://gannet.fish.washington.edu/seashell/snaps/Monosnap_Compiling_genetic_data__Issue_69__sr320ceabigr_2023-05-03_10-08-58.png) ## Merging Epidiverse VCFs Next step for capturing SNP info in Epidiverse Workflow is merging. ```{r, engine='bash'} cd ../data/big wget -r \ --no-directories --no-parent \ -P . \ -A "*vcf.g*" https://gannet.fish.washington.edu/Atumefaciens/20220921-cvir-ceabigr-nextflow-epidiverse-snp/snps/vcf/ ``` ```{r, engine='bash'} /home/shared/bcftools-1.14/bcftools merge \ --force-samples \ ../data/big/*.vcf.gz \ --merge all \ --threads 40 \ -O v \ -o ../output/51-SNPs/EpiDiv_merged.vcf ``` ```{r, engine='bash', eval=TRUE} head ../output/51-SNPs/EpiDiv_merged.vcf tail -2 ../output/51-SNPs/EpiDiv_merged.vcf ``` ```{r, engine='bash'} /home/shared/vcftools-0.1.16/bin/vcftools \ --vcf ../output/51-SNPs/EpiDiv_merged.vcf \ --recode --recode-INFO-all \ --min-alleles 2 --max-alleles 2 \ --max-missing 0.5 \ --mac 2 \ --out ../output/51-SNPs/EpiDiv_merged.f ``` After filtering, kept 26 out of 26 Individuals Outputting VCF file... After filtering, kept 2343637 out of a possible 144873997 Sites Run Time = 897.00 seconds ```{r, engine='bash', eval=TRUE} head ../output/51-SNPs/EpiDiv_merged.f.recode.vcf tail -2 ../output/51-SNPs/EpiDiv_merged.f.recode.vcf ``` ### Trying to get Genetic Relatedness Matrix ngsDist can also be used to compute genetic distance matrices from next-generation sequencing (NGS) data. ngsDist is a part of the ngsTools suite (https://github.com/fgvieira/ngsTools) that provides various tools for analyzing NGS data. Convert the VCF file to genotype likelihoods file (GLF) or binary alignment/map (BAM) format: ngsDist requires input data in GLF or BAM format. If you have a BAM file, you can use the 'angsd' tool from ngsTools to compute genotype likelihoods. If you have a VCF file, you can use 'bcftools' to convert it to GLF format. Here's an example of how to convert a VCF file to GLF format using bcftools: ``` bcftools query -e 'GT="mis" || GT="hom"' -f '[\%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%GL\n]' input.vcf > input.glf ``` ```{r, engine='bash'} /home/shared/bcftools-1.14/bcftools query \ -e 'GT="mis" || GT="hom"' -f '[%SAMPLE\t%CHROM\t%POS\t%REF\t%ALT\t%GL\n]' ../output/51-SNPs/EpiDiv_merged.f.recode.vcf > ../output/51-SNPs/EpiDiv_merged.f.recode_mishom.glf ``` ```{r, engine='bash', eval=TRUE} head -2 ../output/51-SNPs/EpiDiv_merged.f.recode_mishom.glf tail -2 ../output/51-SNPs/EpiDiv_merged.f.recode_mishom.glf ``` ```{r, engine='bash'} /home/shared/bcftools-1.14/bcftools query \ -f'[%CHROM\t%POS\t%REF\t%ALT\t%GL\n]' ../output/51-SNPs/EpiDiv_merged.f.recode.vcf > ../output/51-SNPs/likliehood.txt ``` ```{r, engine='bash', eval=TRUE} tail ../output/51-SNPs/likliehood.txt ``` ```{python} input_file = "../output/51-SNPs/likliehood.txt" output_file = "../output/51-SNPs/EpiDiv_merged.f.recode.glf" with open(input_file, "r") as infile, open(output_file, "w") as outfile: for line in infile: fields = line.strip().split("\t") chrom, pos, ref, alt, gl = fields gl_values = gl.split(",") outfile.write(f"{chrom}\t{pos}\t{ref}\t{alt}\t{' '.join(gl_values)}\n") ``` ```{r, engine='bash', eval=TRUE} tail ../output/51-SNPs/EpiDiv_merged.f.recode.glf ``` ```{r, engine='bash'} #/home/shared/ngsTools/angsd/angsd \ #-doGlF 2 -doMajorMinor 1 -GL 2 \ #-out ../output/51-SNPs/EpiDiv_merged.f.recode.glf \ #-vcf-pl ../output/51-SNPs/EpiDiv_merged.f.recode.vcf ``` ```{r, engine='bash', eval = TRUE} head -3 ../output/51-SNPs/EpiDiv_merged.f.recode.glf tail -3 ../output/51-SNPs/EpiDiv_merged.f.recode.glf ``` Create a sites file: ngsDist requires a sites file, which is a tab-separated file containing information about the sites to be analyzed. You can create this file from the filtered VCF file: ``` awk '{print $1"\t"$2}' filtered.recode.vcf > sites.txt ``` ```{r, engine='bash'} awk '{print $1"\t"$2}' ../output/51-SNPs/EpiDiv_merged.f.recode.vcf \ > ../output/51-SNPs/sites.txt ``` ```{r, engine='bash', eval=TRUE} #tail -1 ../output/51-SNPs/EpiDiv_merged.f.recode.vcf head -10 ../output/51-SNPs/sites.txt ``` ```{r, engine='bash', eval=TRUE} wc -l ../output/51-SNPs/sites.txt ``` 4. Run ngsDist: Now, you can run ngsDist using the GLF or BAM file, and the sites file you created: ```{r,engine='bash'} /home/shared/ngsTools/ngsDist/ngsDist --geno ../output/51-SNPs/EpiDiv_merged.f.recode.glf \ --probs --n_ind 26 \ --n_sites 2343713 --out ../output/51-SNPs/genotype.txt ``` # BS-SNPer ## draft paper ![](http://gannet.fish.washington.edu/seashell/snaps/Monosnap_C._virginica_Methylation_and_Gene_Expression_-_Google_Docs_2023-05-03_10-38-20.png) ## Wiki 10x coverage files (SNP corrected) at [sampleID]\_R1_val_1\_10x.SNPcorr.bedgraph ## issue (backtracking) see ![](http://gannet.fish.washington.edu/seashell/snaps/Monosnap_Index_of_seashellbu-moxscrubbed_2023-05-03_10-33-53.png) ``` bash # Directories and programs bismark_dir="/gscratch/srlab/programs/Bismark-0.22.3/" bowtie2_dir="/gscratch/srlab/programs/bowtie2-2.3.4.1-linux-x86_64/" samtools="/gscratch/srlab/programs/samtools-1.9/samtools/" reads_dir="/gscratch/srlab/sr320/data/cvirg/" fastqc="/gscratch/srlab/programs/fastqc_v0.11.9/fastqc" genome_folder="/gscratch/srlab/sr320/data/Cvirg-genome/" bcftools_dir="/gscratch/srlab/programs/bcftools-1.9/" htlslib_dir="/gscratch/srlab/programs/htslib-1.9/" source /gscratch/srlab/programs/scripts/paths.sh # # Ah sorry, I always forget about this with bcftools. # #Loop through files and compress like this: # # bcftools view -Oz -o compressed.vcf.gz plain.vcf # htsfile compressed.vcf.gz # bcftools index compressed.vcf.gz # FILES=$(ls *vcf) # # for file in ${FILES} # do # NAME=$(echo ${file} | awk -F "." '{print $1}') # echo ${NAME} # # /gscratch/srlab/programs/bcftools-1.9/bcftools view -O z -o ${NAME}.compressed.vcf.gz \ # ${NAME}.SNP-results.vcf # /gscratch/srlab/programs/htslib-1.9/htsfile ${NAME}.compressed.vcf.gz # /gscratch/srlab/programs/bcftools-1.9/bcftools index ${NAME}.compressed.vcf.gz # done # # # /gscratch/srlab/programs/bcftools-1.9/bcftools merge \ # 12M_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 44F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 13M_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 48M_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 16F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 50F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 19F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 52F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 22F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 53F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 23M_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 54F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 29F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 59M_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 31M_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 64M_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 35F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 6M_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 36F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 76F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 39F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 77F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 3F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 7M_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 41F_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # 9M_R1_val_1_bismark_bt2_pe.compressed.vcf.gz \ # --merge all \ # --threads 20 \ # -O v \ # -o Cv_10x_merged.vcf /gscratch/srlab/programs/vcftools-0.1.16/bin/vcftools \ --vcf Cv_10x_merged.vcf \ --recode --recode-INFO-all \ --min-alleles 2 --max-alleles 2 \ --max-missing 0.5 \ --mac 2 \ --out Cv_10x_merged.filtered ``` Merged VCF - 16G  filtered file..  ```{r, engine='bash', eval=TRUE} head ../data/Cv_10x_fxmerge_05.recode.vcf ``` ```{r, engine='bash', eval=TRUE} tail -2 ../data/Cv_10x_fxmerge_05.recode.vcf ```