--- title: "Revisiting EpiDiverse SNPs" author: Steven Roberts date: "`r format(Sys.time(), '%d %B, %Y')`" output: github_document: toc: true toc_depth: 3 number_sections: true html_preview: true html_document: theme: readable highlight: zenburn toc: true toc_float: true number_sections: true code_folding: show code_download: true --- ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(kableExtra) library(DESeq2) library(pheatmap) library(RColorBrewer) library(data.table) library(DT) library(Biostrings) library(spaa) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = TRUE, # Hide warnings message = TRUE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` Sam ran Epidiverse - [Notebook](https://robertslab.github.io/sams-notebook/2022/09/21/BSseq-SNP-Analysis-Nextflow-EpiDiverse-SNP-Pipeline-for-C.virginica-CEABIGR-BSseq-data.html) - VCF Directory - - results dir - ![](http://gannet.fish.washington.edu/seashell/snaps/Monosnap_Compiling_genetic_data__Issue_69__sr320ceabigr_2023-05-03_10-08-58.png) ## Merging Epidiverse VCFs Next step for capturing SNP info in Epidiverse Workflow is merging. ```{r, engine='bash'} cd ../data/big wget -r \ --no-directories --no-parent \ -P . \ -A "*vcf.g*" https://gannet.fish.washington.edu/Atumefaciens/20220921-cvir-ceabigr-nextflow-epidiverse-snp/snps/vcf/ ``` ```{r, engine='bash'} /home/shared/bcftools-1.14/bcftools merge \ --force-samples \ ../data/big/*.vcf.gz \ --merge all \ --threads 40 \ -O v \ -o ../output/51-SNPs/EpiDiv_merged.vcf ``` ```{r, engine='bash', eval=TRUE} head -20 ../output/51-SNPs/EpiDiv_merged.vcf ``` The eventual GRM does not have sample names so wanted to check order of merged files (assuming this is related to how GRM file is created) ```{r, engine='bash'} fgrep "R1_val_1" ../output/51-SNPs/EpiDiv_merged.vcf ``` ``` ##commandline="freebayes -f GCF_002022765.2_C_virginica-3.0_genomic.fa 12M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.bam --strict-vcf --no-partial-observations --report-genotype-likelihood-max --genotype-qualities --min-repeat-entropy 1 --min-coverage 0 --min-base-quality 1 --region NC_035780.1:0-100000" ##bcftools_viewCommand=view -Oz 12M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf; Date=Sat Sep 24 13:39:51 2022 ##bcftools_mergeCommand=merge --force-samples --merge all --threads 40 -O v -o ../output/51-SNPs/EpiDiv_merged.vcf ../data/big/12M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/13M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/16F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/19F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/22F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/23M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/29F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/31M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/35F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/36F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/39F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/3F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/41F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/44F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/48M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/50F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/52F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/53F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/54F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/59M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/64M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/6M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/76F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/77F_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/7M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz ../data/big/9M_R1_val_1_bismark_bt2_pe.deduplicated.sorted.vcf.gz; Date=Sun May 7 10:33:42 2023 ``` Here is the filtering steps Let's break down the command: - **`/home/shared/vcftools-0.1.16/bin/vcftools`**: This is the path to the VCFtools program. You're executing the program from its location. - **`--vcf ../output/51-SNPs/EpiDiv_merged.vcf`**: This specifies the input VCF file that you're going to work with. - **`--recode --recode-INFO-all`**: The **`--recode`** option tells VCFtools to generate a new VCF file as output after performing all of the filtering operations. The **`--recode-INFO-all`** tells VCFtools to keep all INFO fields from the input VCF in the new output file. - **`--min-alleles 2 --max-alleles 2`**: This filters the data to include only bi-allelic sites, meaning sites that have exactly two alleles (one could be the reference allele, and the other is the variant allele). - **`--max-missing 0.5`**: This is a filter that sets a maximum allowable proportion of individuals with missing data at each site. In this case, it discards any variants where more than 50% of the data is missing. - **`--mac 2`**: This option tells the program to only include sites with a Minor Allele Count of at least 2. This means that the less common variant must appear at least twice in your sample. - **`--out ../output/51-SNPs/EpiDiv_merged.f`**: This is the location and prefix of the output files. Several files might be generated depending on the options you used, and they'll all start with this prefix. So, in summary, this command is running a filtering process on a VCF file, and then saving a new VCF file that only includes bi-allelic sites (those with exactly 2 alleles) where less than 50% of the data is missing and the less common variant appears at least twice. The new file will keep all the INFO fields from the original VCF. ```{r, engine='bash'} /home/shared/vcftools-0.1.16/bin/vcftools \ --vcf ../output/51-SNPs/EpiDiv_merged.vcf \ --recode --recode-INFO-all \ --min-alleles 2 --max-alleles 2 \ --max-missing 0.5 \ --mac 2 \ --out ../output/51-SNPs/EpiDiv_merged.f.recode.vcf ``` After filtering, kept 26 out of 26 Individuals Outputting VCF file... After filtering, kept 2343637 out of a possible 144873997 Sites Run Time = 897.00 seconds ```{r, engine='bash', eval=TRUE} head -20 ../output/51-SNPs/EpiDiv_merged.f.recode.vcf ``` ```{r, engine='bash'} fgrep "R1_val_1" ../output/51-SNPs/EpiDiv_merged.f.recode.vcf ``` # NgsRelate a la chatGPT First, let's understand what these tools are. 1. **ngsRelate**: This is a software package used for inferring pairwise relatedness from next-generation sequencing (NGS) data. It can calculate different coefficients of relatedness as well as inbreeding coefficients. 2. **spaa**: This is an R package that can calculate the genomic relationship matrix (GRM) and genomic kinship matrix (GKM) using SNP array and NGS data. As of version 2, NgsRelate can parse BCF/VCF files using htslib with the following command: ``` ./ngsrelate -h my.VCF.gz -O vcf.res By default, NgsRelate will estimate the allele frequencies using the individuals provided in the VCF files. Allele frequencies from the INFO field can used be used instead using -A TAG. The TAG usually take the form of AF or AF1 but can be set to anything. By default the PL data (Phred-scaled likelihoods for genotypes) is parsed, however, the called genotypes can also be used instead with the -T GT option. If called genotypes are being used, the software requires an additional argument (-c 1). If using -c 2, ngsRelate calls genotypes assuming hardy-weinberg. ``` ```{r, engine='bash'} cut -d ',' -f 1 <% tibble::rownames_to_column("sample")), file = "../output/53-revisit-epi-SNPs/epiMATRIX_mbd_rab.txt", sep = '\t', na = "NA", row.names = FALSE, col.names = TRUE) ``` ```{r, eval=TRUE} #write.table(distrab,file="../output/53-revisit-epi-SNPs/epiMATRIX_mbd_rab.txt", #col.names = T, row.names = T, sep = "\t") ``` ```{r, eval=TRUE} # Plot the heatmap heatmap(distrab, Rowv = NA, Colv = NA, col = cm.colors(256), scale = "none") ``` ```{r, eval=TRUE} # Perform hierarchical clustering hc <- hclust(dist(distrab)) # Plot the dendrogram (cladogram) plot(hc, hang = -1, main = "Cladogram") # If you want to add labels to the leaves of the tree (optional): labels <- rownames(distrab) rect.hclust(hc, k = 6, border = "gray") ```