--- title: "Genetic Relatedness Matrix" author: Steven Roberts date: "`r format(Sys.time(), '%d %B, %Y')`" output: github_document: toc: true toc_depth: 3 number_sections: true html_preview: true html_document: theme: readable highlight: zenburn toc: true toc_float: true number_sections: true code_folding: show code_download: true --- ```{r setup, include=FALSE} library(knitr) library(tidyverse) library(kableExtra) library(DESeq2) library(pheatmap) library(RColorBrewer) library(data.table) library(DT) library(Biostrings) library(spaa) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = TRUE, # Hide warnings message = TRUE, # Hide messages fig.width = 6, # Set plot width in inches fig.height = 4, # Set plot height in inches fig.align = "center" # Align plots to the center ) ``` We have a couple of merged VCFs (epidiverse, bs-snper) ```{r, engine='bash', eval=TRUE} echo "epidiverse" tail -2 ../output/51-SNPs/EpiDiv_merged.f.recode.vcf echo "BS" tail -2 ../data/Cv_10x_fxmerge_05.recode.vcf ``` # plink PLINK is a widely used open-source whole-genome association analysis toolset, which can handle large datasets and perform a variety of analyses including population stratification. Here's a general outline of the steps you'd take: 1. **Conversion of VCF to PLINK format**: PLINK operates on a different file format than VCF. You need to convert your VCF to PLINK format, which generally comprises three separate files: `.bed` (binary pedigree), `.bim` (extended MAP file), and `.fam` (family file). You can do this using PLINK's `--vcf` option: ``` plink --vcf your_merged.vcf --make-bed --out plink_files ``` This will create `plink_files.bed`, `plink_files.bim`, and `plink_files.fam`. 2. **Calculate Identity-by-State (IBS) matrix**: You can calculate the IBS matrix using the `--genome` option in PLINK. This gives a measure of the genetic similarity between all pairs of individuals in your dataset: ``` plink --bfile plink_files --genome --out ibs_matrix ``` This will produce a file `ibs_matrix.genome` which contains the pairwise IBS matrix. 3. **Create the Genetic Relatedness Matrix (GRM)**: You can use this IBS matrix to create a GRM. The GRM is a square symmetric matrix that quantifies the degree of genetic variance that can be explained by the genetic markers between each pair of individuals in your samples. GCTA (Genome-wide Complex Trait Analysis) is a tool that can help to create a GRM from your IBS matrix: ``` gcta64 --bfile plink_files --make-grm --out grm_files ``` This will produce two files (`grm_files.grm.bin` and `grm_files.grm.N.bin`) representing the GRM. ```{r, engine='bash'} /home/shared/plink_linux_x86_64_20230116/plink \ --vcf ../data/Cv_10x_fxmerge_05.recode.vcf \ --make-bed \ --allow-extra-chr \ --out ../output/52-genetic-relatedness/plink_files ``` ```{bash} /home/shared/plink_linux_x86_64_20230116/plink -help ``` ``` --vcf-half-call : Specify how '0/.' and similar VCF GT values should be handled. The following four modes are supported: * 'error'/'e' (default) errors out and reports line #. * 'haploid'/'h' treats them as haploid calls. * 'missing'/'m' treats them as missing. * 'reference'/'r' treats the missing value as 0. ``` ```{r, engine='bash', eval=TRUE} /home/shared/plink_linux_x86_64_20230116/plink \ --vcf ../output/51-SNPs/EpiDiv_merged.f.recode.vcf \ --make-bed \ --allow-extra-chr \ --vcf-half-call m \ --out ../output/52-genetic-relatedness/plink_files ``` ```{r, engine='bash'} /home/shared/plink_linux_x86_64_20230116/plink \ --bfile ../output/52-genetic-relatedness/plink_files \ --allow-extra-chr \ --genome \ --out ../output/52-genetic-relatedness/ibs_matrix ``` ```{r, engine='bash'} /home/shared/gcta-1.94.1-linux-x86_64-static \ --bfile ../output/52-genetic-relatedness/plink_files \ --make-grm \ --out ../output/52-genetic-relatedness/grm_files ``` # ANGSD Yes, absolutely. The program ANGSD (Analysis of Next Generation Sequencing Data) is a versatile tool for analyzing low depth Next Generation Sequencing data. This is particularly helpful when working with non-model organisms, ancient DNA, or other datasets where high quality reference genomes might not be available. ANGSD can calculate the genetic relatedness matrix among individuals based on the genotype likelihoods instead of called genotypes, making it a good choice for low depth sequencing data. Here are the steps you might follow: 1. **Filtering**: It's generally a good idea to filter your VCF files first, based on your research needs. This might include filtering for minor allele frequency, quality scores, or read depth. 2. **Calculate Relatedness**: Use the `-doIBS` option in ANGSD to calculate the relatedness between individuals. You would use something like: ``` bash angsd -vcf-gl your_merged.vcf -doIBS 1 -out output ``` 3. **Create genetic relatedness matrix**: The previous step will output a file with the pairwise relatedness values. You can then convert this file to a matrix format for easier analysis. Please note, as with any analysis, you should carefully consider the assumptions and requirements of the method you're using. Some of these commands might need to be adjusted depending on the specifics of your data and research question. Finally, remember to adhere to ethical guidelines and regulations when handling genetic data. ```{r, engine='bash', eval=TRUE} /home/shared/ngsTools/angsd/angsd \ -vcf-gl ../output/51-SNPs/EpiDiv_merged.f.recode.vcf \ -doIBS 1 \ -out ../output/52-genetic-relatedness/angsd-output ``` # NgsRelate a la chatGPT First, let's understand what these tools are. 1. **ngsRelate**: This is a software package used for inferring pairwise relatedness from next-generation sequencing (NGS) data. It can calculate different coefficients of relatedness as well as inbreeding coefficients. 2. **spaa**: This is an R package that can calculate the genomic relationship matrix (GRM) and genomic kinship matrix (GKM) using SNP array and NGS data. As of version 2, NgsRelate can parse BCF/VCF files using htslib with the following command: ``` ./ngsrelate -h my.VCF.gz -O vcf.res By default, NgsRelate will estimate the allele frequencies using the individuals provided in the VCF files. Allele frequencies from the INFO field can used be used instead using -A TAG. The TAG usually take the form of AF or AF1 but can be set to anything. By default the PL data (Phred-scaled likelihoods for genotypes) is parsed, however, the called genotypes can also be used instead with the -T GT option. If called genotypes are being used, the software requires an additional argument (-c 1). If using -c 2, ngsRelate calls genotypes assuming hardy-weinberg. ``` ```{r, engine='bash', eval=FALSE} /home/shared/ngsRelate/ngsRelate/ngsRelate \ -h ../output/51-SNPs/EpiDiv_merged.f.recode.vcf \ -T GT \ -c 1 \ -O ../output/52-genetic-relatedness/vcf.relatedness ``` ```{r, engine='bash', eval=FALSE} /home/shared/ngsRelate/ngsRelate/ngsRelate \ -h ../data/Cv_10x_fxmerge_05.recode.vcf \ -T GT \ -c 1 \ -O ../output/52-genetic-relatedness/vcf.bs.relatedness ``` ```{r, engine='bash', eval=TRUE} head -2 ../output/52-genetic-relatedness/vcf.relatedness ``` ```{r, engine='bash', eval=TRUE} head -2 ../output/52-genetic-relatedness/vcf.bs.relatedness ``` ## Output format ``` a b nSites J9 J8 J7 J6 J5 J4 J3 J2 J1 rab Fa Fb theta inbred_relatedness_1_2 inbred_relatedness_2_1 fraternity identity zygosity 2of3IDB FDiff loglh nIter coverage 2dsfs R0 R1 KING 2dsfs_loglike 2dsfsf_niter 0 1 99927 0.384487 0.360978 0.001416 0.178610 0.071681 0.000617 0.002172 0.000034 0.000005 0.237300 0.002828 0.250330 0.127884 0.001091 0.035846 0.001451 0.000005 0.001456 0.345411 -0.088997 -341223.335664 103 0.999270 0.154920,0.087526,0.038724,0.143087,0.155155,0.139345,0.038473,0.087632,0.155138 0.497548 0.290124 0.000991 -356967.175857 7 ``` The first two columns contain indices of the two individuals used for the analysis. The third column is the number of genomic sites considered. The following nine columns are the maximum likelihood (ML) estimates of the nine jacquard coefficients, where K0==J9; K1==J8; K2==J7 in absence of inbreeding. Based on these Jacquard coefficients, NgsRelate calculates 11 summary statistics: 13. rab is the pairwise relatedness `(J1+J7+0.75*(J3+J5)+.5*J8)` [Hedrick et al](https://academic.oup.com/jhered/article/106/1/20/2961876) 14. Fa is the inbreeding coefficient of individual a `J1+J2+J3+J4` [Jacquard](https://www.springer.com/gp/book/9783642884177) 15. Fb is the inbreeding coefficient of individual b `J1+J2+J5+J6` [Jacquard](https://www.springer.com/gp/book/9783642884177) 16. theta is the coefficient of kinship `J1 + 0.5*(J3+J5+J7) + 0.25*J8)` [Jacquard](https://www.springer.com/gp/book/9783642884177) 17. inbred_relatedness_1\_2 `J1+0.5*J3` [Ackerman et al](http://www.genetics.org/content/206/1/105) 18. inbred_relatedness_2\_1 `J1+0.5*J5` [Ackerman et al](http://www.genetics.org/content/206/1/105) 19. fraternity `J2+J7` [Ackerman et al](http://www.genetics.org/content/206/1/105) 20. identity `J1` [Ackerman et al](http://www.genetics.org/content/206/1/105) 21. zygosity `J1+J2+J7` [Ackerman et al](http://www.genetics.org/content/206/1/105) 22. Two-out-of-three IBD `J1+J2+J3+J5+J7+0.5*(J4+J6+J8)` [Miklos csuros](https://www.sciencedirect.com/science/article/pii/S0040580913001123) 23. Inbreeding difference `0.5*(J4-J6)` [Miklos csuros](https://www.sciencedirect.com/science/article/pii/S0040580913001123) 24. the log-likelihood of the ML estimate. 25. number of EM iterations. If a `-1` is displayed. A boundary estimate had a higher likelihood. 26. If differs from `-1`, a boundary estimate had a higher likelihood. Reported loglikelihood should be highly similar to the corresponding value reported in `loglh` 27. fraction of sites used for the ML estimate The remaining columns relate to statistics based on a 2D-SFS. 28. 2dsfs estimates using the same methodology as implemented in realSFS, see [ANGSD](https://github.com/ANGSD/angsd) 29. R0 [Waples et al](https://www.biorxiv.org/content/early/2018/08/31/260497) 30. R1 [Waples et al](https://www.biorxiv.org/content/early/2018/08/31/260497) 31. KING [Waples et al](https://www.biorxiv.org/content/early/2018/08/31/260497) 32. the log-likelihood of the 2dsfs estimate. 33. number of iterations for 2dsfs estimate You can also input a file with the IDs of the individuals (on ID per line), using the `-z` option, in the same order as in the file `filelist` used to make the genotype likelihoods or the VCF file. If you do the output will also contain these IDs as column 3 and 4. Note that in some cases nIter is -1. This indicates that values on the boundary of the parameter space had a higher likelihood than the values achieved using the EM-algorithm (ML methods sometimes have trouble finding the ML estimate when it is on the boundary of the parameter space, and we therefore test the boundary values explicitly and output these if these have the highest likelihood) ```{r,eval=FALSE} df = read.table("../output/52-genetic-relatedness/vcf.relatedness",header = T) dfrab <- df[,c("a","b","rab")] distrab <- as.matrix(list2dist(dfrab)) write.table(distrab,file="../output/52-genetic-relatedness/epiMATRIX_mbd_rab.txt", col.names = F, row.names = F, sep = "\t") ``` ```{r,eval=FALSE} dfbs = read.table("../output/52-genetic-relatedness/vcf.bs.relatedness",header = T) dfrabbs <- dfbs[,c("a","b","rab")] distrabbs <- as.matrix(list2dist(dfrabbs)) write.table(distrabbs,file="../output/52-genetic-relatedness/bsMATRIX_mbd_rab.txt", col.names = F, row.names = F, sep = "\t") ```