Population Assignment

In this notebook I assign our experimental fish that were collected off Kodiak Island, AK to known spawning populations. Fin clips were collected from all experimental fish (n=160, including the 4 morts) and sequenced using lcWGS. From a previous experiment (Sara Schaal and Ingrid Spies) there is lcWGS data from 100’s of fish from many different spawning locations in the northeast Pacific Ocean. I will attempt to assign my fish to these spawning locations at the finest possible resolution.

# ref-fish-locations

I use a combination of the AFSC lcWGS pipeline developed by Laura Timm and Sara Schaal, and the pipeline outlined in the amre-adaptation github by Matt DeSaix, which includes his program WGSassign. This is based on suggestions from Sara Schaal. Here’s what she had to say:

I would still try with all the loci and use the low-coverage assignment method that Eric Anderson’s group put out (paper here). If it is anything like GTseq which I assume it will be, what you would have to do is build a sites file from the baseline dataset (all my samples) and only use those sites with your dataset. Otherwise there will be a ton of sites in one dataset, but not the other which will drive the PCA (hence big batch effect). Then once you only get data from those sites you would filter based on missing data. That would probably need to be played with a bit to see what levels of missing data you can get away with. I had to go fairly high with GTseq to get rid of obvious missing data driving results. A PCA is good for determine how bad the missing data influences results before you do any assignments.

I think this is your best option for assignment. Low-coverage data is hard to get accurate genotype calls from so I’d be worried about biases in your called genotype data from low-coverage. I really think that WGSassign is what you need.

I’d be more than happy to try and help you implement this method. Its something I’ve wanted to do with the baseline for awhile, but just haven’t had time. I do have sites for GT-seq and I can share that, but I would still be worried about your results given called genotypes from low-coverage data. I have markers in ZP3 that I purposely designed the panel to include. Let me know what you think and maybe we could buddy up trying to get WGSassign to work for your data. I think it’ll be useful for a lot of folks and for me too so I’d be happy to carve out time to work on this with you.

0.0.0.1 Load libraries and source scripts

# Add all required libraries that are installed with install.packages() here
list.of.packages <- c("tidyverse", "readxl", "janitor", "purrr", "ggpubr", "googlesheets4", "plotly", "ggrepel", "clipr", "knitr", "scales", "colorspace")

# Load all required libraries
all.packages <- c(list.of.packages)
lapply(all.packages, FUN = function(X) {
  do.call("require", list(X))
})

`%!in%` = Negate(`%in%`)
source("../references/biostats.R")

0.0.1 Read in lcWGS metadata

# treatment info for experimental fish 
sample.info.lcwgs <- read_excel("../../data/Pcod Temp Growth experiment 2022-23 DATA.xlsx", sheet = "AllData") %>% clean_names() %>%
  mutate_at(c("tank", "temperature"), factor) %>%
  dplyr::select(temperature, genetic_sampling_count) %>% 
  dplyr::rename(sample=genetic_sampling_count, treatment=temperature) %>% mutate(marine_region=NA) %>%
  
   # add reference fish metadata
   rbind(
    read_excel("../references/20230414_pcod_named.xlsx") %>% clean_names() %>%
      dplyr::select(ablg, location1, marine_region) %>% dplyr::rename(sample=ablg, treatment=location1)) %>%
  
  # add Ingrid's marine region 2
  left_join(
    read_delim("../analysis-20240606/wgsassign/allcod_filtered_bamslist_meta_rmasia.csv") %>% clean_names() %>%
      dplyr::select(ablg, marine_region2) %>% dplyr::rename(sample=ablg, marine_region_ingrid=marine_region2), by="sample") %>%
      mutate(marine_region=as.factor(marine_region), marine_region_ingrid=as.factor(marine_region_ingrid)) %>%
  
  #fill in marine_region2 with marine_region where empty 
  mutate(marine_region2=case_when(
    !is.na(marine_region_ingrid) ~ marine_region_ingrid,
    TRUE ~ marine_region))

0.1 Identify putative population-specific markers in reference fish

Here I prep inputs for the AFSC lcWGS pipeline developed by Laura Timm. This pipeline will get me through the alignment step.

Create a new analysis directory with two sub-directories, analysis20240606/reference/ for all lcWGS data from fish of known origin (filtered to remove some fish that can cause issues, as per Sara/Ingrid), and analysis20240606/experimental/ with lcWGS from our experimental fish of unknown origin.
I will use the NCBI version of the Pacific cod genome for alignment etc., since that’s what I’m using for RNASeq and Ingrid is also using that.
Create a chromosomes_pcod-ncbi.txt file using grep ">" /home/lspencer/references/pcod-ncbi/GCF_031168955.1_ASM3116895v1_genomic.fa | tr -d '>' >> chromosomes_pcod-ncbi.txt
Generate input files: adapters.txt (got this from Laura Timm), blacklist.txt (empty for now), and lcWGS_config.txt (copied from the lcWGS repository)
Modify the lcWGS_config.txt file with appropriate paths.
rsync the experimental fish data (all .fq beginning with “GM”) to the experimental/ subdirectory (used rsync instead of mv to ensure data integrity)
Transfer all the reference fish lcWGS data from SednaGold using rsync:

cd /home/lspencer/pcod-lcWGS/analysis20240606/reference/

rsync --archive --progress --verbose laura.spencer@161.55.97.203:/sednagold/Gadus_macrocephalus/novaseq/*.fq.gz .
rsync --archive --progress --verbose laura.spencer@161.55.97.203:/sednagold/Gadus_macrocephalus/novaseq/20220713_secondSeqRun/*.fq.gz .
rsync --archive --progress --verbose laura.spencer@161.55.97.203:/sednagold/Gadus_macrocephalus/novaseq/20221215_seqR3/*.fastq.gz .

For each of the reference and experimental data sets, created file that lists the data: - references: pcod-reference-fastqs-filtered.txt
- pcod-experimental-fastqs.txt

0.1.1 Run most AFSC pipeline steps on reference fish to get alignments

Initiate interactiv node on Sedna srun --pty /bin/bash

Activate virtual environment to access MultiQC:

source /home/lspencer/venv/bin/activate

Define path variables:

scripts=/home/lspencer/lcWGS-pipeline/
inputs=/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/

Run first four python scripts, which generate slurm scripts needed for fastqc, trimming, and alignment:

${scripts}lcWGSpipeline_step0-configure.py -c ${inputs}lcWGS_config.txt
${scripts}lcWGSpipeline_step1-qc.py -p pcod-refs.ckpt
${scripts}lcWGSpipeline_step2-trim.py -p pcod-refs.ckpt
/home/lspencer/lcWGS-pipeline/lcWGSpipeline_step3-align.py -p pcod-refs.ckpt

As per instructions, edited the MultiQC scripts (for raw and trimmed data) using nano to use the correct path to activate my virtual environment (/home/lspencer/venv/bin/activate)

Then initiated these SLURM scripts (they ran at the same time):

sbatch scripts/GCF_031168955.1_ASM3116895v1_genomic_bwa-indexSLURM.sh
sbatch scripts/GCF_031168955.1_ASM3116895v1_genomic_faiSLURM.sh

Issue encountered due to >1k sample files

These jobs are run as job arrays (1 per .fq file), which have a max arrage size of 1000. I have 1,310 .fq files to process so cannot simply run the slurm array scripts. With help from Giles, I broke the fastqARRAY.sh into two separate jobs:

Here is Script #1 (scripts/pcod-refs-raw_fastqcARRAY-1.sh), which runs through files 1-1000:

#!/bin/bash

#SBATCH --job-name=fqc_array_pcod-refs
#SBATCH --cpus-per-task=1
#SBATCH --output=/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/job_outfiles/pcod-refs-raw_fastqc_%A-%a.out
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=laura.spencer@noaa.gov
#SBATCH --time=0-03:00:00
#SBATCH --array=1-1000%24

module unload bio/fastqc/0.11.9
module load bio/fastqc/0.11.9

JOBS_FILE=/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/scripts/pcod-refs-raw_fqcARRAY_input.txt
IDS=$(cat ${JOBS_FILE})

for sample_line in ${IDS}
do
        job_index=$(echo ${sample_line} | awk -F ":" '{print $1}')
        fq=$(echo ${sample_line} | awk -F ":" '{print $2}')
        if [[ ${SLURM_ARRAY_TASK_ID} == ${job_index} ]]; then
                break
        fi
done

fastqc ${fq} -o /home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/fastqc/raw/

Here is Script #2 (scripts/pcod-refs-raw_fastqcARRAY-2.sh), which runs through files 1001+

#!/bin/bash

#SBATCH --job-name=fqc_array_pcod-refs
#SBATCH --cpus-per-task=1
#SBATCH --output=/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/job_outfiles/pcod-refs-raw_fastqc_%A-%a.out
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=laura.spencer@noaa.gov
#SBATCH --time=0-03:00:00
#SBATCH --array=1-310%24

module unload bio/fastqc/0.11.9
module load bio/fastqc/0.11.9

JOBS_FILE=/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/scripts/pcod-refs-raw_fqcARRAY_input.txt
IDS=$(cat ${JOBS_FILE})

POSMOD=1000
NEW_TASK_ID=$((${SLURM_ARRAY_TASK_ID} + ${POSMOD}))

for sample_line in ${IDS}
do
        job_index=$(echo ${sample_line} | awk -F ":" '{print $1}')
        fq=$(echo ${sample_line} | awk -F ":" '{print $2}')
        if [[ ${NEW_TASK_ID} == ${job_index} ]]; then
                break
        fi
done

fastqc ${fq} -o /home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/fastqc/raw/

When the fastqc SLURM jobs were done, edited the multiqc script to remove the cpus option (deleted #SBATCH --cpus-per-task=1) and replaced that to specify using a himem node (#SBATCH -p himem). I then ran the script sbatch scripts/pcod-refs-raw_multiqcSLURM.sh, after it finished I renamed it to “_raw” and transferred the multiqc files to my local computer to view: “pcod-juveniles-2023/references/multiqc_report_raw.html”.

I then edited the trim script to increase the time (it was just 12hr before) and ran it:

sbatch scripts/pcod-refs_trimARRAY.sh

I then split the pcod-refs-trim_fastqcArray.sh script into two as I did before with the raw data, since the array number exceeded 1,000, ran those, then when they were finished I ran the multiQC script.

sbatch scripts/pcod-refs-trim_fastqcARRAY-1.sh  
sbatch scripts/pcod-refs-trim_fastqcARRAY-2.sh  
sbatch scripts/pcod-refs-trim_multiqcSLURM.sh

Transferred the multiqc html file to my local computer, renamed to “_trimmed”. A handful of samples have low counts, but the adapter trimming looks good!

Ran the alignment scripts one after the other:
sbatch pcod-refs_alignARRAY.sh
sbatch scripts/pcod-refs_depthsARRAY.sh - I had to modify this script to include the full path for the python script /home/lspencer/lcWGS-pipeline/mean_cov_ind.py.

0.1.2 Examine number of reads and alignment depths to identify fish for “blacklist”

Here I create a barplot with the # of unique reads for each sample.

#ggplotly(
read.delim("../analysis-20240606/reference/multiqc_data_trimmed/mqc_fastqc_sequence_counts_plot_1.txt") %>% clean_names() %>%
    mutate(sample=gsub("_trimmed_clipped|_paired", "", sample)) %>%
  arrange(unique_reads) %>% mutate(sample=factor(sample, ordered=TRUE)) %>%
  ggplot() + geom_bar(aes(x=fct_reorder(sample, as.numeric(unique_reads)), y=unique_reads), stat = "identity") + ggtitle("Unique reads by sample") +
  theme(axis.text.x = element_text(angle=90), axis.ticks.x = element_blank())#)

Potential blacklist members, <1M unique reads: (although from the AFSC pipeline it looks like I should determine that after alignment): ABLG1972, ABLG10851, ABLG2136, ABLG2223, ABLG2927, ABLG2571, ABLG2210, ABLG2357, ABLG2509, ABLG2506, ABLG1948

Now look at depths data after alignment. The lcWGS pipeline states that samples with mean depth much <1 should be blacklisted. 22 samples have depth <0.75, which I added to the blacklist file, “blacklist.txt”.

#ggplotly(
  read.delim("../analysis-20240606/reference/pcod-refs_depths.csv", header = F, col.names = c("sample", "depth")) %>%
  arrange(depth) %>% mutate(sample=factor(sample, ordered=TRUE)) %>%
  ggplot() + geom_bar(aes(x=fct_reorder(sample, as.numeric(depth)), y=depth), stat = "identity") + ggtitle("Mean depth, reference fish") +
  scale_y_continuous(limits=c(0, 8), breaks = c(0,1,2,3,4,5,6,7,8)) + ylab("Mean depth") + xlab("Sample") +
  theme(axis.text.x = element_blank(), axis.ticks.x = element_blank())#)

read.delim("../analysis-20240606/reference/pcod-refs_depths.csv", header = F, col.names = c("sample", "depth")) %>%
  arrange(depth) %>%
  filter(depth < 0.75) %>%
  dplyr::select(sample) %>%
  write_delim("../analysis-20240606/reference/pcod-refs_blacklist.txt", col_names = F, delim = "\t")

0.1.2.1 Here are how many blacklisted reference samples I have from each population (mean depth <0.75)

read.delim("../analysis-20240606/reference/pcod-refs_depths.csv", header = F, col.names = c("sample", "depth")) %>%
  arrange(depth) %>%
  filter(depth < 0.75) %>% mutate(sample=as.numeric(gsub("ABLG", "", sample))) %>% 
  left_join(sample.info.lcwgs) %>% 
  dplyr::rename("population"="treatment") %>%
  group_by(population) %>% summarize(n=n()) %>% arrange(desc(n)) %>% 
  kable()

population	n
Shumagins	5
WestKodiak	4
Adak	4
NearIslands	3
Kodiak	2
HecateStrait	1
Pervenets	1
Russia	1
Pribilof	1

paste("Mean depth across all fish after removing blacklisted ones: ",
  round((read.delim("../analysis-20240606/reference/pcod-refs_depths.csv", header = F, col.names = c("sample", "depth")) %>%
  arrange(depth) %>% filter(depth > 0.75) %>% summarize(mean=mean(depth)))$mean, digits = 2), "%", sep="")

## [1] "Mean depth across all fish after removing blacklisted ones: 3.21%"

0.1.2.2 Here is a tally of reference fish per spawning location & marine region

Marine_region2 is combination of marine_region and marine_region_ingrid

ref.bams <- read_delim("../analysis-20240606/reference/pcod-refs_filtered_bamslist.txt", delim = "\t", col_names = "file") %>%
  mutate(sample=as.numeric(gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/bamtools/pcod-refs_ABLG|_sorted_dedup_clipped.bam", "", file))) %>%
  left_join(sample.info.lcwgs)

ref.bams %>% group_by(treatment,marine_region2) %>% tally() %>% arrange(treatment) %>% kable()

treatment	marine_region2	n
HecateStrait	eGOA	47
NearIslands	Aleutians	32
NearIslands	NBS	1
TanagaIsland	Aleutians	41
TanagaIsland	eGOA	1
Unimak	wGOA	41
Kodiak	wGOA	46
Shumagins	wGOA	19
AmchitkaPass	Aleutians	39
AmchitkaPass	NBS	2
Pervenets	EBS	32
Pervenets	NBS	6
CookInlet	wGOA	25
PWS	eGOA	1
PWS	wGOA	33
AK_Knight-tagged	EBS	8
AK_Knight-tagged	NBS	12
Vesteraalen-tagged	EBS	2
Vesteraalen-tagged	NBS	7
LynnCanal	eGOA	47
Russia	NBS	12
WestKodiak	wGOA	44
Korea	Eastern_Pacific	13
Zhemchug	EBS	32
Zhemchug	NBS	2
Pribilof	EBS	33
Pribilof	NBS	4
Japan	Eastern_Pacific	10
Adak	Aleutians	41

ref.bams %>% group_by(marine_region2) %>% tally() %>% kable()

marine_region2	n
Aleutians	153
EBS	107
eGOA	96
NBS	46
wGOA	208
Eastern_Pacific	23

0.1.3 Generate genotype likelihoods for reference fish

Here I use ANGSD following amre-adaptation/WGSAssign settings to generate genotype likelihoods for my reference fish. To do so, I ran /home/lspencer/lcWGS-pipeline/lcWGSpipeline_step4-data.py -p pcod-refs.ckpt -b blacklist.txt, which generated two SLURM scripts that run ANGSD to generate genotype likelihood data (.beagle files). I edited and renamed one of the scripts (scripts/pcod-refs_polymorphicARRAY.sh) to use ANGSD filtering thresholds as per the WGSAssign paper/pipeline. I ran that script via sbatch scripts/pcod-refs_wgassign-polymorphicARRAY.sh, then concatenated gls from each separate chromosome into one whole-genome beagle file with the script scripts/pcod-refs_concatenate_beagles_corrected.sh. NOTE: I had to manually correct that script to not include extraneous chromosome header information.

0.1.3.1 The resulting beagle file containing whole-genome genotype likelihoods was: reference/gls_wgassign/pcod-refs_wholegenome_wgassign.beagle

0.1.4 Check reference SNPs for linkage disequilibrium

I then completely abandoned the lcWGS pipeline, and followed the protocol outlined here mgdesaix/amre-adaptation to check my list of SNPs in reference fish for linkage disequilibrium using the program ngsLD. Linked sites are typically pruned (removed) since their presence can bias results. As per the protocol, I referenced this tutorial to prepare my input files for ngsLD (note: ngsLD is already on Sedna, which is handy).

0.1.4.1 Prep files for ngsLD program

Prep -geno file: remove the first three columns (i.e. positions, major allele, minor allele) and header row from whole genome beagle file

zcat gls_wgassign/pcod-refs_wholegenome_wgassign.beagle.gz | cut -f 4- | tail -n +2 | gzip > gls_wgassign/ngsLD/pcod-refs_wholegenome_wgassign_4ngsLD.beagle.gz

Prep -pos file: input file with site coordinates (one per line), where the 1st column stands for the chromosome/contig and the 2nd for the position (bp). One convenient way to generate this is by selecting the first two columns of the mafs file outputted by ANGSD, with the header removed.

zcat gls_wgassign/pcod-refs_wholegenome_wgassign.mafs.gz | cut -f 1,2 | sed 's/:/_/g'| gzip > gls_wgassign/ngsLD/pcod-refs_wholegenome_wgassign_4ngsLD.sites.gz

With the slurm script “pcod-refs_ngsLD.sh” I ran the ngsLD program. ngsLD outputs a TSV file with LD results for all pairs of sites for which LD was calculated, where the first two columns are positions of the SNPs, the third column is the distance (in bp) between the SNPs, and the following 4 columns are the various measures of LD calculated. The amre-adaptation protocol prunes/removes correlated pairs (r > 0.5 in 7th column of ngsLD output file) with the program prune_graph, which Giles installed as a module on Sedna for me! I pruned linked SNPs using the script referencde/scripts/pcod-refs_get-LDpruned.sh.

(base) [lspencer@sedna ngsLD]$ cat ../../job_outfiles/LD-prune.txt
[2024-07-22 12:10:22] INFO: Creating graph...
[2024-07-22 12:10:22] INFO: Reading from input file...
[2024-07-22 12:10:57] INFO: Input file has 431683 nodes with 20544101 edges (334507 edges with column_7 >= 0.5)
[2024-07-22 12:10:57] INFO: Pruning heaviest position...
[2024-07-22 12:12:06] INFO: Pruned 683 nodes in 68s (9.91 nodes/s); 431000 nodes remaining with 310916 edges.
...
[2024-07-22 14:42:15] INFO: Pruned 1000 nodes in 45s (22.22 nodes/s); 317000 nodes remaining with 339 edges.
[2024-07-22 14:42:29] INFO: Pruning complete! Final graph has 316661 nodes with 0 edges
[2024-07-22 14:42:29] INFO: Saving remaining nodes...
[2024-07-22 14:42:30] INFO: Total runtime: 152.13 mins

zcat pcod-refs_wholegenome_wgassign_4ngsLD.sites.gz | wc -l
431691

cat pcod-refs_wholegenome_unlinked | wc -l
316661

# Create sites file and index
awk -F":" '{print $1, $2}' pcod-refs_wholegenome_unlinked > pcod-refs_wholegenome_unlinked.sites
angsd sites index pcod-refs_wholegenome_unlinked.sites

0.1.4.2 Linkage disequilibrium pruning reduced the number of polymorphic sites in my reference fish from 431,691 to 316,661 sites. Cool!

We will use this set of reference fish SNPs (reference/gls_wgassign/ngsLD/pcod-refs_wholegenome_unlinked.sites) later when we run ANGSD on our experimental fish, so that we only analyze/retain the same SNP set!

Finally, I create a filtered beagle file with genotype likelihoods for the final 316,661 polymorphic, unlinked sites. First, generated a file with one column containing “chromosome_location” for all unlinked sites (using sed to replace the space between chrom and loc in the .sites file with a “_” and saving to a temp file), then I used the beagle filtering code:

cat ../reference/gls_wgassign/ngsLD/pcod-refs_wholegenome_unlinked.sites | sed 's/ /_/g' > ../reference/gls_wgassign/ngsLD/pcod-refs_wholegenome_unlinked.sites.temp
ld_snps=../reference/gls_wgassign/ngsLD/pcod-refs_wholegenome_unlinked.sites.temp
input=../reference/gls_wgassign/pcod-refs_wholegenome_wgassign
output=../reference/gls_wgassign/pcod-refs_wholegenome_wgassign_unlinked
zcat ${input}.beagle.gz | head -1 > ${output}.beagle
awk 'NR==FNR{c[$1]++;next};c[$1]' ${ld_snps} <(zcat ${input}.beagle.gz) >> ${output}.beagle
gzip ${output}.beagle
rm ../reference/gls_wgassign/ngsLD/pcod-refs_wholegenome_unlinked.sites.temp

I checked how many loci are in the reference fish filtered SNP set and the experimental fish SNP set:

gzip -cd ../reference/gls_wgassign/pcod-refs_wholegenome_wgassign_unlinked.beagle.gz | wc -l  #= 316,661
cat ../reference/gls_wgassign/ngsLD/pcod-refs_wholegenome_unlinked.sites | wc -l #= 316,661

Same number! To double checked that I have the same loci:

diff <(zcat ../reference/gls_wgassign/pcod-refs_wholegenome_wgassign_unlinked.beagle.gz | cut -f1 | sort) <(cat ../reference/gls_wgassign/ngsLD/pcod-refs_wholegenome_unlinked.sites.temp | cut -f1 | sort)

0.1.4.3 Final file with genotype likelihoods for polymorphic, unlinked sites in reference populations: ../reference/gls_wgassign/pcod-refs_wholegenome_wgassign_unlinked.beagle.gz.

This could be used for any project needing to assign Pacific cod to a spawning population using the amre-adaptation/WGSassign pipeline. NEATO!

0.2 Explore population structure in reference fish

I used pcangsd to generate covariance matrix from genome-wide genotype likelihoods with script reference/scripts/pcod-refs_wholegenome_pcangsd.sh.

Before using WGSassign to identify population-of-origin for my experimental fish, I want to explore my reference fish population structure. I will do this using PCA.

Importance of components: [1] 5.59458

Create filtered set of reference fish that do not include Japan or Korea

pops.refs.filt <- paste("ABLG", (pc.scores.refs %>% filter(sample.id!=121, treatment %!in% c("Japan", "Korea")))$sample.id, sep="")

0.3 Get genotype likelihoods for experimental fish

Now I need to re-run the AFSC lcWGS pipeline and ANGSD on my experimental fish to get genotype likelihoods, and filter them to only include unlinked sites found in the reference fish.

0.3.1 Run lcWGS pipeline on Experimental fish

Initiate interactiv node on Sedna srun --pty /bin/bash

Activate virtual environment to access MultiQC:

source /home/lspencer/venv/bin/activate

Define path variables:

scripts=/home/lspencer/lcWGS-pipeline/
inputs=/home/lspencer/pcod-lcwgs-2023/analysis-20240606/experimental/

Run first four python scripts, which generate slurm scripts needed for fastqc, trimming, and alignment:

${scripts}lcWGSpipeline_step0-configure.py -c ${inputs}lcWGS_config.txt
${scripts}lcWGSpipeline_step1-qc.py -p pcod-exp.ckpt
${scripts}lcWGSpipeline_step2-trim.py -p pcod-exp.ckpt
/home/lspencer/lcWGS-pipeline/lcWGSpipeline_step3-align.py -p pcod-exp.ckpt

As per instructions, edited the MultiQC scripts (for raw and trimmed data) using nano to use the correct path to activate my virtual environment (/home/lspencer/venv/bin/activate)

Then initiated these SLURM scripts (they ran at the same time):

sbatch scripts/GCF_031168955.1_ASM3116895v1_genomic_bwa-indexSLURM.sh
sbatch scripts/GCF_031168955.1_ASM3116895v1_genomic_faiSLURM.sh
sbatch scripts/pcod-lcWGS-raw_fastqcARRAY.sh

When the SLURM jobs were done, ran the multiqc script - sbatch scripts/pcod-exp-raw_multiqcSLURM.sh - After it finished I added “_raw” to file/director names, then transferred the multiqc files to my local computer to view.

I then ran these scripts, one after the other, to trim and then look at the trimmed data:

sbatch scripts/pcod-exp_trimARRAY.sh
sbatch scripts/pcod-exp-trim_fastqcARRAY.sh
sbatch scripts/pcod-exp-trim_multiqcSLURM.sh

I added “_trimmed” to file/director names, then transferred the multiqc html file to my local computer. Looks good!

Ran the alignment scripts one after the other:

sbatch pcod-exp_alignARRAY.sh sbatch scripts/pcod-exp_depthsARRAY.sh - I had to modify this script to include the full path for the python script /home/lspencer/lcWGS-pipeline/mean_cov_ind.py.

Experimental fish - alignment depths, any bad samples to add to the “blacklist”

I rsynced the pcod-exp_depths.csv file from sedna to local, read it into RStudio, and plotted depths by treatment. Mean depth across all samples is ~3x, which is great. No samples fall below the 1x depth threshold. There is a weird treatment difference, particularly in the 5C treatment. I asked Sam to contact the sequencing facility to see why that might be.

depths.exp <- 
  read_delim(file = "../analysis-20240606/experimental/pcod-exp_depths.csv", # analysis with just experimental fish 
#  read_delim(file = "../analysis-20230922/pcod-lcWGS_depths.csv",  # analysis also with reference fish of known origin
             delim = "\t",  col_names = c("sample", "ave_depth")) %>%
  mutate(sample=gsub("GM", "", sample)) %>%
  mutate(sample=as.numeric(sample))

depths.exp %>%
  left_join(sample.info.lcwgs) %>%
  ggplot(aes(x=reorder(sample, ave_depth), y=ave_depth, fill=treatment)) + 
  geom_bar(stat="identity") + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle=90, vjust=0.5, size=8.5)) +
  ggtitle("lcWGS average read depth") +
  scale_fill_manual(values=c(`0`="royalblue1",`5`="darkgreen", `9`="yellow3",`16`="firebrick4")) #comment out when including reference fish

depths.exp %>%
  left_join(sample.info.lcwgs) %>%
  ggplot(aes(x=treatment, y=ave_depth, fill=treatment)) + 
  geom_boxplot() + 
  theme_minimal() +
  theme(axis.text.x = element_text(angle=90, vjust=0.5, size=8.5)) +
  ggtitle("lcWGS average read depth") +
  scale_fill_manual(values=c(`0`="royalblue1",`5`="darkgreen", `9`="yellow3",`16`="firebrick4")) #comment out when including reference fish

paste("Mean depth, experimental fish lcWGS data ", round(mean(depths.exp$ave_depth), digits=2), "x", sep="") #mean depth = 3x

## [1] "Mean depth, experimental fish lcWGS data 3.02x"

paste("Number of experimental samples", nrow(depths.exp), sep=" ") #number samples = 157

## [1] "Number of experimental samples 157"

0.3.2 Run ANGSD on experimental fish to get polymorphic GLs

Now, I am departing from the AFSC lcWGS pipeline.

I customized the ANGSD script to call and filter variants (polymorphic SNPs) only for sites identified as polymorphic/unlinked in the reference fish (using -sites <file> angsd option) and which remained after linkage disequilibrium pruning. These are listed in the file /reference/gls_wgassign/ngsLD/pcod-refs_wholegenome_unlinked.sites, which has the format: chr pos. NOTE: this sites file does NOT contain major and minor alleles (i.e. it is not an “augmented” sites file with 4 columns), but that’s okay since we use the option doMajorMinor 1 which infers major/minor alleles from genotype likelihoods. My custom ANGSD script is /experimental/scripts/pcod-exp_wgassign-polymorphicARRAY.s.

After the custom angsd script ran, I concatenated the beagle.gz and mafs.gz files for separate chromosomes into wholegenome files (as per the AFSC lcWGS pipeline) using the scripts scripts/pcod-exp_concatenate_beagles.sh and scripts/pcod-exp_concatenate_mafs.s.

Resulting genotype likelihoods file for experimental fish: experimental/gls_wgassign/pcod-exp_wholegenome_wgassign.beagle.gz

How many sites do I have for experimental fish?

cat gls_wgassign/pcod-exp_wholegenome_wgassign.sites | wc -l
232,629

0.3.3 Explore possible population structure in experimental fish

Use pcangsd to generate covariance matrix from genome-wide genotype likelihoods with script experimental/scripts/pcod-exp_wholegenome_pcangsd.sh.

sample.order.exp <- (read_delim(
  file="../analysis-20240606/experimental/pcod-exp_filtered_bamslist.txt", delim = "/t", col_names = "sample") %>%
    mutate(sample=gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/experimental/bamtools/pcod-exp_|_sorted_dedup_clipped.bam",
                       "", sample)))$sample
genome.cov.exp <- read_delim(file="../analysis-20240606/experimental/pcod-exp_wholegenome-wgassign.cov", col_names = sample.order.exp) %>% 
  #    dplyr::select(-outlier) %>% 
  as.matrix() %>%
  `rownames<-`(sample.order.exp)

# Get sample IDs for populations we'd like to include in PCAs
pops.exp <- (sample.info.lcwgs %>%
                filter(sample %in% gsub("GM", "", sample.order.exp)) %>%
                mutate(sample=paste("GM", sample, sep="")) %>% 
                  filter(sample %!in% c("GM121", "GM1", "GM2")) %>% #remove outlier sample(s)
        filter(treatment %in% c("0", "5", "9", "16")))$sample

pca.exp <- prcomp(genome.cov.exp[,pops.exp], scale=F) #scale=F for variance-covariance matrix
#pca.eigenval(pca.princomp) #The Proporation of Variance = %variance 
pc.percent.exp <- pca.eigenval(pca.exp)[2,1:6]*100

## Importance of components:

screeplot(pca.exp, bstick=FALSE)

pc.percent.exp[1:2] %>% sum()

## [1] 2.12649

#### Generate dataframe with prcomp results 
pc.scores.exp <- data.frame(sample.id = colnames(genome.cov.exp[,pops.exp]),
  PC1 = pca.exp$rotation[,1],    # the first eigenvector
  PC2 = pca.exp$rotation[,2],    # the second eigenvector
  PC3 = pca.exp$rotation[,3],    # the third eigenvector
  PC4 = pca.exp$rotation[,4],    # the fourth eigenvector
  PC5 = pca.exp$rotation[,5],    # the fourth eigenvector
  PC6 = pca.exp$rotation[,6],    # the fourth eigenvector
  stringsAsFactors = FALSE)
#shapiro.test(pca.princomp$x) #sample size too large for shapiro test which is weird 
#hist(pca.princomp$x) #normal? hard to say, maybe
pc.scores.exp <- left_join(pc.scores.exp %>% mutate(sample.id=as.numeric(sub("GM|ABLG", "", sample.id))), 
                          sample.info.lcwgs[c("treatment", "sample")], by=c("sample.id"="sample")) %>% droplevels() 
# axes <- data.frame("pc.x"=c(1,1,1,1,1,2,2,2,2,3,3,3,4,4,5), 
#                    "pc.y"=c(2,3,4,5,6,3,4,5,6,4,5,6,5,6,6)) %>%
axes <- data.frame("pc.x"=c(1,1,2), 
                   "pc.y"=c(2,3,3)) %>%
  mutate(pc.x=paste("PC", pc.x, sep=""), pc.y=paste("PC", pc.y, sep=""))

variance.exp <- pc.percent.exp %>% as.data.frame() %>% set_names("variance") %>% rownames_to_column("axis") #%>% 
  #mutate(axis=as.numeric(gsub("PC", "", axis))) 

# pcas.genome.exp <-list(1:nrow(axes))
# for (i in 1:nrow(axes)){
#   pcas.genome.exp[[i]] <- 
#     ggplot(pc.scores.exp, aes(col=treatment, text=sample.id)) +
#                geom_point(aes_string(x=axes[i,"pc.x"], y=axes[i,"pc.y"]), size=3, alpha=0.85) +
#     theme_minimal() + ggtitle("Global gene expression PC1xPC2") + 
#     ylab(paste(axes[i, "pc.y"], " (", round(variance.exp[variance.exp$axis==axes[i, "pc.y"], "variance"], digits = 2), "%)", sep="")) + 
#     xlab(paste(axes[i, "pc.x"], " (", round(variance.exp[variance.exp$axis==axes[i, "pc.x"], "variance"], digits = 2), "%)", sep="")) + 
#     theme(legend.position = "right", legend.text=element_text(size=8), legend.title=element_text(size=9)) + 
#       ggtitle(paste(axes[i, "pc.y"], "x", axes[i, "pc.x"], sep=" "))
# }
# for (plot in pcas.genome.exp) {
#   print(ggplotly(plot, tooltip = "sample.id"))
# }

ggplotly(ggplot(pc.scores.exp, aes(col=treatment, text=sample.id)) +
               geom_point(aes_string(x=axes[1,"pc.x"], y=axes[1,"pc.y"]), size=3, alpha=0.85) +
    theme_minimal() + ggtitle("Global gene expression PC1xPC2") + 
    ylab(paste(axes[1, "pc.y"], " (", round(variance.exp[variance.exp$axis==axes[1, "pc.y"], "variance"], digits = 2), "%)", sep="")) + 
    xlab(paste(axes[1, "pc.x"], " (", round(variance.exp[variance.exp$axis==axes[1, "pc.x"], "variance"], digits = 2), "%)", sep="")) + 
    theme(legend.position = "right", legend.text=element_text(size=8), legend.title=element_text(size=9)) + 
      ggtitle(paste(axes[1, "pc.y"], "x", axes[1, "pc.x"], sep=" ")), tooltip = "sample.id")

ggplotly(ggplot(pc.scores.exp, aes(col=treatment, text=sample.id)) +
               geom_point(aes_string(x=axes[2,"pc.x"], y=axes[2,"pc.y"]), size=3, alpha=0.85) +
    theme_minimal() + ggtitle("Global gene expression PC1xPC3") + 
    ylab(paste(axes[2, "pc.y"], " (", round(variance.exp[variance.exp$axis==axes[2, "pc.y"], "variance"], digits = 2), "%)", sep="")) + 
    xlab(paste(axes[2, "pc.x"], " (", round(variance.exp[variance.exp$axis==axes[2, "pc.x"], "variance"], digits = 2), "%)", sep="")) + 
    theme(legend.position = "right", legend.text=element_text(size=8), legend.title=element_text(size=9)) + 
      ggtitle(paste(axes[2, "pc.y"], "x", axes[2, "pc.x"], sep=" ")), tooltip = "sample.id")

ggplotly(ggplot(pc.scores.exp, aes(col=treatment, text=sample.id)) +
               geom_point(aes_string(x=axes[3,"pc.x"], y=axes[3,"pc.y"]), size=3, alpha=0.85) +
    theme_minimal() + ggtitle("Global gene expression PC2xPC3") + 
    ylab(paste(axes[3, "pc.y"], " (", round(variance.exp[variance.exp$axis==axes[3, "pc.y"], "variance"], digits = 2), "%)", sep="")) + 
    xlab(paste(axes[3, "pc.x"], " (", round(variance.exp[variance.exp$axis==axes[3, "pc.x"], "variance"], digits = 2), "%)", sep="")) + 
    theme(legend.position = "right", legend.text=element_text(size=8), legend.title=element_text(size=9)) + 
      ggtitle(paste(axes[3, "pc.y"], "x", axes[3, "pc.x"], sep=" ")), tooltip = "sample.id")

0.4 Explore genetic overlap among experimental and reference fish

I want to use PCA to look for overlaps among our reference fish and experimental fish. So, I need to merge the two beagle files.

0.4.1 Merge separate reference and experimental beagle files into one

This is easier said than done! I originally developed R code to do this (located in the #Boneyard section of this notebook), but it took a long time, and Ingrid had trouble using it on her beagle filess which had many more sites. SO, I created a bash slurm script to do so!

The script is called join-beagles.sh, and inputs are: - 2 beagles that you want to merge, e.g. beagle1 from reference fish, and beagle2 from experimental fish. - 2 bamslist.txt files that were used to create those beagles
- Sample prefixes found in each file listed in the bamslist.txt files (e.g. “ABLG”)

The script first renames beagle columns with sample IDs (based on bamslist.txt files) and the suffixes “_AA” (homozygous for major allele), “_AB” (heterozygous), and “_BB” (homozygous for minor allele). Below shows data at 4 sites from 2 samples.

marker              allele1 allele2     ABLG10408_AA    ABLG10408_AB    ABLG10408_BB    ABLG10409_AA    ABLG10409_AB    ABLG10409_BB
NC_082382.1_2119        2       0       0.333333        0.333333        0.333333        0.333333        0.333333        0.333333
NC_082382.1_8587        2       0       0.000000        0.999468        0.000532        0.666622        0.333333        0.000044
NC_082382.1_9999        0       2       0.666622        0.333333        0.000044        0.666622        0.333333        0.000044
NC_082382.1_10949       1       0       0.333333        0.333333        0.333333        0.666622        0.333333        0.000044

Then, the script filters both beagles for common markers, sorts by marker name, and left-joins them only for sites that have identical marker names and major/minor alleles. Some major/minor markers don’t match – most are usually swapped, a few are just different. So, for those markers with swapped alleles the GL columns are “corrected” in the second beagle (e.g. beagle2) by swapping the column order for AA and BB (scary! But I’m fairly confident it worked correctly.). It then adds data for the markers that were corrected/swapped to the other left-join file.

I used this to merge my reference and experimental beagle files into one: - Experimental beagle file: experimental/gls_wgassign/pcod-exp_wholegenome_wgassign.beagle.gz
- Reference beagle file: reference/gls_wgsassign/pcod-refs_wholegenome_wgassign_unlinked.beagle.gz
- Resulting merged beagle file: wgsassign/refs-exp-merged.beagle.gz

Number of sites in merged beagle file: 232,583

I used PCAngsd to generate covariance matrix from the merged beagle file using the script wgsassign/angsd/pcod-refs-and-exp_wholegenome_pcangsd.sh

0.4.2 PCA with combined reference and experimental fish

Here I perform PCA from covariance matrix generated using PCAngsd with both reference and experimental fish at overlapping markers (~230k)

#genome.cov.all <- read_delim(file="../analysis-20240606/pcod-refs-and-exp_wholegenome-pca.cov") %>%  #? 

# merged beagle 
genome.cov.all.order <- (read_delim("../analysis-20240606/wgsassign/refs-exp-merged_samples-order.txt", delim = "\t", col_names = "sample"))$sample #sample order in covariance matrix 
genome.cov.all <- read_delim(file="../analysis-20240606/wgsassign/refs-exp-merged_pca.cov", 
                             col_names = genome.cov.all.order) %>%  
  #    dplyr::select(-outlier) %>% 
  as.matrix() %>%
  `rownames<-`(genome.cov.all.order)

# Get sample IDs for populations we'd like to include in PCAs
pops.all <- (sample.info.lcwgs %>%
                filter(sample %in% gsub("GM|ABLG", "", c(sample.order.exp, sample.order.refs))) %>%
               mutate(sample=case_when(
                 (treatment=="0" | treatment=="5" | treatment=="9" | treatment=="16") ~ paste("GM", sample, sep=""),
                 TRUE ~ paste("ABLG", sample, sep=""))) %>%
               #remove outlier reference fish
               filter(treatment %!in% c("Japan", "Korea")) %>%  #"AK_Knight-tagged","Vesteraalen-tagged", 
               filter(sample %!in% c("GM1", "GM2", "GM121")))$sample

pca.all <- prcomp(genome.cov.all[,pops.all], scale=F) #scale=F for variance-covariance matrix
#pca.eigenval(pca.princomp) #The Proporation of Variance = %variance 
pc.percent.all <- pca.eigenval(pca.all)[2,1:6]*100

## Importance of components:

screeplot(pca.all, bstick=FALSE)

pc.percent.all[1:2] %>% sum()

## [1] 7.05458

#### Generate dataframe with prcomp results 
pc.scores.all <- data.frame(sample.id = colnames(genome.cov.all[,pops.all]),
  PC1 = pca.all$rotation[,1],    # the first eigenvector
  PC2 = pca.all$rotation[,2],    # the second eigenvector
  PC3 = pca.all$rotation[,3],    # the third eigenvector
  PC4 = pca.all$rotation[,4],    # the fourth eigenvector
  PC5 = pca.all$rotation[,5],    # the fourth eigenvector
  PC6 = pca.all$rotation[,6],    # the fourth eigenvector
  stringsAsFactors = FALSE)
#shapiro.test(pca.princomp$x) #sample size too large for shapiro test which is weird 
#hist(pca.princomp$x) #normal? hard to say, maybe
pc.scores.all <- left_join(pc.scores.all %>% mutate(sample.id=as.numeric(sub("GM|ABLG", "", sample.id))), 
                          sample.info.lcwgs[c("treatment", "sample")], by=c("sample.id"="sample")) %>% droplevels() %>%
  mutate(mort=case_when(sample.id %in% c(157, 158, 159, 160) ~ "mort", TRUE~"survived")) %>% 
  mutate(sample.id=as.character(sample.id)) #%>%
#  left_join(haplos.allzp3[c("sample", "haplo.lcwgs", "haplo.exons.lcwgs")], by = c("sample.id"="sample"))

# axes <- data.frame("pc.x"=c(1,1,1,1,1,2,2,2,2,3,3,3,4,4,5), 
#                    "pc.y"=c(2,3,4,5,6,3,4,5,6,4,5,6,5,6,6)) %>%
axes <- data.frame("pc.x"=c(1,1,2), 
                   "pc.y"=c(2,3,3)) %>%
  mutate(pc.x=paste("PC", pc.x, sep=""), pc.y=paste("PC", pc.y, sep=""))

variance.all <- pc.percent.all %>% as.data.frame() %>% set_names("variance") %>% rownames_to_column("axis") #%>% 
#  mutate(axis=as.numeric(gsub("PC", "", axis))) 

# PLOTS
# pcas.genome.all <- list(1:nrow(axes))
# for (i in 1:nrow(axes)){
#   pcas.genome.all[[i]] <- 
#       ggplot(pc.scores.all,
#              aes(col=treatment, #shape=haplo.exons.lcwgs,
# #             aes(col=marine_region2, # Alternative line for the grouping variable  
#                     text=sample.id)) +  #text=, 
#         geom_point(aes_string(x=axes[i,"pc.x"], y=axes[i,"pc.y"]), size=1.5, alpha=0.85) +  
#         theme_minimal() + ggtitle("Global gene expression PC1xPC2") + 
#         ylab(paste(axes[i, "pc.y"], " (", round(variance.all[variance.all$axis==axes[i, "pc.y"], "variance"], digits = 1), "%)", sep="")) + 
#         xlab(paste(axes[i, "pc.x"], " (", round(variance.all[variance.all$axis==axes[i, "pc.x"], "variance"], digits = 1), "%)", sep="")) + 
#         theme(legend.position = "right", legend.text=element_text(size=8), legend.title=element_text(size=9)) + 
#         ggtitle(paste(axes[i, "pc.y"], "x", axes[i, "pc.x"], sep=" "))
# }
# for (plot in pcas.genome.all) {
#   print(ggplotly(plot, tooltip = c("treatment", "sample.id", "haplo.exons.lcwgs")))
# }

ggplotly(ggplot(pc.scores.all,
             aes(col=treatment, #shape=haplo.exons.lcwgs,
#             aes(col=marine_region2, # Alternative line for the grouping variable  
                    text=sample.id)) +  #text=, 
        geom_point(aes_string(x=axes[1,"pc.x"], y=axes[1,"pc.y"]), size=1.5, alpha=0.85) +  
        theme_minimal() + 
        ylab(paste(axes[1, "pc.y"], " (", round(variance.all[variance.all$axis==axes[1, "pc.y"], "variance"], digits = 1), "%)", sep="")) + 
        xlab(paste(axes[1, "pc.x"], " (", round(variance.all[variance.all$axis==axes[1, "pc.x"], "variance"], digits = 1), "%)", sep="")) + 
        theme(legend.position = "right", legend.text=element_text(size=8), legend.title=element_text(size=9)) + 
        ggtitle(paste(axes[1, "pc.y"], "x", axes[1, "pc.x"], sep=" ")),  tooltip = c("treatment", "sample.id"))

ggplotly(ggplot(pc.scores.all,
             aes(col=treatment, #shape=haplo.exons.lcwgs,
#             aes(col=marine_region2, # Alternative line for the grouping variable  
                    text=sample.id)) +  #text=, 
        geom_point(aes_string(x=axes[2,"pc.x"], y=axes[2,"pc.y"]), size=1.5, alpha=0.85) +  
        theme_minimal() + 
        ylab(paste(axes[2, "pc.y"], " (", round(variance.all[variance.all$axis==axes[2, "pc.y"], "variance"], digits = 1), "%)", sep="")) + 
        xlab(paste(axes[2, "pc.x"], " (", round(variance.all[variance.all$axis==axes[2, "pc.x"], "variance"], digits = 1), "%)", sep="")) + 
        theme(legend.position = "right", legend.text=element_text(size=8), legend.title=element_text(size=9)) + 
        ggtitle(paste(axes[2, "pc.y"], "x", axes[2, "pc.x"], sep=" ")),  tooltip = c("treatment", "sample.id"))

ggplotly(ggplot(pc.scores.all,
             aes(col=treatment, #shape=haplo.exons.lcwgs,
#             aes(col=marine_region2, # Alternative line for the grouping variable  
                    text=sample.id)) +  #text=, 
        geom_point(aes_string(x=axes[3,"pc.x"], y=axes[3,"pc.y"]), size=1.5, alpha=0.85) +  
        theme_minimal() + 
        ylab(paste(axes[3, "pc.y"], " (", round(variance.all[variance.all$axis==axes[3, "pc.y"], "variance"], digits = 1), "%)", sep="")) + 
        xlab(paste(axes[3, "pc.x"], " (", round(variance.all[variance.all$axis==axes[3, "pc.x"], "variance"], digits = 1), "%)", sep="")) + 
        theme(legend.position = "right", legend.text=element_text(size=8), legend.title=element_text(size=9)) + 
        ggtitle(paste(axes[3, "pc.y"], "x", axes[3, "pc.x"], sep=" ")),  tooltip = c("treatment", "sample.id"))

Great to see my experimental fish overlap well with the reference fish on PC1 x PC2! This gives me confidence that the steps I took to reduce the set of markers to just ~230k have alleviated some batch effects.

Here are possible location Groupings based only on PC1xPC2 biplot: A. Pervenets B. NearIslands, AmchitkaPass, TanagaIsland, Adak
C. Pribilof D. Unimak, Zhemchug E. CookInlet, Kodiak, West Kodiak F. PWS
G. LynnCanal
H. HecateStrait

0.5 Identify SNP sets for spawning population assignments

As per the amre-adapation tutorial I’m following, I now need to reduce my SNP set even more to those that predict population assignment well (high Fst sites). For each population I will split the reference fish in half to make two groups – training set, testing set – using the training set I will identify top N markers that predict population. Using my testing set I will evaluate accuracy of assignments. This will help me identify a) which & how many sites to use for experimental purposes, and b) how confident I am in those assignments.

0.5.1 Generate training and testing groups from reference populations

I created a new directory, /home/lspencer/pcod-lcwgs-2023/analysis-20240606/wgsassign/, which is where I’ll perform the remaining steps to assign my experimental fish!

First, subset samples/bams for training and test purposes: With our filtered list of reference samples, half will be used for WGSassign training purposes (to select the SNPs needed for making assignments), and half for testing to make sure those SNPs work.

0.5.1.1 IMPORTANT NOTE: Here I remove Shumagins spawning location (Area 610) as per Ingrid, since it’s not clear where those fish are actually from

# Randomly sample half of each spawning population for training purposes 
refs.training <- sample.info.lcwgs %>% filter(sample %in% gsub("ABLG", "", pops.refs.filt)) %>%
  filter(treatment!="Shumagins") %>% 
  mutate(sample.id=paste("ABLG", sample, sep = "")) %>%
   group_by(marine_region_ingrid, treatment) %>% slice_sample(prop=0.50)

# The rest are for testing purposes
refs.test <-  sample.info.lcwgs %>% filter(sample %in% gsub("ABLG", "", pops.refs.filt)) %>%
  filter(treatment!="Shumagins") %>% 
  mutate(sample.id=paste("ABLG", sample, sep = "")) %>% 
  filter(sample %!in% refs.training$sample)

# Double check that the number of training and testing samples are equal 
refs.training %>% group_by(marine_region_ingrid, treatment) %>% tally() %>% adorn_totals() %>% as.data.frame() %>%
  dplyr::rename("n.training"="n") %>% left_join(
    refs.test %>% group_by(marine_region_ingrid, treatment) %>% tally() %>% adorn_totals() %>% as.data.frame() %>%
  dplyr::rename("n.test"="n")) %>% dplyr::rename("population"="treatment")

##    marine_region_ingrid         population n.training n.test
## 1             Aleutians        NearIslands         15     16
## 2             Aleutians       TanagaIsland         20     21
## 3             Aleutians       AmchitkaPass         19     20
## 4                   EBS          Pervenets         15     16
## 5                   EBS   AK_Knight-tagged          4      4
## 6                   EBS Vesteraalen-tagged          1      1
## 7                   EBS           Zhemchug         16     16
## 8                   EBS           Pribilof         16     17
## 9                  eGOA       HecateStrait         23     24
## 10                 eGOA          LynnCanal         23     23
## 11                  NBS       AmchitkaPass          1      1
## 12                  NBS          Pervenets          3      3
## 13                  NBS   AK_Knight-tagged          6      6
## 14                  NBS Vesteraalen-tagged          3      4
## 15                  NBS             Russia          6      6
## 16                  NBS           Zhemchug          1      1
## 17                  NBS           Pribilof          2      2
## 18                 wGOA             Unimak         20     21
## 19                 wGOA             Kodiak         23     23
## 20                 wGOA          CookInlet         12     13
## 21                 wGOA                PWS         16     17
## 22                 wGOA         WestKodiak         19     19
## 23                 <NA>         WestKodiak          3      3
## 24                 <NA>               Adak         20     21
## 25                Total                  -        287    304

# IMPORTANT 
# I forgot to use sed.seed() above, so each time I run this chunk it will come up with new random sets of training/test fish. So, here's the actual list I used! 
refs.training.final <- read_delim(file = "../analysis-20240606/wgsassign/snp-training/all-locations/training_bams-list-pops.txt",
            delim = "\t", col_names = c("path", "bam", "ablg", "id", "location", "MR", "MR2", "MRIS"))
refs.training.final %>% group_by(location) %>% tally()

## # A tibble: 14 × 2
##    location         n
##    <chr>        <int>
##  1 Adak            20
##  2 AmchitkaPass    19
##  3 CookInlet       12
##  4 HecateStrait    23
##  5 Kodiak          23
##  6 LynnCanal       23
##  7 NearIslands     16
##  8 PWS             17
##  9 Pervenets       16
## 10 Pribilof        16
## 11 TanagaIsland    20
## 12 Unimak          20
## 13 WestKodiak      22
## 14 Zhemchug        16

Check out PC1 x PC2 using just training set

ggplotly( #this enables interactive plots; remove to plot to PDF figure 
      ggplot(pc.scores.all %>% filter(sample.id %in% refs.training$sample),
             aes(col=treatment, #shape=haplo.exons.lcwgs,
#             aes(col=marine_region2, # Alternative line for the grouping variable  
                    text=sample.id)) +  #text=, 
        geom_point(aes_string(x=axes[1,"pc.x"], y=axes[1,"pc.y"]), size=1.5, alpha=0.85) +  
        theme_minimal() + ggtitle("Training fish") + 
        ylab(paste(axes[1, "pc.y"], " (", round(variance.all[variance.all$axis==axes[1, "pc.y"], "variance"], digits = 1), "%)", sep="")) + 
        xlab(paste(axes[1, "pc.x"], " (", round(variance.all[variance.all$axis==axes[1, "pc.x"], "variance"], digits = 1), "%)", sep="")) + 
        theme(legend.position = "right", legend.text=element_text(size=8), legend.title=element_text(size=9)) + 
        ggtitle(paste("Training fish", axes[1, "pc.y"], "x", axes[1, "pc.x"], sep=" ")), tooltip = c("treatment", "sample.id"))

Check out PC1 x PC2 using just test set

ggplotly( #this enables interactive plots; remove to plot to PDF figure 
      ggplot(pc.scores.all %>% filter(sample.id %in% refs.test$sample),
             aes(col=treatment, #shape=haplo.exons.lcwgs,
#             aes(col=marine_region2, # Alternative line for the grouping variable  
                    text=sample.id)) +  #text=, 
        geom_point(aes_string(x=axes[1,"pc.x"], y=axes[1,"pc.y"]), size=1.5, alpha=0.85) +  
        theme_minimal() + 
        ylab(paste(axes[1, "pc.y"], " (", round(variance.all[variance.all$axis==axes[1, "pc.y"], "variance"], digits = 1), "%)", sep="")) + 
        xlab(paste(axes[1, "pc.x"], " (", round(variance.all[variance.all$axis==axes[1, "pc.x"], "variance"], digits = 1), "%)", sep="")) + 
        theme(legend.position = "right", legend.text=element_text(size=8), legend.title=element_text(size=9)) + 
        ggtitle(paste("Test fish", axes[1, "pc.y"], "x", axes[1, "pc.x"], sep=" ")), tooltip = c("treatment", "sample.id"))

0.5.2 Save training fish bam list

refs.training.bams <-
read_delim(file="../analysis-20240606/reference/pcod-refs_filtered_bamslist.txt", delim = "/t", col_names = "fullpath") %>%
  mutate(file=gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/bamtools/pcod-refs_", "", fullpath)) %>%
    mutate(sample.id=gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/bamtools/pcod-refs_|_sorted_dedup_clipped.bam","", fullpath)) %>% 
  mutate(sample.no=as.numeric(gsub("ABLG", "", sample.id))) %>%
  filter(sample.id %in% refs.training$sample.id) %>%
  left_join(sample.info.lcwgs, by=c("sample.no"="sample"))

head(refs.training.bams)

## # A tibble: 6 × 8
##   fullpath                     file  sample.id sample.no treatment marine_region
##   <chr>                        <chr> <chr>         <dbl> <fct>     <fct>        
## 1 /home/lspencer/pcod-lcwgs-2… ABLG… ABLG10408     10408 WestKodi… wGOA         
## 2 /home/lspencer/pcod-lcwgs-2… ABLG… ABLG10409     10409 WestKodi… wGOA         
## 3 /home/lspencer/pcod-lcwgs-2… ABLG… ABLG10413     10413 WestKodi… wGOA         
## 4 /home/lspencer/pcod-lcwgs-2… ABLG… ABLG10415     10415 WestKodi… wGOA         
## 5 /home/lspencer/pcod-lcwgs-2… ABLG… ABLG10418     10418 WestKodi… wGOA         
## 6 /home/lspencer/pcod-lcwgs-2… ABLG… ABLG10419     10419 WestKodi… wGOA         
## # ℹ 2 more variables: marine_region_ingrid <fct>, marine_region2 <fct>

write_delim(refs.training.bams,
            file = "../analysis-20240606/reference/training_bams-list-pops.txt",
            delim = "\t", col_names = F)

0.5.3 Save test fish bam list

refs.test.bams <-
read_delim(file="../analysis-20240606/reference/pcod-refs_filtered_bamslist.txt", delim = "/t", col_names = "fullpath") %>%
rowid_to_column("order") %>% 
  mutate(file=gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/bamtools/pcod-refs_", "", fullpath)) %>%
    mutate(sample.id=gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/bamtools/pcod-refs_|_sorted_dedup_clipped.bam","", fullpath)) %>% 
  mutate(sample.no=as.numeric(gsub("ABLG", "", sample.id))) %>%
  filter(sample.id %in% refs.test$sample.id) %>%
  left_join(sample.info.lcwgs, by=c("sample.no"="sample"))

head(refs.test.bams)

## # A tibble: 6 × 9
##   order fullpath               file  sample.id sample.no treatment marine_region
##   <int> <chr>                  <chr> <chr>         <dbl> <fct>     <fct>        
## 1     3 /home/lspencer/pcod-l… ABLG… ABLG10410     10410 WestKodi… wGOA         
## 2     4 /home/lspencer/pcod-l… ABLG… ABLG10411     10411 WestKodi… wGOA         
## 3     5 /home/lspencer/pcod-l… ABLG… ABLG10412     10412 WestKodi… wGOA         
## 4     7 /home/lspencer/pcod-l… ABLG… ABLG10414     10414 WestKodi… wGOA         
## 5     9 /home/lspencer/pcod-l… ABLG… ABLG10416     10416 WestKodi… wGOA         
## 6    10 /home/lspencer/pcod-l… ABLG… ABLG10417     10417 WestKodi… wGOA         
## # ℹ 2 more variables: marine_region_ingrid <fct>, marine_region2 <fct>

write_delim(refs.test.bams,
            file = "../analysis-20240606/reference/test_bams-list-pops.txt",
            delim = "\t", col_names = F)

I rsynced those bam-list.txt files to the /wgsassign/ directory: - /home/lspencer/pcod-lcwgs-2023/analysis-20240606/wgsassign2/snp-training/all-locations/training_bams-list-pops.txt
- /home/lspencer/pcod-lcwgs-2023/analysis-20240606/wgsassign2/snp-testing/all-locations/test_bams-list-pops.txt

0.5.4 Get Site Allele Frequencies for each population

As per step 1 of the “SNP screening” stage in the AMRE Adaptation tutorial I now need to get allele frequencies by reference population. First, I created a sites file with polymorphic & unlinked sites (4 columns: chr, pos, major, minor) into my wgsassign/ directory and indexed it:

cp reference/gls_wgassign/ngsLS/pcod-refs_wholegenome_unlinked.sites wgsassignd/pcod-refs_wholegenome_unlinked.sites
angsd sites index wgsassign/pcod-exp_wholegenome_wgassign.sites

Then, I created a slurm script using angsd to pull allele frequencies for each population, wgsassign/snp-training/pcod-training-SAFs.sh, which produced separate site allele frequency files in wgsassign/snp-training/pop-safs/ (eg wgsassign_refs-training.Adak.saf.gz) for each population.

0.5.5 Get the 2d site frequency spectrum and FST for each pairwise comparison of populations.

As per steps 2 & 3 of the “SNP screening” stage in the AMRE Adaptation tutorial I now need to generate 2d site frequency spectrums AND calculate FST among each pairwise combination of populations. This step takes a while since there are SO many populations and thus SO many combinations of populations (didn’t write it as an array, need to!!!). Script is wgsassign/snp-training/pcod-training-2dsfs-fst.sh. The final step of the script filters sites that have Fst values > 0 (done so for each pairwise comparison) and sort them from sites with highest to lowest Fst.

0.5.6 Creat subsets of SNPs for each pairwise comparison for testing purposes

I used the script wgsassign/snp-training/pcod-training-pull-top-snps.sh to pull the top n differentiated SNPs from each of the pairwise list of sites (one for each population combination) based on FST, where n = the top 10, 50, 100, 500, 1000, 5000, 10000, 15000, 20000, 25000, 35000, and 45000 SNPs from each of the list. Since there was overlap of SNPs, filtering to the unique number of SNPs resulted in the below total # SNPS:

  # Sites
  316,661 training.all_sites
  300,846 training.top_45000_sites
  290,738 training.top_35000_sites
  271,401 training.top_25000_sites
  255,218 training.top_20000_sites
  230,919 training.top_15000_sites
  193,590 training.top_10000_sites
  132,136 training.top_5000_sites
   41,591 training.top_1000_sites
   23,408 training.top_500_sites
    5,438 training.top_100_sites
    2,779 training.top_50_sites
     565 training.top_10_sites

0.5.6.1 Examine assignment accuracy with leave-one-out cross validation (using WGSassign)

I generated 12 different sets of SNPs that have high Fst among each pairwise population contrast. To determine the best set that represent the population structure in reference fish, I needed to test each set on the set of reference that were NOT used in producing them (testing set). This was done in WGSassign. Two input files were needed:

NOTE: This is where different experiments COULD enter the pipeline. Because different lcWGS datasets can have better/worse data at each site, all the following steps should only examine sites that are ALSO found in experimental fish.

0.5.6.1.1 WGSassign LOO input 1: Beagle file of the testing set of reference fish

I needed 12 separate beagle files with GLs for my testing set of reference fish only, one for each of the top n SNPs identified in the last step, and filtered to ONLY include sites for which I also have good data in experimental fish. First, I needed to prepare a master beagle file that contained all sites for only my test individuals. This is easier said than done! I created this file by selecting columns for test individuals from the refs_exp-merged.beagle.gz file containing gls for all reference and experimental fish. Here are steps I took:

Previously, I wrote a script (/home/lspencer/pcod-lcwgs-2023/analysis-20240606/wgsassign/join-beagles.sh) that joined my reference.beagle and experimental.beagle files at overlapping sites, which included renaming columns to sample IDs based on the order specified in the bamlist.txt file used by angsd (which is the program that produced the beagle file). So, I selected columns from my merged beagle (wgsassign/refs-exp-merged.beagle.gz) using the script (/home/lspencer/pcod-lcwgs-2023/analysis-20240606/wgsassign/snp-testing/subset-beagle.sh) to select columns based on a list of sample IDs (in wgsassign/snp-testing/test-IDs.txt), which I also created in the subset-beagle.sh script by pulling just sample ID and population from /wgsassign/snp-testing/test_bams-list.txt. The resulting beagle file containing only GLs from test individuals is /wgsassign/snp-testing/testing.beagle.gz.

Finally, I used /wgsassign/snp-testing/all-locations/top-n-beagles.sh script to filter the testing.beagle.gz file 12 times (creating 12 new beagle files) for the n sites identified in the previous step.

232584    ../testing.beagle.gz
221507    testing-top-45000.beagle.gz  
214138    testing-top-35000.beagle.gz  
199910    testing-top-25000.beagle.gz  
188035    testing-top-20000.beagle.gz  
170092    testing-top-15000.beagle.gz  
142425   testing-top-10000.beagle.gz
29410   testing-top-1000.beagle.gz 
96608   testing-top-5000.beagle.gz  
16465   testing-top-500.beagle.gz  
3811    testing-top-100.beagle.gz  
1982    testing-top-50.beagle.gz
422   testing-top-10.beagle.gz

0.5.6.1.2 WGSassign LOO input 2. A tab-delimited ID file with 2 columns, the first being sample ID and the second being the known reference population: As mentioned above, this is the wgsassign/snp-testing/test-IDs.txt file created as part of the subset-beagle.sh script from /wgsassign/snp-testing/test_bams-list.txt (which was created in a previous code chunk in R).

I ran the leave-one-out testing (to see how well each SNP set does at predicting our reference test fish) using the script wgsassign/snp-testing/get-loo-WGSassign.sh. In addition to the 12 SNP subsets, I also ran it using all ~230k SNPs that we have data for in both reference and experimental fish.

0.5.7 Assign test individuals to spawning locations to get accuracy rates

pops <- (read_delim("../analysis-20240606/wgsassign/snp-testing/all-locations/top-10.pop_names.txt", delim = "\t", col_names = "population"))$population 
testing.samples.pop <- read_delim("../analysis-20240606/wgsassign/snp-testing/all-locations/test-IDs.txt", col_names = c("sample", "population")) 

like_loo_files <- list.files(path="../analysis-20240606/wgsassign/snp-testing/all-locations/", pattern = "_LOO\\.txt$", full.names = TRUE)
like_loo_AssAcc <- c()

n.top <- vector()
for (i in 1:length(like_loo_files)) {
  n.top[i] <- as.numeric(gsub("../analysis-20240606/wgsassign/snp-testing/all-locations/top-|.pop_like_LOO.txt", "", like_loo_files[i]))
}

like_loo_AssAcc_pops <- vector("list", length(like_loo_files))
names(like_loo_AssAcc_pops) <- n.top

for(i in 1:length(like_loo_files)){
  infile <- like_loo_files[i]
  like_loo <- read_table(infile, col_names = pops) 
  testing.samples.assigned <- cbind(testing.samples.pop, like_loo)
  testing.summary <- testing.samples.assigned %>%
  pivot_longer(cols = Adak:Zhemchug,
               names_to = "AssignedPop",
               values_to = "AssignedLike") %>%
  group_by(sample) %>%
  filter(AssignedLike == max(AssignedLike)) %>%
  ungroup() %>%
  mutate(Correct = if_else(population == AssignedPop, 1, 0))

    like_loo_AssAcc_pops[[i]] <- 
      testing.summary %>% group_by(population) %>% summarise(n.correct=sum(Correct), n.total=n()) %>%
      adorn_totals() %>% mutate(accuracy=signif(100*n.correct/n.total,digits = 2)) %>% 
      mutate(n.top=n.top[i])
}

like_loo_summary <- bind_rows(like_loo_AssAcc_pops, .id = "n.top")  %>% as.data.frame() %>% mutate(n.top=factor(n.top, ordered=T, levels=c(10,50,100,500,1000,5000,10000,15000,20000,25000,35000,45000))) %>%
# add actual number of markers used 
  mutate(n.sites=case_when(
    n.top==10 ~ 422,
    n.top==50 ~ 1982,
    n.top==100 ~ 3811,
    n.top==500 ~ 16465,
    n.top==1000 ~ 29410,
    n.top==5000 ~ 96608,
    n.top==10000 ~ 142425,
    n.top==15000 ~ 170092,
    n.top==20000 ~ 188035,
    n.top==25000 ~ 199910,
    n.top==35000 ~ 214138,
    n.top==45000 ~ 221507))

What’s the overall assignment accuracy for each number of snps pulled?

like_loo_summary %>%  filter(population =="Total") %>% arrange(desc(accuracy)) %>% select(population, n.top, accuracy) %>% kable()

population	n.top	accuracy
Total	50	31
Total	100	27
Total	10	26
Total	1000	26
Total	500	26
Total	5000	26
Total	15000	25
Total	10000	24
Total	20000	24
Total	25000	24
Total	35000	24
Total	45000	24

cluster_colors <-  c(Adak="#a6cee3", AmchitkaPass="#1f78b4", CookInlet="#b2df8a", 
                     HecateStrait="#33a02c", Kodiak="#fb9a99", LynnCanal="#e31a1c", 
                     NearIlsands="#fdbf6f", Pervenets="#ff7f00", Pribilof="#cab2d6", 
                     PWS="#6a3d9a", Shumagins="#ffff99", TanagaIsland="#b15928", 
                     Total="black", Unimak="gray70", WestKodiak="darkgreen", 
                     Zhemchug="magenta")

like_loo_summary %>% filter(population =="Total") %>% ggplot() +
#  geom_point(aes(x = log10(snps), y = accuracy,  text=topn), size=2.5) + 
  geom_label_repel(aes(x = log10(n.sites), y = accuracy,  label=n.top), size=4) + 
  theme_minimal()

like_loo_summary %>% filter(population!="Total") %>%
  ggplot() +
#  geom_point(aes(x = log10(snps), y = accuracy,  text=topn), size=2.5) + 
  geom_bar(aes(x = n.top, y = n.correct,  fill=population), position="stack", stat="identity", color="black") + 
  theme_minimal() + ggtitle("Number correct population assignments by number SNPs, test fish") + 
  xlab("Number top SNPs used per population (as per Fst)") + ylab("Number accurate assignments (280 fish total)") +
  scale_fill_manual(values=cluster_colors)

ggplotly(like_loo_summary %>% #filter(population!="Total") %>%
  ggplot() +
  geom_line(aes(x = n.sites, y = accuracy,  color=population, text="n.top"), cex=.75) + 
  theme_minimal() + ggtitle("Accuracy per population by number SNPs, test fish") + 
  xlab("Number SNPs") + ylab("Percent accurate assignments (279 fish total)") +
  scale_color_manual(values=cluster_colors))

# Zoom in 
ggplotly(like_loo_summary %>% filter(n.sites < 30000) %>%
  ggplot() +
  geom_line(aes(x = n.sites, y = accuracy,  color=population), cex=.75) + 
  theme_minimal() + ggtitle("Accuracy per population by number SNPs, test fish") + 
  xlab("Number SNPs") + ylab("Percent accurate assignments (279 fish total)") +
  scale_color_manual(values=cluster_colors))

The two best options is to use the top 50 and 5,000 SNPs from each pairwise population to generate our list of population-predicting markers (1,982 and 96,608, respectively). I’ll do both!

like_loo_summary %>%  filter(n.top %in% c(50, 5000)) %>% 
  select(population, n.top, accuracy) %>% pivot_wider(names_from = n.top, values_from = accuracy) %>%
#  write_clip()
  kable()

population	50	5000
Adak	4.8	0
AmchitkaPass	21.0	47
CookInlet	0.0	0
HecateStrait	50.0	0
Kodiak	61.0	61
LynnCanal	88.0	100
NearIslands	31.0	0
PWS	53.0	24
Pervenets	0.0	0
Pribilof	0.0	0
TanagaIsland	38.0	0
Unimak	43.0	90
WestKodiak	4.5	0
Zhemchug	6.2	0
Total	31.0	26

0.6 Assign experimental fish to a spawning location

Finally, we come to the step where we can assign our experimental fish! This is done in WGSassign using “the numpy binary file of the allele frequencies from the testing individuals as the reference file for assignment”, and “a beagle file of just [experimental] individuals as input. I need to create that beagle file from our list of 126k sites (generated from the top 5k differentiating sites as per each pairwise Fst calcs). Using the script /wgsassign/assignment/population-assignment.sh I create a new beagle.gz file that contains genotype likelihoods for my experimental fish for the sites identified using top 5,000, then use WGSassign to identify their populations of origin!

0.6.1 Experimental fish assignment results, using top 50 sites for all spawning locations:

like_loo_exp.50 <- read_table("../analysis-20240606/wgsassign/assignment/pcod-experimental-assign_top-50-snps.pop_like.txt", col_names = pops) %>%   mutate(across(Adak:Zhemchug, as.numeric)) 
like_loo_exp.50[like_loo_exp.50 == "-Inf"] <- NA
like_loo_exp.50[like_loo_exp.50 == "NaN"] <- NA

exp.assigned.50 <- cbind(
  read_delim("../analysis-20240606/wgsassign/assignment/pcod-exp_filtered_bamslist.txt", delim = "\t", col_names = "sample.full") %>%
  mutate(sample=as.numeric(gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/experimental/bamtools/pcod-exp_GM|_sorted_dedup_clipped.bam", "", sample.full))) %>%
#  mutate(sample=as.numeric(gsub("GM", "", sample.full))) %>% 
  left_join(sample.info.lcwgs), 
  like_loo_exp.50)

# how many samples have NA likelihoods? Those will be removed 
exp.assigned.50 %>%  filter(if_any(Adak:Zhemchug, is.na)) %>%  group_by(treatment) %>% tally() %>% kable()

treatment	n
0	1
5	1
16	3

na.50 <- (exp.assigned.50 %>%  filter(if_any(Adak:Zhemchug, is.na)) %>%  group_by(treatment))$sample

exp.assign.summary.50 <- exp.assigned.50 %>%
  mutate(across(Adak:Zhemchug, as.numeric)) %>% 
  pivot_longer(cols = Adak:Zhemchug,
               names_to = "AssignedPop",
               values_to = "AssignedLike") %>%
  group_by(sample) %>%
  filter(sample %!in% na.50) %>%
  mutate(AssignedProb = round(exp(AssignedLike - max(AssignedLike)) / sum(exp(AssignedLike - max(AssignedLike))),2 )) %>%
  filter(AssignedLike == max(AssignedLike)) %>%
  ungroup() 
# write_csv(x = nonbreeding.summary,
#           file = "./output/amre.nonbreeding.ind148.ds_2x.sites-filter.top_50_each.assignment_summary.csv")

# Check number of accurate assignments

exp.assign.summary.50 %>% group_by(AssignedPop) %>% 
  summarize(min.prob=min(AssignedProb),
            max.prob=max(AssignedProb),
            mean.prob=mean(AssignedProb)) %>%
  kable()

AssignedPop	min.prob	max.prob	mean.prob
AmchitkaPass	0.92	0.99	0.9550000
Kodiak	0.49	1.00	0.9150000
PWS	0.85	1.00	0.9600000
Pervenets	0.97	1.00	0.9850000
TanagaIsland	0.80	1.00	0.9000000
Unimak	0.55	1.00	0.9866337
Zhemchug	0.63	1.00	0.9071429

exp.assign.summary.50 %>%
  group_by(AssignedPop) %>%
  summarize(N = n()) %>%
  ungroup() %>%
  ggplot(aes(x="", y=N, fill = AssignedPop)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  scale_fill_manual(values = cluster_colors) +
  theme_void() +
  ggtitle("Experimental Fish Assignments\nusing top 50 Fst markers (1,982 total)")

#### But how accurate are these assignments are for each population, and for erroneous ones, which population are they assigned to by accident?

like_loo_pop_50 <- read_table("../analysis-20240606/wgsassign/snp-testing/all-locations/top-50.pop_like_LOO.txt", col_names = pops) 
testing.summary.50 <- cbind(testing.samples.pop, like_loo_pop_50) %>%  
  pivot_longer(cols = Adak:Zhemchug,
               names_to = "AssignedPop",
               values_to = "AssignedLike") %>%
  group_by(sample) %>%
  filter(AssignedLike == max(AssignedLike)) %>%
  ungroup() %>%
  mutate(Correct = if_else(population == AssignedPop, 1, 0))

assign.acc.pop.50 <- like_loo_summary %>% filter(n.top == "50") %>% 
  arrange(desc(accuracy)) %>% select(population, accuracy) %>%
  filter(population!="Total") %>% 
  left_join(testing.summary.50 %>% 
  group_by(population) %>% tally())

testing.summary.50 %>% 
  group_by(population, AssignedPop) %>% tally() %>%
  ggplot() +
  geom_bar(aes(x=population, y=n, fill=AssignedPop), 
           position="stack", stat="identity", color="black") +
      geom_text(data=assign.acc.pop.50, aes(x = as.factor(population), y = n + 1, label = paste0(accuracy, "%")), 
            size = 3, vjust = 0) + 
    theme_minimal() + ggtitle("What % of each population was assigned correctly?\n(Using top 50 SNPs per pairwise, 1,982 markers total") +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  scale_fill_manual(values=cluster_colors, name="Assigned Population") + 
  xlab("Actual Population of Origin")

# # Replot with relevant populations
# testing.summary.50 %>% 
#   filter(population %in% c("AmchitkaPass", "Kodiak", "Pervenets", "PWS", "TanagaIsland", "Unimak", "Zhemchug")) %>% 
#   group_by(population, AssignedPop) %>% tally() %>%
#   ggplot() +
#   geom_bar(aes(x=population, y=n, fill=AssignedPop), 
#            position="stack", stat="identity", color="black") +
#     theme_minimal() + ggtitle("How test fish were assigned\n(Using top 50 SNPs per pairwise, 1,982 markers total\n(Relevant populations)") +
#    theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
#   scale_fill_manual(values=cluster_colors)

0.6.1.1 How many assignments do I believe are accurate?

The below figure shows the % of assignments that were accurate. Here are the Type I error rate / false positives rates

testing.summary.50.type1 <- testing.summary.50 %>% 
  group_by(population, AssignedPop) %>% tally()  %>% 
  mutate(correct=as.factor(case_when(
    AssignedPop == population ~ "correct", TRUE ~ "incorrect"))) %>% 
  group_by(AssignedPop, correct) %>% summarize(n=sum(n)) %>%
  pivot_wider(names_from = correct, values_from = n) %>%
  replace(is.na(.), 0) %>% 
  mutate(total=correct+incorrect) %>% 
  mutate(accuracy = percent(correct/(total)))

testing.summary.50 %>% 
  group_by(population, AssignedPop) %>% tally() %>% 
  ggplot() +
  geom_bar(aes(x=AssignedPop, y=n, fill=population), 
           position="stack", stat="identity", color="black") +

     geom_text(data=testing.summary.50.type1, aes(x=AssignedPop, y = total + 2, label = paste0(accuracy, "%")),
            size = 3, vjust = 0) +
  
    theme_minimal() + 
  ggtitle("Percent of population assignments that are likely accurate") +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  scale_fill_manual(values=cluster_colors, name="Actual population\nof Origin") + 
  ylab("Number of fish assigned") + xlab("Assigned Population") +
  theme(title = element_text(size=10))

0.6.1.2 Summary of how many experimental fish were assigend to each population using top 50 snps, and assignment accuracy for those popuations based LOO testing

exp.assign.summary.50 %>%
  group_by(AssignedPop) %>% tally() %>%
  left_join(like_loo_summary %>% filter(n.top==50) %>% dplyr::select(population, accuracy),
            by=c("AssignedPop"="population")) %>% arrange(desc(accuracy)) %>%
  rename("Number assigned"="n") %>% kable()

AssignedPop	Number assigned	accuracy
Kodiak	34	61.0
PWS	4	53.0
Unimak	101	43.0
TanagaIsland	2	38.0
AmchitkaPass	2	21.0
Zhemchug	7	6.2
Pervenets	2	0.0

0.6.2 Experimental fish assignment results, using top 5,000 sites for all spawning locations:

like_loo_exp.5k <- read_table("../analysis-20240606/wgsassign/assignment/pcod-experimental-assign_top-5000-snps.pop_like.txt", col_names = pops) %>%   mutate(across(Adak:Zhemchug, as.numeric)) 
like_loo_exp.5k[like_loo_exp.5k == "-Inf"] <- NA
like_loo_exp.5k[like_loo_exp.5k == "NaN"] <- NA

exp.assigned.5k <- cbind(
  read_delim("../analysis-20240606/wgsassign/assignment/pcod-exp_filtered_bamslist.txt", delim = "\t", col_names = "sample.full") %>%
  mutate(sample=as.numeric(gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/experimental/bamtools/pcod-exp_GM|_sorted_dedup_clipped.bam", "", sample.full))) %>%
#  mutate(sample=as.numeric(gsub("GM", "", sample.full))) %>% 
  left_join(sample.info.lcwgs), 
  like_loo_exp.5k)

# Fourteen samples had -INF likelihood values, so we cannot assign them. I need to figure out why. Perhaps missing data? That would make sense, since there are more from the 5C treatment and we have less data from those samples for some reason. 
exp.assigned.5k %>%  filter(if_any(Adak:Zhemchug, is.na)) %>%  group_by(treatment) %>% tally()

## # A tibble: 2 × 2
##   treatment     n
##   <fct>     <int>
## 1 5             4
## 2 9             1

na.5k <- (exp.assigned.5k %>%  filter(if_any(Adak:Zhemchug, is.na)) %>%  group_by(treatment))$sample

exp.assign.summary.5k <- exp.assigned.5k %>%
  mutate(across(Adak:Zhemchug, as.numeric)) %>% 
  pivot_longer(cols = Adak:Zhemchug,
               names_to = "AssignedPop",
               values_to = "AssignedLike") %>%
  group_by(sample) %>%
  filter(sample %!in% na.5k) %>%
  mutate(AssignedProb = round(exp(AssignedLike - max(AssignedLike)) / sum(exp(AssignedLike - max(AssignedLike))),2 )) %>%
  filter(AssignedLike == max(AssignedLike)) %>%
  ungroup() 
# write_csv(x = nonbreeding.summary,
#           file = "./output/amre.nonbreeding.ind148.ds_2x.sites-filter.top_50_each.assignment_summary.csv")

# Check number of accurate assignments

exp.assign.summary.5k %>% group_by(AssignedPop) %>% 
  summarize(min.prob=min(AssignedProb),
            max.prob=max(AssignedProb),
            mean.prob=mean(AssignedProb))

## # A tibble: 1 × 4
##   AssignedPop min.prob max.prob mean.prob
##   <chr>          <dbl>    <dbl>     <dbl>
## 1 Unimak             1        1         1

exp.assign.summary.5k %>%
 # filter(AssignedProb > 0.8) %>%
#  nrow() %>%
  group_by(AssignedPop) %>% tally()# %>% write_clip()

## # A tibble: 1 × 2
##   AssignedPop     n
##   <chr>       <int>
## 1 Unimak        152

exp.assign.summary.5k %>%
  group_by(AssignedPop) %>%
  summarize(N = n()) %>%
  ungroup() %>%
  ggplot(aes(x="", y=N, fill = AssignedPop)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  scale_fill_manual(values = cluster_colors) +
  theme_void() + 
  ggtitle("Experimental Fish Assignments\nusing top 5,000 Fst markers (96,608 total)")

0.6.2.1 But how accurate are these assignments are for each population, and for erroneous ones, which population are they assigned to by accident?

like_loo_pop_5k <- read_table("../analysis-20240606/wgsassign/snp-testing/all-locations/top-5000.pop_like_LOO.txt", col_names = pops) 
testing.summary.5k <- cbind(testing.samples.pop, like_loo_pop_5k) %>%  
  pivot_longer(cols = Adak:Zhemchug,
               names_to = "AssignedPop",
               values_to = "AssignedLike") %>%
  group_by(sample) %>%
  filter(AssignedLike == max(AssignedLike)) %>%
  ungroup() %>%
  mutate(Correct = if_else(population == AssignedPop, 1, 0))

testing.summary.5k %>% 
  group_by(population, AssignedPop) %>% tally() %>%
  ggplot() +
  geom_bar(aes(x=population, y=n, fill=AssignedPop), 
           position="stack", stat="identity", color="black") +
    theme_minimal() + ggtitle("How test fish were assigned\n(Using top 5k SNPs per pairwise, 96,608 markers total") +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  scale_fill_manual(values=cluster_colors)

0.6.2.2 Summary of how many experimental fish were assigend to each population using top 5,000 snps, and assignment accuracy for those popuations based LOO testing

exp.assign.summary.5k %>%
  group_by(AssignedPop) %>% tally() %>%
  left_join(like_loo_summary %>% filter(n.top==50) %>% dplyr::select(population, accuracy),
            by=c("AssignedPop"="population")) %>% arrange(desc(accuracy)) %>%
  rename("Number assigned"="n") %>% kable()

AssignedPop	Number assigned	accuracy
Unimak	152	43

Looking above at the combined PCA - do I see a high degree of overlap among just Unimak and experimental fish in my combined PCA?

Not necessarily. It’s possible that there are indeed some fish from Kodiak, Amchitka Pass, PWS, Zhemchug, Tanaga, and Pervenets, as per the top 50 snp analysis

0.7 Assign experimental fish to Marine Regions

Here I re-run the pipeline after grouping the spawning locations into broader marine regions, and use those it identify SNPs and assign our experimental fish.

Files are saved in the pcod-lcwgs-2023/analysis-20240606/wgsassign/snp-/by-marine-region/ directories

Summary of to do: - Identify reference fish - decision, DO include Japan & Korea - Group them into our MarineRegion2, make sure grouping is correct! - Create training & test sets of bam lists for each marine region
- Downsample regions with very high number of reference fish (as that can sway assignments - more likely to assign fish to those regions)
- get SAF, and for each pairwise marine region combination calculate 2dSFS & Fst - Subset top n snps based on Fst - Assign test individuals, assess accuracy to identify how many top n SNPs to use
- Assign experimental fish to marine regions

0.7.1 Filter SNPs for those that predict population structure (high Fst)

0.7.1.1 Subset samples/bams for training and test purposes

set.seed(seed = 100) # So I get the same (random) IDs when I knit this, etc; remove if new sets of training & testing individuals is needed 

# ## Use ALL reference samples (don't downsample)
# # Randomly sample half of each spawning population for training purposes
# refs.training_MR <- sample.info.lcwgs %>% filter(sample %in% gsub("ABLG", "", pops.refs)) %>% #use all reference samples except japan/korea
#   mutate(sample.id=paste("ABLG", sample, sep = "")) %>%
#    group_by(marine_region2) %>% slice_sample(prop=0.51)
# 
# # The rest are for testing purposes
# refs.test_MR <-  sample.info.lcwgs %>% filter(sample %in% gsub("ABLG", "", pops.refs)) %>% #use all reference samples except japan/korea
#   mutate(sample.id=paste("ABLG", sample, sep = "")) %>%
#   filter(sample %!in% refs.training_MR$sample)
# 
# # Double check that the number of training and testing samples are equal
# refs.training_MR %>% group_by(marine_region2) %>% tally() %>% adorn_totals() %>% as.data.frame() %>%
#   dplyr::rename("n.training"="n") %>% left_join(
#     refs.test_MR %>% group_by(marine_region2) %>% tally() %>% adorn_totals() %>% as.data.frame() %>%
#   dplyr::rename("n.test"="n"))

## Use ~equivalent number of samples per Marine Region (downsample)

refs.training_MR <- 
  # #Randomly sample up to 55 fish per marine region for all but NBS 
  sample.info.lcwgs %>% filter(sample %in% gsub("ABLG", "", pops.refs)) %>% #use all reference samples except japan/korea
  mutate(sample.id=paste("ABLG", sample, sep = "")) %>%
  filter(marine_region2 != "NBS") %>%
   group_by(marine_region2) %>% slice_sample(n = 55) %>% ungroup() %>%

  # Add NBS - here randomly sample 51% of fish 
    add_row(sample.info.lcwgs %>% filter(sample %in% gsub("ABLG", "", pops.refs)) %>% #use all reference samples except japan/korea
  mutate(sample.id=paste("ABLG", sample, sep = "")) %>%
  filter(marine_region2 == "NBS") %>%
   group_by(marine_region2) %>% slice_sample(prop = 0.51)) %>% ungroup()


# The rest are for testing purposes
refs.test_MR <-  sample.info.lcwgs %>% filter(sample %in% gsub("ABLG", "", pops.refs)) %>% #use all reference samples except japan/korea
  mutate(sample.id=paste("ABLG", sample, sep = "")) %>% 
  filter(sample %!in% refs.training_MR$sample)

# Double check that the number of training and testing samples are equal 
refs.training_MR %>% group_by(marine_region2) %>% tally() %>% adorn_totals() %>% as.data.frame() %>%
  dplyr::rename("n.training"="n") %>% left_join(
    refs.test_MR %>% group_by(marine_region2) %>% tally() %>% adorn_totals() %>% as.data.frame() %>%
  dplyr::rename("n.test"="n"))

##   marine_region2 n.training n.test
## 1      Aleutians         55     98
## 2            EBS         55     52
## 3           eGOA         55     41
## 4            NBS         23     23
## 5           wGOA         55    153
## 6          Total        243    367

0.7.1.2 Save training bam list

(refs.training.bams_MR <-
read_delim(file="../analysis-20240606/reference/pcod-refs_filtered_bamslist.txt", delim = "/t", col_names = "fullpath") %>%
  mutate(file=gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/bamtools/pcod-refs_", "", fullpath)) %>%
    mutate(sample.id=gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/bamtools/pcod-refs_|_sorted_dedup_clipped.bam","", fullpath)) %>% 
  mutate(sample.no=as.numeric(gsub("ABLG", "", sample.id))) %>%
  filter(sample.id %in% refs.training_MR$sample.id) %>%
  left_join(sample.info.lcwgs, by=c("sample.no"="sample")) %>%
  dplyr::select(fullpath,sample.id,marine_region2))

## # A tibble: 243 × 3
##    fullpath                                             sample.id marine_region2
##    <chr>                                                <chr>     <fct>         
##  1 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10409 wGOA          
##  2 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10410 wGOA          
##  3 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10413 wGOA          
##  4 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10421 wGOA          
##  5 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10422 wGOA          
##  6 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10424 wGOA          
##  7 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10430 wGOA          
##  8 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10432 wGOA          
##  9 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10436 wGOA          
## 10 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10443 wGOA          
## # ℹ 233 more rows

write_delim(refs.training.bams_MR,
            file = "../analysis-20240606/wgsassign/snp-training/regions/training_bams-list_MR.txt",
            delim = "\t", col_names = F)

0.7.1.3 Save test bam list

(refs.test.bams_MR <-
read_delim(file="../analysis-20240606/reference/pcod-refs_filtered_bamslist.txt", delim = "/t", col_names = "fullpath") %>%
rowid_to_column("order") %>% 
  mutate(file=gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/bamtools/pcod-refs_", "", fullpath)) %>%
    mutate(sample.id=gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/bamtools/pcod-refs_|_sorted_dedup_clipped.bam","", fullpath)) %>% 
  mutate(sample.no=as.numeric(gsub("ABLG", "", sample.id))) %>%
  filter(sample.id %in% refs.test_MR$sample.id) %>%
  left_join(sample.info.lcwgs, by=c("sample.no"="sample")) %>%
  dplyr::select(fullpath,sample.id,marine_region2))

## # A tibble: 367 × 3
##    fullpath                                             sample.id marine_region2
##    <chr>                                                <chr>     <fct>         
##  1 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10408 wGOA          
##  2 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10411 wGOA          
##  3 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10412 wGOA          
##  4 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10414 wGOA          
##  5 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10415 wGOA          
##  6 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10416 wGOA          
##  7 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10417 wGOA          
##  8 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10418 wGOA          
##  9 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10419 wGOA          
## 10 /home/lspencer/pcod-lcwgs-2023/analysis-20240606/re… ABLG10420 wGOA          
## # ℹ 357 more rows

write_delim(refs.test.bams_MR,
            file = "../analysis-20240606/wgsassign/snp-testing/regions/test_bams-list_MR.txt",
            delim = "\t", col_names = F)

I rsynced the files test_bams-list_MR.txt, training_bams-list_MR.txt to the /wgsassign//regions/ directories.

0.7.2 Screen SNPs for those that predict each region

Get Site Allele Frequencies for each region: Edited and ran wgsassign/snp-training/regions/pcod-training-SAFs.sh
Get the 2d site frequency spectrum and FST for each pairwise comparison of regions: Edited and ran wgsassign/snp-training/regions/pcod-training-2dsfs-fst.sh.
Create subsets of SNPs for each pairwise comparison for testing purposes: wgsassign/snp-training/regions/pcod-training-pull-top-snps.sh, which produced these sites:

  N sites   File
  187536    training.top_50000_sites
  150156    training.top_35000_sites
  100624    training.top_20000_sites
   57378    training.top_10000_sites
   31067    training.top_5000_sites
    8406    training.top_1250_sites
    6733    training.top_1000_sites
    3369    training.top_500_sites
    1657    training.top_250_sites
     652    training.top_100_sites
     481    training.top_75_sites
     308    training.top_50_sites
      65    training.top_10_sites

0.7.3 Assess regional assignment accuracy with leave-one-out cross validation (using WGSassign)

Pull GLs/columns for JUST test fish using the script wgsassign/snp-training/all-locations/subset-columns-beagle.sh, producing testing.beagle.gz
Generate separate beagle files containing top N sites for test individuals using wgsassign/snp-training/all-locations/top-n-sites.sh. Here are the resulting beagle files, confirming the number of markers in each:

for file in *.gz; do
    echo "$file: $(zcat "$file" | wc -l)"
done

testing.beagle.gz: 232584
testing-top-10000.beagle.gz: 42157
testing-top-1000.beagle.gz: 4890
testing-top-100.beagle.gz: 481
testing-top-10.beagle.gz: 58
testing-top-1250.beagle.gz: 6101
testing-top-20000.beagle.gz: 74123
testing-top-250.beagle.gz: 1214
testing-top-35000.beagle.gz: 110702
testing-top-50000.beagle.gz: 138291
testing-top-5000.beagle.gz: 22752
testing-top-500.beagle.gz: 2451
testing-top-50.beagle.gz: 239
testing-top-75.beagle.gz: 362

Run leave-one-out testing (to see how well each of the 13 SNP set does at predicting our reference test fish) using the script wgsassign/snp-testing/regions/get-loo-WGSassign.sh.
Transfer log-likelihood results to local computer to explore results in R, using rsync --progress --verbose --archive -r lspencer@sedna.nwfsc2.noaa.gov:/home/lspencer/pcod-lcwgs-2023/analysis-20240606/wgsassign2/snp-testing/regions/testing-LOOs/*_LOO.txt .
Assess accuracy of WGSassign region assignments using our test fish of known regional origin

regions <- (read_delim("../analysis-20240606/wgsassign/snp-testing/regions/top-10.pop_names.txt", delim = "\t", col_names = "region"))$region
testing.samples.reg <- read_delim("../analysis-20240606/wgsassign/snp-testing/regions/test-IDs.txt", col_names = c("sample", "region"))
like_loo_files.reg <- list.files(path="../analysis-20240606/wgsassign/snp-testing/regions/", pattern = "_LOO\\.txt$", full.names = TRUE)
like_loo_AssAcc.reg <- c()

n.top.reg <- vector()
for (i in 1:length(like_loo_files.reg)) {
  n.top.reg[i] <- as.numeric(gsub("../analysis-20240606/wgsassign/snp-testing/regions/top-|.pop_like_LOO.txt", "", like_loo_files.reg[i]))
}

like_loo_AssAcc.reg <- vector("list", length(like_loo_files.reg))
names(like_loo_AssAcc.reg) <- n.top.reg

for(i in 1:length(like_loo_files.reg)){
  infile <- like_loo_files.reg[i]
  like_loo <- read_table(infile, col_names = regions) 
  testing.samples.assigned <- cbind(testing.samples.reg, like_loo)
  testing.summary <- testing.samples.assigned %>%
  pivot_longer(cols = Aleutians:wGOA,
               names_to = "AssignedPop",
               values_to = "AssignedLike") %>%
  group_by(sample) %>%
  filter(AssignedLike == max(AssignedLike)) %>%
  ungroup() %>%
  mutate(Correct = if_else(region == AssignedPop, 1, 0))

    like_loo_AssAcc.reg[[i]] <- 
      testing.summary %>% group_by(region) %>% summarise(n.correct=sum(Correct), n.total=n()) %>%
      adorn_totals() %>% mutate(accuracy=signif(100*n.correct/n.total,digits = 2)) %>% 
      mutate(n.top=n.top[i])
}

like_loo_summary.reg <- bind_rows(like_loo_AssAcc.reg, .id = "n.top")  %>% as.data.frame() %>% mutate(n.top=factor(n.top, ordered=T, levels=c(10,50,75,100,250,500,1000,1250,5000,10000,20000,35000,50000))) %>%
# add actual number of markers used 
  mutate(n.sites=case_when(
    n.top==10 ~ 58,
    n.top==50 ~ 239,
    n.top==75 ~ 362,
    n.top==100 ~ 481,
    n.top==250 ~ 1214,
    n.top==500 ~ 2451,
    n.top==1000 ~ 4890,
    n.top==1250 ~ 6101,
    n.top==5000 ~ 22752,
    n.top==10000 ~ 42157,
    n.top==20000 ~ 74123,
    n.top==35000 ~ 110702,
    n.top==50000 ~ 138291))

0.7.3.1 Here is how each marker set performed

cluster_colors.reg <- c(Aleutians="#1f78b4",
                     NBS="#33a02c",
                     eGOA="#e31a1c",
                     wGOA="orange",
                     EBS="#6a3d9a")

#require(ggrepel)
like_loo_summary.reg %>% filter(region =="Total") %>% ggplot() +
#  geom_point(aes(x = log10(snps), y = accuracy,  text=topn), size=2.5) +
  geom_label_repel(aes(x = log10(n.sites), y = accuracy,  label=n.top), size=4) +
  theme_minimal()

# 10,000 SNPs greatest accuracy, 78% accurate

like_loo_summary.reg %>% filter(region!="Total") %>%
  ggplot() +
#  geom_point(aes(x = log10(snps), y = accuracy,  text=topn), size=2.5) +
  geom_bar(aes(x = n.top, y = n.correct,  fill=region), position="stack", stat="identity", color="black") +
  theme_minimal() + ggtitle("Number correct region assignments by number SNPs, test fish") +
  scale_fill_manual(values=cluster_colors.reg) + 
  xlab("Number top SNPs used per region (as per Fs)") + ylab("Number accurate assignments (361 fish total)")

# 5,000 SNPs greatest accuracy, BUT still very bad!


ggplotly(
  like_loo_summary.reg %>% #filter(n.top!="all") %>% mutate(n.top=as.numeric(as.character(n.top))) %>%
#  mutate(n.top = as.numeric(as.character(n.top))) %>%
  ggplot(aes(x = n.sites, y = accuracy,  color=region)) +
  geom_line(cex=.75) +
  
  theme_minimal() + ggtitle("Accuracy per region by number SNPs, test fish") +
  ylab("Percent accurate assignments (361 fish total)") +
  scale_color_manual(values=cluster_colors.reg) + 
    
  scale_x_continuous(
    name = "Number of top Fst markers used", breaks=unique(like_loo_summary.reg$n.sites) %>% sort(), labels=unique(like_loo_summary.reg$n.top) %>% sort())
  )

The highest total accuracy is achieved using the top 1,250 Fst sites per region-region combo, for a total of 6,101 unique sites. Second and third highest accuracy is achieved using 1,000 and 5,000 markers, respectively.

like_loo_summary.reg %>% filter(region =="Total") %>% arrange(desc(accuracy)) %>% kable()

region	n.correct	n.total	accuracy	n.top	n.sites
Total	295	361	82	1250	6101
Total	292	361	81	1000	4890
Total	292	361	81	5000	22752
Total	283	361	78	500	2451
Total	276	361	76	10000	42157
Total	275	361	76	250	1214
Total	264	361	73	100	481
Total	260	361	72	50	239
Total	260	361	72	75	362
Total	249	361	69	20000	74123
Total	242	361	67	10	58
Total	236	361	65	35000	110702
Total	224	361	62	50000	138291

Here is the accuracy breakdown by Marine Region those top three sets of markers. It looks like using the top 1,250 marker set is indeed best.

like_loo_summary.reg %>% filter(n.top %in% c("1250", "1000", "5000")) %>% arrange(desc(accuracy)) %>% select(region, n.top, accuracy) %>%
  pivot_wider(names_from = n.top, values_from = accuracy) %>% 
  kable()

region	1000	1250	5000
eGOA	100	100	95
wGOA	95	97	99
NBS	94	94	82
Aleutians	91	92	90
Total	81	82	81
EBS	0	0	0

0.7.3.2 Here is a breakdown of how each fish of known Marine Region was assigned, and for erroneous ones, which Marine Region were they assigned?

like_loo_reg.1250 <- read_table("../analysis-20240606/wgsassign/snp-testing/regions/top-1250.pop_like_LOO.txt", col_names = regions) 
testing.summary.1250 <- cbind(testing.samples.reg, like_loo_reg.1250) %>%  
  pivot_longer(cols = Aleutians:wGOA,
               names_to = "AssignedPop",
               values_to = "AssignedLike") %>%
  group_by(sample) %>%
  filter(AssignedLike == max(AssignedLike)) %>%
  ungroup() %>%
  mutate(Correct = if_else(regions == AssignedPop, 1, 0))

testing.summary.1250.plot <- testing.summary.1250 %>% 
  group_by(region, AssignedPop) %>% tally()

0.7.3.3 How many assignments are accurate?

The below figure shows the % of assignments that were accurate. Here are the Type I error rate / false positives rates – Aleutians 6% false positives
– EBS: 0%
– NBS: 0% – wGOA: 28% (most of that 28% is EBS fish) – eGOA: 7%

testing.summary.1250.type1 <- testing.summary.1250.plot %>% 
  mutate(correct=as.factor(case_when(
    AssignedPop == region ~ "correct", TRUE ~ "incorrect"))) %>% 
  group_by(AssignedPop, correct) %>% summarize(n=sum(n)) %>%
  pivot_wider(names_from = correct, values_from = n) %>%
  replace(is.na(.), 0) %>% 
  mutate(total=correct+incorrect) %>% 
  mutate(accuracy = percent(correct/(total)))


testing.summary.1250.plot %>% 
  ggplot() +
  geom_bar(aes(x=AssignedPop, y=n, fill=region), 
           position="stack", stat="identity", color="black") +

     geom_text(data=testing.summary.1250.type1, aes(x=AssignedPop, y = total + 7, label = paste0(accuracy, "%")),
            size = 3, vjust = 0) +
  
    theme_minimal() + 
  ggtitle("Percent of regional assignments that are likely accurate") +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  scale_fill_manual(values=cluster_colors.reg, name="Actual Region\nof Origin") + 
  ylab("Number of fish assigned") + xlab("Assigned Region") +
  theme(title = element_text(size=10))

testing.summary.1250.plot %>% select(AssignedPop, region, n) %>% arrange(AssignedPop) %>%
  rename("Assignment"="AssignedPop", "Origin"="region", "Number"="n") %>% kable()

Assignment	Origin	Number
Aleutians	Aleutians	90
Aleutians	EBS	3
Aleutians	NBS	1
Aleutians	wGOA	2
NBS	NBS	16
eGOA	eGOA	41
eGOA	wGOA	3
wGOA	Aleutians	8
wGOA	EBS	49
wGOA	wGOA	148

0.7.3.4 How many fish from each region were correctly assigned?

The below figure shows the % fish that were assigned correctly. Here are the Type II error rate / false negative rates – Aleutians 8%
– EBS: 100% (most were assigned to wGOA)
– eGOA: 0% – NBS: 6% – wGOA: 3%

assign.acc.1250 <-like_loo_summary.reg %>% filter(n.top == "1250") %>% 
  arrange(desc(accuracy)) %>% select(region, accuracy) %>%
  filter(region!="Total") %>% 
  left_join(testing.summary.1250.plot %>% group_by(region) %>%
              summarize(n=sum(n)))

testing.summary.1250.plot %>% 
  ggplot() +
  geom_bar(aes(x=as.factor(region), y=n, fill=AssignedPop), 
           position="stack", stat="identity", color="black") +
    
     geom_text(data=assign.acc.1250, aes(x = as.factor(region), y = n + 7, label = paste0(accuracy, "%")), 
            size = 3, vjust = 0) + 
  
    theme_minimal() + 
  ggtitle("Percent of fish from each region that were correctly assigned\nThis uses 1,250 top Fst SNPs from each combo, 6,101 markers total") +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  scale_fill_manual(values=cluster_colors.reg, name="Assigned Region\nof Origin") + 
  ylab("Number of fish assigned") + xlab("Actual Region of Origin") +
  theme(title = element_text(size=10))

Conclusion: The set of 6,101 markers assigns fish to Aleutian, NBS, wGOA, and eGOA with high accuracy, but is NOT capable of assigning EBS fish correctly. Most of those fish were assigned to wGOA.

0.8 Assign experimental fish to Marine Region

I realize now that I should NOT use the experimental.beagle.gz file to run the analysis, as some of the GLs are likely swapped for major/minor alleles (which probably will mess up assignments). So, before running WGSassign I want to create an experimental beagle file from my merged file.

# get experimental sample IDs to subset 
cat refs-exp-merged_samples-order.txt | grep "GM" > exp_IDs.txt

like_loo_exp.reg <- read_table("../analysis-20240606/wgsassign/assignment/pcod-experimental-assign_top-1250-snps.pop_like.txt", col_names = regions) %>%  mutate(across(Aleutians:wGOA, as.numeric)) 
like_loo_exp.reg[like_loo_exp.reg == "-Inf"] <- NA
like_loo_exp.reg[like_loo_exp.reg == "NaN"] <- NA

exp.assigned.reg <- cbind(
  read_delim("../analysis-20240606/wgsassign/assignment/exp_IDs.txt", delim = "\t", col_names = "sample.full") %>%
  mutate(sample=as.numeric(gsub("GM", "", sample.full))) %>% 
  left_join(sample.info.lcwgs), 
  like_loo_exp.reg)

# How many samples have NA values? 
exp.assigned.reg %>%  filter(if_any(Aleutians:wGOA, is.na)) %>%  group_by(treatment) %>% tally()

## # A tibble: 3 × 2
##   treatment     n
##   <fct>     <int>
## 1 0             1
## 2 5             2
## 3 16            1

na.reg <- (exp.assigned.reg %>%  filter(if_any(Aleutians:wGOA, is.na)) %>%  group_by(treatment))$sample

exp.assign.summary.reg <- exp.assigned.reg %>%
  mutate(across(Aleutians:wGOA, as.numeric)) %>% 
  pivot_longer(cols = Aleutians:wGOA,
               names_to = "AssignedPop",
               values_to = "AssignedLike") %>%
  group_by(sample) %>%
  filter(sample %!in% na.reg) %>%
  mutate(AssignedProb = round(exp(AssignedLike - max(AssignedLike)) / sum(exp(AssignedLike - max(AssignedLike))),2 )) %>%
  filter(AssignedLike == max(AssignedLike)) %>%
  ungroup() 
# write_csv(x = nonbreeding.summary,
#           file = "./output/amre.nonbreeding.ind148.ds_2x.sites-filter.top_50_each.assignment_summary.csv")

# Check number of accurate assignments

exp.assign.summary.reg %>% group_by(AssignedPop) %>% 
  summarize(min.prob=min(AssignedProb),
            max.prob=max(AssignedProb),
            mean.prob=mean(AssignedProb))

## # A tibble: 1 × 4
##   AssignedPop min.prob max.prob mean.prob
##   <chr>          <dbl>    <dbl>     <dbl>
## 1 wGOA               1        1         1

exp.assign.summary.reg %>%
 # filter(AssignedProb > 0.8) %>%
#  nrow() %>%
  group_by(AssignedPop) %>% tally()# %>% write_clip()

## # A tibble: 1 × 2
##   AssignedPop     n
##   <chr>       <int>
## 1 wGOA          153

exp.assign.summary.reg %>%
  group_by(AssignedPop) %>%
  summarize(N = n()) %>%
  ungroup() %>%
  ggplot(aes(x="", y=N, fill = AssignedPop)) +
  geom_bar(stat = "identity", width = 1) +
  coord_polar("y", start = 0) +
  scale_fill_manual(values = cluster_colors.reg, name="Assigned Marine Region") +
  xlab(NULL) + ylab(NULL) + #theme_void() +
  ggtitle("Experimental Fish Assignments\nusing top 1,250 Fst markers (6,101 total)")

0.8.1 Deriving micro-haplotypes from lcWGS-derived genotype likelihoods

0.8.1.1 ZP3

I’m interested in the genetic differences among our juvenile Pacific cod specifically on the ZP3 gene. So, here I pull haplotypes (bp consensus sequences) for each fish for only the ZP3 gene region using angsd -dohaplocall.

Here I load samtools, create a variable with sample IDs, then extract alignment data for the ZP3 gene, then sort and index the zp3.bam files. Then I load the angsd program, generate a text file listing the zp3.bam files. Then, run angsd -dohaplocall with the option -dohaplocall 1 to generate sample single base calls, -doCounts 1 to “count the number A,C,G,T. All sites, All samples,” -r NC_082390.1 to look only on NC_082390.1 (Chromosome 9) (I could have specified the ZP3 gene region directly with angsd and didn’t have to extract bams, BUT having the alignment files are nice for viewing in IGV), and -minMinor 1 to set the minimum observed minor alleles.

base=/home/lspencer/pcod-lcwgs-2023/analysis-20240606
ref_list=${base}/reference/pcod-refs_filtered_bamslist.txt
exp_list=${base}/experimental/pcod-exp_filtered_bamslist.txt

module load bio/angsd
angsd -bam ${ref_list} -dohaplocall 1 -doCounts 1 -r NC_082390.1:1867790-1875256 -minMinor 1 -out ${base}/reference/refs-zp3-haps
angsd -bam ${exp_list} -dohaplocall 1 -doCounts 1 -r NC_082390.1:1867790-1875256 -minMinor 1 -out ${base}/experimental/exps-zp3-haps

I rsynced the resulting files, experimental/exps-zp3-haps.haplo.gz and reference/refs-zp3-haps.haplo.gz and gunzipped, and also rsynced the bamslists (for sample order) locally.

refs.order <- read_delim("../analysis-20240606/microhaps/pcod-refs_filtered_bamslist.txt", delim = "\t", col_names = "bam") %>% 
  mutate(sample=gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/reference/bamtools/pcod-refs_|_sorted_dedup_clipped.bam", "", bam))

exp.order <- read_delim("../analysis-20240606/microhaps/pcod-exp_filtered_bamslist.txt", delim = "\t", col_names = "bam") %>% 
  mutate(sample=gsub("/home/lspencer/pcod-lcwgs-2023/analysis-20240606/experimental/bamtools/pcod-exp_|_sorted_dedup_clipped.bam", "", bam))
  
zp3.haplos <- read_delim("../analysis-20240606/microhaps/refs-zp3-haps.haplo", delim = "\t", skip = 1, col_names = c("chr", "pos", "major", refs.order$sample)) %>% #read in reference haplos 
  left_join(read_delim("../analysis-20240606/microhaps/exps-zp3-haps.haplo", delim = "\t", skip = 1, col_names = c("chr", "pos", "major", exp.order$sample)), by=c("chr", "pos", "major"))

But where exactly are the polymorphic sites? I searched Ingrid’s large list of ~2M SNPs in reference fish using her .beage GLs file:

cat /share/afsc/temporary/AAcod_wholegenome_polymorphic_filtered2.beagle | cut -f1 | sed 's/_/\t/g' | grep "082390.1" | awk -F'\t' '$3 >= 1867790 && $3 <= 1875256'
NC      082390.1        1870572
NC      082390.1        1870654
NC      082390.1        1871844
NC      082390.1        1872046
NC      082390.1        1872249
NC      082390.1        1872275
NC      082390.1        1872571
NC      082390.1        1872663
NC      082390.1        1872670
NC      082390.1        1872800
NC      082390.1        1872816
NC      082390.1        1872915
NC      082390.1        1872935
NC      082390.1        1872938
NC      082390.1        1873061
NC      082390.1        1873661

I don’t know if these markers correspond to Ingrid’s 9 SNPs, and if so which ones. Figure that out!

zp3.haplos.summ <- zp3.haplos %>%  filter(pos %in% c("1870572", "1870654", "1871844", "1872046", "1872249","1872275","1872571","1872663","1872670","1872800","1872816","1872915","1872935","1872938","1873061","1873661")) %>%
  tidyr::unite(chr, pos, col="marker", sep = "_", remove = T) %>% column_to_rownames("marker") %>% t() %>% as.data.frame() %>% rownames_to_column("sample") %>% 

  pivot_longer(cols = -sample, names_to = "locus", values_to = "allele") %>%
  #mutate(locus=gsub("-", "loc.", locus)) %>%
  mutate_at("allele", function(x) {
    case_when(
    x == "A" ~ "T",
    x == "T" ~ "A",
    x == "C" ~ "G",
    x == "G" ~ "C",
    x == "N" ~ "N")}) %>%
  pivot_wider(names_from = "locus", values_from = "allele") %>% 
  unite("haplo.lcwgs", 2:ncol(.), sep="", remove = F) %>% 
  mutate(haplo.lcwgs=as.factor(haplo.lcwgs)) %>% mutate(sample.simple=as.numeric(gsub("ABLG|GM", "", sample))) %>%
  left_join(sample.info.lcwgs %>% select(sample, treatment, marine_region2), by=c("sample.simple"="sample")) %>%
  mutate(group=case_when(
    treatment %in% c(0,5,9,16) ~ "experimental",
    sample=="major" ~ "major_haplo",
    TRUE ~ marine_region2)) %>% 
  mutate(group=factor(group, ordered=T, levels=c("Aleutians", "NBS", "EBS", "wGOA", "eGOA", "experimental", "major_haplo")))
  
zp3.haplos.summ %>% 
  group_by(group, haplo.lcwgs) %>% tally() %>% 
  filter(n>2) %>% 
  ggplot() +
  geom_bar(aes(x=as.factor(group), y=n, fill=haplo.lcwgs), 
           position="stack", stat="identity", color="black") +
    
  theme_minimal() + 
  ggtitle("ZP3 polymorphic microhaps by region/experimental fish\n(microhaps with 3+ fish)") +
   theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust=1)) + 
  scale_fill_manual(values=c("#1f78b4", "#33a02c", "#e31a1c", "#ff7f00",
                     "#6a3d9a", "#b15928","black", "gray70", "darkgreen",
                     "magenta", "yellow"), name="Microhap,\npolymorphic sites in ref fish") +
  ylab("Number of fish") + xlab("Region of Origin") +
  theme(title = element_text(size=10)) +
  annotate("text", label=paste("Major microhap: ", "\n", as.character((zp3.haplos.summ %>% filter(sample=="major"))$haplo.lcwgs), sep=""), x=2, y=50, col=darken("#ff7f00", amount = .2))