---
title: "Creating a workflow for genome-environment-association analyses with RADseq data"
author: "Lara Breitkreutz"
format:
  revealjs:
    incremental: true 
editor: visual
---

## Project Goal

- create a workflow for conducting genotype-environment association (GEA) analyses using RAD-seq data

- in the future, I will use this workflow for identifying candidate loci associated with components of thermal variation

## Methods: alignment

-   Aligned sequences to a reference *Zostera marina* genome using the BWA MEM function in the Burrow-Wheeler Aligner software (Li and Durbin, 2009).

## Methods: alignment

``` bash
## Aligning

# a for loop that aligns each .fq file according to name of sample and outputs a .sam file
for i in ${ind_ID};
do
  bwa mem -t 20 -T 30 $REF $INPUT/${i}.fq > $OUTPUT1/${i}.sam 2>> $REPORT
done

conda deactivate #deactivate conda env

## Sorting
conda activate samtools_env

# a for loop that sorts each .sam file (using samtools) and creates a .bam file
for i in ${ind_ID};
do
  samtools view -b -q 30 $OUTPUT1/${i}.sam > $OUTPUT1/${i}.bam
  samtools sort $OUTPUT1/${i}.bam > $OUTPUT2/${i}.bam
  samtools index $OUTPUT2/${i}.bam
```

## Methods: genotyping

-   Genotyping polymorphic loci using the ref_map function in STACKS.
-   Important input file for this script is the text file that stores the samples and their corresponding collection site and region

## Methods: genotyping

| sample | collection site |    region |
|:-------|:---------------:|----------:|
| B001   |     Belfair     | HoodCanal |
| B002   |     Belfair     | HoodCanal |
| ...    |       ...       |       ... |
| W049   |   Willapa Bay   |     Coast |
| W050   |   Willapa Bay   |     Coast |

## Methods: genotyping

``` bash
# Run the reference-based Stacks pipeline
ref_map.pl \
--samples ${ALL_DEDUP} \
--popmap ${POPMAP} \
-o ${REFMAP_DIR} \
-T 20 \
-X "populations: --vcf" \
-X "populations: --ordered-export"
#&>> $REPORT/refmap_RAD_${DATASET}.txt
```

## Methods: environmental data simulation

- Awaiting arrival of environmental data I have requested (and request has been granted)
- In the mean time, simulating some...

## Methods: environmental data simulation

``` r
# function for simulating temperatures 
simulate_temps <- function() {
  base_temps <- c(
    6.5, 7.1, 8.3, 10.2, 12.8, 15.5,  # Monthly averages for nearshore areas
    17.5, 18.3, 16.1, 12.5, 9.1, 7.2   # Based on Salish Sea observations
  )
  
  tibble(date = seq.Date(as.Date("2025-01-01"), by = "day", length.out = 365)) %>%
    mutate(
      month = month(date),
      temp = base_temps[month] + rnorm(365, 0, 1.0)  # Simplified noise
    )
}
```

## Methods: environmental data simulation

``` r
# calculate seasonal temp metrics
seasonal_data <- sites %>%
  mutate(
    temp_data = map(1:n(), ~simulate_temps()),
    metrics = map(temp_data, ~{
      .x %>%
        mutate(season = case_when(
          month(date) %in% 3:5 ~ "Spring",
          month(date) %in% 6:8 ~ "Summer",
          month(date) %in% 9:11 ~ "Fall",
          TRUE ~ "Winter"
        )) %>%
        group_by(season) %>%
        summarise(
          avg_temp = mean(temp),
          max_temp = max(temp),
          min_temp = min(temp),
          temp_range = max_temp - min_temp,
          warm_anomaly_days = sum(temp > quantile(temp, 0.9)),
          cold_anomaly_days = sum(temp < quantile(temp, 0.1)),
          cumulative_gdd = sum(pmax(temp - 5, 0))
        )
    })
  ) %>%
  unnest(metrics) %>%
  select(-temp_data)
```

## Preliminary results: alignment

-   .sam (sequence alignment map) files stores information for sequences mapped to the reference genome.
-   contain a header with reference sequence name and length:

## Preliminary results: alignment

![](images/samheader.png){width="50%"}

## Preliminary results: alignment

-   .bam (binary alignment map) files are the compressed binary version of .sam files stores information for sequences mapped to the reference genome.
-   their reduced file size improves runtime for gstacks.

## Preliminary results: genotyping

-   gstacks is currently running! ...

## Preliminary results: genotyping

![](images/gstackstail.png){width="50%"} 

## Preliminary results: simulating environmental data

```{r}
library(psych)    # Used to investigate correlations among predictors
library(vegan)    # Used to run RDA
library(dplyr)
library(purrr)
library(lubridate) 

# create sites
sites <- tibble(site_id = paste0("Site_", 1:15)) #15 sites; change once you know how many sites Bryan is providing 

# function for simulating temperatures 
simulate_temps <- function() {
  base_temps <- c(
    6.5, 7.1, 8.3, 10.2, 12.8, 15.5,  # Monthly averages for nearshore areas
    17.5, 18.3, 16.1, 12.5, 9.1, 7.2   # Based on Salish Sea observations
  )
  
  tibble(date = seq.Date(as.Date("2025-01-01"), by = "day", length.out = 365)) %>%
    mutate(
      month = month(date),
      temp = base_temps[month] + rnorm(365, 0, 1.0)  # Simplified noise
    )
}

# calculate seasonal temp metrics
seasonal_data <- sites %>%
  mutate(
    temp_data = map(1:n(), ~simulate_temps()),
    metrics = map(temp_data, ~{
      .x %>%
        mutate(season = case_when(
          month(date) %in% 3:5 ~ "Spring",
          month(date) %in% 6:8 ~ "Summer",
          month(date) %in% 9:11 ~ "Fall",
          TRUE ~ "Winter"
        )) %>%
        group_by(season) %>%
        summarise(
          avg_temp = mean(temp),
          max_temp = max(temp),
          min_temp = min(temp),
          temp_range = max_temp - min_temp,
          warm_anomaly_days = sum(temp > quantile(temp, 0.9)),
          cold_anomaly_days = sum(temp < quantile(temp, 0.1)),
          cumulative_gdd = sum(pmax(temp - 5, 0))
        )
    })
  ) %>%
  tidyr::unnest(metrics) %>%
  select(-temp_data)

# Preview data
#head(seasonal_data)

pairs.panels(seasonal_data[,3:9], scale=T)
```

## Next four weeks

-   Filtering

    -   gstacks will produce .vcf files (variant calls and genotypes)
    -   remove low-quality variants and genotypes

-   Calculate population allele frequencies

## ...continued

-   Process and summarize environmental data at the population level (awaiting access, but I have simulated data if need be)

-   Build GEA analyses! Likely 2 separate analyses for cross-validation

-   Build workflow for identifing candidate loci