---
title: "11-pav"
subtitle: "Present-Absent Variation Analysis"
output: html_document
date: "2025-12-07"
---

# Present-Absent Variation (PAV) Analysis

This notebook identifies genomic regions that are present or absent in each sample 
relative to the reference genome (GCF_016432855.1_SaNama_1.0). PAV analysis uses 
coverage depth from aligned PacBio HiFi reads to identify:

1. **Absent regions**: Genomic regions with zero/very low coverage (potential deletions)
2. **Present regions**: Genomic regions with sufficient coverage
3. **Insertions**: Novel sequences in samples not in reference (from CIGAR analysis)

## Setup

Define samples and paths:

```{r setup}
# Define samples
samples <- c("bc2041", "bc2068", "bc2069", "bc2070", "bc2071", "bc2072", "bc2096")

# Paths
bam_dir <- "../analyses/05-pacbio-align"
output_dir <- "../analyses/11-pav"
reference <- "../data/GCF_016432855.1_SaNama_1.0_genomic.fa"
genes_bed <- "../data/20220818-snam-GCF_016432855.1_SaNama_1.0_genes.bed"

# Create output directory
dir.create(output_dir, showWarnings = FALSE, recursive = TRUE)
```

## Step 1: Generate Reference Genome Index and Chromosome Sizes

```{bash create-genome-files}
# Index reference if not already done
if [ ! -f ../data/GCF_016432855.1_SaNama_1.0_genomic.fa.fai ]; then
    samtools faidx ../data/GCF_016432855.1_SaNama_1.0_genomic.fa
fi

# Create chromosome sizes file
cut -f1,2 ../data/GCF_016432855.1_SaNama_1.0_genomic.fa.fai > ../analyses/11-pav/genome.chrom.sizes

echo "Chromosome sizes file created:"
head ../analyses/11-pav/genome.chrom.sizes
```

## Step 2: Calculate Per-Base Coverage with mosdepth

Run mosdepth for each sample to get coverage information:

```{bash mosdepth-bc2041}
SAMPLE=bc2041
BAM=../analyses/05-pacbio-align/${SAMPLE}.bam
OUT_PREFIX=../analyses/11-pav/${SAMPLE}

# Run mosdepth - generates per-base coverage
mosdepth \
  --threads 16 \
  --no-per-base \
  --by 1000 \
  ${OUT_PREFIX} \
  ${BAM}

echo "mosdepth complete for ${SAMPLE}"
ls -la ../analyses/11-pav/${SAMPLE}*
```

```{bash mosdepth-bc2068}
SAMPLE=bc2068
BAM=../analyses/05-pacbio-align/${SAMPLE}.bam
OUT_PREFIX=../analyses/11-pav/${SAMPLE}

mosdepth \
  --threads 16 \
  --no-per-base \
  --by 1000 \
  ${OUT_PREFIX} \
  ${BAM}

echo "mosdepth complete for ${SAMPLE}"
```

```{bash mosdepth-bc2069}
SAMPLE=bc2069
BAM=../analyses/05-pacbio-align/${SAMPLE}.bam
OUT_PREFIX=../analyses/11-pav/${SAMPLE}

mosdepth \
  --threads 16 \
  --no-per-base \
  --by 1000 \
  ${OUT_PREFIX} \
  ${BAM}

echo "mosdepth complete for ${SAMPLE}"
```

```{bash mosdepth-bc2070}
SAMPLE=bc2070
BAM=../analyses/05-pacbio-align/${SAMPLE}.bam
OUT_PREFIX=../analyses/11-pav/${SAMPLE}

mosdepth \
  --threads 16 \
  --no-per-base \
  --by 1000 \
  ${OUT_PREFIX} \
  ${BAM}

echo "mosdepth complete for ${SAMPLE}"
```

```{bash mosdepth-bc2071}
SAMPLE=bc2071
BAM=../analyses/05-pacbio-align/${SAMPLE}.bam
OUT_PREFIX=../analyses/11-pav/${SAMPLE}

mosdepth \
  --threads 16 \
  --no-per-base \
  --by 1000 \
  ${OUT_PREFIX} \
  ${BAM}

echo "mosdepth complete for ${SAMPLE}"
```

```{bash mosdepth-bc2072}
SAMPLE=bc2072
BAM=../analyses/05-pacbio-align/${SAMPLE}.bam
OUT_PREFIX=../analyses/11-pav/${SAMPLE}

mosdepth \
  --threads 16 \
  --no-per-base \
  --by 1000 \
  ${OUT_PREFIX} \
  ${BAM}

echo "mosdepth complete for ${SAMPLE}"
```

```{bash mosdepth-bc2096}
SAMPLE=bc2096
BAM=../analyses/05-pacbio-align/${SAMPLE}.bam
OUT_PREFIX=../analyses/11-pav/${SAMPLE}

mosdepth \
  --threads 16 \
  --no-per-base \
  --by 1000 \
  ${OUT_PREFIX} \
  ${BAM}

echo "mosdepth complete for ${SAMPLE}"
```

## Step 3: Identify Absent Regions (Zero Coverage)

Extract regions with zero coverage (absent/deleted in sample):

```{bash extract-absent-regions}
OUTPUT_DIR=../analyses/11-pav

for SAMPLE in bc2041 bc2068 bc2069 bc2070 bc2071 bc2072 bc2096; do
    echo "Processing ${SAMPLE}..."
    
    # Decompress and filter for zero coverage regions (column 4 = mean coverage)
    zcat ${OUTPUT_DIR}/${SAMPLE}.regions.bed.gz | \
    awk -v OFS='\t' '$4 == 0 {print $1, $2, $3, "absent", $4, "."}' > \
    ${OUTPUT_DIR}/${SAMPLE}.absent_regions.bed
    
    # Merge adjacent zero-coverage windows into continuous absent regions
    bedtools merge -i ${OUTPUT_DIR}/${SAMPLE}.absent_regions.bed | \
    awk -v OFS='\t' '{print $1, $2, $3, "absent_region", 0, "."}' > \
    ${OUTPUT_DIR}/${SAMPLE}.absent_regions_merged.bed
    
    echo "${SAMPLE} absent regions:"
    wc -l ${OUTPUT_DIR}/${SAMPLE}.absent_regions_merged.bed
done
```

## Step 4: Identify Present Regions (With Coverage)

Extract regions with coverage > 0 (present in sample):

```{bash extract-present-regions}
OUTPUT_DIR=../analyses/11-pav
MIN_COVERAGE=1  # Minimum coverage to consider region "present"

for SAMPLE in bc2041 bc2068 bc2069 bc2070 bc2071 bc2072 bc2096; do
    echo "Processing ${SAMPLE}..."
    
    # Filter for regions with coverage >= MIN_COVERAGE
    zcat ${OUTPUT_DIR}/${SAMPLE}.regions.bed.gz | \
    awk -v min_cov=${MIN_COVERAGE} -v OFS='\t' '$4 >= min_cov {print $1, $2, $3, "present", $4, "."}' > \
    ${OUTPUT_DIR}/${SAMPLE}.present_regions.bed
    
    # Merge adjacent present windows
    bedtools merge -i ${OUTPUT_DIR}/${SAMPLE}.present_regions.bed | \
    awk -v OFS='\t' '{print $1, $2, $3, "present_region", 0, "."}' > \
    ${OUTPUT_DIR}/${SAMPLE}.present_regions_merged.bed
    
    echo "${SAMPLE} present regions:"
    wc -l ${OUTPUT_DIR}/${SAMPLE}.present_regions_merged.bed
done
```

## Step 5: Identify Insertions from BAM Alignments

Extract insertions (novel sequences in samples not in reference) using CIGAR strings.
Insertions >= 50bp are considered significant:

```{bash extract-insertions}
OUTPUT_DIR=../analyses/11-pav
BAM_DIR=../analyses/05-pacbio-align
MIN_INS_SIZE=50  # Minimum insertion size to report

for SAMPLE in bc2041 bc2068 bc2069 bc2070 bc2071 bc2072 bc2096; do
    echo "Extracting insertions for ${SAMPLE}..."
    
    # Extract insertions from CIGAR strings
    # Using samtools and awk to parse CIGAR for insertions (I operations)
    samtools view ${BAM_DIR}/${SAMPLE}.bam | \
    awk -v min_size=${MIN_INS_SIZE} -v OFS='\t' '
    BEGIN { 
        # No header needed
    }
    {
        chrom = $3
        pos = $4
        cigar = $6
        
        # Skip unmapped reads
        if (chrom == "*") next
        
        # Parse CIGAR string for insertions
        ref_pos = pos
        while (match(cigar, /([0-9]+)([MIDNSHP=X])/, arr)) {
            len = arr[1]
            op = arr[2]
            
            if (op == "I" && len >= min_size) {
                # Report insertion at current reference position
                print chrom, ref_pos, ref_pos+1, "insertion_" len "bp", len, "+"
            }
            
            # Advance reference position for M, D, N, =, X operations
            if (op == "M" || op == "D" || op == "N" || op == "=" || op == "X") {
                ref_pos += len
            }
            
            # Remove matched portion from cigar
            cigar = substr(cigar, RSTART + RLENGTH)
        }
    }' | \
    sort -k1,1 -k2,2n > ${OUTPUT_DIR}/${SAMPLE}.insertions_raw.bed
    
    # Merge nearby insertions and count
    bedtools merge -i ${OUTPUT_DIR}/${SAMPLE}.insertions_raw.bed -c 4,5 -o count,sum | \
    awk -v OFS='\t' '{print $1, $2, $3, "insertion_cluster", $4, "+", $5}' > \
    ${OUTPUT_DIR}/${SAMPLE}.insertions.bed
    
    echo "${SAMPLE} insertions (>=${MIN_INS_SIZE}bp):"
    wc -l ${OUTPUT_DIR}/${SAMPLE}.insertions.bed
done
```

## Step 6: Identify Deletions from BAM Alignments

Extract deletions (sequences in reference absent from sample) using CIGAR strings:

```{bash extract-deletions}
OUTPUT_DIR=../analyses/11-pav
BAM_DIR=../analyses/05-pacbio-align
MIN_DEL_SIZE=50  # Minimum deletion size to report

for SAMPLE in bc2041 bc2068 bc2069 bc2070 bc2071 bc2072 bc2096; do
    echo "Extracting deletions for ${SAMPLE}..."
    
    # Extract deletions from CIGAR strings (D operations)
    samtools view ${BAM_DIR}/${SAMPLE}.bam | \
    awk -v min_size=${MIN_DEL_SIZE} -v OFS='\t' '
    {
        chrom = $3
        pos = $4
        cigar = $6
        
        # Skip unmapped reads
        if (chrom == "*") next
        
        # Parse CIGAR string for deletions
        ref_pos = pos
        while (match(cigar, /([0-9]+)([MIDNSHP=X])/, arr)) {
            len = arr[1]
            op = arr[2]
            
            if (op == "D" && len >= min_size) {
                # Report deletion spanning reference coordinates
                print chrom, ref_pos, ref_pos + len, "deletion_" len "bp", len, "+"
            }
            
            # Advance reference position for M, D, N, =, X operations
            if (op == "M" || op == "D" || op == "N" || op == "=" || op == "X") {
                ref_pos += len
            }
            
            # Remove matched portion from cigar
            cigar = substr(cigar, RSTART + RLENGTH)
        }
    }' | \
    sort -k1,1 -k2,2n > ${OUTPUT_DIR}/${SAMPLE}.deletions_raw.bed
    
    # Merge overlapping deletions
    bedtools merge -i ${OUTPUT_DIR}/${SAMPLE}.deletions_raw.bed -c 4,5 -o count,sum | \
    awk -v OFS='\t' '{print $1, $2, $3, "deletion_cluster", $4, "+", $5}' > \
    ${OUTPUT_DIR}/${SAMPLE}.deletions.bed
    
    echo "${SAMPLE} deletions (>=${MIN_DEL_SIZE}bp):"
    wc -l ${OUTPUT_DIR}/${SAMPLE}.deletions.bed
done
```

## Step 7: Create Combined PAV Feature Files

Combine all PAV features (absent, present, insertions, deletions) into single files per sample:

```{bash create-combined-pav}
OUTPUT_DIR=../analyses/11-pav

for SAMPLE in bc2041 bc2068 bc2069 bc2070 bc2071 bc2072 bc2096; do
    echo "Creating combined PAV file for ${SAMPLE}..."
    
    # Create header
    echo -e "#chrom\tstart\tend\tfeature_type\tscore\tstrand" > ${OUTPUT_DIR}/${SAMPLE}.pav_features.bed
    
    # Add absent regions
    awk -v OFS='\t' '{print $1, $2, $3, "ABSENT", $3-$2, "."}' \
        ${OUTPUT_DIR}/${SAMPLE}.absent_regions_merged.bed >> ${OUTPUT_DIR}/${SAMPLE}.pav_features.bed
    
    # Add insertions
    if [ -s ${OUTPUT_DIR}/${SAMPLE}.insertions.bed ]; then
        awk -v OFS='\t' '{print $1, $2, $3, "INSERTION", $5, $6}' \
            ${OUTPUT_DIR}/${SAMPLE}.insertions.bed >> ${OUTPUT_DIR}/${SAMPLE}.pav_features.bed
    fi
    
    # Add deletions  
    if [ -s ${OUTPUT_DIR}/${SAMPLE}.deletions.bed ]; then
        awk -v OFS='\t' '{print $1, $2, $3, "DELETION", $5, $6}' \
            ${OUTPUT_DIR}/${SAMPLE}.deletions.bed >> ${OUTPUT_DIR}/${SAMPLE}.pav_features.bed
    fi
    
    # Sort the combined file (excluding header)
    (head -1 ${OUTPUT_DIR}/${SAMPLE}.pav_features.bed && \
     tail -n +2 ${OUTPUT_DIR}/${SAMPLE}.pav_features.bed | sort -k1,1 -k2,2n) > \
     ${OUTPUT_DIR}/${SAMPLE}.pav_features_sorted.bed
    
    mv ${OUTPUT_DIR}/${SAMPLE}.pav_features_sorted.bed ${OUTPUT_DIR}/${SAMPLE}.pav_features.bed
    
    echo "${SAMPLE} combined PAV features:"
    grep -v "^#" ${OUTPUT_DIR}/${SAMPLE}.pav_features.bed | wc -l
done
```

## Step 8: Intersect PAV Features with Gene Annotations

Identify which genes overlap with PAV regions:

```{bash intersect-genes}
OUTPUT_DIR=../analyses/11-pav
GENES_BED=../data/20220818-snam-GCF_016432855.1_SaNama_1.0_genes.bed

for SAMPLE in bc2041 bc2068 bc2069 bc2070 bc2071 bc2072 bc2096; do
    echo "Intersecting ${SAMPLE} PAV with genes..."
    
    # Get PAV features without header
    grep -v "^#" ${OUTPUT_DIR}/${SAMPLE}.pav_features.bed > ${OUTPUT_DIR}/${SAMPLE}.pav_noheader.bed
    
    # Intersect absent regions with genes
    bedtools intersect \
        -a ${GENES_BED} \
        -b ${OUTPUT_DIR}/${SAMPLE}.absent_regions_merged.bed \
        -wa -wb > ${OUTPUT_DIR}/${SAMPLE}.genes_in_absent_regions.bed
    
    # Intersect deletions with genes (if file exists and is not empty)
    if [ -s ${OUTPUT_DIR}/${SAMPLE}.deletions.bed ]; then
        bedtools intersect \
            -a ${GENES_BED} \
            -b ${OUTPUT_DIR}/${SAMPLE}.deletions.bed \
            -wa -wb > ${OUTPUT_DIR}/${SAMPLE}.genes_with_deletions.bed
    fi
    
    # Intersect insertions with genes (if file exists and is not empty)
    if [ -s ${OUTPUT_DIR}/${SAMPLE}.insertions.bed ]; then
        bedtools intersect \
            -a ${GENES_BED} \
            -b ${OUTPUT_DIR}/${SAMPLE}.insertions.bed \
            -wa -wb > ${OUTPUT_DIR}/${SAMPLE}.genes_with_insertions.bed
    fi
    
    # Clean up temp file
    rm -f ${OUTPUT_DIR}/${SAMPLE}.pav_noheader.bed
    
    echo "${SAMPLE} genes affected:"
    echo "  In absent regions: $(wc -l < ${OUTPUT_DIR}/${SAMPLE}.genes_in_absent_regions.bed)"
    if [ -f ${OUTPUT_DIR}/${SAMPLE}.genes_with_deletions.bed ]; then
        echo "  With deletions: $(wc -l < ${OUTPUT_DIR}/${SAMPLE}.genes_with_deletions.bed)"
    fi
    if [ -f ${OUTPUT_DIR}/${SAMPLE}.genes_with_insertions.bed ]; then
        echo "  With insertions: $(wc -l < ${OUTPUT_DIR}/${SAMPLE}.genes_with_insertions.bed)"
    fi
done
```

## Step 9: Generate Summary Statistics

```{r summary-stats}
library(tidyverse)

output_dir <- "../analyses/11-pav"
samples <- c("bc2041", "bc2068", "bc2069", "bc2070", "bc2071", "bc2072", "bc2096")

# Function to safely read bed files
read_bed_safe <- function(file) {
  if (file.exists(file) && file.info(file)$size > 0) {
    tryCatch({
      read_tsv(file, col_names = c("chrom", "start", "end", "name", "score", "strand"),
               col_types = "ciicdc", comment = "#")
    }, error = function(e) tibble())
  } else {
    tibble()
  }
}

# Collect summary stats
summary_stats <- map_dfr(samples, function(sample) {
  
  absent <- read_bed_safe(file.path(output_dir, paste0(sample, ".absent_regions_merged.bed")))
  deletions <- read_bed_safe(file.path(output_dir, paste0(sample, ".deletions.bed")))
  insertions <- read_bed_safe(file.path(output_dir, paste0(sample, ".insertions.bed")))
  
  tibble(
    sample = sample,
    n_absent_regions = nrow(absent),
    absent_total_bp = if(nrow(absent) > 0) sum(absent$end - absent$start) else 0,
    n_deletions = nrow(deletions),
    deletion_total_bp = if(nrow(deletions) > 0) sum(deletions$end - deletions$start) else 0,
    n_insertions = nrow(insertions)
  )
})

print(summary_stats)

# Save summary
write_csv(summary_stats, file.path(output_dir, "pav_summary_stats.csv"))
```

## Step 10: Visualize PAV Distribution

```{r visualize-pav, fig.width=12, fig.height=8}
library(ggplot2)

# Plot absent regions per sample
if(exists("summary_stats") && nrow(summary_stats) > 0) {
  
  # Absent regions count
  p1 <- ggplot(summary_stats, aes(x = sample, y = n_absent_regions, fill = sample)) +
    geom_bar(stat = "identity") +
    theme_minimal() +
    labs(title = "Number of Absent Regions per Sample",
         x = "Sample", y = "Count") +
    theme(legend.position = "none")
  
  # Absent regions total bp
  p2 <- ggplot(summary_stats, aes(x = sample, y = absent_total_bp/1e6, fill = sample)) +
    geom_bar(stat = "identity") +
    theme_minimal() +
    labs(title = "Total Absent Sequence per Sample",
         x = "Sample", y = "Megabases (Mb)") +
    theme(legend.position = "none")
  
  # Deletions count
  p3 <- ggplot(summary_stats, aes(x = sample, y = n_deletions, fill = sample)) +
    geom_bar(stat = "identity") +
    theme_minimal() +
    labs(title = "Number of Deletions per Sample",
         x = "Sample", y = "Count") +
    theme(legend.position = "none")
  
  # Insertions count
  p4 <- ggplot(summary_stats, aes(x = sample, y = n_insertions, fill = sample)) +
    geom_bar(stat = "identity") +
    theme_minimal() +
    labs(title = "Number of Insertions per Sample",
         x = "Sample", y = "Count") +
    theme(legend.position = "none")
  
  # Combine plots
  library(patchwork)
  combined_plot <- (p1 + p2) / (p3 + p4)
  print(combined_plot)
  
  ggsave(file.path(output_dir, "pav_summary_plot.png"), combined_plot, 
         width = 12, height = 8, dpi = 150)
}
```

## Step 11: Compare PAV Across Samples

Identify regions that are consistently absent or present across samples:

```{bash compare-samples}
OUTPUT_DIR=../analyses/11-pav

# Create a multi-sample intersection of absent regions
# First, create a list of all absent region files
ls ${OUTPUT_DIR}/*.absent_regions_merged.bed > ${OUTPUT_DIR}/absent_files.txt

# Use multiinter to find regions absent in multiple samples
bedtools multiinter \
    -i ${OUTPUT_DIR}/bc2041.absent_regions_merged.bed \
       ${OUTPUT_DIR}/bc2068.absent_regions_merged.bed \
       ${OUTPUT_DIR}/bc2069.absent_regions_merged.bed \
       ${OUTPUT_DIR}/bc2070.absent_regions_merged.bed \
       ${OUTPUT_DIR}/bc2071.absent_regions_merged.bed \
       ${OUTPUT_DIR}/bc2072.absent_regions_merged.bed \
       ${OUTPUT_DIR}/bc2096.absent_regions_merged.bed \
    -header \
    -names bc2041 bc2068 bc2069 bc2070 bc2071 bc2072 bc2096 \
    > ${OUTPUT_DIR}/absent_regions_multiinter.bed

# Filter for regions absent in all samples (potential reference errors or common deletions)
awk '$4 == 7' ${OUTPUT_DIR}/absent_regions_multiinter.bed > ${OUTPUT_DIR}/absent_in_all_samples.bed

# Filter for regions absent in only one sample (sample-specific deletions)
awk '$4 == 1' ${OUTPUT_DIR}/absent_regions_multiinter.bed > ${OUTPUT_DIR}/absent_in_one_sample.bed

echo "Regions absent in all samples: $(wc -l < ${OUTPUT_DIR}/absent_in_all_samples.bed)"
echo "Regions absent in only one sample: $(wc -l < ${OUTPUT_DIR}/absent_in_one_sample.bed)"
```

## Output Files Summary

The following files are generated for each sample in `../analyses/11-pav/`:

| File | Description |
|------|-------------|
| `{sample}.regions.bed.gz` | mosdepth coverage per 1kb window |
| `{sample}.absent_regions_merged.bed` | Merged regions with zero coverage |
| `{sample}.present_regions_merged.bed` | Merged regions with coverage |
| `{sample}.insertions.bed` | Insertions >=50bp from CIGAR |
| `{sample}.deletions.bed` | Deletions >=50bp from CIGAR |
| `{sample}.pav_features.bed` | Combined PAV features |
| `{sample}.genes_in_absent_regions.bed` | Genes overlapping absent regions |
| `{sample}.genes_with_deletions.bed` | Genes overlapping deletions |
| `{sample}.genes_with_insertions.bed` | Genes overlapping insertions |

Cross-sample comparison files:
- `absent_regions_multiinter.bed` - Multi-sample intersection of absent regions
- `absent_in_all_samples.bed` - Regions absent in all samples
- `absent_in_one_sample.bed` - Sample-specific absent regions
- `pav_summary_stats.csv` - Summary statistics per sample

```{r session-info}
sessionInfo()
```