---
title: "Retrieving Bivalve Genomes"
author: "Megan Ewing"
date: "2025-04-07"
output: html_document
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

Here I am retrieving the cds files from NCBI and saving them to my subdirectory `data/genomes_refseq` . As of Apr. 7, 2025, there are 17 genomes on NCBI so I will be using a loop.

First, to do this, I need to load in some info about my genomes so I can have the accession ids readily available for out loop.

### Saving Accessions as Variable for Loop

```{r}
# load in genome info (from NCBI table)

genome_info <- read.delim("../data/genome_info-20250407.tsv")
head(genome_info) 

```

```{r}
# store accession ids as variable -- need it to be space separated to work in loop

accessions <- paste(genome_info$Assembly.Accession, collapse = " ")
accessions

```

now we try our loop

### Loop to Download CDS files

```{bash}

# Path to datasets tool
DATASETS_CMD="/home/shared/datasets"

# Output directory
OUTPUT_DIR="../data/genomes_refseq"

# copied and pasted from output above
accessions="GCF_002022765.2 GCF_026914265.1 GCF_963853765.1 GCF_041381155.1 GCF_963676685.1 GCF_902652985.1 GCF_025612915.1 GCF_947568905.1 GCF_021730395.1 GCF_020536995.1 GCF_036588685.1 GCF_026571515.1 GCF_002113885.1 GCF_021869535.1 GCF_033153115.1 GCF_032062105.1 GCF_031769215.1"

# Loop over the list
for accession in $accessions
do
  echo "Downloading CDS for $accession"
  "$DATASETS_CMD" download genome accession "$accession" --include cds --filename "$OUTPUT_DIR/${accession}_cds.zip"
done

```

unzip the files

```{r}
# Set directory path
zip_dir <- "../data/genomes_refseq"

# List all .zip files in the directory
zip_files <- list.files(zip_dir, pattern = "\\.zip$", full.names = TRUE)

# Unzip each file
for (zip_file in zip_files) {
  unzip(zip_file, exdir = zip_dir)
}

```