---
title: "Project-Exploring data"
author: "Hannia Larino"
date: "2025-04-21"
output: html_document
---

Location of folder: /home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq
Fastq files: raw data generated from RNA-seq experiments are typically stored in fastq files, which contain millions of short sequencing reads that need to be preprocessed before performing differential gene expression analysis.
Using Kallisto to analyze RNA seq data and ID DEG. 

Components of fastq files: 
A FASTQ file has four line-separated fields per sequence:
*Field 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line).
*Field 2 is the raw sequence letters.
*Field 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again.
*Field 4 encodes the quality values for the sequence in Field 2, and must contain the same number of symbols as letters in the sequence.
---------
Step 1. Downlaod FastQC software from https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc
#downloaded and imported manually using "upload". Next time do it on the command line. 

Professor downloaded this data into my directory: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA848687/


About the sample ID of .fastq files:  https://www.ncbi.nlm.nih.gov/biosample?LinkName=bioproject_biosample_all&from_uid=848687

1. SRR19635628 = Control 1 (2 samples/ID)

2.  SRR19635629 = Control 2

3. SRR19635630 = Control 3

4. SRR19635631 = 26 deg Cels 

5. SRR19635632 = 26 deg Cels 

6. SRR19635633 = 26 deg Cels

7. SRR19635634 = 26 deg Cels

8. SRR19635635 = 26 deg Cels

9. SRR19635636 = 26 deg Cels

10. SRR19635637 = 30

11. SRR19635638 = 30
 
12. SRR19635639 = 30 ***dont have this one*** 


**RUN FastQC**
Running FastQC program on PRJNA848687_fastq folder. This will generate a HTML report that can be viewed in a web browser.
There are 22 samples total. 11 treatment, 11 control. 
Line 28- this is how you run the fastqc program, similar to blast. 

```{bash}
/home/shared/8TB_HDD_02/hannia/SeaCucumber/FastQC/fastqc \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/*.fastq \
-o /home/shared/8TB_HDD_02/hannia/SeaCucumber/output

```
#The QC results show that all sampels have a red x for per base sequence content and sequence duplication levels. 

#2. Get a reference genome of sea cucumber (Apostichopus japonicus) from:
https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_037975245.1/


##curl into the SeaCucumber folder
```{bash}
URL="https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCF_037975245.1/download?include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GFF&include_annotation_type=RNA_FASTA&include_annotation_type=CDS_FASTA&include_annotation_type=PROT_FASTA&include_annotation_type=SEQUENCE_REPORT&hydrated=FULLY_HYDRATED"

curl -L "$URL" -o /home/shared/8TB_HDD_02/hannia/SeaCucumber/GCF_037975245.1_ref.zip
```
###unizp GCF_037975245.1_ref
```{bash}
cd /home/shared/8TB_HDD_02/hannia/SeaCucumber
unzip GCF_037975245.1_ref.zip -d GCF_037975245.1_ref 
```
####explore contents
```{bash}
cd /home/shared/8TB_HDD_02/hannia/SeaCucumber/GCF_037975245.1_ref/ncbi_dataset/data/GCF_037975245.1
ls

```


**2. (Pseudo - alignment)**
checking if single end or paired
```{bash}
head -4 /home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/*.fastq          

```

#making Kallisto index first b4 pseudo alignment (needed for Kallisto quant)
#making index from ref transcriptome (rna.fna file)
#answers above show that these files are paired-end FASTQ files 
#index file: compressed data structure that allows fast pseudoalignment of RNA-Seq reads to a reference transcriptome.This is for faster alignment, does not align entire genome but instead matching transcripts.

```{bash}
/home/shared/kallisto/kallisto index \
-i /home/shared/8TB_HDD_02/hannia/SeaCucumber/index.idx \
/home/shared/8TB_HDD_02/hannia/SeaCucumber/GCF_037975245.1_ref/ncbi_dataset/data/GCF_037975245.1/rna.fna
```

#completing pseudo alignment 
```{bash}
/home/shared/kallisto/kallisto quant \
-i /home/shared/8TB_HDD_02/hannia/SeaCucumber/index.idx \
-o /home/shared/8TB_HDD_02/hannia/SeaCucumber/output /home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/*.fastq
```

accesing hyak: ssh hannia@klone.hyak.uw.edu

cd /gscratch/scrubbed/fish546

**3. Differential Expression Analysis (Kallisto)**


Already created index, next step making output folder for kallisto data

```{r, engine='bash'}
cd /home/shared/8TB_HDD_02/hannia/SeaCucumber 
pwd
mkdir /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01 
#mkdir makes new folder called kallisto01 up one directory level 
```
Creating sub-directory
```{r, engine='bash'}
find /home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/*fastq.gz \

| xargs basename -s _L001_R1_001.fastq.gz | xargs -I{} /home/shared/kallisto/kallisto \
quant -i /home/shared/8TB_HDD_02/hannia/2.RNASeq/data/cgigas_roslin_rna.index \
-o /home/shared/8TB_HDD_02/hannia/2.RNASeq/output/kallisto_01/{} \
-t 40 \
--single -l 100 -s 10 /home/shared/8TB_HDD_02/hannia/2.RNASeq/data/{}_L001_R1_001.fastq.gz
```

SRR19635628_1.fastq 

```{bash}
#!/bin/bash

# Set input and output directories
INPUT_DIR="/home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq"
OUTPUT_DIR="/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01"
INDEX="/home/shared/8TB_HDD_02/hannia/SeaCucumber/index.idx"
KALLISTO="/home/shared/kallisto/kallisto"

# Loop through all forward reads (_1.fastq)
for R1 in ${INPUT_DIR}/*_1.fastq; do
    # Extract the base sample ID (e.g., SRR19635628)
    SAMPLE=$(basename "$R1" _1.fastq)

    # Define the reverse read
    R2="${INPUT_DIR}/${SAMPLE}_2.fastq"

    # Create output directory for this sample
    SAMPLE_OUT="${OUTPUT_DIR}/${SAMPLE}"
    mkdir -p "$SAMPLE_OUT"

    # Run kallisto quant
    "$KALLISTO" quant -i "$INDEX" -o "$SAMPLE_OUT" -t 40 "$R1" "$R2"
done

```

Creating abundance estimates to gene expression matrix

```{r, engine='bash'}
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
  --est_method kallisto \
  --gene_trans_map none \
  --out_prefix /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01 \
  --name_sample_by_basedir \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635628/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635629/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635630/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635631/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635632/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635633/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635634/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635635/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635636/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635637/abundance.tsv \
  /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635638/abundance.tsv
```

#chunk above gives error but created .matrix anyway 

**PART II. RUNNING DESeq2**

**8. Reading in count matrix** \*Switching to R language and displaying first few rows.

```{r}
library(knitr)
```

```{r, engine='r'}
setwd("/home/shared/8TB_HDD_02/hannia/SeaCucumber/output")
countmatrix <- read.delim("/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)
```


**9. Round integers up to hole numbers for further analysis.**

```{r, engine='r'}
countmatrix <- round(countmatrix, 0)
str(countmatrix) #like summary()
```
##RUNNING DESeq2

```{r}
library(DESeq2)
library(tidyverse)
library(pheatmap)
library(RColorBrewer)
library(data.table)
```

```{r}
if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("DESeq2")
```

1. SRR19635628 = Control 1 (2 samples/ID)

2.  SRR19635629 = Control 2

3. SRR19635630 = Control 3

4. SRR19635631 = 26 deg Cels 

5. SRR19635632 = 26 deg Cels 

6. SRR19635633 = 26 deg Cels

7. SRR19635634 = 26 deg Cels

8. SRR19635635 = 26 deg Cels

9. SRR19635636 = 26 deg Cels

10. SRR19635637 = 30

11. SRR19635638 = 30
 
12. SRR19635639 = 30 ***dont have this one*** 


Getting DEG based on Desication. Run this table using 2 control, 2 26 deg, and 2 30 deg. 


```{r, engine='r'}
deseq2.colData <- data.frame(condition=factor(c(rep("control", 2), rep("26.C", 2), rep("30.C", 2))), 
                             type=factor(rep("paired", 24)))
rownames(deseq2.colData) <- colnames(data#/change this to name of folder, look up how to do)
deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix,
                                     colData = deseq2.colData, 
                                     design = ~ condition)
kable(head(deseq2.res), digits = 3, caption = "Top 6 Differential Expression Results")
```