--- title: "Project-Exploring data" author: "Hannia Larino" date: "2025-04-21" output: html_document --- Location of folder: /home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq Fastq files: raw data generated from RNA-seq experiments are typically stored in fastq files, which contain millions of short sequencing reads that need to be preprocessed before performing differential gene expression analysis. Using Kallisto to analyze RNA seq data and ID DEG. Components of fastq files: A FASTQ file has four line-separated fields per sequence: *Field 1 begins with a ‘@’ character and is followed by a sequence identifier and an optional description (like a FASTA title line). *Field 2 is the raw sequence letters. *Field 3 begins with a ‘+’ character and is optionally followed by the same sequence identifier (and any description) again. *Field 4 encodes the quality values for the sequence in Field 2, and must contain the same number of symbols as letters in the sequence. --------- Step 1. Downlaod FastQC software from https://www.bioinformatics.babraham.ac.uk/projects/download.html#fastqc #downloaded and imported manually using "upload". Next time do it on the command line. Professor downloaded this data into my directory: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA848687/ About the sample ID of .fastq files: https://www.ncbi.nlm.nih.gov/biosample?LinkName=bioproject_biosample_all&from_uid=848687 1. SRR19635628 = Control 1 (2 samples/ID) 2. SRR19635629 = Control 2 3. SRR19635630 = Control 3 4. SRR19635631 = 26 deg Cels 5. SRR19635632 = 26 deg Cels 6. SRR19635633 = 26 deg Cels 7. SRR19635634 = 26 deg Cels 8. SRR19635635 = 26 deg Cels 9. SRR19635636 = 26 deg Cels 10. SRR19635637 = 30 11. SRR19635638 = 30 12. SRR19635639 = 30 ***dont have this one*** **RUN FastQC** Running FastQC program on PRJNA848687_fastq folder. This will generate a HTML report that can be viewed in a web browser. There are 22 samples total. 11 treatment, 11 control. Line 28- this is how you run the fastqc program, similar to blast. ```{bash} /home/shared/8TB_HDD_02/hannia/SeaCucumber/FastQC/fastqc \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/*.fastq \ -o /home/shared/8TB_HDD_02/hannia/SeaCucumber/output ``` #The QC results show that all sampels have a red x for per base sequence content and sequence duplication levels. #2. Get a reference genome of sea cucumber (Apostichopus japonicus) from: https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_037975245.1/ ##curl into the SeaCucumber folder ```{bash} URL="https://api.ncbi.nlm.nih.gov/datasets/v2/genome/accession/GCF_037975245.1/download?include_annotation_type=GENOME_FASTA&include_annotation_type=GENOME_GFF&include_annotation_type=RNA_FASTA&include_annotation_type=CDS_FASTA&include_annotation_type=PROT_FASTA&include_annotation_type=SEQUENCE_REPORT&hydrated=FULLY_HYDRATED" curl -L "$URL" -o /home/shared/8TB_HDD_02/hannia/SeaCucumber/GCF_037975245.1_ref.zip ``` ###unizp GCF_037975245.1_ref ```{bash} cd /home/shared/8TB_HDD_02/hannia/SeaCucumber unzip GCF_037975245.1_ref.zip -d GCF_037975245.1_ref ``` ####explore contents ```{bash} cd /home/shared/8TB_HDD_02/hannia/SeaCucumber/GCF_037975245.1_ref/ncbi_dataset/data/GCF_037975245.1 ls ``` **2. (Pseudo - alignment)** checking if single end or paired ```{bash} head -4 /home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/*.fastq ``` #making Kallisto index first b4 pseudo alignment (needed for Kallisto quant) #making index from ref transcriptome (rna.fna file) #answers above show that these files are paired-end FASTQ files #index file: compressed data structure that allows fast pseudoalignment of RNA-Seq reads to a reference transcriptome.This is for faster alignment, does not align entire genome but instead matching transcripts. ```{bash} /home/shared/kallisto/kallisto index \ -i /home/shared/8TB_HDD_02/hannia/SeaCucumber/index.idx \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/GCF_037975245.1_ref/ncbi_dataset/data/GCF_037975245.1/rna.fna ``` #completing pseudo alignment ```{bash} /home/shared/kallisto/kallisto quant \ -i /home/shared/8TB_HDD_02/hannia/SeaCucumber/index.idx \ -o /home/shared/8TB_HDD_02/hannia/SeaCucumber/output /home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/*.fastq ``` accesing hyak: ssh hannia@klone.hyak.uw.edu cd /gscratch/scrubbed/fish546 **3. Differential Expression Analysis (Kallisto)** Already created index, next step making output folder for kallisto data ```{r, engine='bash'} cd /home/shared/8TB_HDD_02/hannia/SeaCucumber pwd mkdir /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01 #mkdir makes new folder called kallisto01 up one directory level ``` Creating sub-directory ```{r, engine='bash'} find /home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq/*fastq.gz \ | xargs basename -s _L001_R1_001.fastq.gz | xargs -I{} /home/shared/kallisto/kallisto \ quant -i /home/shared/8TB_HDD_02/hannia/2.RNASeq/data/cgigas_roslin_rna.index \ -o /home/shared/8TB_HDD_02/hannia/2.RNASeq/output/kallisto_01/{} \ -t 40 \ --single -l 100 -s 10 /home/shared/8TB_HDD_02/hannia/2.RNASeq/data/{}_L001_R1_001.fastq.gz ``` SRR19635628_1.fastq ```{bash} #!/bin/bash # Set input and output directories INPUT_DIR="/home/shared/8TB_HDD_02/hannia/SeaCucumber/PRJNA848687_fastq" OUTPUT_DIR="/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01" INDEX="/home/shared/8TB_HDD_02/hannia/SeaCucumber/index.idx" KALLISTO="/home/shared/kallisto/kallisto" # Loop through all forward reads (_1.fastq) for R1 in ${INPUT_DIR}/*_1.fastq; do # Extract the base sample ID (e.g., SRR19635628) SAMPLE=$(basename "$R1" _1.fastq) # Define the reverse read R2="${INPUT_DIR}/${SAMPLE}_2.fastq" # Create output directory for this sample SAMPLE_OUT="${OUTPUT_DIR}/${SAMPLE}" mkdir -p "$SAMPLE_OUT" # Run kallisto quant "$KALLISTO" quant -i "$INDEX" -o "$SAMPLE_OUT" -t 40 "$R1" "$R2" done ``` Creating abundance estimates to gene expression matrix ```{r, engine='bash'} perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \ --est_method kallisto \ --gene_trans_map none \ --out_prefix /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01 \ --name_sample_by_basedir \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635628/abundance.tsv \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635629/abundance.tsv \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635630/abundance.tsv \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635631/abundance.tsv \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635632/abundance.tsv \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635633/abundance.tsv \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635634/abundance.tsv \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635635/abundance.tsv \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635636/abundance.tsv \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635637/abundance.tsv \ /home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01/SRR19635638/abundance.tsv ``` #chunk above gives error but created .matrix anyway **PART II. RUNNING DESeq2** **8. Reading in count matrix** \*Switching to R language and displaying first few rows. ```{r} library(knitr) ``` ```{r, engine='r'} setwd("/home/shared/8TB_HDD_02/hannia/SeaCucumber/output") countmatrix <- read.delim("/home/shared/8TB_HDD_02/hannia/SeaCucumber/output/kallisto_01.isoform.counts.matrix", header = TRUE, sep = '\t') rownames(countmatrix) <- countmatrix$X countmatrix <- countmatrix[,-1] head(countmatrix) ``` **9. Round integers up to hole numbers for further analysis.** ```{r, engine='r'} countmatrix <- round(countmatrix, 0) str(countmatrix) #like summary() ``` ##RUNNING DESeq2 ```{r} library(DESeq2) library(tidyverse) library(pheatmap) library(RColorBrewer) library(data.table) ``` ```{r} if (!require("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("DESeq2") ``` 1. SRR19635628 = Control 1 (2 samples/ID) 2. SRR19635629 = Control 2 3. SRR19635630 = Control 3 4. SRR19635631 = 26 deg Cels 5. SRR19635632 = 26 deg Cels 6. SRR19635633 = 26 deg Cels 7. SRR19635634 = 26 deg Cels 8. SRR19635635 = 26 deg Cels 9. SRR19635636 = 26 deg Cels 10. SRR19635637 = 30 11. SRR19635638 = 30 12. SRR19635639 = 30 ***dont have this one*** Getting DEG based on Desication. Run this table using 2 control, 2 26 deg, and 2 30 deg. ```{r, engine='r'} deseq2.colData <- data.frame(condition=factor(c(rep("control", 2), rep("26.C", 2), rep("30.C", 2))), type=factor(rep("paired", 24))) rownames(deseq2.colData) <- colnames(data#/change this to name of folder, look up how to do) deseq2.dds <- DESeqDataSetFromMatrix(countData = countmatrix, colData = deseq2.colData, design = ~ condition) kable(head(deseq2.res), digits = 3, caption = "Top 6 Differential Expression Results") ```