--- title: "kallisto-summer21-phelgenomegenelist" output: html_document --- Run this on Raven. Rmd to align summer 2021 trimmed and QC'ed _Pycnopodia helianthoides_ coelomocyte RNAseq libraries to the _Pycnopodia helianthoides_ genome gene list that Steven got. Genome: Lauren M Schiebelhut, Melissa B DeBiasse, Lars Gabriel, Katharina J Hoff, Michael N Dawson, A reference genome for ecological restoration of the sunflower sea star, Pycnopodia helianthoides, Journal of Heredity, 2023;, esad054, https://doi.org/10.1093/jhered/esad054 Can't use the genome fasta because it contains introns. `HISAT2` is able to handle those, but `kallisto` is not, so I'll be using a fasta of the genes. # Confirm `kallisto` location on Raven: ```{bash} /home/shared/kallisto/kallisto ``` ## Print working directory ```{bash} pwd ``` # Make the 2023 *P. helianthoides* fasta of genes an index: Get the fasta of genes on Raven: ```{bash} /home/shared/kallisto_linux-v0.50.1/kallisto index \ -t 40 \ -i /home/shared/8TB_HDD_02/graceac9/GitHub/paper-pycno-sswd-2021/code/2023_phel_genomefasta.index \ /home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-sizeclass-2022/data/augustus.hints.codingseq ``` # Get `quant` info: ```{bash} /home/shared/kallisto/kallisto \ quant ``` I want all kallisto files to go into: `project-pycno-sizeclass-2021/analyses/03-kallisto` Trimmed summer 2021 RNAseq reads live: `/home/shared/8TB_HDD_02/graceac9/data/pycno2021` ```{bash} pwd ``` ```{bash} #list all files in directory, get count of how many files DATA_DIRECTORY="../../../data/pycno2021" ls -1 "$DATA_DIRECTORY"/*.fq.gz | wc -l ``` Should be 64 --> two files per each of the 32 libraries. # Kallisto quanitification ```{bash} # Set the paths DATA_DIRECTORY="../../../data/pycno2021" KALLISTO_INDEX="2023_phel_genomefasta.index" OUTPUT_DIRECTORY="../analyses/03-kallisto" pwd echo $DATA_DIRECTORY # Iterate over all .fq.gz files in the data directory for FILE in "$DATA_DIRECTORY"/*_R1_001.fastq.gz.fastp-trim.20220810.fq.gz; do # Extract the base name of the file for naming the output folder BASENAME=$(basename "$FILE" _R1_001.fastq.gz.fastp-trim.20220810.fq.gz) # Create output directory for this sample3 SAMPLE_OUTPUT="$OUTPUT_DIRECTORY/$BASENAME" mkdir -p "$SAMPLE_OUTPUT" # Run Kallisto quantification /home/shared/kallisto_linux-v0.50.1/kallisto quant \ -i "$KALLISTO_INDEX" \ -o "$SAMPLE_OUTPUT" \ -t 40 \ "$DATA_DIRECTORY"/"$BASENAME"_R1_001.fastq.gz.fastp-trim.20220810.fq.gz \ "$DATA_DIRECTORY"/"$BASENAME"_R2_001.fastq.gz.fastp-trim.20220810.fq.gz done echo "Kallisto quantification complete." ``` # Creating count matrix ```{bash} pwd ``` ```{bash} perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \ --est_method kallisto \ --gene_trans_map none \ --out_prefix ../analyses/03-kallisto/kallisto_20240118 \ --name_sample_by_basedir \ ../analyses/03-kallisto/*/abundance.tsv ``` ```{bash} head ../analyses/03-kallisto/kallisto_20240118.isoform.counts.matrix ``` ```{r} countmatrix <- read.delim("../analyses/03-kallisto/kallisto_20240118.isoform.counts.matrix", header = TRUE, sep = '\t') rownames(countmatrix) <- countmatrix$X countmatrix <- countmatrix[,-1] head(countmatrix) ``` ```{r} countmatrix <- round(countmatrix, 0) head(countmatrix) ``` write out count matrix (not rounded): ```{r} #write.table(countmatrix, "../data/kallisto_count_matrix_rounded.tab", quote = FALSE, sep = '\t') ```` Wrote out 2024-01-18