---
title: "17-kallisto-2021-2022"
output: html_document
date: "2024-09-18"
---
Rmd to get count matrices for the 2021 and 2022 libraries using `kallisto`.
# Confirm `kallisto` location on Raven:
```{bash}
/home/shared/kallisto/kallisto
```
## Print working directory
```{bash}
pwd
```
# Make the 2023 *P. helianthoides* fasta of genes an index:
Get the fasta of genes on Raven:
```{bash}
/home/shared/kallisto_linux-v0.50.1/kallisto index \
-t 40 \
-i /home/shared/8TB_HDD_02/graceac9/GitHub/paper-pycno-sswd-2021-2022/code/2023_phel_genomefasta.index \
/home/shared/8TB_HDD_02/graceac9/GitHub/paper-pycno-sswd-2021-2022/data/augustus.hints.codingseq
```
# Get `quant` info:
```{bash}
/home/shared/kallisto/kallisto \
quant
```
I want all kallisto files to go into:
`paper-pycno-sswd-2021-2022/analyses/17-kallisto-2021-2022`
Trimmed summer 2021 RNAseq reads live: `/home/shared/8TB_HDD_02/graceac9/data/pycno2021`
Trimmed summer 2022 RNAseq reads live: `/home/shared/8TB_HDD_02/graceac9/data/pycno2022`
```{bash}
pwd
```
```{bash}
#list all files in directory, get count of how many files
DATA_DIRECTORY="../../../data/pycno2021"
ls -1 "$DATA_DIRECTORY"/*.fq.gz | wc -l
```
should be 64 --> it is
```{bash}
#list all files in directory, get count of how many files
DATA_DIRECTORY="../../../data/pycno2022"
ls -1 "$DATA_DIRECTORY"/*.fq.gz | wc -l
```
should be 64 --> it is
# Kallisto quanitification 2021 libraries
```{bash}
# Set the paths
DATA_DIRECTORY="../../../data/pycno2021"
KALLISTO_INDEX="2023_phel_genomefasta.index"
OUTPUT_DIRECTORY="../analyses/17-kallisto-2021-2022"
pwd
echo $DATA_DIRECTORY
# Iterate over all .fq.gz files in the data directory
for FILE in "$DATA_DIRECTORY"/*_R1_001.fastq.gz.fastp-trim.20220810.fq.gz; do
# Extract the base name of the file for naming the output folder
BASENAME=$(basename "$FILE" _R1_001.fastq.gz.fastp-trim.20220810.fq.gz)
# Create output directory for this sample3
SAMPLE_OUTPUT="$OUTPUT_DIRECTORY/$BASENAME"
mkdir -p "$SAMPLE_OUTPUT"
# Run Kallisto quantification
/home/shared/kallisto_linux-v0.50.1/kallisto quant \
-i "$KALLISTO_INDEX" \
-o "$SAMPLE_OUTPUT" \
-t 40 \
"$DATA_DIRECTORY"/"$BASENAME"_R1_001.fastq.gz.fastp-trim.20220810.fq.gz \
"$DATA_DIRECTORY"/"$BASENAME"_R2_001.fastq.gz.fastp-trim.20220810.fq.gz
done
echo "Kallisto quantification complete."
```
# Kallisto quanitification 2022 libraries
```{bash}
# Set the paths
DATA_DIRECTORY="../../../data/pycno2022"
KALLISTO_INDEX="2023_phel_genomefasta.index"
OUTPUT_DIRECTORY="../analyses/17-kallisto-2021-2022"
pwd
echo $DATA_DIRECTORY
# Iterate over all .fq.gz files in the data directory
for FILE in "$DATA_DIRECTORY"/*_R1_001.fastq.gz.fastp-trim.20231101.fq.gz; do
# Extract the base name of the file for naming the output folder
BASENAME=$(basename "$FILE" _R1_001.fastq.gz.fastp-trim.20231101.fq.gz)
# Create output directory for this sample3
SAMPLE_OUTPUT="$OUTPUT_DIRECTORY/$BASENAME"
mkdir -p "$SAMPLE_OUTPUT"
# Run Kallisto quantification
/home/shared/kallisto_linux-v0.50.1/kallisto quant \
-i "$KALLISTO_INDEX" \
-o "$SAMPLE_OUTPUT" \
-t 40 \
"$DATA_DIRECTORY"/"$BASENAME"_R1_001.fastq.gz.fastp-trim.20231101.fq.gz \
"$DATA_DIRECTORY"/"$BASENAME"_R2_001.fastq.gz.fastp-trim.20231101.fq.gz
done
echo "Kallisto quantification complete."
```
# Creating count matrix
```{bash}
pwd
```
```{bash}
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
--gene_trans_map none \
--out_prefix ../analyses/17-kallisto-2021-2022/kallisto_20240918 \
--name_sample_by_basedir \
../analyses/17-kallisto-2021-2022/*/abundance.tsv
```
```{bash}
head ../analyses/17-kallisto-2021-2022/kallisto_20240918.isoform.counts.matrix
```
```{r}
countmatrix <- read.delim("../analyses/17-kallisto-2021-2022/kallisto_20240918.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)
```
```{r}
countmatrix <- round(countmatrix, 0)
head(countmatrix)
```
write out count matrix (not rounded):
```{r}
#write.table(countmatrix, "../data/2021-2022_kallisto_count_matrix_rounded.tab", quote = FALSE, sep = '\t')
````
Wrote out 2024-09-18