---
title: "07-kallisto-summer2022-phelgenome.Rmd"
output: html_document
date: "2023-12-12"
---
Rmd to run `kallisto` to get read counts for summer 2022 RNAseq libraries. Instead of making the index the Phel Transcriptome, I'll make the Phel Genome the index. 

Can't use the genome fasta because it contains introns. `HISAT2` is able to handle those, but `kallisto` is not, so I'll be using a fasta of the genes. 

Run this on Raven. 

# Confirm `kallisto` location on Raven:
```{bash}
/home/shared/kallisto/kallisto
```

## Print working directory
```{bash}
pwd
```

# Make the 2023 *P. helianthoides* fasta of genes an index:
Get the fasta of genes on Raven: 


```{bash}
/home/shared/kallisto_linux-v0.50.1/kallisto index \
-t 40 \
-i /home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-sizeclass-2022/code/2023_phel_genomefasta.index \
/home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-sizeclass-2022/data/augustus.hints.codingseq
```

# Get `quant` info:

```{bash}
/home/shared/kallisto/kallisto \
quant
```


I want all kallisto files to go into: 

`project-pycno-sizeclass-2022/analyses/07-kallisto-phelgenome

Trimmed summer 2022 RNAseq reads live: `/home/shared/8TB_HDD_02/graceac9/data/pycno2022`

```{bash}
pwd
```

```{bash}
#list all files in directory, get count of how many files
DATA_DIRECTORY="../../../data/pycno2022"
ls -1 "$DATA_DIRECTORY"/*.fq.gz | wc -l
```

# Kallisto quanitification 

```{bash}
# Set the paths
DATA_DIRECTORY="../../../data/pycno2022"
KALLISTO_INDEX="2023_phel_genomefasta.index"
OUTPUT_DIRECTORY="../analyses/07-kallisto-phelgenome"

pwd
echo $DATA_DIRECTORY

# Iterate over all .fq.gz files in the data directory
for FILE in "$DATA_DIRECTORY"/*_R1_001.fastq.gz.fastp-trim.20231101.fq.gz; do
    # Extract the base name of the file for naming the output folder
    BASENAME=$(basename "$FILE" _R1_001.fastq.gz.fastp-trim.20231101.fq.gz)

    # Create output directory for this sample3
    SAMPLE_OUTPUT="$OUTPUT_DIRECTORY/$BASENAME"
    mkdir -p "$SAMPLE_OUTPUT"

    # Run Kallisto quantification
    /home/shared/kallisto_linux-v0.50.1/kallisto quant \
    -i "$KALLISTO_INDEX" \
    -o "$SAMPLE_OUTPUT" \
    -t 40 \
    "$DATA_DIRECTORY"/"$BASENAME"_R1_001.fastq.gz.fastp-trim.20231101.fq.gz \
    "$DATA_DIRECTORY"/"$BASENAME"_R2_001.fastq.gz.fastp-trim.20231101.fq.gz
done

echo "Kallisto quantification complete."
```


# Creating count matrix 

```{bash}
pwd
```

```{bash}
perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \
--est_method kallisto \
    --gene_trans_map none \
    --out_prefix ../analyses/07-kallisto-phelgenome/kallisto_20231204 \
    --name_sample_by_basedir \
     ../analyses/07-kallisto-phelgenome/*/abundance.tsv
```

     
```{bash}     
head ../analyses/07-kallisto-phelgenome/kallisto_20231204.isoform.counts.matrix
```

```{r}
countmatrix <- read.delim("../analyses/07-kallisto-phelgenome/kallisto_20231204.isoform.counts.matrix", header = TRUE, sep = '\t')
rownames(countmatrix) <- countmatrix$X
countmatrix <- countmatrix[,-1]
head(countmatrix)
```

```{r}
countmatrix <- round(countmatrix, 0)
str(countmatrix)
```