--- title: "07-kallisto-summer2022-phelgenome.Rmd" output: html_document date: "2023-12-12" --- Rmd to run `kallisto` to get read counts for summer 2022 RNAseq libraries. Instead of making the index the Phel Transcriptome, I'll make the Phel Genome the index. Can't use the genome fasta because it contains introns. `HISAT2` is able to handle those, but `kallisto` is not, so I'll be using a fasta of the genes. Run this on Raven. # Confirm `kallisto` location on Raven: ```{bash} /home/shared/kallisto/kallisto ``` ## Print working directory ```{bash} pwd ``` # Make the 2023 *P. helianthoides* fasta of genes an index: Get the fasta of genes on Raven: ```{bash} /home/shared/kallisto_linux-v0.50.1/kallisto index \ -t 40 \ -i /home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-sizeclass-2022/code/2023_phel_genomefasta.index \ /home/shared/8TB_HDD_02/graceac9/GitHub/project-pycno-sizeclass-2022/data/augustus.hints.codingseq ``` # Get `quant` info: ```{bash} /home/shared/kallisto/kallisto \ quant ``` I want all kallisto files to go into: `project-pycno-sizeclass-2022/analyses/07-kallisto-phelgenome Trimmed summer 2022 RNAseq reads live: `/home/shared/8TB_HDD_02/graceac9/data/pycno2022` ```{bash} pwd ``` ```{bash} #list all files in directory, get count of how many files DATA_DIRECTORY="../../../data/pycno2022" ls -1 "$DATA_DIRECTORY"/*.fq.gz | wc -l ``` # Kallisto quanitification ```{bash} # Set the paths DATA_DIRECTORY="../../../data/pycno2022" KALLISTO_INDEX="2023_phel_genomefasta.index" OUTPUT_DIRECTORY="../analyses/07-kallisto-phelgenome" pwd echo $DATA_DIRECTORY # Iterate over all .fq.gz files in the data directory for FILE in "$DATA_DIRECTORY"/*_R1_001.fastq.gz.fastp-trim.20231101.fq.gz; do # Extract the base name of the file for naming the output folder BASENAME=$(basename "$FILE" _R1_001.fastq.gz.fastp-trim.20231101.fq.gz) # Create output directory for this sample3 SAMPLE_OUTPUT="$OUTPUT_DIRECTORY/$BASENAME" mkdir -p "$SAMPLE_OUTPUT" # Run Kallisto quantification /home/shared/kallisto_linux-v0.50.1/kallisto quant \ -i "$KALLISTO_INDEX" \ -o "$SAMPLE_OUTPUT" \ -t 40 \ "$DATA_DIRECTORY"/"$BASENAME"_R1_001.fastq.gz.fastp-trim.20231101.fq.gz \ "$DATA_DIRECTORY"/"$BASENAME"_R2_001.fastq.gz.fastp-trim.20231101.fq.gz done echo "Kallisto quantification complete." ``` # Creating count matrix ```{bash} pwd ``` ```{bash} perl /home/shared/trinityrnaseq-v2.12.0/util/abundance_estimates_to_matrix.pl \ --est_method kallisto \ --gene_trans_map none \ --out_prefix ../analyses/07-kallisto-phelgenome/kallisto_20231204 \ --name_sample_by_basedir \ ../analyses/07-kallisto-phelgenome/*/abundance.tsv ``` ```{bash} head ../analyses/07-kallisto-phelgenome/kallisto_20231204.isoform.counts.matrix ``` ```{r} countmatrix <- read.delim("../analyses/07-kallisto-phelgenome/kallisto_20231204.isoform.counts.matrix", header = TRUE, sep = '\t') rownames(countmatrix) <- countmatrix$X countmatrix <- countmatrix[,-1] head(countmatrix) ``` ```{r} countmatrix <- round(countmatrix, 0) str(countmatrix) ```