--- title: "07-Peve-sRNAseq-MirMachine" author: "Sam White" date: "2023-11-15" output: bookdown::html_document2: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true github_document: toc: true number_sections: true html_document: theme: cosmo toc: true toc_float: true number_sections: true code_folding: show code_download: true bibliography: references.bib link-citations: true --- ```{r setup, include=FALSE} library(knitr) library(kableExtra) library(dplyr) library(reticulate) knitr::opts_chunk$set( echo = TRUE, # Display code chunks eval = FALSE, # Evaluate code chunks warning = FALSE, # Hide warnings message = FALSE, # Hide messages comment = "" # Prevents appending '##' to beginning of lines in code output ) ``` Use [MirMachine](https://github.com/sinanugur/MirMachine) [@umu2022] to identify potential miRNA homologs in the *Pevermanni* genome. Genome FastA originally downloaded from on 20230629. ------------------------------------------------------------------------ Inputs: - Genome FastA. Outputs: - Confidence-filtered GFF of predicted miRNAs. - Unfiltered FastA of predicted miRNAs. - Unfiltered GFF of predicted miRNAs. Software requirements: - Requires a [MirMachine](https://github.com/sinanugur/MirMachine) Conda/Mamba environment, per the installation instructions. ------------------------------------------------------------------------ # Set R variables Replace with name of your MirMachine environment and the path to the corresponding conda installation (find this *after* you've activated the environment). E.g. ```{r conda-env-example, eval=FALSE} # Activate environment conda activate mirmachine_env # Find conda path which conda ``` ```{r R-variables, eval=TRUE} mirmachine_conda_env_name <- c("mirmachine_env") mirmachine_cond_path <- c("/home/sam/programs/mambaforge/condabin/conda") ``` # Create a Bash variables file This allows usage of Bash variables across R Markdown chunks. ```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE} { echo "#### Assign Variables ####" echo "" echo "# Data directories" echo 'export deep_dive_dir=/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive' echo 'export output_dir_top=${deep_dive_dir}/E-Peve/output/07-Peve-sRNAseq-MirMachine' echo "" echo 'export mirmarchine_output_prefix="Peve-MirMachine"' echo "# Input/Output files" echo 'export genome_fasta_dir=${deep_dive_dir}/E-Peve/data' echo 'export genome_fasta_name="Porites_evermanni_v1.fa"' echo 'export genome_fasta="${genome_fasta_dir}/${genome_fasta_name}"' echo "" echo "# External data URLs" echo 'export genome_fasta_url="https://gannet.fish.washington.edu/seashell/snaps/"' echo "" echo "# Set number of CPUs to use" echo 'export threads=40' echo "" } > .bashvars cat .bashvars ``` # Load [MirMachine](https://github.com/sinanugur/MirMachine) conda environment If this is successful, the first line of output should show that the Python being used is the one in your [MirMachine](https://github.com/sinanugur/MirMachine) conda environment path. E.g. `python: /home/sam/programs/mambaforge/envs/mirmachine_env/bin/python` ```{r load-mirmachine-conda-env, eval=TRUE} use_condaenv(condaenv = mirmachine_conda_env_name, conda = mirmachine_cond_path) # Check successful env loading py_config() ``` # Download *P.evermanni* genome from our server ```{r download-genome-fasta, engine='bash', eval=TRUE} # Load bash variables into memory source .bashvars wget \ --directory-prefix ${genome_fasta_dir} \ --recursive \ --no-check-certificate \ --continue \ --no-host-directories \ --no-directories \ --no-parent \ --quiet \ --execute robots=off \ --accept "${genome_fasta_name}" ${genome_fasta_url} ls -lh "${genome_fasta_dir}" ``` # Run [MirMachine](https://github.com/sinanugur/MirMachine) The `--species` option specifies output naming structure ```{r mirmachine, engine='bash', cache=TRUE} # Load bash variables into memory source .bashvars cd "${output_dir_top}" time \ MirMachine.py \ --node Metazoa \ --species ${mirmarchine_output_prefix} \ --genome ${genome_fasta} \ --cpu 40 \ --add-all-nodes \ &> mirmachine.log ``` # Results ## View [MirMachine](https://github.com/sinanugur/MirMachine) output structure ```{r mirmachine-output-tree, eval=TRUE, engine='bash'} # Load bash variables into memory source .bashvars tree -h ${output_dir_top}/results/predictions ``` ## Check FastA ```{r check-mirmachine-fasta, eval=TRUE, engine='bash'} # Load bash variables into memory source .bashvars echo "Total number of predicted miRNAs (high & low confidence):" grep --count "^>" "${output_dir_top}/results/predictions/fasta/${mirmarchine_output_prefix}.PRE.fasta" echo "" echo "----------------------------------------------------------------------------------" echo "" head "${output_dir_top}/results/predictions/fasta/${mirmarchine_output_prefix}.PRE.fasta" ``` ## Check filtered GFF Per [MirMachine](https://github.com/sinanugur/MirMachine) recommendations: > `filtered_gff/` High confidence miRNA family predictions after bitscore filtering. (This file is what you need in most cases) So, we'll stick with that. ```{r check-filtered-gff, eval=TRUE, engine='bash'} # Load bash variables into memory source .bashvars # Avoid 10th line - lengthy list of all miRNA families examined head -n 9 "${output_dir_top}/results/predictions/filtered_gff/${mirmarchine_output_prefix}.PRE.gff" echo "" # Pull out lines that do _not_ begin with a '#'. grep "^[^#]" "${output_dir_top}/results/predictions/filtered_gff/${mirmarchine_output_prefix}.PRE.gff" \ | head ``` ## Counts of predicted high-confidence miRNAs ```{r count-predicted-hc-mirnas, eval=TRUE, engine='bash'} # Load bash variables into memory source .bashvars # Skip lines which begin with '#' (i.e. the header components) echo "Predicted loci:" grep -c "^[^#]" "${output_dir_top}/results/predictions/filtered_gff/${mirmarchine_output_prefix}.PRE.gff" echo "" echo "Unique miRNA families:" grep "^[^#]" "${output_dir_top}/results/predictions/filtered_gff/${mirmarchine_output_prefix}.PRE.gff" \ | awk -F"[\t=;]" '{print $10}' \ | sort -u \ | wc -l ``` ------------------------------------------------------------------------ # Citations