Use miRTrace (Kang et al. 2018) to identify taxonomic origins of miRNA sequencing data.
NOTE: This requires you to have previously run 06.2-Peve-sRNAseq-trimming-31bp-fastp-merged.Rmd
, as the code relies on the trimmed and merged reads output from that code.
Inputs:
Outputs:
mirtrace.config
: A miRTrace config file. A comma-separated file with this layout (one FastQ per line): /path/to/fastq,custom_sample_name
“Collapsed” (i.e. unique sequences only) FastA for each corresponding input FastQ.
mirtrace-report.html
: HTML-formatted report generated by miRTrace.
mirtrace-stats-contamination_basic.tsv
: Tab-delimited report with counts of sequences from each collapsed FastAs having matches to known miRNAs within each of the miRTrace Clades.
mirtrace-stats-contamination_detailed.csv
: Tab-delimited report of only Clades with which sequences were matched, along with the corresponding miRNA families in each clade, and the sequence counts.
Create a Bash variables file
This allows usage of Bash variables across R Markdown chunks.
{
echo "#### Assign Variables ####"
echo ""
echo "# Data directories"
echo 'export deep_dive_dir=/home/shared/8TB_HDD_02/shedurkin/deep-dive'
echo 'export output_dir_top=${deep_dive_dir}/E-Peve/output/09.2-Peve-sRNAseq-miRTrace-31bp-fastp-merged'
echo 'export trimmed_reads_dir=${deep_dive_dir}/E-Peve/output/06.2-Peve-sRNAseq-trimming-31bp-fastp-merged/trimmed-reads'
echo ""
echo "# Paths to programs"
echo 'export mirtrace=/home/sam/programs/mambaforge/envs/miRTrace_env/bin/mirtrace'
echo ""
echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""
echo "export fastq_pattern='*-31bp-merged.fq.gz'"
echo "# Programs associative array"
echo "declare -A programs_array"
echo "programs_array=("
echo '[mirtrace]="${mirtrace}"'
echo ")"
} > .bashvars
cat .bashvars
#### Assign Variables ####
# Data directories
export deep_dive_dir=/home/shared/8TB_HDD_02/shedurkin/deep-dive
export output_dir_top=${deep_dive_dir}/E-Peve/output/09.2-Peve-sRNAseq-miRTrace-31bp-fastp-merged
export trimmed_reads_dir=${deep_dive_dir}/E-Peve/output/06.2-Peve-sRNAseq-trimming-31bp-fastp-merged/trimmed-reads
# Paths to programs
export mirtrace=/home/sam/programs/mambaforge/envs/miRTrace_env/bin/mirtrace
# Set number of CPUs to use
export threads=40
export fastq_pattern='*-31bp-merged.fq.gz'
# Programs associative array
declare -A programs_array
programs_array=(
[mirtrace]="${mirtrace}"
)
Create miRTrace config file
# Load bash variables into memory
source .bashvars
# Declare array
fastq_array=()
# Populate array
fastq_array=(${trimmed_reads_dir}/${fastq_pattern})
# Loop through read pairs
# Increment by 2 to process next pair of FastQ files
if [ -f "${output_dir_top}/mirtrace.config" ]; then
echo "mirtrace.config already exists. Nothing to do."
else
for (( i=0; i<${#fastq_array[@]} ; i+=2 ))
do
# Use first three parts of filename to create short sample name
R1_name=$(echo "${fastq_array[i]##*/}" | awk -F "-" '{print $1"-"$2"-"$3}')
R2_name=$(echo "${fastq_array[i+1]##*/}" | awk -F "-" '{print $1"-"$2"-"$3}')
echo "${fastq_array[i]},${R1_name}_1"
echo "${fastq_array[i+1]},${R2_name}_2"
done >> "${output_dir_top}/mirtrace.config"
fi
cat "${output_dir_top}/mirtrace.config"
mirtrace.config already exists. Nothing to do.
/home/shared/8TB_HDD_02/shedurkin/deep-dive/E-Peve/output/06.2-Peve-sRNAseq-trimming-31bp-fastp-merged/trimmed-reads/sRNA-POR-73-S1-TP2-fastp-adapters-polyG-31bp-merged.fq.gz,sRNA-POR-73_1
/home/shared/8TB_HDD_02/shedurkin/deep-dive/E-Peve/output/06.2-Peve-sRNAseq-trimming-31bp-fastp-merged/trimmed-reads/sRNA-POR-79-S1-TP2-fastp-adapters-polyG-31bp-merged.fq.gz,sRNA-POR-79_2
/home/shared/8TB_HDD_02/shedurkin/deep-dive/E-Peve/output/06.2-Peve-sRNAseq-trimming-31bp-fastp-merged/trimmed-reads/sRNA-POR-82-S1-TP2-fastp-adapters-polyG-31bp-merged.fq.gz,sRNA-POR-82_1
,--_2
# Load bash variables into memory
source .bashvars
time \
${programs_array[mirtrace]} trace \
--config ${output_dir_top}/mirtrace.config \
--write-fasta \
--num-threads ${threads} \
--output-dir ${output_dir_top} \
--force
tree -h ${output_dir_top}
miRTrace version 1.0.1 starting. Processing 3 sample(s).
NOTE: reusing existing output directory, outdated files may be present.
Run complete. Processed 3 sample(s) in 23 s.
Reports written to /home/shared/8TB_HDD_02/shedurkin/deep-dive/E-Peve/output/09.2-Peve-sRNAseq-miRTrace-31bp-fastp-merged/
For information about citing our paper, run miRTrace in mode "cite".
real 0m23.795s
user 1m5.348s
sys 0m2.158s
/home/shared/8TB_HDD_02/shedurkin/deep-dive/E-Peve/output/09.2-Peve-sRNAseq-miRTrace-31bp-fastp-merged
├── [ 573] mirtrace.config
├── [298K] mirtrace-report.html
├── [ 269] mirtrace-stats-contamination_basic.tsv
├── [ 365] mirtrace-stats-contamination_detailed.tsv
└── [4.0K] qc_passed_reads.all.collapsed
├── [ 68M] sRNA-POR-73-S1-TP2-fastp-adapters-polyG-31bp-merged.fasta
├── [108M] sRNA-POR-79-S1-TP2-fastp-adapters-polyG-31bp-merged.fasta
└── [129M] sRNA-POR-82-S1-TP2-fastp-adapters-polyG-31bp-merged.fasta
1 directory, 7 files
Results
Read in table as data frame
mirtrace.detailed.df <- read.csv("../output/09-Peve-sRNAseq-miRTrace/mirtrace-stats-contamination_detailed.tsv", sep = "\t", header = TRUE)
str(mirtrace.detailed.df)
'data.frame': 1 obs. of 10 variables:
$ CLADE : chr "insects"
$ FAMILY_ID : int 14
$ MIRBASE_IDS : chr "aae-miR-14,aga-miR-14,ame-miR-14,api-miR-14,bmo-miR-14,cqu-miR-14,dan-miR-14,der-miR-14,dgr-miR-14,dme-miR-14,d"| __truncated__
$ SEQ : chr "TCAGTCTTTTTCTCTCTCCT"
$ sRNA.POR.73_1: int 0
$ sRNA.POR.73_2: int 0
$ sRNA.POR.79_1: int 0
$ sRNA.POR.79_2: int 0
$ sRNA.POR.82_1: int 1
$ sRNA.POR.82_2: int 0
Number of samples with matches
IMPORTANT: Change starts_with()
to match prefix of input sRNAseq reads!
# Select columns corresponding to sample names
sample_columns <- mirtrace.detailed.df %>%
select(starts_with("sRNA.POR."))
# Calculate the sum for each column
sample_sums <- colSums(sample_columns)
# Count the number of columns with a sum greater than 0
samples_with_sum_gt_0 <- sum(sample_sums > 0)
paste("Number of samples with matches: ", samples_with_sum_gt_0)
[1] "Number of samples with matches: 1"
Percentage of samples with matches
# Total number of samples (columns)
total_samples <- ncol(sample_columns)
# Percentage of samples with sums greater than 0
percentage_samples_gt_0 <- (samples_with_sum_gt_0 / total_samples) * 100
paste("Percentage of samples with matches: ", percentage_samples_gt_0)
[1] "Percentage of samples with matches: 16.6666666666667"
Number of clades with matches
unique_clade_count <- mirtrace.detailed.df %>%
distinct(CLADE) %>% # Get unique entries in CLADE column
count() # Count the number of unique entries
paste("Number of clades with matches:", unique_clade_count)
[1] "Number of clades with matches: 1"
To make them easier to see, counts > 0 are highlighted in green.
mirtrace.detailed.df %>%
mutate(
across(
starts_with("sRNA"),
~cell_spec(
.,
background = ifelse(
. > 0,
"lightgreen",
"white"
)
)
)
) %>%
kable(escape = F, caption = "Clades identified as having sRNAseq matches.") %>%
kable_styling("striped") %>%
scroll_box(width = "100%", height = "500px")
Table 4.1: Table 4.2: Clades identified as having sRNAseq matches.
CLADE
|
FAMILY_ID
|
MIRBASE_IDS
|
SEQ
|
sRNA.POR.73_1
|
sRNA.POR.73_2
|
sRNA.POR.79_1
|
sRNA.POR.79_2
|
sRNA.POR.82_1
|
sRNA.POR.82_2
|
insects
|
14
|
aae-miR-14,aga-miR-14,ame-miR-14,api-miR-14,bmo-miR-14,cqu-miR-14,dan-miR-14,der-miR-14,dgr-miR-14,dme-miR-14,dmo-miR-14,dpe-miR-14,dps-miR-14,dse-miR-14,dsi-miR-14,dvi-miR-14,dwi-miR-14,dya-miR-14,hme-miR-14,mse-miR-14,ngi-miR-14,nvi-miR-14,tca-miR-14
|
TCAGTCTTTTTCTCTCTCCT
|
<span style=” border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: white !important;” >0</span>
|
<span style=” border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: white !important;” >0</span>
|
<span style=” border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: white !important;” >0</span>
|
<span style=” border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: white !important;” >0</span>
|
<span style=” border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: lightgreen !important;” >1</span>
|
<span style=” border-radius: 4px; padding-right: 4px; padding-left: 4px; background-color: white !important;” >0</span>
|
Citations
Kang, Wenjing, Yrin Eldfjell, Bastian Fromm, Xavier Estivill, Inna Biryukova, and Marc R. Friedländer. 2018.
“miRTrace Reveals the Organismal Origins of microRNA Sequencing Data.” Genome Biology 19 (1).
https://doi.org/10.1186/s13059-018-1588-9.
---
title: "09.2-Peve-sRNAseq-miRTrace-31bp-fastp-merged"
author: "Kathleen Durkin"
date: "2024-04-11"
output: 
  bookdown::html_document2:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
  github_document:
    toc: true
    number_sections: true
  html_document:
    theme: cosmo
    toc: true
    toc_float: true
    number_sections: true
    code_folding: show
    code_download: true
always_allow_html: true
bibliography: references.bib
link-citations: true
---

```{r setup, include=FALSE}
library(knitr)
library(kableExtra)
library(dplyr)
library(reticulate)
knitr::opts_chunk$set(
  echo = TRUE,         # Display code chunks
  eval = FALSE,        # Evaluate code chunks
  warning = FALSE,     # Hide warnings
  message = FALSE,     # Hide messages
  comment = ""         # Prevents appending '##' to beginning of lines in code output
)
```

Use [miRTrace](https://github.com/friedlanderlab/mirtrace) [@kang2018] to identify taxonomic origins of miRNA sequencing data.

NOTE: This requires you to have previously run [`06.2-Peve-sRNAseq-trimming-31bp-fastp-merged.Rmd`](https://github.com/urol-e5/deep-dive/blob/main/E-Peve/code/06.2-Peve-sRNAseq-trimming-31bp-fastp-merged.Rmd), as the code relies on the trimmed and merged reads output from that code.

------------------------------------------------------------------------

Inputs:

-   Trimmed sRNAseq FastQs generated by [`06.2-Peve-sRNAseq-trimming-31bp-fastp-merged.Rmd`](https://github.com/urol-e5/deep-dive/blob/main/E-Peve/code/06.2-Peve-sRNAseq-trimming-31bp-fastp-merged.Rmd)

    -   Filenames formatted: `*-31bp-merged.fq.gz`

Outputs:

-   `mirtrace.config`: A [miRTrace](https://github.com/friedlanderlab/mirtrace) config file. A comma-separated file with this layout (one FastQ per line): `/path/to/fastq,custom_sample_name`

-   "Collapsed" (i.e. unique sequences only) FastA for each corresponding input FastQ.

-   `mirtrace-report.html`: HTML-formatted report generated by [miRTrace](https://github.com/friedlanderlab/mirtrace).

-   `mirtrace-stats-contamination_basic.tsv`: Tab-delimited report with counts of sequences from each collapsed FastAs having matches to known miRNAs within each of the [miRTrace](https://github.com/friedlanderlab/mirtrace) Clades.

-   `mirtrace-stats-contamination_detailed.csv`: Tab-delimited report of *only* Clades with which sequences were matched, along with the corresponding miRNA families in each clade, and the sequence counts.

# Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

```{r save-bash-variables-to-rvars-file, engine='bash', eval=TRUE}
{
echo "#### Assign Variables ####"
echo ""

echo "# Data directories"
echo 'export deep_dive_dir=/home/shared/8TB_HDD_02/shedurkin/deep-dive'
echo 'export output_dir_top=${deep_dive_dir}/E-Peve/output/09.2-Peve-sRNAseq-miRTrace-31bp-fastp-merged'
echo 'export trimmed_reads_dir=${deep_dive_dir}/E-Peve/output/06.2-Peve-sRNAseq-trimming-31bp-fastp-merged/trimmed-reads'
echo ""

echo "# Paths to programs"
echo 'export mirtrace=/home/sam/programs/mambaforge/envs/miRTrace_env/bin/mirtrace'
echo ""

echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""

echo "export fastq_pattern='*-31bp-merged.fq.gz'"

echo "# Programs associative array"
echo "declare -A programs_array"
echo "programs_array=("
echo '[mirtrace]="${mirtrace}"'
echo ")"
} > .bashvars

cat .bashvars
```

# Create [miRTrace](https://github.com/friedlanderlab/mirtrace) config file

```{r create-config-file, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

# Declare array
fastq_array=()

# Populate array
fastq_array=(${trimmed_reads_dir}/${fastq_pattern})

# Loop through read pairs
# Increment by 2 to process next pair of FastQ files
if [ -f "${output_dir_top}/mirtrace.config" ]; then
  echo "mirtrace.config already exists. Nothing to do."
  
else

  for (( i=0; i<${#fastq_array[@]} ; i+=2 ))
  do
    # Use first three parts of filename to create short sample name
    R1_name=$(echo "${fastq_array[i]##*/}" | awk -F "-" '{print $1"-"$2"-"$3}')
    R2_name=$(echo "${fastq_array[i+1]##*/}" | awk -F "-" '{print $1"-"$2"-"$3}')
    echo "${fastq_array[i]},${R1_name}_1"
    echo "${fastq_array[i+1]},${R2_name}_2"
  done >> "${output_dir_top}/mirtrace.config"

fi

cat "${output_dir_top}/mirtrace.config"
```

# Run [miRTrace](https://github.com/friedlanderlab/mirtrace)

```{r run-mirtrace, engine='bash', eval=TRUE}
# Load bash variables into memory
source .bashvars

time \
${programs_array[mirtrace]} trace \
--config ${output_dir_top}/mirtrace.config \
--write-fasta \
--num-threads ${threads} \
--output-dir ${output_dir_top} \
--force

tree -h ${output_dir_top}
```

# Results

## Read in table as data frame

```{r read-table, eval=TRUE}
mirtrace.detailed.df <- read.csv("../output/09-Peve-sRNAseq-miRTrace/mirtrace-stats-contamination_detailed.tsv", sep = "\t", header = TRUE)

str(mirtrace.detailed.df)
```

## Number of samples with matches

IMPORTANT: Change `starts_with()` to match prefix of input sRNAseq reads!

```{r counts-samples-with-matches, eval=TRUE}
# Select columns corresponding to sample names
sample_columns <- mirtrace.detailed.df %>%
  select(starts_with("sRNA.POR."))

# Calculate the sum for each column
sample_sums <- colSums(sample_columns)

# Count the number of columns with a sum greater than 0
samples_with_sum_gt_0 <- sum(sample_sums > 0)

paste("Number of samples with matches: ", samples_with_sum_gt_0)
```

## Percentage of samples with matches

```{r percentage-samples-with-matches, eval=TRUE}
# Total number of samples (columns)
total_samples <- ncol(sample_columns)

# Percentage of samples with sums greater than 0
percentage_samples_gt_0 <- (samples_with_sum_gt_0 / total_samples) * 100

paste("Percentage of samples with matches: ", percentage_samples_gt_0)
```

## Number of clades with matches

```{r distinct-clades, eval=TRUE, class.source = 'fold-hide'}

unique_clade_count <- mirtrace.detailed.df %>%
  distinct(CLADE) %>%    # Get unique entries in CLADE column
  count()               # Count the number of unique entries



paste("Number of clades with matches:", unique_clade_count)


```

## [miRTrace](https://github.com/friedlanderlab/mirtrace) table

To make them easier to see, counts \> 0 are highlighted in green.

```{r mirtrace-output-table, eval=TRUE, class.source = 'fold-hide'}

mirtrace.detailed.df %>%
  mutate(
    across(
      starts_with("sRNA"),
      ~cell_spec(
        .,
        background = ifelse(
          . > 0,
          "lightgreen",
          "white"
          )
        )
      )
    ) %>%
  kable(escape = F, caption = "Clades identified as having sRNAseq matches.") %>%
  kable_styling("striped") %>% 
  scroll_box(width = "100%", height = "500px")
```

------------------------------------------------------------------------

# Citations