Use miRDeep2 (Friedländer et al. 2011) to identify potential miRNAs using A.pulchra sRNAseq reads. The A.millepora genome will be used as the reference genome for A.pulchra, as A.pulchra does not currently have a sequenced genome and A.millepora had highest alignment rates for standard RNAseq data compared to other published genomes tested. The mature miRBase miRNA database will also be used.
Inputs:
Requires collapsed reads (i.e. concatenated, unique reads) in FastA format. See 10.1-Apul-sRNAseq-BLASTn-31bp-fastp-merged-cnidarian_miRBase.Rmd for code.
A.millepora genome FastA. See 12-Apul-sRNAseq-MirMachine.Rmd for download info if needed.
Utilizes a modified version, which includes cnidarian miRNA culled from literature by Jill Ahsley.
Outputs:
Primary outputs are a result table in BED, CSV (tab-delimited), and HTML formats.
Due to the nature of mirDeep2’s naming, trying to use variable names is challenging. As such, the chunks processing those files will require manual intervention to identify and provide the output filename(s); they are not handled at in the .bashvars
file at the top of this script.
Other output files are too large for GitHub (and some (all?) are not needed for this analysis). Please find a full backup here:
This allows usage of Bash variables across R Markdown chunks.
{
echo "#### Assign Variables ####"
echo ""
echo "# Trimmed FastQ naming pattern"
echo "export trimmed_fastqs_pattern='*fastp-adapters-polyG-31bp-merged.fq.gz'"
echo ""
echo "# Data directories"
echo 'export deep_dive_dir=/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive'
echo 'export deep_dive_data_dir="${deep_dive_dir}/data"'
echo 'export output_dir_top=${deep_dive_dir}/D-Apul/output/11.1-Apul-sRNAseq-miRdeep2-31bp-fastp-merged'
echo 'export genome_fasta_dir=${deep_dive_dir}/D-Apul/data/Amil/ncbi_dataset/data/GCF_013753865.1'
echo 'export trimmed_fastqs_dir="${deep_dive_dir}/D-Apul/output/08.2-Apul-sRNAseq-trimming-31bp-fastp-merged/trimmed-reads"'
echo 'export collapsed_reads_dir="${deep_dive_dir}/D-Apul/output/10.1-Apul-sRNAseq-BLASTn-31bp-fastp-merged-cnidarian_miRBase"'
echo ""
echo "# Input/Output files"
echo 'export collapsed_reads_fasta="collapsed-reads-all.fasta"'
echo 'export collapsed_reads_mirdeep2="collapsed-reads-all-mirdeep2.fasta"'
echo 'export collapsed_reads_gt17bp_mirdeep2="collapsed_reads_gt17bp_mirdeep2.fasta"'
echo 'export concatenated_trimmed_reads_fastq="concatenated-trimmed-reads-all.fastq.gz"'
echo 'export genome_fasta_name="GCF_013753865.1_Amil_v2.1_genomic.fna"'
echo 'export genome_fasta_no_spaces="GCF_013753865.1_Amil_v2.1_genomic-no_spaces.fna"'
echo 'export mirdeep2_mapping_file="Apul-mirdeep2-mapping.arf"'
echo 'export mirbase_mature_fasta_name="cnidarian-mirbase-mature-v22.1.fasta"'
echo 'export mirbase_mature_fasta_no_spaces="cnidarian-mirbase-mature-v22.1-no_spaces.fa"'
echo ""
echo "# Paths to programs"
echo 'export mirdeep2_mapper="mapper.pl"'
echo 'export mirdeep2="miRDeep2.pl"'
echo 'export mirdeep2_fastaparse="fastaparse.pl"'
echo 'export bowtie_build="/home/shared/bowtie-1.3.1-linux-x86_64/bowtie-build"'
echo ""
echo "# Set number of CPUs to use"
echo 'export threads=40'
echo ""
} > .bashvars
cat .bashvars
#### Assign Variables ####
# Trimmed FastQ naming pattern
export trimmed_fastqs_pattern='*fastp-adapters-polyG-31bp-merged.fq.gz'
# Data directories
export deep_dive_dir=/home/shared/8TB_HDD_01/sam/gitrepos/deep-dive
export deep_dive_data_dir="${deep_dive_dir}/data"
export output_dir_top=${deep_dive_dir}/D-Apul/output/11.1-Apul-sRNAseq-miRdeep2-31bp-fastp-merged
export genome_fasta_dir=${deep_dive_dir}/D-Apul/data/Amil/ncbi_dataset/data/GCF_013753865.1
export trimmed_fastqs_dir="${deep_dive_dir}/D-Apul/output/08.2-Apul-sRNAseq-trimming-31bp-fastp-merged/trimmed-reads"
export collapsed_reads_dir="${deep_dive_dir}/D-Apul/output/10.1-Apul-sRNAseq-BLASTn-31bp-fastp-merged-cnidarian_miRBase"
# Input/Output files
export collapsed_reads_fasta="collapsed-reads-all.fasta"
export collapsed_reads_mirdeep2="collapsed-reads-all-mirdeep2.fasta"
export collapsed_reads_gt17bp_mirdeep2="collapsed_reads_gt17bp_mirdeep2.fasta"
export concatenated_trimmed_reads_fastq="concatenated-trimmed-reads-all.fastq.gz"
export genome_fasta_name="GCF_013753865.1_Amil_v2.1_genomic.fna"
export genome_fasta_no_spaces="GCF_013753865.1_Amil_v2.1_genomic-no_spaces.fna"
export mirdeep2_mapping_file="Apul-mirdeep2-mapping.arf"
export mirbase_mature_fasta_name="cnidarian-mirbase-mature-v22.1.fasta"
export mirbase_mature_fasta_no_spaces="cnidarian-mirbase-mature-v22.1-no_spaces.fa"
# Paths to programs
export mirdeep2_mapper="mapper.pl"
export mirdeep2="miRDeep2.pl"
export mirdeep2_fastaparse="fastaparse.pl"
export bowtie_build="/home/shared/bowtie-1.3.1-linux-x86_64/bowtie-build"
# Set number of CPUs to use
export threads=40
Per miRDeep2 documentation:
The readID must end with _xNumber and is not allowed to contain whitespaces. has to have the format name_uniqueNumber_xnumber
# Load bash variables into memory
source .bashvars
# Append miRDeep2 to system PATH and set PERL5LIB
export PATH=$PATH:/home/shared/mirdeep2/bin
export PERL5LIB=$PERL5LIB:/home/shared/mirdeep2/lib/perl5
mkdir --parents "${output_dir_top}"
sed '/^>/ s/-/_x/g' "${collapsed_reads_dir}/${collapsed_reads_fasta}" \
| sed '/^>/ s/>/>seq_/' \
> "${output_dir_top}/${collapsed_reads_mirdeep2}"
# Filter for reads at least 17 bases long
# Min. read length required for MirDeep2
${mirdeep2_fastaparse} \
"${output_dir_top}/${collapsed_reads_mirdeep2}" \
-a 17 \
> "${output_dir_top}/${collapsed_reads_gt17bp_mirdeep2}" \
2> /dev/null
grep "^>" ${collapsed_reads_dir}/${collapsed_reads_fasta} \
| head
echo ""
echo "--------------------------------------------------"
echo ""
grep "^>" ${output_dir_top}/${collapsed_reads_gt17bp_mirdeep2} \
| head
>1-1843196
>2-1083086
>3-967720
>4-832824
>5-750080
>6-691325
>7-659957
>8-606529
>9-571297
>10-528956
--------------------------------------------------
>seq_1_x1843196
>seq_2_x1083086
>seq_3_x967720
>seq_4_x832824
>seq_5_x750080
>seq_6_x691325
>seq_7_x659957
>seq_8_x606529
>seq_9_x571297
>seq_10_x528956
miRDeep2 can’t process genome FastAs with spaces in the description lines.
So, I’m replacing spaces with underscores.
And, for aesthetics, I’m also removing commas.
# Load bash variables into memory
source .bashvars
sed '/^>/ s/ /_/g' "${genome_fasta_dir}/${genome_fasta_name}" \
| sed '/^>/ s/,//g' \
> "${genome_fasta_dir}/${genome_fasta_no_spaces}"
grep "^>" ${genome_fasta_dir}/${genome_fasta_name} \
| head
echo ""
echo "--------------------------------------------------"
echo ""
grep "^>" ${genome_fasta_dir}/${genome_fasta_no_spaces} \
| head
>NC_058066.1 Acropora millepora isolate JS-1 chromosome 1, Amil_v2.1, whole genome shotgun sequence
>NC_058067.1 Acropora millepora isolate JS-1 chromosome 2, Amil_v2.1, whole genome shotgun sequence
>NC_058068.1 Acropora millepora isolate JS-1 chromosome 3, Amil_v2.1, whole genome shotgun sequence
>NC_058069.1 Acropora millepora isolate JS-1 chromosome 4, Amil_v2.1, whole genome shotgun sequence
>NC_058070.1 Acropora millepora isolate JS-1 chromosome 5, Amil_v2.1, whole genome shotgun sequence
>NC_058071.1 Acropora millepora isolate JS-1 chromosome 6, Amil_v2.1, whole genome shotgun sequence
>NC_058072.1 Acropora millepora isolate JS-1 chromosome 7, Amil_v2.1, whole genome shotgun sequence
>NC_058073.1 Acropora millepora isolate JS-1 chromosome 8, Amil_v2.1, whole genome shotgun sequence
>NC_058074.1 Acropora millepora isolate JS-1 chromosome 9, Amil_v2.1, whole genome shotgun sequence
>NC_058075.1 Acropora millepora isolate JS-1 chromosome 10, Amil_v2.1, whole genome shotgun sequence
--------------------------------------------------
>NC_058066.1_Acropora_millepora_isolate_JS-1_chromosome_1_Amil_v2.1_whole_genome_shotgun_sequence
>NC_058067.1_Acropora_millepora_isolate_JS-1_chromosome_2_Amil_v2.1_whole_genome_shotgun_sequence
>NC_058068.1_Acropora_millepora_isolate_JS-1_chromosome_3_Amil_v2.1_whole_genome_shotgun_sequence
>NC_058069.1_Acropora_millepora_isolate_JS-1_chromosome_4_Amil_v2.1_whole_genome_shotgun_sequence
>NC_058070.1_Acropora_millepora_isolate_JS-1_chromosome_5_Amil_v2.1_whole_genome_shotgun_sequence
>NC_058071.1_Acropora_millepora_isolate_JS-1_chromosome_6_Amil_v2.1_whole_genome_shotgun_sequence
>NC_058072.1_Acropora_millepora_isolate_JS-1_chromosome_7_Amil_v2.1_whole_genome_shotgun_sequence
>NC_058073.1_Acropora_millepora_isolate_JS-1_chromosome_8_Amil_v2.1_whole_genome_shotgun_sequence
>NC_058074.1_Acropora_millepora_isolate_JS-1_chromosome_9_Amil_v2.1_whole_genome_shotgun_sequence
>NC_058075.1_Acropora_millepora_isolate_JS-1_chromosome_10_Amil_v2.1_whole_genome_shotgun_sequence
# Load bash variables into memory
source .bashvars
sed '/^>/ s/ /_/g' "${deep_dive_data_dir}/${mirbase_mature_fasta_name}" \
| sed '/^>/ s/,//g' \
> "${deep_dive_data_dir}/${mirbase_mature_fasta_no_spaces}"
grep "^>" ${deep_dive_data_dir}/${mirbase_mature_fasta_name} \
| head
echo ""
echo "--------------------------------------------------"
echo ""
grep "^>" ${deep_dive_data_dir}/${mirbase_mature_fasta_no_spaces} \
| head
>cel-let-7-5p MIMAT0000001 Caenorhabditis elegans let-7-5p
>cel-let-7-3p MIMAT0015091 Caenorhabditis elegans let-7-3p
>cel-lin-4-5p MIMAT0000002 Caenorhabditis elegans lin-4-5p
>cel-lin-4-3p MIMAT0015092 Caenorhabditis elegans lin-4-3p
>cel-miR-1-5p MIMAT0020301 Caenorhabditis elegans miR-1-5p
>cel-miR-1-3p MIMAT0000003 Caenorhabditis elegans miR-1-3p
>cel-miR-2-5p MIMAT0020302 Caenorhabditis elegans miR-2-5p
>cel-miR-2-3p MIMAT0000004 Caenorhabditis elegans miR-2-3p
>cel-miR-34-5p MIMAT0000005 Caenorhabditis elegans miR-34-5p
>cel-miR-34-3p MIMAT0015093 Caenorhabditis elegans miR-34-3p
--------------------------------------------------
>cel-let-7-5p_MIMAT0000001_Caenorhabditis_elegans_let-7-5p
>cel-let-7-3p_MIMAT0015091_Caenorhabditis_elegans_let-7-3p
>cel-lin-4-5p_MIMAT0000002_Caenorhabditis_elegans_lin-4-5p
>cel-lin-4-3p_MIMAT0015092_Caenorhabditis_elegans_lin-4-3p
>cel-miR-1-5p_MIMAT0020301_Caenorhabditis_elegans_miR-1-5p
>cel-miR-1-3p_MIMAT0000003_Caenorhabditis_elegans_miR-1-3p
>cel-miR-2-5p_MIMAT0020302_Caenorhabditis_elegans_miR-2-5p
>cel-miR-2-3p_MIMAT0000004_Caenorhabditis_elegans_miR-2-3p
>cel-miR-34-5p_MIMAT0000005_Caenorhabditis_elegans_miR-34-5p
>cel-miR-34-3p_MIMAT0015093_Caenorhabditis_elegans_miR-34-3p
miRDeep2 requires a Bowtie v1 genome index - cannot use Bowtie2 index
# Load bash variables into memory
source .bashvars
# Check for existence of genome index first
if [ ! -f "${genome_fasta_no_spaces%.*}.[0-9].ebwt" ]; then
${bowtie_build} \
${genome_fasta_dir}/${genome_fasta_no_spaces} \
${genome_fasta_dir}/${genome_fasta_no_spaces%.*} \
--threads ${threads} \
--quiet
fi
Requires genome to be previously indexed with Bowtie.
Additionally, requires user to enter path to their mirdeep2
directory as well
as their perl5
installation.
# Load bash variables into memory
source .bashvars
# Append miRDeep2 to system PATH and set PERL5LIB
export PATH=$PATH:/home/shared/mirdeep2/bin
export PERL5LIB=$PERL5LIB:/home/shared/mirdeep2/lib/perl5
# Run miRDeep2 mapping
time \
${mirdeep2_mapper} \
${output_dir_top}/${collapsed_reads_gt17bp_mirdeep2} \
-c \
-p ${genome_fasta_dir}/${genome_fasta_no_spaces%.*} \
-t ${output_dir_top}/${mirdeep2_mapping_file} \
-o ${threads}
Recommendation is to use the closest related species in the miRDeep2 options, even if the species isn’t very closely related. The documentation indicates that miRDeep2 is always more accurate when at least a species is provided.
The options provided to the command are as follows:
none
: Known miRNAs of the species being analyzed.-t S.pupuratus
: Related species.none
: Known miRNA precursors in this species.-P
: Specifies miRBase version > 18.-v
: Remove temporary files after completion.-g -1
: Number of precursors to analyze. A setting of -1
will analyze all. Default is 50,000
I set this to -1
after multiple attempts to run using the default kept failing.NOTE: This will take an extremely long time to run (days). Could possibly be shortened by
excluding randfold
analysis.
# Load bash variables into memory
source .bashvars
# Append miRDeep2 to system PATH and set PERL5LIB
export PATH=$PATH:/home/shared/mirdeep2/bin
export PERL5LIB=$PERL5LIB:/home/shared/mirdeep2/lib/perl5
time \
${mirdeep2} \
${output_dir_top}/${collapsed_reads_gt17bp_mirdeep2} \
${genome_fasta_dir}/${genome_fasta_no_spaces} \
${output_dir_top}/${mirdeep2_mapping_file} \
none \
${deep_dive_data_dir}/${mirbase_mature_fasta_no_spaces} \
none \
-t S.purpuratus \
-P \
-v \
-g -1 \
2>${output_dir_top}/miRDeep2-S.purpuratus-report.log
# Load bash variables into memory
source .bashvars
tail -n 6 ${output_dir_top}/miRDeep2-S.purpuratus-report.log
miRDeep runtime:
started: 13:0:39
ended: 12:37:10
total:23h:36m:31s
MiRDeep2 outputs all files to the current working directly with no way to redirect so want to move to intended output directory.
Output files will be in the format of result_*
and error_*
# Load bash variables into memory
source .bashvars
for file in result_* error_*
do
mv "${file}" "${output_dir_top}/"
done
# Load bash variables into memory
source .bashvars
ls -l | grep "^d"
drwxrwxr-x 3 sam sam 4096 Oct 30 14:00 07-Apul-lncRNA-dist_files
drwxr-xr-x 4 sam sam 4096 Feb 12 10:38 08.1-Dapul-sRNAseq-trimming-R1-only_cache
drwxr-xr-x 4 sam sam 4096 Apr 1 14:06 10.1-Apul-sRNAseq-BLASTn-31bp-fastp-merged-cnidarian_miRBase_cache
drwxr-xr-x 4 sam sam 4096 Nov 7 14:29 10-Apul-sRNAseq-BLASTn_cache
drwxr-xr-x 4 sam sam 4096 Apr 19 12:22 11.1-Apul-sRNAseq-miRdeep2-31bp-fastp-merged_cache
drwxr-xr-x 3 sam sam 4096 Nov 2 13:10 12-Apul-sRNAseq-MirMachine_cache
drwxr-xr-x 4 sam sam 4096 Feb 15 13:35 13.1.1-Apul-sRNAseq-ShortStack-R1-reads-cnidarian_miRBase_cache
drwxr-xr-x 4 sam sam 4096 Feb 13 11:20 13.1-Apul-sRNAseq-ShortStack-R1-reads_cache
drwxr-xr-x 4 sam sam 4096 Feb 15 12:27 13.2.1-Apul-sRNAseq-ShortStack-31bp-fastp-merged-cnidarian_miRBase_cache
drwxr-xr-x 4 sam sam 4096 Feb 15 08:39 13.2-Apul-sRNAseq-ShortStack-31bp-fastp-merged_cache
drwxr-xr-x 4 sam sam 4096 Nov 5 17:39 13-Apul-sRNAseq-ShortStack_cache
drwxrwxr-x 3 sam sam 4096 May 5 2023 rsconnect
# Load bash variables into memory
source .bashvars
rsync -aP . --include='dir***' --exclude='*' --quiet "${output_dir_top}/"
rsync -aP . --include='map***' --exclude='*' --quiet "${output_dir_top}/"
rsync -aP . --exclude='mirgene*' --include='mir***' --exclude='*' --quiet "${output_dir_top}/"
rsync -aP . --include='pdfs***' --exclude='*' --quiet "${output_dir_top}/"
echo ""
echo "Check new location:"
echo ""
ls -l "${output_dir_top}/" | grep "^d"
FYI - eval=FALSE
is set because the following command will only work
once…
ls --directory dir_* mapper_logs mirdeep_runs mirna_results* pdfs_*
# Load bash variables into memory
source .bashvars
rm -rf dir_* mapper_logs mirdeep_runs mirna_results* pdfs_*
ls --directory dir_* mapper_logs mirdeep_runs mirna_results* pdfs_*
ls: cannot access 'dir_*': No such file or directory
ls: cannot access 'mapper_logs': No such file or directory
ls: cannot access 'mirdeep_runs': No such file or directory
ls: cannot access 'mirna_results*': No such file or directory
ls: cannot access 'pdfs_*': No such file or directory
The formatting of this CSV is terrible. It has a 3-column table on top of a 17-column table. This makes parsing a bit of a pain in its raw format.
# Load bash variables into memory
source .bashvars
head -n 30 "${output_dir_top}/result_03_04_2024_t_13_00_39.csv"
miRDeep2 score estimated signal-to-noise excision gearing
10 7.3 1
9 7.2 1
8 7.3 1
7 7.3 1
6 7.2 1
5 6.8 1
4 5.8 1
3 4.7 1
2 3.1 1
1 2.4 1
0 2.1 1
-1 1.6 1
-2 1.2 1
-3 1 1
-4 1 1
-5 1 1
-6 1 1
-7 1 1
-8 1 1
-9 1 1
-10 1 1
novel miRNAs predicted by miRDeep2
provisional id miRDeep2 score estimated probability that the miRNA candidate is a true positive rfam alert total read count mature read count loop read count star read count significant randfold p-value miRBase miRNA example miRBase miRNA with the same seed UCSC browser NCBI blastn consensus mature sequence consensus star sequence consensus precursor sequence precursor coordinate
NC_058066.1_Acropora_millepora_isolate_JS-1_chromosome_1_Amil_v2.1_whole_genome_shotgun_sequence_49320 127017.5 - 249134 248655 0 479 no - nve-miR-9437_MIMAT0035398_Nematostella_vectensis_miR-9437 - - uuaacgaguagauaaaugaagaga cuucguuuauucacucguucaua cuucguuuauucacucguucauauuuauuauuaacgaguagauaaaugaagaga NC_058066.1_Acropora_millepora_isolate_JS-1_chromosome_1_Amil_v2.1_whole_genome_shotgun_sequence:20346247..20346301:-
NC_058070.1_Acropora_millepora_isolate_JS-1_chromosome_5_Amil_v2.1_whole_genome_shotgun_sequence_208678 40616.1 - 79661 79121 0 540 no - hsa-miR-4330_MIMAT0016924_Homo_sapiens_miR-4330 - - ucucagauuacaguaguuaagu cuuaacuacugcaaucugaacau ucucagauuacaguaguuaaguaauuaaauacuuaacuacugcaaucugaacau NC_058070.1_Acropora_millepora_isolate_JS-1_chromosome_5_Amil_v2.1_whole_genome_shotgun_sequence:11598966..11599020:+
NC_058068.1_Acropora_millepora_isolate_JS-1_chromosome_3_Amil_v2.1_whole_genome_shotgun_sequence_123927 40339.4 - 79122 73935 0 5187 no - bdi-miR7733-3p_MIMAT0030201_Brachypodium_distachyon_miR7733-3p - - ucugcguuaucggugaaauugu auuucacuagauaagcgcuaac auuucacuagauaagcgcuaacuguuauuguucugcguuaucggugaaauugu NC_058068.1_Acropora_millepora_isolate_JS-1_chromosome_3_Amil_v2.1_whole_genome_shotgun_sequence:597846..597899:+
This will match the line beginning with provisional id
and print to
the end of the file (represented by the $p
. $
= end, p
= print)
# Load bash variables into memory
source .bashvars
sed --quiet '/provisional id/,$p' "${output_dir_top}/result_03_04_2024_t_13_00_39.csv" \
> "${output_dir_top}/parsable-result_03_04_2024_t_13_00_39.csv"
head "${output_dir_top}/parsable-result_03_04_2024_t_13_00_39.csv"
provisional id miRDeep2 score estimated probability that the miRNA candidate is a true positive rfam alert total read count mature read count loop read count star read count significant randfold p-value miRBase miRNA example miRBase miRNA with the same seed UCSC browser NCBI blastn consensus mature sequence consensus star sequence consensus precursor sequence precursor coordinate
NC_058066.1_Acropora_millepora_isolate_JS-1_chromosome_1_Amil_v2.1_whole_genome_shotgun_sequence_49320 127017.5 - 249134 248655 0 479 no - nve-miR-9437_MIMAT0035398_Nematostella_vectensis_miR-9437 - - uuaacgaguagauaaaugaagaga cuucguuuauucacucguucaua cuucguuuauucacucguucauauuuauuauuaacgaguagauaaaugaagaga NC_058066.1_Acropora_millepora_isolate_JS-1_chromosome_1_Amil_v2.1_whole_genome_shotgun_sequence:20346247..20346301:-
NC_058070.1_Acropora_millepora_isolate_JS-1_chromosome_5_Amil_v2.1_whole_genome_shotgun_sequence_208678 40616.1 - 79661 79121 0 540 no - hsa-miR-4330_MIMAT0016924_Homo_sapiens_miR-4330 - - ucucagauuacaguaguuaagu cuuaacuacugcaaucugaacau ucucagauuacaguaguuaaguaauuaaauacuuaacuacugcaaucugaacau NC_058070.1_Acropora_millepora_isolate_JS-1_chromosome_5_Amil_v2.1_whole_genome_shotgun_sequence:11598966..11599020:+
NC_058068.1_Acropora_millepora_isolate_JS-1_chromosome_3_Amil_v2.1_whole_genome_shotgun_sequence_123927 40339.4 - 79122 73935 0 5187 no - bdi-miR7733-3p_MIMAT0030201_Brachypodium_distachyon_miR7733-3p - - ucugcguuaucggugaaauugu auuucacuagauaagcgcuaac auuucacuagauaagcgcuaacuguuauuguucugcguuaucggugaaauugu NC_058068.1_Acropora_millepora_isolate_JS-1_chromosome_3_Amil_v2.1_whole_genome_shotgun_sequence:597846..597899:+
NC_058071.1_Acropora_millepora_isolate_JS-1_chromosome_6_Amil_v2.1_whole_genome_shotgun_sequence_256047 39769.6 - 78001 65898 0 12103 no - nve-miR-2023-3p_MIMAT0009756_Nematostella_vectensis_miR-2023-3p - - aaagaaguacaagugguaggg cugcuacuuggacuuuuuuca cugcuacuuggacuuuuuucacguuuaucgaugaaagaaguacaagugguaggg NC_058071.1_Acropora_millepora_isolate_JS-1_chromosome_6_Amil_v2.1_whole_genome_shotgun_sequence:8840769..8840823:+
NC_058068.1_Acropora_millepora_isolate_JS-1_chromosome_3_Amil_v2.1_whole_genome_shotgun_sequence_123929 38710.3 - 75924 73935 0 1989 no - bdi-miR7733-3p_MIMAT0030201_Brachypodium_distachyon_miR7733-3p - - ucugcguuaucggugaaauugu auuucacuagaugagcgcuaac auuucacuagaugagcgcuaacuguuauuguucugcguuaucggugaaauugu NC_058068.1_Acropora_millepora_isolate_JS-1_chromosome_3_Amil_v2.1_whole_genome_shotgun_sequence:598142..598195:+
NC_058073.1_Acropora_millepora_isolate_JS-1_chromosome_8_Amil_v2.1_whole_genome_shotgun_sequence_341478 25654.9 - 50316 49080 0 1236 no - mtr-miR2651_MIMAT0013423_Medicago_truncatula_miR2651 - - auugauuguagacaagccucuga aggguauugucuaugaucaaaaa aggguauugucuaugaucaaaaauuuuauuauugauuguagacaagccucuga NC_058073.1_Acropora_millepora_isolate_JS-1_chromosome_8_Amil_v2.1_whole_genome_shotgun_sequence:11426819..11426872:-
NC_058075.1_Acropora_millepora_isolate_JS-1_chromosome_10_Amil_v2.1_whole_genome_shotgun_sequence_406883 20352.5 - 39917 39897 0 20 no - mmu-miR-7664-5p_MIMAT0029834_Mus_musculus_miR-7664-5p - - uaguuuacaggcucaacuuugcu ggaguugaucuuguaaac uaguuuacaggcucaacuuugcuuuaaagggaguugaucuuguaaac NC_058075.1_Acropora_millepora_isolate_JS-1_chromosome_10_Amil_v2.1_whole_genome_shotgun_sequence:15858707..15858754:-
NC_058067.1_Acropora_millepora_isolate_JS-1_chromosome_2_Amil_v2.1_whole_genome_shotgun_sequence_81558 14705.5 - 28847 27971 0 876 no - - - - uaggcguauuuccgauugucc gacagccggaacuacgucugu uaggcguauuuccgauuguccuuuuacuacgacagccggaacuacgucugu NC_058067.1_Acropora_millepora_isolate_JS-1_chromosome_2_Amil_v2.1_whole_genome_shotgun_sequence:21549048..21549099:+
NC_058077.1_Acropora_millepora_isolate_JS-1_chromosome_12_Amil_v2.1_whole_genome_shotgun_sequence_459486 12860.5 - 25227 25162 0 65 no - - - - agugcacuuuucucaggaugaa cuucuugagaaaauuucaccca agugcacuuuucucaggaugaauugaauucuucuugagaaaauuucaccca NC_058077.1_Acropora_millepora_isolate_JS-1_chromosome_12_Amil_v2.1_whole_genome_shotgun_sequence:21629096..21629147:+
This chunk provides a more concise overview of the data and it’s columns.
mirdeep_result.df <- read.csv("../output/11.1-Apul-sRNAseq-miRdeep2-31bp-fastp-merged/parsable-result_03_04_2024_t_13_00_39.csv",
header = TRUE,
sep = "\t")
str(mirdeep_result.df)
'data.frame': 896 obs. of 17 variables:
$ provisional.id : chr "NC_058066.1_Acropora_millepora_isolate_JS-1_chromosome_1_Amil_v2.1_whole_genome_shotgun_sequence_49320" "NC_058070.1_Acropora_millepora_isolate_JS-1_chromosome_5_Amil_v2.1_whole_genome_shotgun_sequence_208678" "NC_058068.1_Acropora_millepora_isolate_JS-1_chromosome_3_Amil_v2.1_whole_genome_shotgun_sequence_123927" "NC_058071.1_Acropora_millepora_isolate_JS-1_chromosome_6_Amil_v2.1_whole_genome_shotgun_sequence_256047" ...
$ miRDeep2.score : num 127018 40616 40339 39770 38710 ...
$ estimated.probability.that.the.miRNA.candidate.is.a.true.positive: logi NA NA NA NA NA NA ...
$ rfam.alert : chr "-" "-" "-" "-" ...
$ total.read.count : int 249134 79661 79122 78001 75924 50316 39917 28847 25227 24755 ...
$ mature.read.count : int 248655 79121 73935 65898 73935 49080 39897 27971 25162 24435 ...
$ loop.read.count : int 0 0 0 0 0 0 0 0 0 0 ...
$ star.read.count : int 479 540 5187 12103 1989 1236 20 876 65 320 ...
$ significant.randfold.p.value : chr "no" "no" "no" "no" ...
$ miRBase.miRNA : chr "-" "-" "-" "-" ...
$ example.miRBase.miRNA.with.the.same.seed : chr "nve-miR-9437_MIMAT0035398_Nematostella_vectensis_miR-9437" "hsa-miR-4330_MIMAT0016924_Homo_sapiens_miR-4330" "bdi-miR7733-3p_MIMAT0030201_Brachypodium_distachyon_miR7733-3p" "nve-miR-2023-3p_MIMAT0009756_Nematostella_vectensis_miR-2023-3p" ...
$ UCSC.browser : chr "-" "-" "-" "-" ...
$ NCBI.blastn : chr "-" "-" "-" "-" ...
$ consensus.mature.sequence : chr "uuaacgaguagauaaaugaagaga" "ucucagauuacaguaguuaagu" "ucugcguuaucggugaaauugu" "aaagaaguacaagugguaggg" ...
$ consensus.star.sequence : chr "cuucguuuauucacucguucaua" "cuuaacuacugcaaucugaacau" "auuucacuagauaagcgcuaac" "cugcuacuuggacuuuuuuca" ...
$ consensus.precursor.sequence : chr "cuucguuuauucacucguucauauuuauuauuaacgaguagauaaaugaagaga" "ucucagauuacaguaguuaaguaauuaaauacuuaacuacugcaaucugaacau" "auuucacuagauaagcgcuaacuguuauuguucugcguuaucggugaaauugu" "cugcuacuuggacuuuuuucacguuuaucgaugaaagaaguacaagugguaggg" ...
$ precursor.coordinate : chr "NC_058066.1_Acropora_millepora_isolate_JS-1_chromosome_1_Amil_v2.1_whole_genome_shotgun_sequence:20346247..20346301:-" "NC_058070.1_Acropora_millepora_isolate_JS-1_chromosome_5_Amil_v2.1_whole_genome_shotgun_sequence:11598966..11599020:+" "NC_058068.1_Acropora_millepora_isolate_JS-1_chromosome_3_Amil_v2.1_whole_genome_shotgun_sequence:597846..597899:+" "NC_058071.1_Acropora_millepora_isolate_JS-1_chromosome_6_Amil_v2.1_whole_genome_shotgun_sequence:8840769..8840823:+" ...
This provides some rudimentary numbers for the miRDeep2 output.
Further analysis is possibly desired to evaluate score thresholds, miRNA families, etc.
# Load bash variables into memory
source .bashvars
# Total predicted miRNAS
total_miRNAs=$(awk 'NR > 1' ${output_dir_top}/parsable-result_03_04_2024_t_13_00_39.csv \
| wc -l
)
echo "Total of predicted miRNAs: ${total_miRNAs}"
echo ""
# Matches to known mature miRNAs
mature_miRNAs=$(awk -F'\t' '$11 != "-" && $11 != "" {print $11}' ${output_dir_top}/parsable-result_03_04_2024_t_13_00_39.csv \
| wc -l
)
echo "Number of seed matches to known miRNAS: ${mature_miRNAs}"
echo ""
# Novel miRNAs
novel_miRNAs=$(awk -F "\t" '$11 == "-" || $11 == "" {print $11}' ${output_dir_top}/parsable-result_03_04_2024_t_13_00_39.csv \
| awk 'NR > 1' \
| wc -l
)
echo "Number of novel miRNAs: ${novel_miRNAs}"
Total of predicted miRNAs: 896
Number of seed matches to known miRNAS: 777
Number of novel miRNAs: 119