Nov17-update

lncRNA Discovery

  • Aligned RNA-seq reads to genome -> SAM files
  • SAM files converted to BAMS
  • BAMS converted to GTF (Stringtie), merged to single GTF
  • ID non-coding transcripts (GFFCompare, CPC2)
  • Fasta and Bed file created for each species

lncRNA by the numbers

  • D Apul total transcripts / lncRNA = 115492 / 16206 (14%)
  • E Peve total transcripts / lncRNA = 74686 / 7378 (10%)
  • F Pmea total transcripts / lncRNA = 77592 / 14307 (18%)

*total from merged gtf transcript count

lncRNA by the numbers

Apul

grep -v '^#' ../../D-Apul/output/05.33-lncRNA-discovery/stringtie_merged.gtf | cut -f3 | sort | uniq -c
 627332 exon
 115492 transcript

Peve

grep -v '^#' ../../E-Peve/output/05-lncRNA-discovery/stringtie_merged.gtf | cut -f3 | sort | uniq -c 
 467494 exon
  74686 transcript

Pmea

grep -v '^#' ../../F-Pmea/output/02-lncRNA-discovery/stringtie_merged.gtf | cut -f3 | sort | uniq -c 
 482050 exon
  77592 transcript

About the GFFs

Apul

grep -v '^#' ../../D-Apul/data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff | cut -f3 | sort | uniq
cDNA_match
CDS
exon
gene
guide_RNA
lnc_RNA
mRNA
pseudogene
region
rRNA
snoRNA
snRNA
transcript
tRNA

Peve

grep -v '^#' ../../E-Peve/data/Porites_evermanni_v1.annot.gff | cut -f3 | sort | uniq
CDS
mRNA
UTR

Pmea

grep -v '^#' ../../F-Pmea/data/Pocillopora_meandrina_HIv1.genes.gff3 | cut -f3 | sort | uniq
CDS
exon
transcript

What about this?

Apul

grep -v '^#' ../../D-Apul/data/Amil/ncbi_dataset/data/GCF_013753865.1/genomic.gff | cut -f3 | sort | uniq -c
  22387 cDNA_match
 317969 CDS
 390533 exon
  36904 gene
      1 guide_RNA
   6128 lnc_RNA
  41860 mRNA
   5871 pseudogene
    854 region
    283 rRNA
     62 snoRNA
    170 snRNA
   2066 transcript
   1413 tRNA

sRNAseq Analysis

  • Trim reads to 25bp length
  • miRTrace
    • Identify taxonomic origins (Clade-level) of sRNAseq data
  • MirMachine
    • Identify potential miRNA homologs in genome (no sRNAseq analyses)
  • ShortStack alignment of sRNAseq data and annotation of sRNA-producing genes.
    • Uses sRNAseq, genome, and miRNA database (miRBase).
  • BLASTn
    • Align sRNAseq to miRNA databases (miRBase, MirGene)
  • miRDeep2
    • Use sRNAseq, genome, and miRNA database (miRBase)

MiRTrace

A.pulchra

Clades identified as having sRNAseq matches.
CLADE FAMILY_ID MIRBASE_IDS SEQ sRNA.ACR.140_1 sRNA.ACR.140_2 sRNA.ACR.145_1 sRNA.ACR.145_2 sRNA.ACR.150_1 sRNA.ACR.150_2 sRNA.ACR.173_1 sRNA.ACR.173_2 sRNA.ACR.178_1 sRNA.ACR.178_2
lophotrochozoa 1994 cla-miR-1994,cte-miR-1994,lgi-miR-1994a,lgi-miR-1994b TGAGACAGTGTGTCCTCCCT 0 0 0 0 1 0 3 0 0 0
lophotrochozoa 1985 hru-miR-1985,lgi-miR-1985 TGCCATTTTTATCAGTCACT 0 0 0 0 11 0 15 0 0 0
lophotrochozoa 1984 hru-miR-1984,lgi-miR-1984 TGCCCTATCCGTCAGGAACT 0 0 0 0 10 0 16 0 0 0
rodents 351 mmu-miR-351,rno-miR-351 TCCCTGAGGAGCCCTTTGAG 0 0 0 0 0 0 0 0 2 0
primates 618 hsa-miR-618,mml-miR-618,ppy-miR-618,ptr-miR-618 AAACTCTACTTGTCCTTCTG 0 0 0 0 0 0 1 0 0 0
primates 576 hsa-miR-576,mml-miR-576,ppy-miR-576,ptr-miR-576 AAGATGTGGAAAAATTGGAA 0 0 0 0 1 0 0 0 0 0
primates 576 hsa-miR-576 ATTCTAATTTCTCCACGTCT 0 0 0 0 0 0 1 0 0 0

MirMachine

A.millepora

Predicted loci: 109

Unique familes: 11

miRDeep2

A.pulchra

Predicted loci: 4553

Matches to mature miRNAs (seeds): 4137

Novel miRNAs: 416

Further analysis is possibly desired to evaluate score thresholds, miRNA families, etc.

BLASTn

-task blastn_short

A.pulchra

Total query seqs: 19,185,356

E-value = 1000 (default)

  • miRBase: 19,185,356
  • MirGene: 19,185,356

E-value = 10

  • mirBase: 19,120,159
  • MirGene: 19,037,617

ShortStack

A.pulchra

Potential loci: 18,772

miRBase matches: 46

Number of loci characterized as miRNA: 0

MiRTrace

P.evermanni

Clades identified as having sRNAseq matches.
CLADE FAMILY_ID MIRBASE_IDS SEQ sRNA.POR.73_1 sRNA.POR.73_2 sRNA.POR.79_1 sRNA.POR.79_2 sRNA.POR.82_1 sRNA.POR.82_2
insects 14 aae-miR-14,aga-miR-14,ame-miR-14,api-miR-14,bmo-miR-14,cqu-miR-14,dan-miR-14,der-miR-14,dgr-miR-14,dme-miR-14,dmo-miR-14,dpe-miR-14,dps-miR-14,dse-miR-14,dsi-miR-14,dvi-miR-14,dwi-miR-14,dya-miR-14,hme-miR-14,mse-miR-14,ngi-miR-14,nvi-miR-14,tca-miR-14 TCAGTCTTTTTCTCTCTCCT 0 0 0 0 1 0

MirMachine

P.evermanni

Predicted loci: 83

Unique familes: 15

BLASTn

-task blastn_short

P.evermanni

Total query seqs: 8,870,343

E-value = 1000 (default)

  • miRBase: 8,870,343
  • MirGene: 8,870,343

E-value = 10

  • mirBase: 8,824,359
  • MirGene: 8,783,

ShortStack

P.evermanni

Potential loci: 15,040

miRBase matches: 25

Number of loci characterized as miRNA: 0