--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: lncRNA Identification - P.generosa lncRNAs using CPC2 and bedtools date: '2023-05-02 08:29' tags: - lncRNA - Panopea generosa - Pacific geoduck - CPC2 - bedtools categories: - 2023 - Miscellaneous --- After [trimming _P.generosa_ RNA-seq reads on 20230426](https://robertslab.github.io/sams-notebook/posts/2023/2023-04-26-FastQ-Trimming-and-QC-P.generosa-RNA-seq-Data-from-20220323-on-Mox.html) and then [aligning and annotating them to the Panopea-generosa-v1.0 genome on 20230426](https://robertslab.github.io/sams-notebook/2023/04/26/Transcript-Alignments-P.generosa-RNA-seq-Alignments-for-lncRNA-Identification-Using-Hisat2-StingTie-and-gffcompare-on-Mox.html), I proceeded with the final step of lncRNA identification. To do this, I used [Zach's notebook entry on lncRNA identification](https://zbengt.github.io/2023-04-20-LncRNA-Discovery/) for guidance. I utilized the annotated GTF generated by [`gffcompare`](https://ccb.jhu.edu/software/stringtie/gffcompare.shtml) during the [alignment/annotation step on 20230426](https://robertslab.github.io/sams-notebook/2023/04/26/Transcript-Alignments-P.generosa-RNA-seq-Alignments-for-lncRNA-Identification-Using-Hisat2-StingTie-and-gffcompare-on-Mox/). I used ['bedtools getfasta`](https://bedtools.readthedocs.io/en/latest/content/tools/getfasta/) and [`CPC2`](https://github.com/gao-lab/CPC2_standalone) with an aribtrary 200bp minimum length to identify lncRNAs. All of this was done in a Jupyter Notebook (links below). Jupyter Notebook (GitHub): - [20230502-pgen-lncRNA-identification.ipynb](https://github.com/RobertsLab/code/blob/master/notebooks/sam/20230502-pgen-lncRNA-identification.ipynb) Jupyter Notebook (NB Viewer): - [20230502-pgen-lncRNA-identification.ipynb](https://nbviewer.org/github/RobertsLab/code/blob/master/notebooks/sam/20230502-pgen-lncRNA-identification.ipynb) --- # RESULTS Some very brief "stats": Total P.generosa transccripts ID's by HiSat2/Stringtie: `79,269` Total P.generosa lncRNA ID'd by CPC2 (>= 200bp): `13,606` Percentage of transcripts which are lncRNAs: `17%` Output folder: - [20230426-pgen-HISAT2-stringtie-gffcompare-RNAseq/](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-HISAT2-stringtie-gffcompare-RNAseq/) #### lncRNA GTF - [20230502-pgen-lncRNA-IDs.gtf](https://gannet.fish.washington.edu/Atumefaciens/20230426-pgen-HISAT2-stringtie-gffcompare-RNAseq/20230502-pgen-lncRNA-IDs.gtf) (2.2M) - MD5: `9adb7efc18fe1bfedcad24c86da1161f`