--- author: Sam White toc-title: Contents toc-depth: 5 toc-location: left layout: post title: Daily Bits - January 2023 date: '2023-01-03 07:02' tags: - daily bits categories: - 2023 - Daily Bits --- 20230131 - Updated coral GTFs: - [Data-Wrangling-P.acuta-Genome-GFF-to-GTF-Conversion-Using-gffread](https://robertslab.github.io/sams-notebook/posts/2023/2023-01-26-Data-Wrangling---P.acuta-Genome-GFF-to-GTF-Conversion-Using-gffread/) - [Data-Wrangling-M.capitata-Genome-GFF-to-GTF-Using-gffread](https://robertslab.github.io/sams-notebook/posts/2023/2023-01-27-Data-Wrangling---M.capitata-Genome-GFF-to-GTF-Using-gffread/) - Added GTFs to [Genomic Resources handbook page](https://robertslab.github.io/resources/Genomic-Resources/). - Added updated coral genome [`HISAT2`](https://daehwankimlab.github.io/hisat2/) indexes to [Genomic Resources handbook page](https://robertslab.github.io/resources/Genomic-Resources/). 20230130 - Resolved [the issue with the coral GTFs](https://github.com/RobertsLab/resources/issues/1575) (GitHub Issue)!!! - Lab meeting. - Added additional coral GFFs/genomes to [Roberts Lab Handbook](https://robertslab.github.io/resources/Genomic-Resources/). 20230127 - Encountered [an issue with coral GTFs](https://github.com/RobertsLab/resources/issues/1575) (GitHub Issue) not being compatible with [`HISAT2`](https://daehwankimlab.github.io/hisat2/) `extract_exons.py` script. Working on fixing... 20230126 - Created [`HISAT2`](https://daehwankimlab.github.io/hisat2/) genome index for: - [_P.acuta_](https://robertslab.github.io/sams-notebook/posts/2023/2023-01-31-Genome-Indexing---P.acuta-HIv2-Assembly-with-HiSat2-on-Mox/) (Notebook entry) - Added _M.capitata_, _P.acuta_, and _P.verrucosa_ genomes and corresponding [`HISAT2`](https://daehwankimlab.github.io/hisat2/) indexes to The Roberts Lab Handbook [Genomic Resources page](https://robertslab.github.io/resources/Genomic-Resources/). 20230125 - Downloaded genome data for the following species. For those hosted on NCBI, I got the commands using NCBI's very useful `curl` commands now provided on the beta page for a given genome. - _P.verrucosa_: `curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_014529365.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,SEQUENCE_REPORT&filename=GCA_014529365.1.zip" -H "Accept: application/zip"` - NOTE: No annotation file(s) (i.e. no GFF) available. - _M.capitata_: `curl -OJX GET "https://api.ncbi.nlm.nih.gov/datasets/v2alpha/genome/accession/GCA_006542545.1/download?include_annotation_type=GENOME_FASTA,GENOME_GFF,SEQUENCE_REPORT&filename=GCA_006542545.1.zip" -H "Accept: application/zip"` - NOTE: No annotation file(s) (i.e. no GFF) available. - _P.acuta_: - Genome FastA: `wget http://cyanophora.rutgers.edu/Pocillopora_acuta/Pocillopora_acuta_HIv2.assembly.fasta.gz` - Genome GFF: `http://cyanophora.rutgers.edu/Pocillopora_acuta/Pocillopora_acuta_HIv2.genes.gff3.gz` - Created [`HISAT2`](https://daehwankimlab.github.io/hisat2/) genome indexes for: - [_M.captita_](https://robertslab.github.io/sams-notebook/posts/2023/2023-01-25-Genome-Indexing---M.capitata-NCBI-GCA_006542545.1-with-HiSat2-on-Mox/) (Notebook entry) - [P.verrucosa_](https://robertslab.github.io/sams-notebook/posts/2023/2023-01-25-Genome-Indexing---P.verrucosa-NCBI-GCA_014529365.1-with-HiSat2-on-Mox/) (Notebook entry) - Helped resolve Grace's [Trinity issue](https://github.com/RobertsLab/resources/issues/1476#issuecomment-1404328510) (GitHub Issue). 20230124 - Put together full notebook entry regarding coral SRA BioProject [PRJNA74403](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA744403) download/QC/trimming: https://robertslab.github.io/sams-notebook/posts/2022/2022-07-06-SRA-Data---S.namaycush-SRA-BioProject-PRJNA674328-Download-and-QC/ 20230120 - Re-ran QC/trimming SLURM script for coral SRA data (BioProject [PRJNA74403](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA744403)) a couple of times. Still dealing with some minor file output organization issues... Also, seems like `fastp` is trimming data (i.e. output reports indicate average read length _after_ filtering is ~140bp, but the resulting graphs of read lengths after trimming still show 150bp...). 20230119 - Ran QC/trimming SLURM script for coral SRA data (BioProject [PRJNA74403](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA744403)). - Took ~ 10.5hrs, but needed some adjustments... - Pub-a-thon 20230118 - Continued work on [this Issue](https://github.com/RobertsLab/resources/issues/1569) (GitHub Issue) for dealing with coral SRA data (BioProject [PRJNA74403](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA744403)). - `gzip`-ed all the data. - Continued working on script to trim data. 20230117 - Lab meeting. - Continued work on [this Issue](https://github.com/RobertsLab/resources/issues/1569) (GitHub Issue) for dealing with coral SRA data (BioProject [PRJNA74403](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA744403)). - Continued working on script to trim data. 20230113 - Made some scatter plots from the CEABIGR mean gene methylation coefficients of variation calcs. Not sure if they're useful or not, but here are some of them (animated GIF - changes every ~4s). Animated GIF cycling through scatter plots of different comparisions of mean gene methylation coefficients of variation (CoV). Point are colored by absolute value of differences (delta) between mean gene methylation CoV. Purple are lowest differences, while red are greatest differences. Blue line is the linear model regression line, while the redline is a an artificial regression line with a slope of `1`. ![Animated GIF cycling through scatter plots of different comparisions of mean gene methylation coefficients of variation (CoV). Point are colored by absolute value of differences (delta) between mean gene methylation CoV. Purple are lowest differences, while red are greatest differences. Blue line is the linear model regression line, while the redline is a an artificial regression line with a slope of `1`.](https://github.com/RobertsLab/sams-notebook/blob/master/images/screencaps/20230113-ceabigr-scatter_plots-mean_gene_methylation_CoV.gif?raw=true) - Began work on [this Issue](https://github.com/RobertsLab/resources/issues/1569) to download/QC some coral BS-seq and RNA-seq data from NCBI SRA BioProject [PRJNA74403](https://www.ncbi.nlm.nih.gov/bioproject/?term=PRJNA744403)). Downloads and conversion from SRA to FastQ took > 12hrs. 20230112 - Spent a _very_ long time trying to update CEABIGR mean gene methylation CoV data frames in list so that I could add a delta of CoVs between comparison groups. Had to resort to using ChatGPT (OpenAI) and the bot solved it in less than a minute! Here was the successful solution: ```r methylation.transposed.rownames.list <- lapply(methylation.transposed.rownames.list, function(df) { df$delta <- abs(df[,1] - df[,2]) return(df) }) ``` I had something _very_ similar to this, but didn't have the `return()` aspect of the data the function. I think that was crucial, as I was getting the `delta` column by itself as the result, but wasn't getting the full data frames with the new `delta` column added to them. 20230111 - Messed around with plotting CEABIGR mean gene methylation CoV, via scatter plots. Trying to decide if this method provides any info or not. Also tried to figure out how to plot all data frames, as they are stored in a list. 20230110 - Worked extensively on troubleshooting "missing" row names in a list of data frames in CEABIGR project for coefficients of variaton of mean DNA methylation. - Turns out, the row names _were_ present the entire time (i.e. I had written the code correctly from the start), but I couldn't figure out how to view the data frames within a list so that the row names would be visible. - Additionally, the primary problem was that I wanted row names written in the output files. I in the `write.csv()`, I had the argument `row.names = FALSE`! Doh! - Answered [Yaamini's question regarding retrieving FastA sequences from NCBI](https://github.com/RobertsLab/resources/discussions/1565). 20230109 - Read Ch.11 of "The Disordered Cosmos" - Lab meeting - Discussed Ch.11 of "The Disordered Cosmos" - Wrote recommendation letter draft for Dorothy. - Updated Owl. 20230106 - Worked on Dorothy's recommendation letter. - Science Hour. 20230105 - Long lab meeting discussing ways to improve lab "life" with suggestions from everyone on what they'd like to see. Really interesting/informative session! - Continued to work on the [Roberts Lab Handbook transcriptome annotation](https://robertslab.github.io/resources/bio-Annotation/#transcriptome-trinity). 20230104 - Worked on Linda Rhodes project on figuring out how to use a pre-built SILVA 138 QIIME classifier. - ``` qiime feature-classifier classify-sklearn \ --i-reads nonchimeras.qza \ --i-classifier silva-138.1-ssu-nr99-515f-806r-classifier.qza \ --o-classification classifier-taxonomy-test.qza ``` - This uses a _lot_ of memory. For the testing I was doing on Linda's marine mammals data set, it required 25GB of RAM, _per CPU_! As such, I couldn't run this with more than a single CPU without the command crashing. - Runtime was somewhere between 1.5 - 2 _days_.