07-CG-motifs ================ Kathleen Durkin 2025-05-16 [Assignment details](https://sr320.github.io/course-fish546-2025/assignments/07-CG.html) Read in one of my fastas (if necessary) ``` bash curl -L https://gannet.fish.washington.edu/gitrepos/ceasmallr/output/00.00-trimming-fastp/CF02-CM02-Zygote_R2_001.fastp-trim.fq.gz -o ../../project/data/trimmed-WGBS-reads/CF02-CM02-Zygote_R2_001.fastp-trim.fq.gz ``` Unzip fasta file and select a subset (the full file is prohibitively large) ``` bash gunzip ../../project/data/trimmed-WGBS-reads/CF02-CM02-Zygote_R2_001.fastp-trim.fq.gz head -n 4000 ../../project/data/trimmed-WGBS-reads/CF02-CM02-Zygote_R2_001.fastp-trim.fq.gz > ../data/07-CG-motifs/input.fq ``` Load R packages ``` r library(seqinr) library(ShortRead) ``` ## Loading required package: BiocGenerics ## ## Attaching package: 'BiocGenerics' ## The following objects are masked from 'package:stats': ## ## IQR, mad, sd, var, xtabs ## The following objects are masked from 'package:base': ## ## anyDuplicated, aperm, append, as.data.frame, basename, cbind, ## colnames, dirname, do.call, duplicated, eval, evalq, Filter, Find, ## get, grep, grepl, intersect, is.unsorted, lapply, Map, mapply, ## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int, ## Position, rank, rbind, Reduce, rownames, sapply, setdiff, sort, ## table, tapply, union, unique, unsplit, which.max, which.min ## Loading required package: BiocParallel ## Loading required package: Biostrings ## Loading required package: S4Vectors ## Loading required package: stats4 ## ## Attaching package: 'S4Vectors' ## The following objects are masked from 'package:base': ## ## expand.grid, I, unname ## Loading required package: IRanges ## Loading required package: XVector ## Loading required package: GenomeInfoDb ## ## Attaching package: 'Biostrings' ## The following object is masked from 'package:seqinr': ## ## translate ## The following object is masked from 'package:base': ## ## strsplit ## Loading required package: Rsamtools ## Loading required package: GenomicRanges ## Loading required package: GenomicAlignments ## Loading required package: SummarizedExperiment ## Loading required package: MatrixGenerics ## Loading required package: matrixStats ## ## Attaching package: 'matrixStats' ## The following object is masked from 'package:seqinr': ## ## count ## ## Attaching package: 'MatrixGenerics' ## The following objects are masked from 'package:matrixStats': ## ## colAlls, colAnyNAs, colAnys, colAvgsPerRowSet, colCollapse, ## colCounts, colCummaxs, colCummins, colCumprods, colCumsums, ## colDiffs, colIQRDiffs, colIQRs, colLogSumExps, colMadDiffs, ## colMads, colMaxs, colMeans2, colMedians, colMins, colOrderStats, ## colProds, colQuantiles, colRanges, colRanks, colSdDiffs, colSds, ## colSums2, colTabulates, colVarDiffs, colVars, colWeightedMads, ## colWeightedMeans, colWeightedMedians, colWeightedSds, ## colWeightedVars, rowAlls, rowAnyNAs, rowAnys, rowAvgsPerColSet, ## rowCollapse, rowCounts, rowCummaxs, rowCummins, rowCumprods, ## rowCumsums, rowDiffs, rowIQRDiffs, rowIQRs, rowLogSumExps, ## rowMadDiffs, rowMads, rowMaxs, rowMeans2, rowMedians, rowMins, ## rowOrderStats, rowProds, rowQuantiles, rowRanges, rowRanks, ## rowSdDiffs, rowSds, rowSums2, rowTabulates, rowVarDiffs, rowVars, ## rowWeightedMads, rowWeightedMeans, rowWeightedMedians, ## rowWeightedSds, rowWeightedVars ## Loading required package: Biobase ## Welcome to Bioconductor ## ## Vignettes contain introductory material; view with ## 'browseVignettes()'. To cite Bioconductor, see ## 'citation("Biobase")', and for packages 'citation("pkgname")'. ## ## Attaching package: 'Biobase' ## The following object is masked from 'package:MatrixGenerics': ## ## rowMedians ## The following objects are masked from 'package:matrixStats': ## ## anyMissing, rowMedians ``` r # Replace 'input.fasta' with the name of your multi-sequence fasta file input_file <- "../data/07-CG-motifs/input.fq" # Need to use silghtly different code to accomodate an .fq file instead of a .fasta # sequences <- read.fasta(input_file) fq <- readFastq(input_file) sequences <- sread(fq) ``` ``` r # Set the seed for reproducibility (optional) set.seed(42) number_of_sequences_to_select <- 100 if (length(sequences) < number_of_sequences_to_select) { warning("There are fewer than 10 sequences in the fasta file. All sequences will be selected.") number_of_sequences_to_select <- length(sequences) } selected_indices <- sample(length(sequences), number_of_sequences_to_select) selected_sequences <- sequences[selected_indices] ``` ``` r # Replace 'output.fasta' with your desired output file name output_file <- "../output/07-CG-motifs/100_seqs.fasta" write.fasta(selected_sequences, names(selected_sequences), output_file, open = "w") ``` Issues with IGV visualization because this generated fasta doesn’t have matching scaffold name. Fix this here ``` bash sed -i '1s/.*/>EMBOSS_001/' ../output/07-CG-motifs/100_seqs.fasta ``` ``` bash #needed downstream for IGV /home/shared/samtools-1.12/samtools faidx \ ../output/07-CG-motifs/100_seqs.fasta ``` ``` bash fuzznuc -sequence ../output/07-CG-motifs/100_seqs.fasta -pattern CG -rformat gff -outfile ../output/07-CG-motifs/CGoutput.gff ``` ## Search for patterns in nucleotide sequences IGV visualization: ![](../output/07-CG-motifs/IGV.png)