In this notebook I identify differentially methylated genes (DMGs) between two Olympia oyster populations, Hood Canal and South Sound. First I prepare my data to be in the correct format / shape, then test for DMGs using a binomial GLM. The genes are also annotated using a gene feature file and BEDtools, and biological functions associated with GO terms are visualized with REVIGO.
list.of.packages <- c("tidyverse", "reshape2","dplyr", "tidyr", "readr", "stringr", "plotly", "car", "Pstat") #add new libraries here
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)
# Load all libraries
lapply(list.of.packages, FUN = function(X) {
do.call("require", list(X))
})
load(here::here("analyses", "methylation-filtered", "R-objects", "meth_filter_reshaped"))
head(meth_filter_reshaped)
## chr start end strand sample coverage numCs numTs percMeth population
## 1 Contig0 39226 39226 + 1 21 11 10 52.38095 HC
## 2 Contig0 39234 39234 + 1 24 10 14 41.66667 HC
## 3 Contig0 64179 64179 + 1 10 9 1 90.00000 HC
## 4 Contig0 71523 71523 + 1 13 13 0 100.00000 HC
## 5 Contig0 71533 71533 + 1 17 17 0 100.00000 HC
## 6 Contig0 71542 71542 + 1 16 15 1 93.75000 HC
slop
to include 2kb regions on either side of gene regionsI want to identify methylated loci that are within genes but also 2kb upstream and downstream of gene regions. Therefore, before intersecting methylated loci with genes, I need to create a gene region +/- 2kb file. I will do this using slopBed
.
First, generate a “genome file”, which defines size of each chromosome. This is so the slop
function restricts results to within a contig. I can use the indexed FASTA file that I created for a blast.
Extract column 1 (contig name) and column 2 (# bases in the contig)
head -n 2 "../resources/Olurida_v081.fa.fai"
cut -f 1,2 "../resources/Olurida_v081.fa.fai" > "../resources/Olurida_chrom.sizes"
head -n 2 "../resources/Olurida_chrom.sizes"
## Contig0 116746 9 116746 116747
## Contig1 87411 116765 87411 87412
## Contig0 116746
## Contig1 87411
head -n 4 "../genome-features/Olurida_v081-20190709.gene.gff"
## ##gff-version 3
## Contig61093 maker gene 7493 7946 . + . ID=OLUR_00020575;Name=OLUR_00020575;Alias=maker-Contig61093-snap-gene-0.2;Note=Protein of unknown function;
## Contig1111 maker gene 24968 28696 . - . ID=OLUR_00006628;Name=OLUR_00006628;Alias=maker-Contig1111-snap-gene-0.1;Note=Similar to Spag6: Sperm-associated antigen 6 (Mus musculus OX%3D10090);Dbxref=Gene3D:G3DSA:1.25.10.10,InterPro:IPR000225,InterPro:IPR000357,InterPro:IPR011989,InterPro:IPR016024,Pfam:PF02985,SMART:SM00185,SUPERFAMILY:SSF48371;Ontology_term=GO:0005515;
## Contig214118 maker gene 201 926 . + . ID=OLUR_00032161;Name=OLUR_00032161;Alias=maker-Contig214118-snap-gene-0.0;Note=Protein of unknown function;Dbxref=Gene3D:G3DSA:3.10.450.10;
slopBed \
-i "../genome-features/Olurida_v081-20190709.gene.gff" \
-g "../resources/Olurida_chrom.sizes" \
-b 2000 \
> "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff"
head -n 3 "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff"
## Contig61093 maker gene 5493 9946 . + . ID=OLUR_00020575;Name=OLUR_00020575;Alias=maker-Contig61093-snap-gene-0.2;Note=Protein of unknown function;
## Contig1111 maker gene 22968 28792 . - . ID=OLUR_00006628;Name=OLUR_00006628;Alias=maker-Contig1111-snap-gene-0.1;Note=Similar to Spag6: Sperm-associated antigen 6 (Mus musculus OX%3D10090);Dbxref=Gene3D:G3DSA:1.25.10.10,InterPro:IPR000225,InterPro:IPR000357,InterPro:IPR011989,InterPro:IPR016024,Pfam:PF02985,SMART:SM00185,SUPERFAMILY:SSF48371;Ontology_term=GO:0005515;
## Contig214118 maker gene 1 1068 . + . ID=OLUR_00032161;Name=OLUR_00032161;Alias=maker-Contig214118-snap-gene-0.0;Note=Protein of unknown function;Dbxref=Gene3D:G3DSA:3.10.450.10;
wb: Print all lines in the second file a: input data file, which was created in previous code chunk b: save annotated gene list
intersectBed \
-wb \
-a "../analyses/methylation-filtered/meth_filter_long.tab" \
-b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" \
> "../analyses/methylation-filtered/meth_filter_gene.2kbslop.tab"
wc -l "../analyses/methylation-filtered/meth_filter_gene.2kbslop.tab"
## 379314 ../analyses/methylation-filtered/meth_filter_gene.2kbslop.tab
head -n 2 "../analyses/methylation-filtered/meth_filter_gene.2kbslop.tab"
## Contig0 39226 39226 + 1 21 11 10 52.38095238095239 HC Contig0 maker gene 10497 95068 . + . ID=OLUR_00000039;Name=OLUR_00000039;Alias=maker-Contig0-snap-gene-0.8;Note=Similar to WDR87: WD repeat-containing protein 87 (Homo sapiens OX%3D9606);Dbxref=Coils:Coil,Gene3D:G3DSA:1.25.10.10,Gene3D:G3DSA:2.130.10.10,InterPro:IPR001680,InterPro:IPR011989,InterPro:IPR015943,InterPro:IPR016024,InterPro:IPR017986,InterPro:IPR036322,MobiDBLite:mobidb-lite,Pfam:PF00400,ProSiteProfiles:PS50082,ProSiteProfiles:PS50294,SMART:SM00320,SUPERFAMILY:SSF48371,SUPERFAMILY:SSF50978;Ontology_term=GO:0005515;
## Contig0 39234 39234 + 1 24 10 14 41.66666666666667 HC Contig0 maker gene 10497 95068 . + . ID=OLUR_00000039;Name=OLUR_00000039;Alias=maker-Contig0-snap-gene-0.8;Note=Similar to WDR87: WD repeat-containing protein 87 (Homo sapiens OX%3D9606);Dbxref=Coils:Coil,Gene3D:G3DSA:1.25.10.10,Gene3D:G3DSA:2.130.10.10,InterPro:IPR001680,InterPro:IPR011989,InterPro:IPR015943,InterPro:IPR016024,InterPro:IPR017986,InterPro:IPR036322,MobiDBLite:mobidb-lite,Pfam:PF00400,ProSiteProfiles:PS50082,ProSiteProfiles:PS50294,SMART:SM00320,SUPERFAMILY:SSF48371,SUPERFAMILY:SSF50978;Ontology_term=GO:0005515;
Return the number of loci associated with gene regions. Not sure if this will be used, but it’s good to save it just in case.
intersectBed \
-wb \
-a "../analyses/methylation-filtered/meth_filter_long.tab" \
-b "../genome-features/Olurida_v081-20190709.gene.gff" \
> "../analyses/methylation-filtered/meth_filter_gene.tab"
wc -l "../analyses/methylation-filtered/meth_filter_gene.tab"
## 336384 ../analyses/methylation-filtered/meth_filter_gene.tab
head -n 3 "../analyses/methylation-filtered/meth_filter_gene.2kbslop.tab"
## Contig0 39226 39226 + 1 21 11 10 52.38095238095239 HC Contig0 maker gene 10497 95068 . + . ID=OLUR_00000039;Name=OLUR_00000039;Alias=maker-Contig0-snap-gene-0.8;Note=Similar to WDR87: WD repeat-containing protein 87 (Homo sapiens OX%3D9606);Dbxref=Coils:Coil,Gene3D:G3DSA:1.25.10.10,Gene3D:G3DSA:2.130.10.10,InterPro:IPR001680,InterPro:IPR011989,InterPro:IPR015943,InterPro:IPR016024,InterPro:IPR017986,InterPro:IPR036322,MobiDBLite:mobidb-lite,Pfam:PF00400,ProSiteProfiles:PS50082,ProSiteProfiles:PS50294,SMART:SM00320,SUPERFAMILY:SSF48371,SUPERFAMILY:SSF50978;Ontology_term=GO:0005515;
## Contig0 39234 39234 + 1 24 10 14 41.66666666666667 HC Contig0 maker gene 10497 95068 . + . ID=OLUR_00000039;Name=OLUR_00000039;Alias=maker-Contig0-snap-gene-0.8;Note=Similar to WDR87: WD repeat-containing protein 87 (Homo sapiens OX%3D9606);Dbxref=Coils:Coil,Gene3D:G3DSA:1.25.10.10,Gene3D:G3DSA:2.130.10.10,InterPro:IPR001680,InterPro:IPR011989,InterPro:IPR015943,InterPro:IPR016024,InterPro:IPR017986,InterPro:IPR036322,MobiDBLite:mobidb-lite,Pfam:PF00400,ProSiteProfiles:PS50082,ProSiteProfiles:PS50294,SMART:SM00320,SUPERFAMILY:SSF48371,SUPERFAMILY:SSF50978;Ontology_term=GO:0005515;
## Contig0 64179 64179 + 1 10 9 1 90 HC Contig0 maker gene 10497 95068 . + . ID=OLUR_00000039;Name=OLUR_00000039;Alias=maker-Contig0-snap-gene-0.8;Note=Similar to WDR87: WD repeat-containing protein 87 (Homo sapiens OX%3D9606);Dbxref=Coils:Coil,Gene3D:G3DSA:1.25.10.10,Gene3D:G3DSA:2.130.10.10,InterPro:IPR001680,InterPro:IPR011989,InterPro:IPR015943,InterPro:IPR016024,InterPro:IPR017986,InterPro:IPR036322,MobiDBLite:mobidb-lite,Pfam:PF00400,ProSiteProfiles:PS50082,ProSiteProfiles:PS50294,SMART:SM00320,SUPERFAMILY:SSF48371,SUPERFAMILY:SSF50978;Ontology_term=GO:0005515;
head -n 1 "../analyses/methylation-filtered/meth_filter_gene.tab"
## Contig0 39226 39226 + 1 21 11 10 52.38095238095239 HC Contig0 maker gene 12497 93068 . + . ID=OLUR_00000039;Name=OLUR_00000039;Alias=maker-Contig0-snap-gene-0.8;Note=Similar to WDR87: WD repeat-containing protein 87 (Homo sapiens OX%3D9606);Dbxref=Coils:Coil,Gene3D:G3DSA:1.25.10.10,Gene3D:G3DSA:2.130.10.10,InterPro:IPR001680,InterPro:IPR011989,InterPro:IPR015943,InterPro:IPR016024,InterPro:IPR017986,InterPro:IPR036322,MobiDBLite:mobidb-lite,Pfam:PF00400,ProSiteProfiles:PS50082,ProSiteProfiles:PS50294,SMART:SM00320,SUPERFAMILY:SSF48371,SUPERFAMILY:SSF50978;Ontology_term=GO:0005515;
Genes plus 2kb downstream (3’) (using the -r
option to indicate “add”)
slopBed \
-i "../genome-features/Olurida_v081-20190709.gene.gff" \
-g "../resources/Olurida_chrom.sizes" \
-r 2000 \
-l 0 \
> "../genome-features/Olurida_v081-20190709.gene.2kb-down.gff"
head -n 3 "../genome-features/Olurida_v081-20190709.gene.2kb-down.gff"
## Contig61093 maker gene 7493 9946 . + . ID=OLUR_00020575;Name=OLUR_00020575;Alias=maker-Contig61093-snap-gene-0.2;Note=Protein of unknown function;
## Contig1111 maker gene 24968 28792 . - . ID=OLUR_00006628;Name=OLUR_00006628;Alias=maker-Contig1111-snap-gene-0.1;Note=Similar to Spag6: Sperm-associated antigen 6 (Mus musculus OX%3D10090);Dbxref=Gene3D:G3DSA:1.25.10.10,InterPro:IPR000225,InterPro:IPR000357,InterPro:IPR011989,InterPro:IPR016024,Pfam:PF02985,SMART:SM00185,SUPERFAMILY:SSF48371;Ontology_term=GO:0005515;
## Contig214118 maker gene 201 1068 . + . ID=OLUR_00032161;Name=OLUR_00032161;Alias=maker-Contig214118-snap-gene-0.0;Note=Protein of unknown function;Dbxref=Gene3D:G3DSA:3.10.450.10;
Genes plus 2kb upstream (5’) (using the -l
option to indicate “subract”)
slopBed \
-i "../genome-features/Olurida_v081-20190709.gene.gff" \
-g "../resources/Olurida_chrom.sizes" \
-l 2000 \
-r 0 \
> "../genome-features/Olurida_v081-20190709.gene.2kb-up.gff"
head -n 3 "../genome-features/Olurida_v081-20190709.gene.2kb-up.gff"
## Contig61093 maker gene 5493 7946 . + . ID=OLUR_00020575;Name=OLUR_00020575;Alias=maker-Contig61093-snap-gene-0.2;Note=Protein of unknown function;
## Contig1111 maker gene 22968 28696 . - . ID=OLUR_00006628;Name=OLUR_00006628;Alias=maker-Contig1111-snap-gene-0.1;Note=Similar to Spag6: Sperm-associated antigen 6 (Mus musculus OX%3D10090);Dbxref=Gene3D:G3DSA:1.25.10.10,InterPro:IPR000225,InterPro:IPR000357,InterPro:IPR011989,InterPro:IPR016024,Pfam:PF02985,SMART:SM00185,SUPERFAMILY:SSF48371;Ontology_term=GO:0005515;
## Contig214118 maker gene 1 926 . + . ID=OLUR_00032161;Name=OLUR_00032161;Alias=maker-Contig214118-snap-gene-0.0;Note=Protein of unknown function;Dbxref=Gene3D:G3DSA:3.10.450.10;
– gene = contig number + start/end locus
– group = sample number + gene
– population = HC for Hood Canal, or SS for South Sound
meth_filter_genes_2kbslop <-
read_delim(file = here::here("analyses", "methylation-filtered", "meth_filter_gene.2kbslop.tab"), delim = "\t", col_names = c(colnames(meth_filter_reshaped[-10]), "population", "contig_gene", "source_gene", "feature_gene", "start_gene_2kb", "end_gene_2kb", "unknown1_gene", "strand_gene", "unknown2_gene", "notes_gene")) %>%
mutate(gene=paste(contig_gene, start_gene_2kb, end_gene_2kb, sep="_")) %>%
mutate(group=paste(sample, gene, sep="_"))
## Parsed with column specification:
## cols(
## chr = col_character(),
## start = col_double(),
## end = col_double(),
## strand = col_character(),
## sample = col_double(),
## coverage = col_double(),
## numCs = col_double(),
## numTs = col_double(),
## percMeth = col_double(),
## population = col_character(),
## contig_gene = col_character(),
## source_gene = col_character(),
## feature_gene = col_character(),
## start_gene_2kb = col_double(),
## end_gene_2kb = col_double(),
## unknown1_gene = col_character(),
## strand_gene = col_character(),
## unknown2_gene = col_character(),
## notes_gene = col_character()
## )
length(unique(meth_filter_genes_2kbslop$gene))
## [1] 3754
min.filt_2kbslop <- dplyr::count(meth_filter_genes_2kbslop, vars = c(group))
newdata <- min.filt_2kbslop[which(min.filt_2kbslop$n > 4), ]
sub_meth_table_2kbslop <- meth_filter_genes_2kbslop[meth_filter_genes_2kbslop$group %in% newdata$vars,]
save(sub_meth_table_2kbslop, file="../analyses/DMGs/R-objects/sub_meth_table_2kbslop")
head(sub_meth_table_2kbslop)
## # A tibble: 6 x 21
## chr start end strand sample coverage numCs numTs percMeth population
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Cont… 39226 39226 + 1 21 11 10 52.4 HC
## 2 Cont… 39234 39234 + 1 24 10 14 41.7 HC
## 3 Cont… 64179 64179 + 1 10 9 1 90 HC
## 4 Cont… 71523 71523 + 1 13 13 0 100 HC
## 5 Cont… 71533 71533 + 1 17 17 0 100 HC
## 6 Cont… 71542 71542 + 1 16 15 1 93.8 HC
## # … with 11 more variables: contig_gene <chr>, source_gene <chr>,
## # feature_gene <chr>, start_gene_2kb <dbl>, end_gene_2kb <dbl>,
## # unknown1_gene <chr>, strand_gene <chr>, unknown2_gene <chr>,
## # notes_gene <chr>, gene <chr>, group <chr>
length(unique(sub_meth_table_2kbslop$gene))
## [1] 1393
Note: this script was created by Hollie Putnam (GM.Rmd); there are minor revisions below. I retained some commented out lines (notably-testing for position w/n gene, such as intron & exon) in case we want to include those as factors in the future.
# create data frame to stored results
results_2kbslop <- data.frame()
gs <- unique(sub_meth_table_2kbslop$gene)
#first subset the unique dataframes and second run the GLMs
for(i in 1:length(gs)){
#subset the dataframe gene by gene
sub_meth_table_2kbslop1 <- subset(sub_meth_table_2kbslop, gene ==gs[i])
# fit glm position model
fit <- glm(matrix(c(numCs, numTs), ncol=2) ~ as.factor(population) + (1|sample),
data=sub_meth_table_2kbslop1, family=binomial)
a <- anova(fit, test="Chisq")
# capture summary stats to data frame
df <- data.frame(sub_meth_table_2kbslop1[c("population", "sample", "contig_gene", "start_gene_2kb", "end_gene_2kb", "gene", "chr", "start", "sample", "coverage", "numCs", "numTs", "percMeth", "notes_gene")],
pval.treatment = a$`Pr(>Chi)`[2],
#pval.position = a$`Pr(>Chi)`[3], #uncomment if you want to include position of CpG within a gene
#pval.treatment_x_position = a$`Pr(>Chi)`[4], #uncomment if you want to include position of CpG within a gene interaction with treatment
stringsAsFactors = F)
# bind rows of temporary data frame to the results data frame
results_2kbslop <- rbind(results_2kbslop, df)
}
results_2kbslop[is.na(results_2kbslop)] <- 0
results_2kbslop$adj.pval.pop <- p.adjust(results_2kbslop$pval.treatment, method='BH')
#results_2kbslop$adj.pval.position <- p.adjust(results_2kbslop$pval.position, method='BH') #uncomment if you want to include position of CpG within a gene
#results_2kbslop$adj.pval.treatment_x_position <- p.adjust(results_2kbslop$pval.treatment_x_position, method='BH') #uncomment if you want to include position of CpG within a gene interaction with treatment
#save df with differentially methylated genes
DMGs_2kbslop <- subset(results_2kbslop, adj.pval.pop < 0.05) #%>%
# mutate(contig_gene_start=paste(contig_gene, start_gene_2kb, sep="_"))
save(DMGs_2kbslop, file="../analyses/DMGs/R-objects/DMGs_2kbslop")
DMGs_2kbslop.genes <- DMGs_2kbslop[!duplicated(DMGs_2kbslop$gene), c("contig_gene", "start_gene_2kb", "end_gene_2kb", "notes_gene", "pval.treatment", "adj.pval.pop")]
# Save .bed file for gene enrichment and visualization
write_delim(DMGs_2kbslop.genes[,c("contig_gene", "start_gene_2kb", "end_gene_2kb")], "../analyses/DMGs/DMGs_2kbslop.bed", delim = '\t', col_names = FALSE)
# Save .bed file for all genes included in DMG analysis
write_delim(results_2kbslop[!duplicated(results_2kbslop$gene), c("contig_gene", "start_gene_2kb", "end_gene_2kb")], "../analyses/DMGs/AllGenes-DMGs_2kbslop.bed", delim = '\t', col_names = FALSE)
length(unique(DMGs_2kbslop$gene))
## [1] 279
# split gene data in "notes_gene" column into separate columns
DMGs_2kbslop.genes <- DMGs_2kbslop.genes %>%
mutate(ID=str_extract(notes_gene, "ID=(.*?);"),
Parent=str_extract(notes_gene, "Parent=(.*?);"),
Name=str_extract(notes_gene, "Name=(.*?);"),
Alias=str_extract(notes_gene, "Alias=(.*?);"),
AED=str_extract(notes_gene, "AED=(.*?);"),
eAED=str_extract(notes_gene, "eAED=(.*?);"),
Note=str_extract(notes_gene, "Note=(.*?);"),
Ontology_term=str_extract(notes_gene, "Ontology_term=(.*?);"),
Dbxref=str_extract(notes_gene, "Dbxref=(.*?);")
)
write_csv(DMGs_2kbslop.genes, path = here::here("analyses", "DMGs", "DMGs_2kbslop.genes.csv"))
#Extract GO terms
DMGs_2kbslop.genes.GO <- DMGs_2kbslop.genes %>%
mutate(Ontology_term = str_replace(Ontology_term, pattern="Ontology_term=",replacement = "")) %>%
mutate(Ontology_term = str_replace(Ontology_term, pattern=";",replacement = "")) %>%
separate(Ontology_term, sep=",", into=paste("GO", 1:11, sep="_")) %>%
pivot_longer(cols=c("GO_1","GO_2","GO_3","GO_4","GO_5","GO_6","GO_7","GO_8","GO_9","GO_10","GO_11"), names_to = "GO_number", values_to = "GO_term") %>%
dplyr::select(-GO_number) %>%
filter(!is.na(Note) & !is.na(GO_term))
## Warning: Expected 11 pieces. Missing pieces filled with `NA` in 127 rows [1, 3,
## 4, 5, 9, 10, 11, 12, 16, 17, 19, 20, 24, 26, 27, 28, 29, 30, 31, 32, ...].
write_delim(DMGs_2kbslop.genes.GO[,c("GO_term","adj.pval.pop")], path = here::here("analyses/", "DMGs/", "DMGs_2kbslop.GO.txt"), delim = '\t', col_names = F) #write out df with just GO terms and p-adj values
The file “../analyses/DMGs/DMGs_2kbslop.GO.txt” was opened outside of RMarkdown and the GO terms and p-values were pasted into REVIGO. I then exported the R script to generate the Biological Processes scatterplot. Here is the REVIGO table:
term_ID | description | frequency | plot_X | plot_Y | plot_size | log10 p-value | uniqueness | dispensability | representative | eliminated |
---|---|---|---|---|---|---|---|---|---|---|
GO:0006260 | DNA replication | 1.58% | 5.949 | 2.822 | 5.306 | -6.1605 | 0.719 | 0 | 6260 | 0 |
GO:0007155 | cell adhesion | 0.54% | -6.544 | 1.248 | 4.844 | -6.4135 | 0.914 | 0 | 7155 | 0 |
GO:0007156 | homophilic cell adhesion via plasma membrane adhesion molecules | 0.10% | null | null | 4.087 | -6.4135 | 0.914 | 0.85 | 7155 | 1 |
GO:0023052 | signaling | 6.77% | -3.032 | -0.496 | 5.939 | -7.3332 | 0.956 | 0 | 23052 | 0 |
GO:0070588 | calcium ion transmembrane transport | 0.16% | -3.153 | -4.713 | 4.305 | -2.1079 | 0.868 | 0 | 70588 | 0 |
GO:0009058 | biosynthetic process | 31.61% | -1.225 | -0.762 | 6.608 | -2.4968 | 0.943 | 0.026 | 9058 | 0 |
GO:0006334 | nucleosome assembly | 0.09% | -4.686 | 5.561 | 4.057 | -3.6686 | 0.881 | 0.029 | 6334 | 0 |
GO:0007264 | small GTPase mediated signal transduction | 0.49% | 1.48 | 6.771 | 4.794 | -2.1354 | 0.743 | 0.034 | 7264 | 0 |
GO:0006457 | protein folding | 0.90% | -4.69 | 1.661 | 5.064 | -1.7376 | 0.912 | 0.037 | 6457 | 0 |
GO:0007154 | cell communication | 7.22% | -6.016 | 3.233 | 5.967 | -4.0935 | 0.904 | 0.048 | 7154 | 0 |
GO:0005975 | carbohydrate metabolic process | 5.26% | -3.347 | 3.585 | 5.829 | -3.2718 | 0.888 | 0.081 | 5975 | 0 |
GO:0006030 | chitin metabolic process | 0.08% | 2.981 | -3.369 | 3.994 | -3.9452 | 0.855 | 0.103 | 6030 | 0 |
GO:0032958 | inositol phosphate biosynthetic process | 0.01% | 4.623 | -5.39 | 2.943 | -6.0631 | 0.827 | 0.121 | 32958 | 0 |
GO:0030154 | cell differentiation | 1.13% | 1.041 | -6.177 | 5.162 | -5.21 | 0.858 | 0.133 | 30154 | 0 |
GO:0016567 | protein ubiquitination | 0.52% | 7.405 | 0.126 | 4.827 | -2.6487 | 0.717 | 0.15 | 16567 | 0 |
GO:0016579 | protein deubiquitination | 0.20% | null | null | 4.398 | -1.8804 | 0.726 | 0.829 | 16567 | 1 |
GO:0006006 | glucose metabolic process | 0.43% | 2.958 | -6.194 | 4.741 | -3.4042 | 0.835 | 0.213 | 6006 | 0 |
GO:0055114 | oxidation-reduction process | 15.06% | 1.16 | -1.152 | 6.286 | -6.1605 | 0.859 | 0.214 | 55114 | 0 |
GO:0019427 | acetyl-CoA biosynthetic process from acetate | 0.03% | 4.516 | -4.71 | 3.604 | -2.0321 | 0.81 | 0.237 | 19427 | 0 |
GO:0006886 | intracellular protein transport | 1.20% | -4.141 | -3.625 | 5.187 | -1.5764 | 0.862 | 0.263 | 6886 | 0 |
GO:0009408 | response to heat | 0.17% | 0.479 | 7.378 | 4.328 | -1.7376 | 0.834 | 0.315 | 9408 | 0 |
GO:0006396 | RNA processing | 3.21% | 6.772 | 2.794 | 5.615 | -2.1891 | 0.746 | 0.319 | 6396 | 0 |
GO:0006511 | ubiquitin-dependent protein catabolic process | 0.58% | 7.678 | -0.832 | 4.874 | -1.8804 | 0.781 | 0.321 | 6511 | 0 |
GO:0006541 | glutamine metabolic process | 0.47% | 3.071 | -4.558 | 4.782 | -1.9271 | 0.785 | 0.33 | 6541 | 0 |
GO:0001522 | pseudouridine synthesis | 0.35% | 6.93 | 1.68 | 4.652 | -2.1891 | 0.709 | 0.384 | 1522 | 0 |
GO:0000154 | rRNA modification | 0.35% | null | null | 4.658 | -1.3136 | 0.678 | 0.701 | 1522 | 1 |
GO:0006811 | ion transport | 5.34% | -3.91 | -4.255 | 5.836 | -2.1079 | 0.866 | 0.389 | 6811 | 0 |
GO:0046600 | negative regulation of centriole replication | 0.00% | -2.59 | 5.971 | 2.477 | -2.1569 | 0.824 | 0.407 | 46600 | 0 |
GO:0006486 | protein glycosylation | 0.32% | 5.491 | -1.021 | 4.61 | -2.0586 | 0.675 | 0.429 | 6486 | 0 |
GO:0051103 | DNA ligation involved in DNA repair | 0.04% | 3.562 | 5.637 | 3.703 | -2.8009 | 0.703 | 0.454 | 51103 | 0 |
GO:0035023 | regulation of Rho protein signal transduction | 0.13% | 1.394 | 7.1 | 4.206 | -1.5602 | 0.764 | 0.515 | 35023 | 0 |
GO:0055085 | transmembrane transport | 8.92% | -4.515 | -3.723 | 6.058 | -1.5468 | 0.863 | 0.535 | 55085 | 0 |
GO:0006355 | regulation of transcription, DNA-templated | 9.92% | 4.282 | 3.292 | 6.105 | -1.7475 | 0.633 | 0.537 | 6355 | 0 |
GO:0006468 | protein phosphorylation | 4.14% | 6.56 | -0.302 | 5.725 | -2.3389 | 0.67 | 0.56 | 6468 | 0 |
GO:0034220 | ion transmembrane transport | 3.53% | -3.536 | -4.332 | 5.656 | -1.6655 | 0.846 | 0.566 | 34220 | 0 |
GO:0006470 | protein dephosphorylation | 0.59% | 6.774 | -1.091 | 4.875 | -1.5439 | 0.715 | 0.568 | 6470 | 0 |
GO:0009451 | RNA modification | 1.78% | 6.493 | 1.45 | 5.358 | -2.1891 | 0.686 | 0.581 | 9451 | 0 |
GO:0015914 | phospholipid transport | 0.08% | -4.251 | -3.327 | 3.987 | -1.3062 | 0.821 | 0.584 | 15914 | 0 |
GO:0035556 | intracellular signal transduction | 4.00% | 1.211 | 6.286 | 5.71 | -1.705 | 0.704 | 0.593 | 35556 | 0 |
GO:0007186 | G-protein coupled receptor signaling pathway | 0.88% | 0.818 | 6.368 | 5.054 | -1.6167 | 0.736 | 0.638 | 7186 | 0 |
GO:0007166 | cell surface receptor signaling pathway | 0.92% | 0.957 | 6.609 | 5.072 | -1.4923 | 0.735 | 0.641 | 7166 | 0 |
GO:0006281 | DNA repair | 2.23% | 3.916 | 4.695 | 5.457 | -2.8009 | 0.608 | 0.684 | 6281 | 0 |
GO:0006310 | DNA recombination | 1.64% | 6.285 | 3.557 | 5.323 | -2.8009 | 0.732 | 0.688 | 6310 | 0 |
# A plotting R script produced by the REVIGO server at http://revigo.irb.hr/
# If you found REVIGO useful in your work, please cite the following reference:
# Supek F et al. "REVIGO summarizes and visualizes long lists of Gene Ontology
# terms" PLoS ONE 2011. doi:10.1371/journal.pone.0021800
# --------------------------------------------------------------------------
# If you don't have the ggplot2 package installed, uncomment the following line:
# install.packages( "ggplot2" );
library( ggplot2 );
# --------------------------------------------------------------------------
# If you don't have the scales package installed, uncomment the following line:
# install.packages( "scales" );
library( scales );
# --------------------------------------------------------------------------
# Here is your data from REVIGO. Scroll down for plot configuration options.
revigo.names <- c("term_ID","description","frequency_%","plot_X","plot_Y","plot_size","log10_p_value","uniqueness","dispensability");
revigo.data <- rbind(c("GO:0006260","DNA replication", 1.577, 5.949, 2.822, 5.306,-6.1605,0.719,0.000),
c("GO:0007155","cell adhesion", 0.544,-6.544, 1.248, 4.844,-6.4135,0.914,0.000),
c("GO:0023052","signaling", 6.765,-3.032,-0.496, 5.939,-7.3332,0.956,0.000),
c("GO:0070588","calcium ion transmembrane transport", 0.157,-3.153,-4.713, 4.305,-2.1079,0.868,0.000),
c("GO:0009058","biosynthetic process",31.611,-1.225,-0.762, 6.608,-2.4968,0.943,0.026),
c("GO:0006334","nucleosome assembly", 0.089,-4.686, 5.561, 4.057,-3.6686,0.881,0.029),
c("GO:0007264","small GTPase mediated signal transduction", 0.485, 1.480, 6.771, 4.794,-2.1354,0.743,0.034),
c("GO:0006457","protein folding", 0.903,-4.690, 1.661, 5.064,-1.7376,0.912,0.037),
c("GO:0007154","cell communication", 7.219,-6.016, 3.233, 5.967,-4.0935,0.904,0.048),
c("GO:0005975","carbohydrate metabolic process", 5.260,-3.347, 3.585, 5.829,-3.2718,0.888,0.081),
c("GO:0006030","chitin metabolic process", 0.077, 2.981,-3.369, 3.994,-3.9452,0.855,0.103),
c("GO:0032958","inositol phosphate biosynthetic process", 0.007, 4.623,-5.390, 2.943,-6.0631,0.827,0.121),
c("GO:0030154","cell differentiation", 1.133, 1.041,-6.177, 5.162,-5.2100,0.858,0.133),
c("GO:0016567","protein ubiquitination", 0.523, 7.405, 0.126, 4.827,-2.6487,0.717,0.150),
c("GO:0006006","glucose metabolic process", 0.430, 2.958,-6.194, 4.741,-3.4042,0.835,0.213),
c("GO:0055114","oxidation-reduction process",15.060, 1.160,-1.152, 6.286,-6.1605,0.859,0.214),
c("GO:0019427","acetyl-CoA biosynthetic process from acetate", 0.031, 4.516,-4.710, 3.604,-2.0321,0.810,0.237),
c("GO:0006886","intracellular protein transport", 1.199,-4.141,-3.625, 5.187,-1.5764,0.862,0.263),
c("GO:0009408","response to heat", 0.166, 0.479, 7.378, 4.328,-1.7376,0.834,0.315),
c("GO:0006396","RNA processing", 3.210, 6.772, 2.794, 5.615,-2.1891,0.746,0.319),
c("GO:0006511","ubiquitin-dependent protein catabolic process", 0.584, 7.678,-0.832, 4.874,-1.8804,0.781,0.321),
c("GO:0006541","glutamine metabolic process", 0.472, 3.071,-4.558, 4.782,-1.9271,0.785,0.330),
c("GO:0001522","pseudouridine synthesis", 0.350, 6.930, 1.680, 4.652,-2.1891,0.709,0.384),
c("GO:0006811","ion transport", 5.344,-3.910,-4.255, 5.836,-2.1079,0.866,0.389),
c("GO:0046600","negative regulation of centriole replication", 0.002,-2.590, 5.971, 2.477,-2.1569,0.824,0.407),
c("GO:0006486","protein glycosylation", 0.317, 5.491,-1.021, 4.610,-2.0586,0.675,0.429),
c("GO:0051103","DNA ligation involved in DNA repair", 0.039, 3.562, 5.637, 3.703,-2.8009,0.703,0.454),
c("GO:0035023","regulation of Rho protein signal transduction", 0.125, 1.394, 7.100, 4.206,-1.5602,0.764,0.515),
c("GO:0055085","transmembrane transport", 8.916,-4.515,-3.723, 6.058,-1.5468,0.863,0.535),
c("GO:0006355","regulation of transcription, DNA-templated", 9.917, 4.282, 3.292, 6.105,-1.7475,0.633,0.537),
c("GO:0006468","protein phosphorylation", 4.137, 6.560,-0.302, 5.725,-2.3389,0.670,0.560),
c("GO:0034220","ion transmembrane transport", 3.528,-3.536,-4.332, 5.656,-1.6655,0.846,0.566),
c("GO:0006470","protein dephosphorylation", 0.585, 6.774,-1.091, 4.875,-1.5439,0.715,0.568),
c("GO:0009451","RNA modification", 1.778, 6.493, 1.450, 5.358,-2.1891,0.686,0.581),
c("GO:0015914","phospholipid transport", 0.076,-4.251,-3.327, 3.987,-1.3062,0.821,0.584),
c("GO:0035556","intracellular signal transduction", 4.000, 1.211, 6.286, 5.710,-1.7050,0.704,0.593),
c("GO:0007186","G-protein coupled receptor signaling pathway", 0.882, 0.818, 6.368, 5.054,-1.6167,0.736,0.638),
c("GO:0007166","cell surface receptor signaling pathway", 0.920, 0.957, 6.609, 5.072,-1.4923,0.735,0.641),
c("GO:0006281","DNA repair", 2.234, 3.916, 4.695, 5.457,-2.8009,0.608,0.684),
c("GO:0006310","DNA recombination", 1.641, 6.285, 3.557, 5.323,-2.8009,0.732,0.688));
one.data <- data.frame(revigo.data);
names(one.data) <- revigo.names;
one.data <- one.data [(one.data$plot_X != "null" & one.data$plot_Y != "null"), ];
one.data$plot_X <- as.numeric( as.character(one.data$plot_X) );
one.data$plot_Y <- as.numeric( as.character(one.data$plot_Y) );
one.data$plot_size <- as.numeric( as.character(one.data$plot_size) );
one.data$log10_p_value <- as.numeric( as.character(one.data$log10_p_value) );
one.data$frequency <- as.numeric( as.character(one.data$frequency) );
one.data$uniqueness <- as.numeric( as.character(one.data$uniqueness) );
one.data$dispensability <- as.numeric( as.character(one.data$dispensability) );
#head(one.data);
# --------------------------------------------------------------------------
# Names of the axes, sizes of the numbers and letters, names of the columns,
# etc. can be changed below
p1 <- ggplot( data = one.data );
p1 <- p1 + geom_point( aes( plot_X, plot_Y, colour = log10_p_value, size = plot_size), alpha = I(0.6) ) + scale_size_area();
p1 <- p1 + scale_colour_gradientn( colours = c("blue", "green", "yellow", "red"), limits = c( min(one.data$log10_p_value), 0) );
p1 <- p1 + geom_point( aes(plot_X, plot_Y, size = plot_size), shape = 21, fill = "transparent", colour = I (alpha ("black", 0.6) )) + scale_size_area();
p1 <- p1 + scale_size( range=c(5, 30)) + theme_bw(); # + scale_fill_gradientn(colours = heat_hcl(7), limits = c(-300, 0) );
#ex <- one.data [ one.data$dispensability < 0.15, ];
p1 <- p1 + geom_text( data = one.data, aes(plot_X, plot_Y, label = description), colour = I(alpha("black", 0.85)), size = 3 );
p1 <- p1 + labs (y = "semantic space x", x = "semantic space y");
p1 <- p1 + theme(legend.key = element_blank()) ;
one.x_range = max(one.data$plot_X) - min(one.data$plot_X);
one.y_range = max(one.data$plot_Y) - min(one.data$plot_Y);
p1 <- p1 + xlim(min(one.data$plot_X)-one.x_range/10,max(one.data$plot_X)+one.x_range/10);
p1 <- p1 + ylim(min(one.data$plot_Y)-one.y_range/10,max(one.data$plot_Y)+one.y_range/10);
# --------------------------------------------------------------------------
# Output the plot to screen
p1;
Create bed files to visualze as a track of DMGs_2kbslop in IGV
DMGs_2kbslop.bed <- meth_filter_genes_2kbslop %>%
filter(contig_gene %in% DMGs_2kbslop.genes$contig_gene &
start_gene_2kb %in% DMGs_2kbslop.genes$start_gene &
end_gene_2kb %in% DMGs_2kbslop.genes$end_gene) %>%
dplyr::select(contig_gene, start_gene_2kb, end_gene_2kb,
unknown1_gene, strand_gene, unknown2_gene, notes_gene) %>%
distinct(contig_gene, start_gene_2kb, end_gene_2kb)
#DMGs_2kbslop.bed <- DMGs_2kbslop.bed[!duplicated(test$contig_gene),]
readr::write_delim(DMGs_2kbslop.bed, "../analyses/DMGs/DMGs_2kbslop.bed", delim = '\t', col_names = FALSE)
NOTE: this code is commented out as it’s not needed and prints too many bar plots
# ggplotly(meth_filter_genes_2kbslop %>%
# filter(contig_gene %in% DMGs_2kbslop.genes$contig_gene &
# start_gene %in% DMGs_2kbslop.genes$start_gene &
# end_gene %in% DMGs_2kbslop.genes$end_gene) %>%
# group_by(population, gene) %>%
# summarise(allCs_percent = 100*(sum(numCs)/sum(coverage)),
# mean_percentMeth = mean(percMeth)) %>%
# ggplot(aes(x = population, y = mean_percentMeth, fill = population)) + geom_bar(stat="identity") + facet_wrap(~gene) + theme_light() + scale_fill_manual(values=c("firebrick3","dodgerblue3")))
#
# #checking to make sure numCs + numTs = coverage; should be 1:1 line
# plot(meth_filter_genes_2kbslop$numCs + meth_filter_genes_2kbslop$numTs ~ meth_filter_genes_2kbslop$coverage)
#
# ## Look at coverage for each DMG by population (mean % methylation across samples)
# ggplotly(meth_filter_genes_2kbslop %>%
# filter(contig_gene %in% DMGs_2kbslop.genes$contig_gene &
# start_gene %in% DMGs_2kbslop.genes$start_gene &
# end_gene %in% DMGs_2kbslop.genes$end_gene) %>%
# group_by(population, gene) %>%
# summarise(sum_cov = sum(coverage),
# mean_cov = mean(coverage)) %>%
# ggplot(aes(x = population, y = mean_cov, fill = population)) +
# geom_bar(stat="identity") +
# facet_wrap(~gene) + theme_light() + scale_fill_manual(values=c("firebrick3","dodgerblue3")))
#
# DMG_counts <- meth_filter_genes_2kbslop %>%
# filter(contig_gene %in% DMGs_2kbslop.genes$contig_gene &
# start_gene %in% DMGs_2kbslop.genes$start_gene &
# end_gene %in% DMGs_2kbslop.genes$end_gene)
#
# # Look at coverage for each DMG locus, by population
# # mean % methylation across samples
# DMG_genes_unique <- unique(DMG_counts$gene)
# for (i in 1:length(DMG_genes_unique)) {
# temp <- DMG_counts %>%
# filter(chr == "Contig22489") %>%
# group_by(population, chr, start) %>%
# summarise(allCs_percent = 100*(sum(numCs)/sum(coverage)),
# mean_percentMeth = mean(percMeth))
# print(ggplotly(ggplot(temp, aes(x = population, y = mean_percentMeth, fill = population)) +
# geom_bar(stat="identity") +
# facet_wrap(~start) +
# theme_light() + ggtitle(paste("gene = ", "Contig22489", sep="")) +
# scale_fill_manual(values=c("firebrick3","dodgerblue3"))))
# }
intersectBed \
-wb \
-a "../analyses/DMGs/DMGs_2kbslop.bed" \
-b "../analyses/DMLs/dml25.bed" \
> "../analyses/DMGs/DMGs_2kbslop-with-DMLs.tab"
wc -l "../analyses/DMGs/DMGs_2kbslop-with-DMLs.tab"
## 96 ../analyses/DMGs/DMGs_2kbslop-with-DMLs.tab
dml25 <- read_delim(file = here::here("analyses", "DMLs", "dml25.bed"), delim = "\t", col_names = TRUE)
## Parsed with column specification:
## cols(
## Contig100994 = col_character(),
## `3427` = col_double(),
## `3429` = col_double(),
## `-31.801994301994306` = col_double()
## )
DMLs.in.DMGs_2kbslop <-
read_delim(file = here::here("analyses", "DMGs", "DMGs_2kbslop-with-DMLs.tab"), delim = "\t", col_names = c(colnames(DMGs_2kbslop.bed), colnames(dml25))) #%>%
## Parsed with column specification:
## cols(
## contig_gene = col_character(),
## start_gene_2kb = col_double(),
## end_gene_2kb = col_double(),
## Contig100994 = col_character(),
## `3427` = col_double(),
## `3429` = col_double(),
## `-31.801994301994306` = col_double()
## )
#mutate(gene=paste(contig_gene, start_gene, sep="_"))
write.csv(DMLs.in.DMGs_2kbslop, file = "../analyses/DMGs/DMLS.in.DMGs_2kbslop.csv")
save(DMLs.in.DMGs_2kbslop, file="../analyses/DMGs/R-objects/DMLs.in.DMGs_2kbslop")
# Barplots of all DMLs also located in DMGs_2kbslop
DMLs.in.DMGs_2kbslop.calcs <- meth_filter_reshaped %>%
filter(chr %in% DMLs.in.DMGs_2kbslop$contig_gene &
start %in% DMLs.in.DMGs_2kbslop$start_gene+1 &
end %in% DMLs.in.DMGs_2kbslop$end_gene-1) %>%
group_by(population, chr, start) %>%
dplyr::summarise(
mean_percMeth = mean(percMeth, na.rm=TRUE),
sd_percMeth=sd(percMeth, na.rm=TRUE),
n())
## Warning: Unknown or uninitialised column: `start_gene`.
## Warning: Unknown or uninitialised column: `end_gene`.
DMLs.in.DMGs_2kbslop.calcs %>% ungroup() %>% dplyr::select(chr, start) %>% distinct()
## # A tibble: 1,399 x 2
## chr start
## <chr> <int>
## 1 Contig103503 7021
## 2 Contig103503 7028
## 3 Contig103503 7054
## 4 Contig103503 7067
## 5 Contig103503 7072
## 6 Contig1245 9214
## 7 Contig1245 9223
## 8 Contig1245 9229
## 9 Contig1245 9246
## 10 Contig1245 9269
## # … with 1,389 more rows
#Plots don't work when knitted
# DMLs_in_DMGs_2kbslop_plots <- list()
# for (i in 1:nrow(DMLs.in.DMGs_2kbslop)) {
# DMLs_in_DMGs_2kbslop_plots[[i]] <-
# DMLs.in.DMGs_2kbslop.calcs %>%
# filter(chr==DMLs.in.DMGs_2kbslop$contig_gene[i] &
# start==DMLs.in.DMGs_2kbslop$start_gene[i]+1) %>%
# ggplot(aes(x = population, y = mean_percMeth, fill = population,
# label=paste0(round(mean_percMeth, digits = 2), "%"))) +
# geom_bar(stat="identity", width = 0.5) + ylim(0,110) +
# geom_pointrange(aes(ymin=mean_percMeth,
# ymax=mean_percMeth+sd_percMeth)) +
# geom_text(size=3, vjust=-0.5, hjust=1.25) +
# theme_light() + ggtitle(paste("Contig = ", DMLs.in.DMGs_2kbslop[i, "contig_gene"], ", Locus = ",
# DMLs.in.DMGs_2kbslop[i+1, "start_gene"], sep="")) +
# scale_fill_manual(values=c("firebrick3","dodgerblue3"))
# }
# DMLs_in_DMGs_2kbslop_plots[1:6]
# How many gene regions are there after filtering for those with 5 methylated loci?
sub_meth_table_2kbslop %>%
distinct(gene) %>%
nrow()
## [1] 1393
head(sub_meth_table_2kbslop)
## # A tibble: 6 x 21
## chr start end strand sample coverage numCs numTs percMeth population
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Cont… 39226 39226 + 1 21 11 10 52.4 HC
## 2 Cont… 39234 39234 + 1 24 10 14 41.7 HC
## 3 Cont… 64179 64179 + 1 10 9 1 90 HC
## 4 Cont… 71523 71523 + 1 13 13 0 100 HC
## 5 Cont… 71533 71533 + 1 17 17 0 100 HC
## 6 Cont… 71542 71542 + 1 16 15 1 93.8 HC
## # … with 11 more variables: contig_gene <chr>, source_gene <chr>,
## # feature_gene <chr>, start_gene_2kb <dbl>, end_gene_2kb <dbl>,
## # unknown1_gene <chr>, strand_gene <chr>, unknown2_gene <chr>,
## # notes_gene <chr>, gene <chr>, group <chr>
head(meth_filter_reshaped) #data frame that contains coverage and % methylation info
## chr start end strand sample coverage numCs numTs percMeth population
## 1 Contig0 39226 39226 + 1 21 11 10 52.38095 HC
## 2 Contig0 39234 39234 + 1 24 10 14 41.66667 HC
## 3 Contig0 64179 64179 + 1 10 9 1 90.00000 HC
## 4 Contig0 71523 71523 + 1 13 13 0 100.00000 HC
## 5 Contig0 71533 71533 + 1 17 17 0 100.00000 HC
## 6 Contig0 71542 71542 + 1 16 15 1 93.75000 HC
# I think this filtering step doesn't account for the 2kb +/- start and stop
perc_meth_genes_2kbslop <- sub_meth_table_2kbslop %>%
group_by(population, sample, contig_gene, start_gene_2kb, end_gene_2kb) %>%
dplyr::summarise(
mean_percMeth = mean(percMeth, na.rm=TRUE),
sd_percMeth=sd(percMeth, na.rm=TRUE),
n())
# check to make sure % methylation is calculated separately for each sample and gene region
perc_meth_genes_2kbslop %>% filter(contig_gene=="Contig0", start_gene_2kb==10497, end_gene_2kb==95068) %>%
ggplot(aes(x=sample, y=mean_percMeth)) + geom_bar(stat="identity")
# How many unique gene regions?
perc_meth_genes_2kbslop %>%
ungroup() %>%
dplyr::select(contig_gene, start_gene_2kb, end_gene_2kb) %>%
distinct(contig_gene, start_gene_2kb, end_gene_2kb) %>%
nrow()
## [1] 1393
# Reshape data. Need to have one row per sample, one column with the population, and separate columns with each gene region with % methylation.
perc_meth_genes_2kbslop_wide <- perc_meth_genes_2kbslop %>%
ungroup() %>%
tidyr::unite("gene_region", c("contig_gene", "start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
dplyr::select(population, sample, gene_region, mean_percMeth) %>%
spread(gene_region, mean_percMeth) %>%
tibble::column_to_rownames(var = "sample")
head(perc_meth_genes_2kbslop_wide[1:4]) #confirm correct format
## population Contig0_10497_95068 Contig100188_1_4123 Contig100396_1_4908
## 1 HC 88.86439 97.79135 93.75000
## 2 HC 88.54722 91.18234 93.47985
## 3 HC 89.21439 88.03419 93.99802
## 4 HC 87.40210 90.34792 95.97155
## 5 HC 84.91479 94.96795 87.76224
## 6 HC 84.48810 100.00000 100.00000
ncol(perc_meth_genes_2kbslop_wide) #1394 gene regions
## [1] 1394
#Now run the following line and it will provide Pst estimates for every gene.
genes_2kbslop_5loci_Pst <- Pst(perc_meth_genes_2kbslop_wide)
## [1] "Populations sizes are:"
## HC SS
## 9 9
# Check out Pst distribution
hist(genes_2kbslop_5loci_Pst$Pst_Values)
summary(genes_2kbslop_5loci_Pst$Pst_Values)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000003 0.0603303 0.2234150 0.2948533 0.5093137 0.9200812
nrow(genes_2kbslop_5loci_Pst)
## [1] 1393
# format of dataframe that I will save
head(genes_2kbslop_5loci_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE))
## Quant_Varia contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1 Contig0_10497_95068 Contig0 10497 95068 0.01933289
## 2 Contig100188_1_4123 Contig100188 1 4123 0.73496201
## 3 Contig100396_1_4908 Contig100396 1 4908 0.18520587
## 4 Contig100994_1_8602 Contig100994 1 8602 0.70114357
## 5 Contig1013_14071_19514 Contig1013 14071 19514 0.01722196
## 6 Contig1013_16204_21663 Contig1013 16204 21663 0.03959333
# Write out Pst results
write.table(genes_2kbslop_5loci_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE), file = here::here("analyses", "DMGs", "Pst_gene_2kbslop_5loci.tab"), sep = '\t', na = "NA", row.names = FALSE, col.names = TRUE)
head(genes_2kbslop_5loci_Pst %>%
separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
filter(Pst_Values>0.5))
## Quant_Varia contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1 Contig100188_1_4123 Contig100188 1 4123 0.7349620
## 2 Contig100994_1_8602 Contig100994 1 8602 0.7011436
## 3 Contig101333_3371_8675 Contig101333 3371 8675 0.7122709
## 4 Contig1024_1_14330 Contig1024 1 14330 0.9200812
## 5 Contig103503_4845_9852 Contig103503 4845 9852 0.8293130
## 6 Contig10373_1_3003 Contig10373 1 3003 0.7098571
#Merge # (gene region list+Pst values) with (gene region list + adjusted p-values (<0.05 indicates significant))
genes_2kbslop_Pst_DMGpvalue <- merge(x=genes_2kbslop_5loci_Pst,
y=results_2kbslop[!duplicated(results_2kbslop$gene), c("contig_gene", "start_gene_2kb", "end_gene_2kb", "notes_gene", "gene", "adj.pval.pop")],
by.x="Quant_Varia",
by.y="gene")
save(genes_2kbslop_Pst_DMGpvalue, file = "../analyses/DMGs/R-objects/genes_2kbslop_Pst_DMGpvalue")
# Assess relationship between Pst values and P-adjusted for DMG regions
hist(genes_2kbslop_Pst_DMGpvalue$Pst_Values^0.5)
hist(genes_2kbslop_Pst_DMGpvalue$adj.pval.pop)
summary(lm(Pst_Values^.5 ~ adj.pval.pop, data=genes_2kbslop_Pst_DMGpvalue))
##
## Call:
## lm(formula = Pst_Values^0.5 ~ adj.pval.pop, data = genes_2kbslop_Pst_DMGpvalue)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.62760 -0.12509 0.03018 0.14487 0.49766
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.707490 0.008561 82.64 <2e-16 ***
## adj.pval.pop -0.533980 0.015650 -34.12 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1939 on 1391 degrees of freedom
## Multiple R-squared: 0.4556, Adjusted R-squared: 0.4552
## F-statistic: 1164 on 1 and 1391 DF, p-value: < 2.2e-16
library(ggpmisc)
formula <- genes_2kbslop_Pst_DMGpvalue$adj.pval.pop ~ genes_2kbslop_Pst_DMGpvalue$Pst_Values
ggplot(genes_2kbslop_Pst_DMGpvalue, aes(x=Pst_Values, y=adj.pval.pop)) +
geom_point(aes(colour = cut(adj.pval.pop, c(-Inf, 0.05, Inf)))) +
scale_color_manual(name = "DMG region significance",
values = c("(-Inf,0.05]" = "red",
"(0.05, Inf]" = "black"),
labels = c("significant", "non-significant")) +
ylab("P-Adjusted from diff. methylated gene region analysis") +
xlab("Pst values, gene regions") +
ggtitle("Gene Region P-Adjusted ~ Pst") +
geom_smooth(method = "lm", se = F) +
stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")),
label.x.npc = "right", label.y.npc = 0.85,
formula = formula, parse = TRUE, size = 3, col="blue") +
ylim(c(0,1))
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 13 rows containing missing values (geom_smooth).
# Read in data file with gene bodies that contain methylated loci
meth_filter_genes <-
read_delim(file = here::here("analyses", "methylation-filtered", "meth_filter_gene.tab"), delim = "\t", col_names = c(colnames(meth_filter_reshaped[-10]), "population", "contig_gene", "source_gene", "feature_gene", "start_gene", "end_gene", "unknown1_gene", "strand_gene", "unknown2_gene", "notes_gene")) %>%
mutate(gene=paste(contig_gene, start_gene, end_gene, sep="_")) %>%
mutate(group=paste(sample, gene, sep="_"))
## Parsed with column specification:
## cols(
## chr = col_character(),
## start = col_double(),
## end = col_double(),
## strand = col_character(),
## sample = col_double(),
## coverage = col_double(),
## numCs = col_double(),
## numTs = col_double(),
## percMeth = col_double(),
## population = col_character(),
## contig_gene = col_character(),
## source_gene = col_character(),
## feature_gene = col_character(),
## start_gene = col_double(),
## end_gene = col_double(),
## unknown1_gene = col_character(),
## strand_gene = col_character(),
## unknown2_gene = col_character(),
## notes_gene = col_character()
## )
# Filter for gene bodies that have 5 or more methylated loci
min.filt_genes <- dplyr::count(meth_filter_genes, vars = c(group))
newdata <- min.filt_genes[which(min.filt_genes$n > 4), ]
sub_meth_table <- meth_filter_genes[meth_filter_genes$group %in% newdata$vars,]
# Calculate average % methylation within gene bodies by sample
perc_meth_genes <- sub_meth_table %>%
group_by(population, sample, contig_gene, start_gene, end_gene) %>%
dplyr::summarise(
mean_percMeth = mean(percMeth, na.rm=TRUE),
sd_percMeth=sd(percMeth, na.rm=TRUE),
n())
# check to make sure % methylation is calculated separately for each sample and gene region
perc_meth_genes %>% filter(contig_gene=="Contig0", start_gene==12497, end_gene==93068) %>%
ggplot(aes(x=sample, y=mean_percMeth)) + geom_bar(stat="identity")
# Reshape data. Need to have one row per sample, one column with the population, and separate columns with each gene region with % methylation.
perc_meth_genes_wide <- perc_meth_genes %>%
ungroup() %>%
tidyr::unite("gene_region", c("contig_gene", "start_gene", "end_gene"), sep = "_", remove = FALSE) %>%
dplyr::select(population, sample, gene_region, mean_percMeth) %>%
spread(gene_region, mean_percMeth) %>%
column_to_rownames(var = "sample")
head(perc_meth_genes_wide[1:4]) #confirm correct format
## population Contig0_12497_93068 Contig100188_461_2761 Contig100396_517_2908
## 1 HC 88.86439 97.79135 93.75000
## 2 HC 88.54722 91.18234 93.47985
## 3 HC 89.21439 88.03419 93.99802
## 4 HC 87.40210 90.34792 95.97155
## 5 HC 84.91479 94.96795 87.76224
## 6 HC 84.48810 100.00000 100.00000
#Now run the following line and it will provide Pst estimates for every gene.
genes_Pst <- Pst(perc_meth_genes_wide)
## [1] "Populations sizes are:"
## HC SS
## 9 9
hist(genes_Pst$Pst_Values)
summary(genes_Pst$Pst_Values)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000003 0.0653852 0.2306499 0.2976943 0.5102395 0.9200812
nrow(genes_Pst)
## [1] 1248
# Write out Pst results
write.table(genes_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene", "end_gene"), sep = "_", remove = FALSE), file = here::here("analyses", "DMGs", "Pst_gene.tab"), sep = '\t', na = "NA", row.names = FALSE, col.names = TRUE)
DMG.ratios <- DMGs_2kbslop %>%
#mutate(contig_gene_start=paste(contig_gene, start_gene_2kb, sep="_")) %>%
group_by(population, gene) %>%
summarise(mean_percentMeth = mean(percMeth, na.rm = TRUE)) %>%
dcast(gene ~ population) %>%
mutate(ratio_HC.SS = HC/SS) %>%
arrange(ratio_HC.SS) #in this ratio column, values <1 = HC hypomethylated, values >1 = SS hypomethylated
## Using mean_percentMeth as value column: use value.var to override.
ggplot(DMGs_2kbslop, aes(sample, gene, fill= percMeth)) + xlab("Sample") + geom_tile(na.rm = T) +
scale_y_discrete(limits=(DMG.ratios[order(DMG.ratios$ratio_HC.SS),]$gene)) +
#scale_fill_viridis(discrete=FALSE)
theme(axis.title.y=element_blank(),
axis.text.y=element_blank(),
axis.ticks.y=element_blank()) +
scale_fill_distiller(palette = "YlGnBu", direction = 1)
#scale_fill_gradient(low="gray5", high="white")
newdata_4loci <- min.filt_2kbslop[which(min.filt_2kbslop$n > 3), ]
sub_meth_table_2kbslop_4loci <- meth_filter_genes_2kbslop[meth_filter_genes_2kbslop$group %in% newdata_4loci$vars,]
save(sub_meth_table_2kbslop_4loci, file="../analyses/DMGs/R-objects/sub_meth_table_2kbslop_4loci")
length(unique(sub_meth_table_2kbslop_4loci$gene))
## [1] 1723
Note: this script was created by Hollie Putnam (GM.Rmd); there are minor revisions below. I retained some commented out lines (notably-testing for position w/n gene, such as intron & exon) in case we want to include those as factors in the future.
# create data frame to stored results
results_2kbslop_4loci <- data.frame()
gs <- unique(sub_meth_table_2kbslop_4loci$gene)
#first subset the unique dataframes and second run the GLMs
for(i in 1:length(gs)){
#subset the dataframe gene by gene
sub_meth_table_2kbslop1 <- subset(sub_meth_table_2kbslop_4loci, gene ==gs[i])
# fit glm position model
fit <- glm(matrix(c(numCs, numTs), ncol=2) ~ as.factor(population) + (1|sample),
data=sub_meth_table_2kbslop1, family=binomial)
a <- anova(fit, test="Chisq")
# capture summary stats to data frame
df <- data.frame(sub_meth_table_2kbslop1[c("population", "sample", "contig_gene", "start_gene_2kb", "end_gene_2kb", "gene", "chr", "start", "sample", "coverage", "numCs", "numTs", "percMeth", "notes_gene")],
pval.treatment = a$`Pr(>Chi)`[2],
#pval.position = a$`Pr(>Chi)`[3], #uncomment if you want to include position of CpG within a gene
#pval.treatment_x_position = a$`Pr(>Chi)`[4], #uncomment if you want to include position of CpG within a gene interaction with treatment
stringsAsFactors = F)
# bind rows of temporary data frame to the results data frame
results_2kbslop_4loci <- rbind(results_2kbslop_4loci, df)
}
results_2kbslop_4loci[is.na(results_2kbslop_4loci)] <- 0
results_2kbslop_4loci$adj.pval.pop <- p.adjust(results_2kbslop_4loci$pval.treatment, method='BH')
#results_2kbslop$adj.pval.position <- p.adjust(results_2kbslop$pval.position, method='BH') #uncomment if you want to include position of CpG within a gene
#results_2kbslop$adj.pval.treatment_x_position <- p.adjust(results_2kbslop$pval.treatment_x_position, method='BH') #uncomment if you want to include position of CpG within a gene interaction with treatment
#save df with differentially methylated genes
DMGs_2kbslop_4loci <- subset(results_2kbslop_4loci, adj.pval.pop < 0.05) #%>%
# mutate(contig_gene_start=paste(contig_gene, start_gene_2kb, sep="_"))
DMGs_2kbslop_4loci.genes <- DMGs_2kbslop_4loci[!duplicated(DMGs_2kbslop_4loci$gene), c("contig_gene", "start_gene_2kb", "end_gene_2kb", "notes_gene", "pval.treatment", "adj.pval.pop")]
save(DMGs_2kbslop_4loci, file="../analyses/DMGs/R-objects/DMGs_2kbslop_4loci")
length(unique(DMGs_2kbslop_4loci$gene))
## [1] 342
# split gene data in "notes_gene" column into separate columns
DMGs_2kbslop_4loci.genes <- DMGs_2kbslop_4loci.genes %>%
mutate(ID=str_extract(notes_gene, "ID=(.*?);"),
Parent=str_extract(notes_gene, "Parent=(.*?);"),
Name=str_extract(notes_gene, "Name=(.*?);"),
Alias=str_extract(notes_gene, "Alias=(.*?);"),
AED=str_extract(notes_gene, "AED=(.*?);"),
eAED=str_extract(notes_gene, "eAED=(.*?);"),
Note=str_extract(notes_gene, "Note=(.*?);"),
Ontology_term=str_extract(notes_gene, "Ontology_term=(.*?);"),
Dbxref=str_extract(notes_gene, "Dbxref=(.*?);")
)
write_csv(DMGs_2kbslop_4loci.genes, path = here::here("analyses", "DMGs", "DMGs_2kbslop_4loci.genes.csv"))
#Extract GO terms
DMGs_2kbslop_4loci.genes.GO <- DMGs_2kbslop_4loci.genes %>%
mutate(Ontology_term = str_replace(Ontology_term, pattern="Ontology_term=",replacement = "")) %>%
mutate(Ontology_term = str_replace(Ontology_term, pattern=";",replacement = "")) %>%
separate(Ontology_term, sep=",", into=paste("GO", 1:11, sep="_")) %>%
pivot_longer(cols=c("GO_1","GO_2","GO_3","GO_4","GO_5","GO_6","GO_7","GO_8","GO_9","GO_10","GO_11"), names_to = "GO_number", values_to = "GO_term") %>%
dplyr::select(-GO_number) %>%
filter(!is.na(Note) & !is.na(GO_term))
## Warning: Expected 11 pieces. Missing pieces filled with `NA` in 156 rows [1, 3,
## 4, 5, 9, 10, 11, 12, 14, 15, 17, 19, 20, 22, 23, 29, 31, 32, 33, 34, ...].
write_delim(DMGs_2kbslop_4loci.genes.GO[,c("GO_term","adj.pval.pop")], path = here::here("analyses/", "DMGs/", "DMGs_2kbslop_4loci.GO.txt"), delim = '\t', col_names = F) #write out df with just GO terms and p-adj values
# How many gene regions are there after filtering for those with 4 methylated loci?
sub_meth_table_2kbslop_4loci %>%
distinct(gene) %>%
nrow()
## [1] 1723
# I think this filtering step doesn't account for the 2kb +/- start and stop
perc_meth_genes_2kbslop_4loci <- sub_meth_table_2kbslop_4loci %>%
group_by(population, sample, contig_gene, start_gene_2kb, end_gene_2kb) %>%
dplyr::summarise(
mean_percMeth = mean(percMeth, na.rm=TRUE),
sd_percMeth=sd(percMeth, na.rm=TRUE),
n())
# check to make sure % methylation is calculated separately for each sample and gene region
perc_meth_genes_2kbslop_4loci %>% filter(contig_gene=="Contig0", start_gene_2kb==10497, end_gene_2kb==95068) %>%
ggplot(aes(x=sample, y=mean_percMeth)) + geom_bar(stat="identity")
# How many unique gene regions?
perc_meth_genes_2kbslop_4loci %>%
ungroup() %>%
dplyr::select(contig_gene, start_gene_2kb, end_gene_2kb) %>%
distinct(contig_gene, start_gene_2kb, end_gene_2kb) %>%
nrow()
## [1] 1723
# Reshape data. Need to have one row per sample, one column with the population, and separate columns with each gene region with % methylation.
perc_meth_genes_2kbslop_4loci_wide <- perc_meth_genes_2kbslop_4loci %>%
ungroup() %>%
tidyr::unite("gene_region", c("contig_gene", "start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
dplyr::select(population, sample, gene_region, mean_percMeth) %>%
spread(gene_region, mean_percMeth) %>%
tibble::column_to_rownames(var = "sample")
head(perc_meth_genes_2kbslop_4loci_wide[1:4]) #confirm correct format
## population Contig0_10497_95068 Contig100188_1_4123 Contig100396_1_4908
## 1 HC 88.86439 97.79135 93.75000
## 2 HC 88.54722 91.18234 93.47985
## 3 HC 89.21439 88.03419 93.99802
## 4 HC 87.40210 90.34792 95.97155
## 5 HC 84.91479 94.96795 87.76224
## 6 HC 84.48810 100.00000 100.00000
ncol(perc_meth_genes_2kbslop_4loci_wide) #1724 gene regions
## [1] 1724
#Now run the following line and it will provide Pst estimates for every gene.
genes_2kbslop_4loci_Pst <- Pst(perc_meth_genes_2kbslop_4loci_wide)
## [1] "Populations sizes are:"
## HC SS
## 9 9
# Check out Pst distribution
hist(genes_2kbslop_4loci_Pst$Pst_Values)
summary(genes_2kbslop_4loci_Pst$Pst_Values)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000003 0.0660358 0.2423961 0.3014315 0.5164664 0.9306490
nrow(genes_2kbslop_4loci_Pst)
## [1] 1723
# format of dataframe that I will save
head(genes_2kbslop_4loci_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE))
## Quant_Varia contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1 Contig0_10497_95068 Contig0 10497 95068 0.01933289
## 2 Contig100188_1_4123 Contig100188 1 4123 0.73496201
## 3 Contig100396_1_4908 Contig100396 1 4908 0.18520587
## 4 Contig100994_1_8602 Contig100994 1 8602 0.70114357
## 5 Contig10117_48_10916 Contig10117 48 10916 0.05127876
## 6 Contig1013_14071_19514 Contig1013 14071 19514 0.01722196
# Write out Pst results
write.table(genes_2kbslop_4loci_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE), file = here::here("analyses", "DMGs", "Pst_gene_2kbslop_4loci.tab"), sep = '\t', na = "NA", row.names = FALSE, col.names = TRUE)
head(genes_2kbslop_4loci_Pst %>%
separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
filter(Pst_Values>0.5))
## Quant_Varia contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1 Contig100188_1_4123 Contig100188 1 4123 0.7349620
## 2 Contig100994_1_8602 Contig100994 1 8602 0.7011436
## 3 Contig101333_3371_8675 Contig101333 3371 8675 0.7122709
## 4 Contig1024_1_14330 Contig1024 1 14330 0.9200812
## 5 Contig103503_4845_9852 Contig103503 4845 9852 0.8293130
## 6 Contig10373_1_3003 Contig10373 1 3003 0.7098571
newdata_3loci <- min.filt_2kbslop[which(min.filt_2kbslop$n > 2), ]
sub_meth_table_2kbslop_3loci <- meth_filter_genes_2kbslop[meth_filter_genes_2kbslop$group %in% newdata_3loci$vars,]
save(sub_meth_table_2kbslop_3loci, file="../analyses/DMGs/R-objects/sub_meth_table_2kbslop_3loci")
length(unique(sub_meth_table_2kbslop_3loci$gene))
## [1] 2177
Note: this script was created by Hollie Putnam (GM.Rmd); there are minor revisions below. I retained some commented out lines (notably-testing for position w/n gene, such as intron & exon) in case we want to include those as factors in the future.
# create data frame to stored results
results_2kbslop_3loci <- data.frame()
gs <- unique(sub_meth_table_2kbslop_3loci$gene)
#first subset the unique dataframes and second run the GLMs
for(i in 1:length(gs)){
#subset the dataframe gene by gene
sub_meth_table_2kbslop1 <- subset(sub_meth_table_2kbslop_3loci, gene ==gs[i])
# fit glm position model
fit <- glm(matrix(c(numCs, numTs), ncol=2) ~ as.factor(population) + (1|sample),
data=sub_meth_table_2kbslop1, family=binomial)
a <- anova(fit, test="Chisq")
# capture summary stats to data frame
df <- data.frame(sub_meth_table_2kbslop1[c("population", "sample", "contig_gene", "start_gene_2kb", "end_gene_2kb", "gene", "chr", "start", "sample", "coverage", "numCs", "numTs", "percMeth", "notes_gene")],
pval.treatment = a$`Pr(>Chi)`[2],
#pval.position = a$`Pr(>Chi)`[3], #uncomment if you want to include position of CpG within a gene
#pval.treatment_x_position = a$`Pr(>Chi)`[4], #uncomment if you want to include position of CpG within a gene interaction with treatment
stringsAsFactors = F)
# bind rows of temporary data frame to the results data frame
results_2kbslop_3loci <- rbind(results_2kbslop_3loci, df)
}
results_2kbslop_3loci[is.na(results_2kbslop_3loci)] <- 0
results_2kbslop_3loci$adj.pval.pop <- p.adjust(results_2kbslop_3loci$pval.treatment, method='BH')
#results_2kbslop$adj.pval.position <- p.adjust(results_2kbslop$pval.position, method='BH') #uncomment if you want to include position of CpG within a gene
#results_2kbslop$adj.pval.treatment_x_position <- p.adjust(results_2kbslop$pval.treatment_x_position, method='BH') #uncomment if you want to include position of CpG within a gene interaction with treatment
#save df with differentially methylated genes
DMGs_2kbslop_3loci <- subset(results_2kbslop_3loci, adj.pval.pop < 0.05) #%>%
# mutate(contig_gene_start=paste(contig_gene, start_gene_2kb, sep="_"))
DMGs_2kbslop_3loci.genes <- DMGs_2kbslop_3loci[!duplicated(DMGs_2kbslop_3loci$gene), c("contig_gene", "start_gene_2kb", "end_gene_2kb", "notes_gene", "pval.treatment", "adj.pval.pop")]
save(DMGs_2kbslop_3loci, file="../analyses/DMGs/R-objects/DMGs_2kbslop_3loci")
length(unique(DMGs_2kbslop_3loci$gene))
## [1] 401
# split gene data in "notes_gene" column into separate columns
DMGs_2kbslop_3loci.genes <- DMGs_2kbslop_3loci.genes %>%
mutate(ID=str_extract(notes_gene, "ID=(.*?);"),
Parent=str_extract(notes_gene, "Parent=(.*?);"),
Name=str_extract(notes_gene, "Name=(.*?);"),
Alias=str_extract(notes_gene, "Alias=(.*?);"),
AED=str_extract(notes_gene, "AED=(.*?);"),
eAED=str_extract(notes_gene, "eAED=(.*?);"),
Note=str_extract(notes_gene, "Note=(.*?);"),
Ontology_term=str_extract(notes_gene, "Ontology_term=(.*?);"),
Dbxref=str_extract(notes_gene, "Dbxref=(.*?);")
)
write_csv(DMGs_2kbslop_3loci.genes, path = here::here("analyses", "DMGs", "DMGs_2kbslop_3loci.genes.csv"))
#Extract GO terms
DMGs_2kbslop_3loci.genes.GO <- DMGs_2kbslop_3loci.genes %>%
mutate(Ontology_term = str_replace(Ontology_term, pattern="Ontology_term=",replacement = "")) %>%
mutate(Ontology_term = str_replace(Ontology_term, pattern=";",replacement = "")) %>%
separate(Ontology_term, sep=",", into=paste("GO", 1:11, sep="_")) %>%
pivot_longer(cols=c("GO_1","GO_2","GO_3","GO_4","GO_5","GO_6","GO_7","GO_8","GO_9","GO_10","GO_11"), names_to = "GO_number", values_to = "GO_term") %>%
dplyr::select(-GO_number) %>%
filter(!is.na(Note) & !is.na(GO_term))
## Warning: Expected 11 pieces. Missing pieces filled with `NA` in 179 rows [1, 3,
## 4, 5, 10, 11, 12, 13, 15, 16, 18, 20, 21, 23, 24, 30, 31, 33, 36, 37, ...].
write_delim(DMGs_2kbslop_3loci.genes.GO[,c("GO_term","adj.pval.pop")], path = here::here("analyses/", "DMGs/", "DMGs_2kbslop_3loci.GO.txt"), delim = '\t', col_names = F) #write out df with just GO terms and p-adj values
# How many gene regions are there after filtering for those with 3 methylated loci?
sub_meth_table_2kbslop_3loci %>%
distinct(gene) %>%
nrow()
## [1] 2177
# I think this filtering step doesn't account for the 2kb +/- start and stop
perc_meth_genes_2kbslop_3loci <- sub_meth_table_2kbslop_3loci %>%
group_by(population, sample, contig_gene, start_gene_2kb, end_gene_2kb) %>%
dplyr::summarise(
mean_percMeth = mean(percMeth, na.rm=TRUE),
sd_percMeth=sd(percMeth, na.rm=TRUE),
n())
# check to make sure % methylation is calculated separately for each sample and gene region
perc_meth_genes_2kbslop_3loci %>% filter(contig_gene=="Contig0", start_gene_2kb==10497, end_gene_2kb==95068) %>%
ggplot(aes(x=sample, y=mean_percMeth)) + geom_bar(stat="identity")
# How many unique gene regions?
perc_meth_genes_2kbslop_3loci %>%
ungroup() %>%
dplyr::select(contig_gene, start_gene_2kb, end_gene_2kb) %>%
distinct(contig_gene, start_gene_2kb, end_gene_2kb) %>%
nrow()
## [1] 2177
# Reshape data. Need to have one row per sample, one column with the population, and separate columns with each gene region with % methylation.
perc_meth_genes_2kbslop_3loci_wide <- perc_meth_genes_2kbslop_3loci %>%
ungroup() %>%
tidyr::unite("gene_region", c("contig_gene", "start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
dplyr::select(population, sample, gene_region, mean_percMeth) %>%
spread(gene_region, mean_percMeth) %>%
tibble::column_to_rownames(var = "sample")
head(perc_meth_genes_2kbslop_3loci_wide[1:4]) #confirm correct format
## population Contig0_10497_95068 Contig100188_1_4123 Contig100396_1_4908
## 1 HC 88.86439 97.79135 93.75000
## 2 HC 88.54722 91.18234 93.47985
## 3 HC 89.21439 88.03419 93.99802
## 4 HC 87.40210 90.34792 95.97155
## 5 HC 84.91479 94.96795 87.76224
## 6 HC 84.48810 100.00000 100.00000
ncol(perc_meth_genes_2kbslop_3loci_wide) #1724 gene regions
## [1] 2178
#Now run the following line and it will provide Pst estimates for every gene.
genes_2kbslop_3loci_Pst <- Pst(perc_meth_genes_2kbslop_3loci_wide)
## [1] "Populations sizes are:"
## HC SS
## 9 9
# Check out Pst distribution
hist(genes_2kbslop_3loci_Pst$Pst_Values)
summary(genes_2kbslop_3loci_Pst$Pst_Values)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000003 0.0661707 0.2558255 0.3043360 0.5174845 0.9306490
nrow(genes_2kbslop_3loci_Pst)
## [1] 2177
# format of dataframe that I will save
head(genes_2kbslop_3loci_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE))
## Quant_Varia contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1 Contig0_10497_95068 Contig0 10497 95068 0.01933289
## 2 Contig100188_1_4123 Contig100188 1 4123 0.73496201
## 3 Contig100396_1_4908 Contig100396 1 4908 0.18520587
## 4 Contig100743_7752_12849 Contig100743 7752 12849 0.51317776
## 5 Contig100994_1_8602 Contig100994 1 8602 0.70114357
## 6 Contig10117_48_10916 Contig10117 48 10916 0.05127876
# Write out Pst results
write.table(genes_2kbslop_3loci_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE), file = here::here("analyses", "DMGs", "Pst_gene_2kbslop_3loci.tab"), sep = '\t', na = "NA", row.names = FALSE, col.names = TRUE)
head(genes_2kbslop_3loci_Pst %>%
separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
filter(Pst_Values>0.5))
## Quant_Varia contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1 Contig100188_1_4123 Contig100188 1 4123 0.7349620
## 2 Contig100743_7752_12849 Contig100743 7752 12849 0.5131778
## 3 Contig100994_1_8602 Contig100994 1 8602 0.7011436
## 4 Contig101333_3371_8675 Contig101333 3371 8675 0.7122709
## 5 Contig1024_1_14330 Contig1024 1 14330 0.9200812
## 6 Contig103503_4845_9852 Contig103503 4845 9852 0.8293130
meth_filter_genes_2kbslop %>%
distinct(gene) %>%
nrow()
## [1] 3754
perc_meth_genes_2kbslop <- meth_filter_genes_2kbslop %>%
group_by(population, sample, contig_gene, start_gene_2kb, end_gene_2kb) %>%
dplyr::summarise(
mean_percMeth = mean(percMeth, na.rm=TRUE),
sd_percMeth=sd(percMeth, na.rm=TRUE),
n())
# check a couple loci to make sure % methylation is calculated separately for each sample and gene region
perc_meth_genes_2kbslop %>% filter(contig_gene=="Contig0", start_gene_2kb==10497, end_gene_2kb==95068) %>%
ggplot(aes(x=sample, y=mean_percMeth)) + geom_bar(stat="identity")
perc_meth_genes_2kbslop %>% filter(contig_gene=="Contig1050", start_gene_2kb==4735, end_gene_2kb==27978) %>%
ggplot(aes(x=sample, y=mean_percMeth)) + geom_bar(stat="identity")
# Reshape data. Need to have one row per sample, one column with the population, and separate columns with each gene region with % methylation.
perc_meth_genes_2kbslop_wide <- perc_meth_genes_2kbslop %>%
ungroup() %>%
tidyr::unite("gene_region", c("contig_gene", "start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
dplyr::select(population, sample, gene_region, mean_percMeth) %>%
spread(gene_region, mean_percMeth) %>%
tibble::column_to_rownames(var = "sample")
head(perc_meth_genes_2kbslop_wide[1:4]) #confirm correct format
## population Contig0_10497_95068 Contig100_48322_58076 Contig100188_1_4123
## 1 HC 88.86439 90 97.79135
## 2 HC 88.54722 90 91.18234
## 3 HC 89.21439 80 88.03419
## 4 HC 87.40210 100 90.34792
## 5 HC 84.91479 NaN 94.96795
## 6 HC 84.48810 NaN 100.00000
ncol(perc_meth_genes_2kbslop_wide) # 3754 gene regions
## [1] 3755
#Now run the following line and it will provide Pst estimates for every gene.
genes_2kbslop_Pst <- Pst(perc_meth_genes_2kbslop_wide)
## [1] "Populations sizes are:"
## HC SS
## 9 9
# This object contains Pst values for all genes that contain methylation data.
save(genes_2kbslop_Pst, file = "../analyses/methylation-filtered/R-objects/genes_2kbslop_Pst") #save object to file
# Check out Pst distribution & summary statistics
hist(genes_2kbslop_Pst$Pst_Values)
mean(genes_2kbslop_Pst$Pst_Values)
## [1] 0.2938804
min(genes_2kbslop_Pst$Pst_Values)
## [1] 3.096144e-32
nrow(genes_2kbslop_Pst) #
## [1] 3754
# format of dataframe that I will save
head(genes_2kbslop_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE))
## Quant_Varia contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1 Contig0_10497_95068 Contig0 10497 95068 0.01933289
## 2 Contig100_48322_58076 Contig100 48322 58076 0.06700810
## 3 Contig100188_1_4123 Contig100188 1 4123 0.73496201
## 4 Contig100199_5411_27437 Contig100199 5411 27437 0.06825011
## 5 Contig100396_1_4908 Contig100396 1 4908 0.18520587
## 6 Contig100499_508_4044 Contig100499 508 4044 0.15042326
# Write out Pst results
write.table(genes_2kbslop_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE), file = here::here("analyses", "DMGs", "Pst_gene_2kbslop.tab"), sep = '\t', na = "NA", row.names = FALSE, col.names = TRUE)
head(genes_2kbslop_Pst %>%
separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
filter(Pst_Values>0.5))
## Quant_Varia contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1 Contig100188_1_4123 Contig100188 1 4123 0.7349620
## 2 Contig100743_7752_12849 Contig100743 7752 12849 0.5131778
## 3 Contig100844_1_4149 Contig100844 1 4149 0.5276145
## 4 Contig100994_1_8602 Contig100994 1 8602 0.7011436
## 5 Contig1010_1_34210 Contig1010 1 34210 0.5606926
## 6 Contig101333_3371_8675 Contig101333 3371 8675 0.7122709
# Methylation data within genes
meth_filter_genes
## # A tibble: 336,384 x 21
## chr start end strand sample coverage numCs numTs percMeth population
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Cont… 39226 39226 + 1 21 11 10 52.4 HC
## 2 Cont… 39234 39234 + 1 24 10 14 41.7 HC
## 3 Cont… 64179 64179 + 1 10 9 1 90 HC
## 4 Cont… 71523 71523 + 1 13 13 0 100 HC
## 5 Cont… 71533 71533 + 1 17 17 0 100 HC
## 6 Cont… 71542 71542 + 1 16 15 1 93.8 HC
## 7 Cont… 71546 71546 + 1 14 14 0 100 HC
## 8 Cont… 71558 71558 + 1 17 17 0 100 HC
## 9 Cont… 71563 71563 + 1 16 16 0 100 HC
## 10 Cont… 71617 71617 + 1 19 15 4 78.9 HC
## # … with 336,374 more rows, and 11 more variables: contig_gene <chr>,
## # source_gene <chr>, feature_gene <chr>, start_gene <dbl>, end_gene <dbl>,
## # unknown1_gene <chr>, strand_gene <chr>, unknown2_gene <chr>,
## # notes_gene <chr>, gene <chr>, group <chr>
# Methylation data within genes +/- 2kb
meth_filter_genes_2kbslop
## # A tibble: 379,314 x 21
## chr start end strand sample coverage numCs numTs percMeth population
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 Cont… 39226 39226 + 1 21 11 10 52.4 HC
## 2 Cont… 39234 39234 + 1 24 10 14 41.7 HC
## 3 Cont… 64179 64179 + 1 10 9 1 90 HC
## 4 Cont… 71523 71523 + 1 13 13 0 100 HC
## 5 Cont… 71533 71533 + 1 17 17 0 100 HC
## 6 Cont… 71542 71542 + 1 16 15 1 93.8 HC
## 7 Cont… 71546 71546 + 1 14 14 0 100 HC
## 8 Cont… 71558 71558 + 1 17 17 0 100 HC
## 9 Cont… 71563 71563 + 1 16 16 0 100 HC
## 10 Cont… 71617 71617 + 1 19 15 4 78.9 HC
## # … with 379,304 more rows, and 11 more variables: contig_gene <chr>,
## # source_gene <chr>, feature_gene <chr>, start_gene_2kb <dbl>,
## # end_gene_2kb <dbl>, unknown1_gene <chr>, strand_gene <chr>,
## # unknown2_gene <chr>, notes_gene <chr>, gene <chr>, group <chr>
(distance from start of gene / length of gene) = (Methylated locus - start_gene) / (end_gene - start_gene)
meth_filter_genes <- meth_filter_genes %>%
mutate(gen_location = ((start-start_gene)/(end_gene-start_gene)))
# Histogram showing frequency of methylation data across gene (relative location)
meth_filter_genes %>%
group_by(gene, gen_location, population) %>%
summarise(mean_meth=mean(percMeth)) %>%
ggplot(., aes(x=gen_location)) +
stat_bin(bins=150, col="gray30", fill="lavenderblush3") +
xlab("Relative location along gene") +
ylab("Frequency of methylation data") +
ggtitle("Methylation frequency relative to gene location") +
theme_minimal()
# Percent methylation across gene
meth_filter_genes$gen_location_round <- round(meth_filter_genes$gen_location, digits=2)
meth_filter_genes %>%
ungroup() %>%
group_by(gen_location_round) %>%
summarise(mean_meth=mean(percMeth, na.rm=TRUE), sd_meth=sd(percMeth, na.rm=TRUE)) %>%
ggplot(., aes(x=gen_location_round, y=mean_meth)) +
geom_bar(stat="identity", col="gray30", fill="lavenderblush3") +
# geom_errorbar(aes(ymin=mean_meth-sd_meth, ymax=mean_meth+sd_meth), width=.02,
# position=position_dodge(.9), ) +
scale_y_continuous(limits = c(0, 100)) +
xlab("Relative location along gene") +
ylab("Mean % methylation") +
ggtitle("% methylation by relative gene location") +
theme_minimal()
# Plot methylation variance by gene location
meth_filter_genes %>%
ungroup() %>%
group_by(gen_location_round) %>%
summarise(var_meth=var(percMeth, na.rm=TRUE)) %>%
ggplot(., aes(x=gen_location_round, y=var_meth)) +
geom_bar(stat="identity", col="gray30", fill="lavenderblush3") +
xlab("Relative location along gene") +
ylab("Variance in % methylation") +
ggtitle("Variance in % methylation by relative gene location") +
theme_minimal()
meth_filter_genes_2kbslop <- meth_filter_genes_2kbslop %>%
mutate(gen_location = ((start-start_gene_2kb)/(end_gene_2kb-start_gene_2kb)))
hist(meth_filter_genes_2kbslop$gen_location, breaks = 100, main= "Frequency of methylation data by relative genome location")
# Histogram showing frequency of methylation data across genes with 2kb flanking regions (relative location)
meth_filter_genes_2kbslop %>%
group_by(gene, gen_location, population) %>%
summarise(mean_meth=mean(percMeth)) %>%
ggplot(., aes(x=gen_location)) +
stat_bin(bins=200, col="gray30", fill="lavenderblush3") +
xlab("Relative location along gene") +
ylab("Frequency of methylation data") +
ggtitle("Methylation frequency along genes +/- 2kb flanks") +
theme_minimal()
Plot methylation Pst values (mean) calculated at the loci level across a gene
load("../analyses/methylation-filtered/R-objects/perc.meth.Pst.done")
# split loci location info (contig, start, end) into three columns, merge with methylation by gene object, drop loci w/o Pst values, and plot
perc.meth.Pst.done %>%
separate(Quant_Varia, c("chr", "start", "end"), "\\.") %>%
mutate(start=as.numeric(start), end=as.numeric(end)) %>%
right_join(.,
meth_filter_genes %>%
dplyr::select(chr, start, end, start_gene, end_gene,
gen_location, gen_location_round)) %>%
drop_na(Pst_Values) %>%
#mutate(gen_location_round=as.factor(gen_location_round)) %>%
ungroup() %>%
group_by(gen_location_round) %>%
summarise(mean_Pst=mean(Pst_Values, na.rm=TRUE), sd_Pst=sd(Pst_Values, na.rm=TRUE)) %>%
ggplot(., aes(x=gen_location_round, y=mean_Pst)) +
geom_bar(stat="identity", col="gray30", fill="lavenderblush3") +
# geom_errorbar(aes(ymin=mean_Pst, ymax=mean_Pst+sd_Pst), width=.01,
# position=position_dodge(.9), ) +
xlab("Relative location along gene") +
ylab("Methylation Pst") +
ggtitle("Mean Methylation Pst (comparing population) by relative gene location") +
theme_minimal() #+
## Joining, by = c("chr", "start", "end")
# theme(axis.text.x=element_blank())