02-DMG-analysis

In this notebook I identify differentially methylated genes (DMGs) between two Olympia oyster populations, Hood Canal and South Sound. First I prepare my data to be in the correct format / shape, then test for DMGs using a binomial GLM. The genes are also annotated using a gene feature file and BEDtools, and biological functions associated with GO terms are visualized with REVIGO.

Load libraries

list.of.packages <- c("tidyverse", "reshape2","dplyr", "tidyr", "readr", "stringr", "plotly", "car", "Pstat") #add new libraries here 
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages)

# Load all libraries 
lapply(list.of.packages, FUN = function(X) {
  do.call("require", list(X)) 
})

Load filtered methylation data object with sample info which was created in the notebook “01-methylkit”

load(here::here("analyses", "methylation-filtered", "R-objects", "meth_filter_reshaped")) 
head(meth_filter_reshaped)

##       chr start   end strand sample coverage numCs numTs  percMeth population
## 1 Contig0 39226 39226      +      1       21    11    10  52.38095         HC
## 2 Contig0 39234 39234      +      1       24    10    14  41.66667         HC
## 3 Contig0 64179 64179      +      1       10     9     1  90.00000         HC
## 4 Contig0 71523 71523      +      1       13    13     0 100.00000         HC
## 5 Contig0 71533 71533      +      1       17    17     0 100.00000         HC
## 6 Contig0 71542 71542      +      1       16    15     1  93.75000         HC

Use bedtools `slop` to include 2kb regions on either side of gene regions

I want to identify methylated loci that are within genes but also 2kb upstream and downstream of gene regions. Therefore, before intersecting methylated loci with genes, I need to create a gene region +/- 2kb file. I will do this using slopBed.

First, generate a “genome file”, which defines size of each chromosome. This is so the slop function restricts results to within a contig. I can use the indexed FASTA file that I created for a blast.

Extract column 1 (contig name) and column 2 (# bases in the contig)

head -n 2 "../resources/Olurida_v081.fa.fai"
cut -f 1,2 "../resources/Olurida_v081.fa.fai" > "../resources/Olurida_chrom.sizes"
head -n 2 "../resources/Olurida_chrom.sizes"

## Contig0  116746  9   116746  116747
## Contig1  87411   116765  87411   87412
## Contig0  116746
## Contig1  87411

Run slopBed with gene feature file

head -n 4 "../genome-features/Olurida_v081-20190709.gene.gff"

## ##gff-version 3
## Contig61093  maker   gene    7493    7946    .   +   .   ID=OLUR_00020575;Name=OLUR_00020575;Alias=maker-Contig61093-snap-gene-0.2;Note=Protein of unknown function;
## Contig1111   maker   gene    24968   28696   .   -   .   ID=OLUR_00006628;Name=OLUR_00006628;Alias=maker-Contig1111-snap-gene-0.1;Note=Similar to Spag6: Sperm-associated antigen 6 (Mus musculus OX%3D10090);Dbxref=Gene3D:G3DSA:1.25.10.10,InterPro:IPR000225,InterPro:IPR000357,InterPro:IPR011989,InterPro:IPR016024,Pfam:PF02985,SMART:SM00185,SUPERFAMILY:SSF48371;Ontology_term=GO:0005515;
## Contig214118 maker   gene    201 926 .   +   .   ID=OLUR_00032161;Name=OLUR_00032161;Alias=maker-Contig214118-snap-gene-0.0;Note=Protein of unknown function;Dbxref=Gene3D:G3DSA:3.10.450.10;

slopBed \
-i "../genome-features/Olurida_v081-20190709.gene.gff" \
-g "../resources/Olurida_chrom.sizes" \
-b 2000 \
> "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff"
head -n 3 "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff"

## Contig61093  maker   gene    5493    9946    .   +   .   ID=OLUR_00020575;Name=OLUR_00020575;Alias=maker-Contig61093-snap-gene-0.2;Note=Protein of unknown function;
## Contig1111   maker   gene    22968   28792   .   -   .   ID=OLUR_00006628;Name=OLUR_00006628;Alias=maker-Contig1111-snap-gene-0.1;Note=Similar to Spag6: Sperm-associated antigen 6 (Mus musculus OX%3D10090);Dbxref=Gene3D:G3DSA:1.25.10.10,InterPro:IPR000225,InterPro:IPR000357,InterPro:IPR011989,InterPro:IPR016024,Pfam:PF02985,SMART:SM00185,SUPERFAMILY:SSF48371;Ontology_term=GO:0005515;
## Contig214118 maker   gene    1   1068    .   +   .   ID=OLUR_00032161;Name=OLUR_00032161;Alias=maker-Contig214118-snap-gene-0.0;Note=Protein of unknown function;Dbxref=Gene3D:G3DSA:3.10.450.10;

Use intersectBed to find where loci and genes intersect, allowing loci to be mapped to annotated genes

wb: Print all lines in the second file a: input data file, which was created in previous code chunk b: save annotated gene list

intersectBed \
  -wb \
  -a "../analyses/methylation-filtered/meth_filter_long.tab" \
  -b "../genome-features/Olurida_v081-20190709.gene.2kbslop.gff" \
  > "../analyses/methylation-filtered/meth_filter_gene.2kbslop.tab"

Here is the number of loci associated with gene regions +/- 2kb:

wc -l "../analyses/methylation-filtered/meth_filter_gene.2kbslop.tab"

##   379314 ../analyses/methylation-filtered/meth_filter_gene.2kbslop.tab

Check out format of resulting gene regions

head -n 2  "../analyses/methylation-filtered/meth_filter_gene.2kbslop.tab"

## Contig0  39226   39226   +   1   21  11  10  52.38095238095239   HC  Contig0 maker   gene    10497   95068   .   +   .   ID=OLUR_00000039;Name=OLUR_00000039;Alias=maker-Contig0-snap-gene-0.8;Note=Similar to WDR87: WD repeat-containing protein 87 (Homo sapiens OX%3D9606);Dbxref=Coils:Coil,Gene3D:G3DSA:1.25.10.10,Gene3D:G3DSA:2.130.10.10,InterPro:IPR001680,InterPro:IPR011989,InterPro:IPR015943,InterPro:IPR016024,InterPro:IPR017986,InterPro:IPR036322,MobiDBLite:mobidb-lite,Pfam:PF00400,ProSiteProfiles:PS50082,ProSiteProfiles:PS50294,SMART:SM00320,SUPERFAMILY:SSF48371,SUPERFAMILY:SSF50978;Ontology_term=GO:0005515;
## Contig0  39234   39234   +   1   24  10  14  41.66666666666667   HC  Contig0 maker   gene    10497   95068   .   +   .   ID=OLUR_00000039;Name=OLUR_00000039;Alias=maker-Contig0-snap-gene-0.8;Note=Similar to WDR87: WD repeat-containing protein 87 (Homo sapiens OX%3D9606);Dbxref=Coils:Coil,Gene3D:G3DSA:1.25.10.10,Gene3D:G3DSA:2.130.10.10,InterPro:IPR001680,InterPro:IPR011989,InterPro:IPR015943,InterPro:IPR016024,InterPro:IPR017986,InterPro:IPR036322,MobiDBLite:mobidb-lite,Pfam:PF00400,ProSiteProfiles:PS50082,ProSiteProfiles:PS50294,SMART:SM00320,SUPERFAMILY:SSF48371,SUPERFAMILY:SSF50978;Ontology_term=GO:0005515;

Now run intersectBed to NOT include the 2kb flanks

Return the number of loci associated with gene regions. Not sure if this will be used, but it’s good to save it just in case.

intersectBed \
  -wb \
  -a "../analyses/methylation-filtered/meth_filter_long.tab" \
  -b "../genome-features/Olurida_v081-20190709.gene.gff" \
  > "../analyses/methylation-filtered/meth_filter_gene.tab"
wc -l "../analyses/methylation-filtered/meth_filter_gene.tab"

##   336384 ../analyses/methylation-filtered/meth_filter_gene.tab

Check out resulting files which contain methylated loci that are within gene regions (+/- 2kb), and gene bodies:

head -n 3 "../analyses/methylation-filtered/meth_filter_gene.2kbslop.tab"

## Contig0  39226   39226   +   1   21  11  10  52.38095238095239   HC  Contig0 maker   gene    10497   95068   .   +   .   ID=OLUR_00000039;Name=OLUR_00000039;Alias=maker-Contig0-snap-gene-0.8;Note=Similar to WDR87: WD repeat-containing protein 87 (Homo sapiens OX%3D9606);Dbxref=Coils:Coil,Gene3D:G3DSA:1.25.10.10,Gene3D:G3DSA:2.130.10.10,InterPro:IPR001680,InterPro:IPR011989,InterPro:IPR015943,InterPro:IPR016024,InterPro:IPR017986,InterPro:IPR036322,MobiDBLite:mobidb-lite,Pfam:PF00400,ProSiteProfiles:PS50082,ProSiteProfiles:PS50294,SMART:SM00320,SUPERFAMILY:SSF48371,SUPERFAMILY:SSF50978;Ontology_term=GO:0005515;
## Contig0  39234   39234   +   1   24  10  14  41.66666666666667   HC  Contig0 maker   gene    10497   95068   .   +   .   ID=OLUR_00000039;Name=OLUR_00000039;Alias=maker-Contig0-snap-gene-0.8;Note=Similar to WDR87: WD repeat-containing protein 87 (Homo sapiens OX%3D9606);Dbxref=Coils:Coil,Gene3D:G3DSA:1.25.10.10,Gene3D:G3DSA:2.130.10.10,InterPro:IPR001680,InterPro:IPR011989,InterPro:IPR015943,InterPro:IPR016024,InterPro:IPR017986,InterPro:IPR036322,MobiDBLite:mobidb-lite,Pfam:PF00400,ProSiteProfiles:PS50082,ProSiteProfiles:PS50294,SMART:SM00320,SUPERFAMILY:SSF48371,SUPERFAMILY:SSF50978;Ontology_term=GO:0005515;
## Contig0  64179   64179   +   1   10  9   1   90  HC  Contig0 maker   gene    10497   95068   .   +   .   ID=OLUR_00000039;Name=OLUR_00000039;Alias=maker-Contig0-snap-gene-0.8;Note=Similar to WDR87: WD repeat-containing protein 87 (Homo sapiens OX%3D9606);Dbxref=Coils:Coil,Gene3D:G3DSA:1.25.10.10,Gene3D:G3DSA:2.130.10.10,InterPro:IPR001680,InterPro:IPR011989,InterPro:IPR015943,InterPro:IPR016024,InterPro:IPR017986,InterPro:IPR036322,MobiDBLite:mobidb-lite,Pfam:PF00400,ProSiteProfiles:PS50082,ProSiteProfiles:PS50294,SMART:SM00320,SUPERFAMILY:SSF48371,SUPERFAMILY:SSF50978;Ontology_term=GO:0005515;

head -n 1 "../analyses/methylation-filtered/meth_filter_gene.tab"

## Contig0  39226   39226   +   1   21  11  10  52.38095238095239   HC  Contig0 maker   gene    12497   93068   .   +   .   ID=OLUR_00000039;Name=OLUR_00000039;Alias=maker-Contig0-snap-gene-0.8;Note=Similar to WDR87: WD repeat-containing protein 87 (Homo sapiens OX%3D9606);Dbxref=Coils:Coil,Gene3D:G3DSA:1.25.10.10,Gene3D:G3DSA:2.130.10.10,InterPro:IPR001680,InterPro:IPR011989,InterPro:IPR015943,InterPro:IPR016024,InterPro:IPR017986,InterPro:IPR036322,MobiDBLite:mobidb-lite,Pfam:PF00400,ProSiteProfiles:PS50082,ProSiteProfiles:PS50294,SMART:SM00320,SUPERFAMILY:SSF48371,SUPERFAMILY:SSF50978;Ontology_term=GO:0005515;

For the fun of it, re-run slopBed with gene feature file to create 1) genes+2kb (downstream, 3’), and 2) genes-kb (upstream, 5’)

Genes plus 2kb downstream (3’) (using the -r option to indicate “add”)

slopBed \
-i "../genome-features/Olurida_v081-20190709.gene.gff" \
-g "../resources/Olurida_chrom.sizes" \
-r 2000 \
-l 0 \
> "../genome-features/Olurida_v081-20190709.gene.2kb-down.gff"
head -n 3 "../genome-features/Olurida_v081-20190709.gene.2kb-down.gff"

## Contig61093  maker   gene    7493    9946    .   +   .   ID=OLUR_00020575;Name=OLUR_00020575;Alias=maker-Contig61093-snap-gene-0.2;Note=Protein of unknown function;
## Contig1111   maker   gene    24968   28792   .   -   .   ID=OLUR_00006628;Name=OLUR_00006628;Alias=maker-Contig1111-snap-gene-0.1;Note=Similar to Spag6: Sperm-associated antigen 6 (Mus musculus OX%3D10090);Dbxref=Gene3D:G3DSA:1.25.10.10,InterPro:IPR000225,InterPro:IPR000357,InterPro:IPR011989,InterPro:IPR016024,Pfam:PF02985,SMART:SM00185,SUPERFAMILY:SSF48371;Ontology_term=GO:0005515;
## Contig214118 maker   gene    201 1068    .   +   .   ID=OLUR_00032161;Name=OLUR_00032161;Alias=maker-Contig214118-snap-gene-0.0;Note=Protein of unknown function;Dbxref=Gene3D:G3DSA:3.10.450.10;

Genes plus 2kb upstream (5’) (using the -l option to indicate “subract”)

slopBed \
-i "../genome-features/Olurida_v081-20190709.gene.gff" \
-g "../resources/Olurida_chrom.sizes" \
-l 2000 \
-r 0 \
> "../genome-features/Olurida_v081-20190709.gene.2kb-up.gff"
head -n 3 "../genome-features/Olurida_v081-20190709.gene.2kb-up.gff"

## Contig61093  maker   gene    5493    7946    .   +   .   ID=OLUR_00020575;Name=OLUR_00020575;Alias=maker-Contig61093-snap-gene-0.2;Note=Protein of unknown function;
## Contig1111   maker   gene    22968   28696   .   -   .   ID=OLUR_00006628;Name=OLUR_00006628;Alias=maker-Contig1111-snap-gene-0.1;Note=Similar to Spag6: Sperm-associated antigen 6 (Mus musculus OX%3D10090);Dbxref=Gene3D:G3DSA:1.25.10.10,InterPro:IPR000225,InterPro:IPR000357,InterPro:IPR011989,InterPro:IPR016024,Pfam:PF02985,SMART:SM00185,SUPERFAMILY:SSF48371;Ontology_term=GO:0005515;
## Contig214118 maker   gene    1   926 .   +   .   ID=OLUR_00032161;Name=OLUR_00032161;Alias=maker-Contig214118-snap-gene-0.0;Note=Protein of unknown function;Dbxref=Gene3D:G3DSA:3.10.450.10;

Binomial GLMs to test for differentially methylated genes regions.

Add columns for organization / filtering:

– gene = contig number + start/end locus
– group = sample number + gene
– population = HC for Hood Canal, or SS for South Sound

meth_filter_genes_2kbslop <- 
  read_delim(file = here::here("analyses", "methylation-filtered", "meth_filter_gene.2kbslop.tab"), delim = "\t", col_names = c(colnames(meth_filter_reshaped[-10]), "population", "contig_gene", "source_gene", "feature_gene", "start_gene_2kb", "end_gene_2kb", "unknown1_gene", "strand_gene", "unknown2_gene", "notes_gene")) %>%
  mutate(gene=paste(contig_gene, start_gene_2kb, end_gene_2kb, sep="_")) %>%
  mutate(group=paste(sample, gene, sep="_"))

## Parsed with column specification:
## cols(
##   chr = col_character(),
##   start = col_double(),
##   end = col_double(),
##   strand = col_character(),
##   sample = col_double(),
##   coverage = col_double(),
##   numCs = col_double(),
##   numTs = col_double(),
##   percMeth = col_double(),
##   population = col_character(),
##   contig_gene = col_character(),
##   source_gene = col_character(),
##   feature_gene = col_character(),
##   start_gene_2kb = col_double(),
##   end_gene_2kb = col_double(),
##   unknown1_gene = col_character(),
##   strand_gene = col_character(),
##   unknown2_gene = col_character(),
##   notes_gene = col_character()
## )

Here is the number of genes that are represented in our methylation data set:

length(unique(meth_filter_genes_2kbslop$gene))

## [1] 3754

Filter for genes with at minimum 5 methylated loci

min.filt_2kbslop <- dplyr::count(meth_filter_genes_2kbslop, vars = c(group))
newdata <- min.filt_2kbslop[which(min.filt_2kbslop$n > 4), ]
sub_meth_table_2kbslop <- meth_filter_genes_2kbslop[meth_filter_genes_2kbslop$group %in% newdata$vars,]
save(sub_meth_table_2kbslop, file="../analyses/DMGs/R-objects/sub_meth_table_2kbslop")

head(sub_meth_table_2kbslop)

## # A tibble: 6 x 21
##   chr   start   end strand sample coverage numCs numTs percMeth population
##   <chr> <dbl> <dbl> <chr>   <dbl>    <dbl> <dbl> <dbl>    <dbl> <chr>     
## 1 Cont… 39226 39226 +           1       21    11    10     52.4 HC        
## 2 Cont… 39234 39234 +           1       24    10    14     41.7 HC        
## 3 Cont… 64179 64179 +           1       10     9     1     90   HC        
## 4 Cont… 71523 71523 +           1       13    13     0    100   HC        
## 5 Cont… 71533 71533 +           1       17    17     0    100   HC        
## 6 Cont… 71542 71542 +           1       16    15     1     93.8 HC        
## # … with 11 more variables: contig_gene <chr>, source_gene <chr>,
## #   feature_gene <chr>, start_gene_2kb <dbl>, end_gene_2kb <dbl>,
## #   unknown1_gene <chr>, strand_gene <chr>, unknown2_gene <chr>,
## #   notes_gene <chr>, gene <chr>, group <chr>

Here is the number of genes that remain after filtering for those with 5 or more methylated loci within each gene region:

length(unique(sub_meth_table_2kbslop$gene))

## [1] 1393

Run GLM to test for differences among population for each gene individually

Note: this script was created by Hollie Putnam (GM.Rmd); there are minor revisions below. I retained some commented out lines (notably-testing for position w/n gene, such as intron & exon) in case we want to include those as factors in the future.

# create data frame to stored results
results_2kbslop <- data.frame()

gs <- unique(sub_meth_table_2kbslop$gene)

#first subset the unique dataframes and second run the GLMs
for(i in 1:length(gs)){
  
  #subset the dataframe gene by gene
  sub_meth_table_2kbslop1 <- subset(sub_meth_table_2kbslop, gene ==gs[i])
  
  # fit glm position model
  fit <- glm(matrix(c(numCs, numTs), ncol=2) ~ as.factor(population) + (1|sample), 
             data=sub_meth_table_2kbslop1, family=binomial)
  a <- anova(fit, test="Chisq")
  
  # capture summary stats to data frame
  df <- data.frame(sub_meth_table_2kbslop1[c("population", "sample", "contig_gene", "start_gene_2kb", "end_gene_2kb", "gene", "chr", "start", "sample", "coverage", "numCs", "numTs", "percMeth", "notes_gene")],
                   pval.treatment = a$`Pr(>Chi)`[2],
                   #pval.position = a$`Pr(>Chi)`[3], #uncomment if you want to include position of CpG within a gene
                   #pval.treatment_x_position = a$`Pr(>Chi)`[4], #uncomment if you want to include position of CpG within a gene interaction with treatment
                   stringsAsFactors = F)
  
  # bind rows of temporary data frame to the results data frame
  results_2kbslop <- rbind(results_2kbslop, df)
  
}
results_2kbslop[is.na(results_2kbslop)] <- 0
results_2kbslop$adj.pval.pop <- p.adjust(results_2kbslop$pval.treatment, method='BH')
#results_2kbslop$adj.pval.position <- p.adjust(results_2kbslop$pval.position, method='BH') #uncomment if you want to include position of CpG within a gene
#results_2kbslop$adj.pval.treatment_x_position <- p.adjust(results_2kbslop$pval.treatment_x_position, method='BH') #uncomment if you want to include position of CpG within a gene interaction with treatment

Extract only genes that were differentially methylated (p-adj < 0.05),

#save df with differentially methylated genes 
DMGs_2kbslop <- subset(results_2kbslop, adj.pval.pop < 0.05) #%>%
#  mutate(contig_gene_start=paste(contig_gene, start_gene_2kb, sep="_"))
save(DMGs_2kbslop, file="../analyses/DMGs/R-objects/DMGs_2kbslop")


DMGs_2kbslop.genes <- DMGs_2kbslop[!duplicated(DMGs_2kbslop$gene), c("contig_gene", "start_gene_2kb", "end_gene_2kb", "notes_gene", "pval.treatment", "adj.pval.pop")]

# Save .bed file for gene enrichment and visualization 
write_delim(DMGs_2kbslop.genes[,c("contig_gene", "start_gene_2kb", "end_gene_2kb")], "../analyses/DMGs/DMGs_2kbslop.bed",  delim = '\t', col_names = FALSE)

# Save .bed file for all genes included in DMG analysis 
write_delim(results_2kbslop[!duplicated(results_2kbslop$gene), c("contig_gene", "start_gene_2kb", "end_gene_2kb")], "../analyses/DMGs/AllGenes-DMGs_2kbslop.bed",  delim = '\t', col_names = FALSE)

Here is the number of differentially methylated genes (5 loci per gene):

length(unique(DMGs_2kbslop$gene))

## [1] 279

Extract GO terms for DMGs_2kbslop and save to file

# split gene data in "notes_gene" column into separate columns 
DMGs_2kbslop.genes <- DMGs_2kbslop.genes %>%
  mutate(ID=str_extract(notes_gene, "ID=(.*?);"),
       Parent=str_extract(notes_gene, "Parent=(.*?);"),
       Name=str_extract(notes_gene, "Name=(.*?);"),
       Alias=str_extract(notes_gene, "Alias=(.*?);"),
       AED=str_extract(notes_gene, "AED=(.*?);"),
       eAED=str_extract(notes_gene, "eAED=(.*?);"),
       Note=str_extract(notes_gene, "Note=(.*?);"),
       Ontology_term=str_extract(notes_gene, "Ontology_term=(.*?);"),
       Dbxref=str_extract(notes_gene, "Dbxref=(.*?);")
       )
write_csv(DMGs_2kbslop.genes, path = here::here("analyses", "DMGs", "DMGs_2kbslop.genes.csv"))

#Extract GO terms 
DMGs_2kbslop.genes.GO <- DMGs_2kbslop.genes %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern="Ontology_term=",replacement = "")) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern=";",replacement = "")) %>%
  separate(Ontology_term, sep=",", into=paste("GO", 1:11, sep="_")) %>%
  pivot_longer(cols=c("GO_1","GO_2","GO_3","GO_4","GO_5","GO_6","GO_7","GO_8","GO_9","GO_10","GO_11"), names_to = "GO_number", values_to = "GO_term") %>%
  dplyr::select(-GO_number) %>%
  filter(!is.na(Note) & !is.na(GO_term))

## Warning: Expected 11 pieces. Missing pieces filled with `NA` in 127 rows [1, 3,
## 4, 5, 9, 10, 11, 12, 16, 17, 19, 20, 24, 26, 27, 28, 29, 30, 31, 32, ...].

write_delim(DMGs_2kbslop.genes.GO[,c("GO_term","adj.pval.pop")], path = here::here("analyses/", "DMGs/", "DMGs_2kbslop.GO.txt"), delim = '\t', col_names = F) #write out df with just GO terms and p-adj values

The file “../analyses/DMGs/DMGs_2kbslop.GO.txt” was opened outside of RMarkdown and the GO terms and p-values were pasted into REVIGO. I then exported the R script to generate the Biological Processes scatterplot. Here is the REVIGO table:

term_ID	description	frequency	plot_X	plot_Y	plot_size	log10 p-value	uniqueness	dispensability	representative	eliminated
GO:0006260	DNA replication	1.58%	5.949	2.822	5.306	-6.1605	0.719	0	6260	0
GO:0007155	cell adhesion	0.54%	-6.544	1.248	4.844	-6.4135	0.914	0	7155	0
GO:0007156	homophilic cell adhesion via plasma membrane adhesion molecules	0.10%	null	null	4.087	-6.4135	0.914	0.85	7155	1
GO:0023052	signaling	6.77%	-3.032	-0.496	5.939	-7.3332	0.956	0	23052	0
GO:0070588	calcium ion transmembrane transport	0.16%	-3.153	-4.713	4.305	-2.1079	0.868	0	70588	0
GO:0009058	biosynthetic process	31.61%	-1.225	-0.762	6.608	-2.4968	0.943	0.026	9058	0
GO:0006334	nucleosome assembly	0.09%	-4.686	5.561	4.057	-3.6686	0.881	0.029	6334	0
GO:0007264	small GTPase mediated signal transduction	0.49%	1.48	6.771	4.794	-2.1354	0.743	0.034	7264	0
GO:0006457	protein folding	0.90%	-4.69	1.661	5.064	-1.7376	0.912	0.037	6457	0
GO:0007154	cell communication	7.22%	-6.016	3.233	5.967	-4.0935	0.904	0.048	7154	0
GO:0005975	carbohydrate metabolic process	5.26%	-3.347	3.585	5.829	-3.2718	0.888	0.081	5975	0
GO:0006030	chitin metabolic process	0.08%	2.981	-3.369	3.994	-3.9452	0.855	0.103	6030	0
GO:0032958	inositol phosphate biosynthetic process	0.01%	4.623	-5.39	2.943	-6.0631	0.827	0.121	32958	0
GO:0030154	cell differentiation	1.13%	1.041	-6.177	5.162	-5.21	0.858	0.133	30154	0
GO:0016567	protein ubiquitination	0.52%	7.405	0.126	4.827	-2.6487	0.717	0.15	16567	0
GO:0016579	protein deubiquitination	0.20%	null	null	4.398	-1.8804	0.726	0.829	16567	1
GO:0006006	glucose metabolic process	0.43%	2.958	-6.194	4.741	-3.4042	0.835	0.213	6006	0
GO:0055114	oxidation-reduction process	15.06%	1.16	-1.152	6.286	-6.1605	0.859	0.214	55114	0
GO:0019427	acetyl-CoA biosynthetic process from acetate	0.03%	4.516	-4.71	3.604	-2.0321	0.81	0.237	19427	0
GO:0006886	intracellular protein transport	1.20%	-4.141	-3.625	5.187	-1.5764	0.862	0.263	6886	0
GO:0009408	response to heat	0.17%	0.479	7.378	4.328	-1.7376	0.834	0.315	9408	0
GO:0006396	RNA processing	3.21%	6.772	2.794	5.615	-2.1891	0.746	0.319	6396	0
GO:0006511	ubiquitin-dependent protein catabolic process	0.58%	7.678	-0.832	4.874	-1.8804	0.781	0.321	6511	0
GO:0006541	glutamine metabolic process	0.47%	3.071	-4.558	4.782	-1.9271	0.785	0.33	6541	0
GO:0001522	pseudouridine synthesis	0.35%	6.93	1.68	4.652	-2.1891	0.709	0.384	1522	0
GO:0000154	rRNA modification	0.35%	null	null	4.658	-1.3136	0.678	0.701	1522	1
GO:0006811	ion transport	5.34%	-3.91	-4.255	5.836	-2.1079	0.866	0.389	6811	0
GO:0046600	negative regulation of centriole replication	0.00%	-2.59	5.971	2.477	-2.1569	0.824	0.407	46600	0
GO:0006486	protein glycosylation	0.32%	5.491	-1.021	4.61	-2.0586	0.675	0.429	6486	0
GO:0051103	DNA ligation involved in DNA repair	0.04%	3.562	5.637	3.703	-2.8009	0.703	0.454	51103	0
GO:0035023	regulation of Rho protein signal transduction	0.13%	1.394	7.1	4.206	-1.5602	0.764	0.515	35023	0
GO:0055085	transmembrane transport	8.92%	-4.515	-3.723	6.058	-1.5468	0.863	0.535	55085	0
GO:0006355	regulation of transcription, DNA-templated	9.92%	4.282	3.292	6.105	-1.7475	0.633	0.537	6355	0
GO:0006468	protein phosphorylation	4.14%	6.56	-0.302	5.725	-2.3389	0.67	0.56	6468	0
GO:0034220	ion transmembrane transport	3.53%	-3.536	-4.332	5.656	-1.6655	0.846	0.566	34220	0
GO:0006470	protein dephosphorylation	0.59%	6.774	-1.091	4.875	-1.5439	0.715	0.568	6470	0
GO:0009451	RNA modification	1.78%	6.493	1.45	5.358	-2.1891	0.686	0.581	9451	0
GO:0015914	phospholipid transport	0.08%	-4.251	-3.327	3.987	-1.3062	0.821	0.584	15914	0
GO:0035556	intracellular signal transduction	4.00%	1.211	6.286	5.71	-1.705	0.704	0.593	35556	0
GO:0007186	G-protein coupled receptor signaling pathway	0.88%	0.818	6.368	5.054	-1.6167	0.736	0.638	7186	0
GO:0007166	cell surface receptor signaling pathway	0.92%	0.957	6.609	5.072	-1.4923	0.735	0.641	7166	0
GO:0006281	DNA repair	2.23%	3.916	4.695	5.457	-2.8009	0.608	0.684	6281	0
GO:0006310	DNA recombination	1.64%	6.285	3.557	5.323	-2.8009	0.732	0.688	6310	0

Here is a screen shot of the REVIGO plot showing GO terms associated with DMGs_2kbslop (5 methylated loci per gene)

DMG_REVIGO_biological-functions.png

Here is a script to re-create that REVIGO plot

# A plotting R script produced by the REVIGO server at http://revigo.irb.hr/
# If you found REVIGO useful in your work, please cite the following reference:
# Supek F et al. "REVIGO summarizes and visualizes long lists of Gene Ontology
# terms" PLoS ONE 2011. doi:10.1371/journal.pone.0021800


# --------------------------------------------------------------------------
# If you don't have the ggplot2 package installed, uncomment the following line:
# install.packages( "ggplot2" );
library( ggplot2 );
# --------------------------------------------------------------------------
# If you don't have the scales package installed, uncomment the following line:
# install.packages( "scales" );
library( scales );


# --------------------------------------------------------------------------
# Here is your data from REVIGO. Scroll down for plot configuration options.

revigo.names <- c("term_ID","description","frequency_%","plot_X","plot_Y","plot_size","log10_p_value","uniqueness","dispensability");
revigo.data <- rbind(c("GO:0006260","DNA replication", 1.577, 5.949, 2.822, 5.306,-6.1605,0.719,0.000),
c("GO:0007155","cell adhesion", 0.544,-6.544, 1.248, 4.844,-6.4135,0.914,0.000),
c("GO:0023052","signaling", 6.765,-3.032,-0.496, 5.939,-7.3332,0.956,0.000),
c("GO:0070588","calcium ion transmembrane transport", 0.157,-3.153,-4.713, 4.305,-2.1079,0.868,0.000),
c("GO:0009058","biosynthetic process",31.611,-1.225,-0.762, 6.608,-2.4968,0.943,0.026),
c("GO:0006334","nucleosome assembly", 0.089,-4.686, 5.561, 4.057,-3.6686,0.881,0.029),
c("GO:0007264","small GTPase mediated signal transduction", 0.485, 1.480, 6.771, 4.794,-2.1354,0.743,0.034),
c("GO:0006457","protein folding", 0.903,-4.690, 1.661, 5.064,-1.7376,0.912,0.037),
c("GO:0007154","cell communication", 7.219,-6.016, 3.233, 5.967,-4.0935,0.904,0.048),
c("GO:0005975","carbohydrate metabolic process", 5.260,-3.347, 3.585, 5.829,-3.2718,0.888,0.081),
c("GO:0006030","chitin metabolic process", 0.077, 2.981,-3.369, 3.994,-3.9452,0.855,0.103),
c("GO:0032958","inositol phosphate biosynthetic process", 0.007, 4.623,-5.390, 2.943,-6.0631,0.827,0.121),
c("GO:0030154","cell differentiation", 1.133, 1.041,-6.177, 5.162,-5.2100,0.858,0.133),
c("GO:0016567","protein ubiquitination", 0.523, 7.405, 0.126, 4.827,-2.6487,0.717,0.150),
c("GO:0006006","glucose metabolic process", 0.430, 2.958,-6.194, 4.741,-3.4042,0.835,0.213),
c("GO:0055114","oxidation-reduction process",15.060, 1.160,-1.152, 6.286,-6.1605,0.859,0.214),
c("GO:0019427","acetyl-CoA biosynthetic process from acetate", 0.031, 4.516,-4.710, 3.604,-2.0321,0.810,0.237),
c("GO:0006886","intracellular protein transport", 1.199,-4.141,-3.625, 5.187,-1.5764,0.862,0.263),
c("GO:0009408","response to heat", 0.166, 0.479, 7.378, 4.328,-1.7376,0.834,0.315),
c("GO:0006396","RNA processing", 3.210, 6.772, 2.794, 5.615,-2.1891,0.746,0.319),
c("GO:0006511","ubiquitin-dependent protein catabolic process", 0.584, 7.678,-0.832, 4.874,-1.8804,0.781,0.321),
c("GO:0006541","glutamine metabolic process", 0.472, 3.071,-4.558, 4.782,-1.9271,0.785,0.330),
c("GO:0001522","pseudouridine synthesis", 0.350, 6.930, 1.680, 4.652,-2.1891,0.709,0.384),
c("GO:0006811","ion transport", 5.344,-3.910,-4.255, 5.836,-2.1079,0.866,0.389),
c("GO:0046600","negative regulation of centriole replication", 0.002,-2.590, 5.971, 2.477,-2.1569,0.824,0.407),
c("GO:0006486","protein glycosylation", 0.317, 5.491,-1.021, 4.610,-2.0586,0.675,0.429),
c("GO:0051103","DNA ligation involved in DNA repair", 0.039, 3.562, 5.637, 3.703,-2.8009,0.703,0.454),
c("GO:0035023","regulation of Rho protein signal transduction", 0.125, 1.394, 7.100, 4.206,-1.5602,0.764,0.515),
c("GO:0055085","transmembrane transport", 8.916,-4.515,-3.723, 6.058,-1.5468,0.863,0.535),
c("GO:0006355","regulation of transcription, DNA-templated", 9.917, 4.282, 3.292, 6.105,-1.7475,0.633,0.537),
c("GO:0006468","protein phosphorylation", 4.137, 6.560,-0.302, 5.725,-2.3389,0.670,0.560),
c("GO:0034220","ion transmembrane transport", 3.528,-3.536,-4.332, 5.656,-1.6655,0.846,0.566),
c("GO:0006470","protein dephosphorylation", 0.585, 6.774,-1.091, 4.875,-1.5439,0.715,0.568),
c("GO:0009451","RNA modification", 1.778, 6.493, 1.450, 5.358,-2.1891,0.686,0.581),
c("GO:0015914","phospholipid transport", 0.076,-4.251,-3.327, 3.987,-1.3062,0.821,0.584),
c("GO:0035556","intracellular signal transduction", 4.000, 1.211, 6.286, 5.710,-1.7050,0.704,0.593),
c("GO:0007186","G-protein coupled receptor signaling pathway", 0.882, 0.818, 6.368, 5.054,-1.6167,0.736,0.638),
c("GO:0007166","cell surface receptor signaling pathway", 0.920, 0.957, 6.609, 5.072,-1.4923,0.735,0.641),
c("GO:0006281","DNA repair", 2.234, 3.916, 4.695, 5.457,-2.8009,0.608,0.684),
c("GO:0006310","DNA recombination", 1.641, 6.285, 3.557, 5.323,-2.8009,0.732,0.688));

one.data <- data.frame(revigo.data);
names(one.data) <- revigo.names;
one.data <- one.data [(one.data$plot_X != "null" & one.data$plot_Y != "null"), ];
one.data$plot_X <- as.numeric( as.character(one.data$plot_X) );
one.data$plot_Y <- as.numeric( as.character(one.data$plot_Y) );
one.data$plot_size <- as.numeric( as.character(one.data$plot_size) );
one.data$log10_p_value <- as.numeric( as.character(one.data$log10_p_value) );
one.data$frequency <- as.numeric( as.character(one.data$frequency) );
one.data$uniqueness <- as.numeric( as.character(one.data$uniqueness) );
one.data$dispensability <- as.numeric( as.character(one.data$dispensability) );
#head(one.data);


# --------------------------------------------------------------------------
# Names of the axes, sizes of the numbers and letters, names of the columns,
# etc. can be changed below

p1 <- ggplot( data = one.data );
p1 <- p1 + geom_point( aes( plot_X, plot_Y, colour = log10_p_value, size = plot_size), alpha = I(0.6) ) + scale_size_area();
p1 <- p1 + scale_colour_gradientn( colours = c("blue", "green", "yellow", "red"), limits = c( min(one.data$log10_p_value), 0) );
p1 <- p1 + geom_point( aes(plot_X, plot_Y, size = plot_size), shape = 21, fill = "transparent", colour = I (alpha ("black", 0.6) )) + scale_size_area();
p1 <- p1 + scale_size( range=c(5, 30)) + theme_bw(); # + scale_fill_gradientn(colours = heat_hcl(7), limits = c(-300, 0) );
#ex <- one.data [ one.data$dispensability < 0.15, ]; 
p1 <- p1 + geom_text( data = one.data, aes(plot_X, plot_Y, label = description), colour = I(alpha("black", 0.85)), size = 3 );
p1 <- p1 + labs (y = "semantic space x", x = "semantic space y");
p1 <- p1 + theme(legend.key = element_blank()) ;
one.x_range = max(one.data$plot_X) - min(one.data$plot_X);
one.y_range = max(one.data$plot_Y) - min(one.data$plot_Y);
p1 <- p1 + xlim(min(one.data$plot_X)-one.x_range/10,max(one.data$plot_X)+one.x_range/10);
p1 <- p1 + ylim(min(one.data$plot_Y)-one.y_range/10,max(one.data$plot_Y)+one.y_range/10);

# --------------------------------------------------------------------------
# Output the plot to screen

p1;

IGV

Create bed files to visualze as a track of DMGs_2kbslop in IGV

DMGs_2kbslop.bed <- meth_filter_genes_2kbslop %>%
  filter(contig_gene %in% DMGs_2kbslop.genes$contig_gene & 
           start_gene_2kb %in% DMGs_2kbslop.genes$start_gene & 
           end_gene_2kb %in% DMGs_2kbslop.genes$end_gene) %>%
  dplyr::select(contig_gene, start_gene_2kb, end_gene_2kb, 
                unknown1_gene, strand_gene, unknown2_gene, notes_gene) %>%
  distinct(contig_gene, start_gene_2kb, end_gene_2kb)
#DMGs_2kbslop.bed <- DMGs_2kbslop.bed[!duplicated(test$contig_gene),]
readr::write_delim(DMGs_2kbslop.bed, "../analyses/DMGs/DMGs_2kbslop.bed",  delim = '\t', col_names = FALSE)

Barplots showing % methylation of all methylated genes by population

NOTE: this code is commented out as it’s not needed and prints too many bar plots

# ggplotly(meth_filter_genes_2kbslop %>%
#   filter(contig_gene %in% DMGs_2kbslop.genes$contig_gene & 
#            start_gene %in% DMGs_2kbslop.genes$start_gene & 
#            end_gene %in% DMGs_2kbslop.genes$end_gene) %>% 
#   group_by(population, gene) %>%
#   summarise(allCs_percent = 100*(sum(numCs)/sum(coverage)), 
#             mean_percentMeth = mean(percMeth)) %>%
#   ggplot(aes(x = population, y = mean_percentMeth, fill = population)) + geom_bar(stat="identity") + facet_wrap(~gene) + theme_light() + scale_fill_manual(values=c("firebrick3","dodgerblue3")))
# 
# #checking to make sure numCs + numTs = coverage; should be 1:1 line 
# plot(meth_filter_genes_2kbslop$numCs + meth_filter_genes_2kbslop$numTs ~ meth_filter_genes_2kbslop$coverage)  
# 
# ## Look at coverage for each DMG by population (mean % methylation across samples)
# ggplotly(meth_filter_genes_2kbslop %>%
#   filter(contig_gene %in% DMGs_2kbslop.genes$contig_gene & 
#            start_gene %in% DMGs_2kbslop.genes$start_gene & 
#            end_gene %in% DMGs_2kbslop.genes$end_gene) %>% 
#   group_by(population, gene) %>%
#   summarise(sum_cov = sum(coverage), 
#             mean_cov = mean(coverage)) %>%
#   ggplot(aes(x = population, y = mean_cov, fill = population)) + 
#     geom_bar(stat="identity") + 
#     facet_wrap(~gene) + theme_light() + scale_fill_manual(values=c("firebrick3","dodgerblue3")))
# 
# DMG_counts <- meth_filter_genes_2kbslop %>%
#   filter(contig_gene %in% DMGs_2kbslop.genes$contig_gene & 
#            start_gene %in% DMGs_2kbslop.genes$start_gene & 
#            end_gene %in% DMGs_2kbslop.genes$end_gene)
# 
# # Look at coverage for each DMG locus, by population 
# # mean % methylation across samples  
# DMG_genes_unique <- unique(DMG_counts$gene)
# for (i in 1:length(DMG_genes_unique)) {
#   temp <-  DMG_counts %>% 
#   filter(chr == "Contig22489") %>%
#   group_by(population, chr, start) %>%
#   summarise(allCs_percent = 100*(sum(numCs)/sum(coverage)), 
#             mean_percentMeth = mean(percMeth))
#     print(ggplotly(ggplot(temp, aes(x = population, y = mean_percentMeth, fill = population)) + 
#     geom_bar(stat="identity") + 
#     facet_wrap(~start) + 
#     theme_light() + ggtitle(paste("gene = ", "Contig22489", sep="")) +
#     scale_fill_manual(values=c("firebrick3","dodgerblue3"))))
#   }

Identify DMGs_2kbslop that contain DMLs

intersectBed \
  -wb \
  -a "../analyses/DMGs/DMGs_2kbslop.bed" \
  -b "../analyses/DMLs/dml25.bed" \
  > "../analyses/DMGs/DMGs_2kbslop-with-DMLs.tab"

Here is the number of DML loci associated with DM gene regions:

wc -l "../analyses/DMGs/DMGs_2kbslop-with-DMLs.tab"

##       96 ../analyses/DMGs/DMGs_2kbslop-with-DMLs.tab

dml25 <- read_delim(file = here::here("analyses", "DMLs", "dml25.bed"), delim = "\t", col_names = TRUE)

## Parsed with column specification:
## cols(
##   Contig100994 = col_character(),
##   `3427` = col_double(),
##   `3429` = col_double(),
##   `-31.801994301994306` = col_double()
## )

DMLs.in.DMGs_2kbslop <- 
  read_delim(file = here::here("analyses", "DMGs", "DMGs_2kbslop-with-DMLs.tab"), delim = "\t", col_names = c(colnames(DMGs_2kbslop.bed), colnames(dml25))) #%>%

## Parsed with column specification:
## cols(
##   contig_gene = col_character(),
##   start_gene_2kb = col_double(),
##   end_gene_2kb = col_double(),
##   Contig100994 = col_character(),
##   `3427` = col_double(),
##   `3429` = col_double(),
##   `-31.801994301994306` = col_double()
## )

  #mutate(gene=paste(contig_gene, start_gene, sep="_")) 
write.csv(DMLs.in.DMGs_2kbslop, file = "../analyses/DMGs/DMLS.in.DMGs_2kbslop.csv")
save(DMLs.in.DMGs_2kbslop, file="../analyses/DMGs/R-objects/DMLs.in.DMGs_2kbslop")

# Barplots of all DMLs also located in DMGs_2kbslop 

DMLs.in.DMGs_2kbslop.calcs <- meth_filter_reshaped %>% 
  filter(chr %in% DMLs.in.DMGs_2kbslop$contig_gene & 
           start %in% DMLs.in.DMGs_2kbslop$start_gene+1 & 
           end %in% DMLs.in.DMGs_2kbslop$end_gene-1) %>% 
  group_by(population, chr, start) %>% 
  dplyr::summarise(
    mean_percMeth = mean(percMeth, na.rm=TRUE),
    sd_percMeth=sd(percMeth, na.rm=TRUE),
    n())

## Warning: Unknown or uninitialised column: `start_gene`.

## Warning: Unknown or uninitialised column: `end_gene`.

DMLs.in.DMGs_2kbslop.calcs %>% ungroup() %>% dplyr::select(chr, start) %>% distinct()

## # A tibble: 1,399 x 2
##    chr          start
##    <chr>        <int>
##  1 Contig103503  7021
##  2 Contig103503  7028
##  3 Contig103503  7054
##  4 Contig103503  7067
##  5 Contig103503  7072
##  6 Contig1245    9214
##  7 Contig1245    9223
##  8 Contig1245    9229
##  9 Contig1245    9246
## 10 Contig1245    9269
## # … with 1,389 more rows

#Plots don't work when knitted
# DMLs_in_DMGs_2kbslop_plots <- list()
# for (i in 1:nrow(DMLs.in.DMGs_2kbslop)) {
#   DMLs_in_DMGs_2kbslop_plots[[i]] <-
#     DMLs.in.DMGs_2kbslop.calcs %>%
#     filter(chr==DMLs.in.DMGs_2kbslop$contig_gene[i] &
#              start==DMLs.in.DMGs_2kbslop$start_gene[i]+1) %>%
#     ggplot(aes(x = population, y = mean_percMeth, fill = population,
#                          label=paste0(round(mean_percMeth, digits = 2), "%"))) +
#       geom_bar(stat="identity", width = 0.5) + ylim(0,110) +
#       geom_pointrange(aes(ymin=mean_percMeth,
#                         ymax=mean_percMeth+sd_percMeth)) +
#       geom_text(size=3, vjust=-0.5, hjust=1.25) +
#       theme_light() + ggtitle(paste("Contig = ", DMLs.in.DMGs_2kbslop[i, "contig_gene"], ", Locus = ",
#                                    DMLs.in.DMGs_2kbslop[i+1, "start_gene"], sep="")) +
#     scale_fill_manual(values=c("firebrick3","dodgerblue3"))
# }
# DMLs_in_DMGs_2kbslop_plots[1:6]

Calculate Pst for each gene

First, calculate Pst of average % methylation within genes and +/- 2kb. Only look genes that have at minimum 5 methylated loci (this was done in a previous chunk). I need the % methylation data for each of the gene regions, so I’ll first create that dataframe.

# How many gene regions are there after filtering for those with 5 methylated loci? 
sub_meth_table_2kbslop %>% 
       distinct(gene) %>%
  nrow()

## [1] 1393

head(sub_meth_table_2kbslop)

## # A tibble: 6 x 21
##   chr   start   end strand sample coverage numCs numTs percMeth population
##   <chr> <dbl> <dbl> <chr>   <dbl>    <dbl> <dbl> <dbl>    <dbl> <chr>     
## 1 Cont… 39226 39226 +           1       21    11    10     52.4 HC        
## 2 Cont… 39234 39234 +           1       24    10    14     41.7 HC        
## 3 Cont… 64179 64179 +           1       10     9     1     90   HC        
## 4 Cont… 71523 71523 +           1       13    13     0    100   HC        
## 5 Cont… 71533 71533 +           1       17    17     0    100   HC        
## 6 Cont… 71542 71542 +           1       16    15     1     93.8 HC        
## # … with 11 more variables: contig_gene <chr>, source_gene <chr>,
## #   feature_gene <chr>, start_gene_2kb <dbl>, end_gene_2kb <dbl>,
## #   unknown1_gene <chr>, strand_gene <chr>, unknown2_gene <chr>,
## #   notes_gene <chr>, gene <chr>, group <chr>

head(meth_filter_reshaped) #data frame that contains coverage and % methylation info

##       chr start   end strand sample coverage numCs numTs  percMeth population
## 1 Contig0 39226 39226      +      1       21    11    10  52.38095         HC
## 2 Contig0 39234 39234      +      1       24    10    14  41.66667         HC
## 3 Contig0 64179 64179      +      1       10     9     1  90.00000         HC
## 4 Contig0 71523 71523      +      1       13    13     0 100.00000         HC
## 5 Contig0 71533 71533      +      1       17    17     0 100.00000         HC
## 6 Contig0 71542 71542      +      1       16    15     1  93.75000         HC

# I think this filtering step doesn't account for the 2kb +/- start and stop 
perc_meth_genes_2kbslop <- sub_meth_table_2kbslop %>% 
   group_by(population, sample, contig_gene, start_gene_2kb, end_gene_2kb) %>%
   dplyr::summarise(
    mean_percMeth = mean(percMeth, na.rm=TRUE),
    sd_percMeth=sd(percMeth, na.rm=TRUE),
    n())  

# check to make sure % methylation is calculated separately for each sample and gene region 
perc_meth_genes_2kbslop %>% filter(contig_gene=="Contig0", start_gene_2kb==10497, end_gene_2kb==95068) %>%
  ggplot(aes(x=sample, y=mean_percMeth)) + geom_bar(stat="identity")

# How many unique gene regions? 
perc_meth_genes_2kbslop %>% 
       ungroup() %>% 
  dplyr::select(contig_gene, start_gene_2kb, end_gene_2kb) %>%
       distinct(contig_gene, start_gene_2kb, end_gene_2kb) %>%
  nrow()

## [1] 1393

# Reshape data. Need to have one row per sample, one column with the population, and separate columns with each gene region with % methylation. 
perc_meth_genes_2kbslop_wide <- perc_meth_genes_2kbslop %>% 
  ungroup() %>%
  tidyr::unite("gene_region", c("contig_gene", "start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
  dplyr::select(population, sample, gene_region, mean_percMeth) %>%
  spread(gene_region, mean_percMeth) %>%
  tibble::column_to_rownames(var = "sample")

head(perc_meth_genes_2kbslop_wide[1:4]) #confirm correct format

##   population Contig0_10497_95068 Contig100188_1_4123 Contig100396_1_4908
## 1         HC            88.86439            97.79135            93.75000
## 2         HC            88.54722            91.18234            93.47985
## 3         HC            89.21439            88.03419            93.99802
## 4         HC            87.40210            90.34792            95.97155
## 5         HC            84.91479            94.96795            87.76224
## 6         HC            84.48810           100.00000           100.00000

ncol(perc_meth_genes_2kbslop_wide) #1394 gene regions

## [1] 1394

Run Pst

#Now run the following line and it will provide Pst estimates for every gene.
genes_2kbslop_5loci_Pst <- Pst(perc_meth_genes_2kbslop_wide)

## [1] "Populations sizes are:"
## HC SS 
##  9  9

# Check out Pst distribution 
hist(genes_2kbslop_5loci_Pst$Pst_Values)

summary(genes_2kbslop_5loci_Pst$Pst_Values)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000003 0.0603303 0.2234150 0.2948533 0.5093137 0.9200812

nrow(genes_2kbslop_5loci_Pst)

## [1] 1393

# format of dataframe that I will save 
head(genes_2kbslop_5loci_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE))

##              Quant_Varia  contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1    Contig0_10497_95068      Contig0          10497        95068 0.01933289
## 2    Contig100188_1_4123 Contig100188              1         4123 0.73496201
## 3    Contig100396_1_4908 Contig100396              1         4908 0.18520587
## 4    Contig100994_1_8602 Contig100994              1         8602 0.70114357
## 5 Contig1013_14071_19514   Contig1013          14071        19514 0.01722196
## 6 Contig1013_16204_21663   Contig1013          16204        21663 0.03959333

# Write out Pst results 
write.table(genes_2kbslop_5loci_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE), file = here::here("analyses", "DMGs", "Pst_gene_2kbslop_5loci.tab"), sep = '\t', na = "NA", row.names = FALSE, col.names = TRUE)

head(genes_2kbslop_5loci_Pst %>% 
  separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
  filter(Pst_Values>0.5))

##              Quant_Varia  contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1    Contig100188_1_4123 Contig100188              1         4123  0.7349620
## 2    Contig100994_1_8602 Contig100994              1         8602  0.7011436
## 3 Contig101333_3371_8675 Contig101333           3371         8675  0.7122709
## 4     Contig1024_1_14330   Contig1024              1        14330  0.9200812
## 5 Contig103503_4845_9852 Contig103503           4845         9852  0.8293130
## 6     Contig10373_1_3003  Contig10373              1         3003  0.7098571

Merge DMG list with Pst values

#Merge # (gene region list+Pst values) with (gene region list + adjusted p-values (<0.05 indicates significant)) 

genes_2kbslop_Pst_DMGpvalue <- merge(x=genes_2kbslop_5loci_Pst,  
      y=results_2kbslop[!duplicated(results_2kbslop$gene), c("contig_gene", "start_gene_2kb", "end_gene_2kb", "notes_gene", "gene", "adj.pval.pop")],
      by.x="Quant_Varia", 
      by.y="gene")
save(genes_2kbslop_Pst_DMGpvalue, file = "../analyses/DMGs/R-objects/genes_2kbslop_Pst_DMGpvalue")

# Assess relationship between Pst values and P-adjusted for DMG regions 
hist(genes_2kbslop_Pst_DMGpvalue$Pst_Values^0.5)

hist(genes_2kbslop_Pst_DMGpvalue$adj.pval.pop)

summary(lm(Pst_Values^.5 ~ adj.pval.pop, data=genes_2kbslop_Pst_DMGpvalue))

## 
## Call:
## lm(formula = Pst_Values^0.5 ~ adj.pval.pop, data = genes_2kbslop_Pst_DMGpvalue)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.62760 -0.12509  0.03018  0.14487  0.49766 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   0.707490   0.008561   82.64   <2e-16 ***
## adj.pval.pop -0.533980   0.015650  -34.12   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1939 on 1391 degrees of freedom
## Multiple R-squared:  0.4556, Adjusted R-squared:  0.4552 
## F-statistic:  1164 on 1 and 1391 DF,  p-value: < 2.2e-16

library(ggpmisc)
formula <- genes_2kbslop_Pst_DMGpvalue$adj.pval.pop ~ genes_2kbslop_Pst_DMGpvalue$Pst_Values

ggplot(genes_2kbslop_Pst_DMGpvalue, aes(x=Pst_Values, y=adj.pval.pop)) +
  geom_point(aes(colour = cut(adj.pval.pop, c(-Inf, 0.05, Inf)))) + 
  scale_color_manual(name = "DMG region significance",
                     values = c("(-Inf,0.05]" = "red",
                                  "(0.05, Inf]" = "black"),
                     labels = c("significant", "non-significant")) +
  ylab("P-Adjusted from diff. methylated gene region analysis") +
  xlab("Pst values, gene regions") + 
  ggtitle("Gene Region P-Adjusted ~ Pst") +
  geom_smooth(method = "lm", se = F) +
  stat_poly_eq(aes(label = paste(..eq.label.., ..rr.label.., sep = "~~~")), 
               label.x.npc = "right", label.y.npc = 0.85,
               formula = formula, parse = TRUE, size = 3, col="blue") +
  ylim(c(0,1))

## `geom_smooth()` using formula 'y ~ x'

## Warning: Removed 13 rows containing missing values (geom_smooth).

Second, calculate Pst of average % methylation only within genes. Only look genes that have at minimum 5 methylated loci (this was done in a previous chunk). I need the % methylation data for each of the gene regions, so I’ll first create that dataframe.

# Read in data file with gene bodies that contain methylated loci 
meth_filter_genes <- 
  read_delim(file = here::here("analyses", "methylation-filtered", "meth_filter_gene.tab"), delim = "\t", col_names = c(colnames(meth_filter_reshaped[-10]), "population", "contig_gene", "source_gene", "feature_gene", "start_gene", "end_gene", "unknown1_gene", "strand_gene", "unknown2_gene", "notes_gene")) %>%
  mutate(gene=paste(contig_gene, start_gene, end_gene, sep="_")) %>%
  mutate(group=paste(sample, gene, sep="_"))

## Parsed with column specification:
## cols(
##   chr = col_character(),
##   start = col_double(),
##   end = col_double(),
##   strand = col_character(),
##   sample = col_double(),
##   coverage = col_double(),
##   numCs = col_double(),
##   numTs = col_double(),
##   percMeth = col_double(),
##   population = col_character(),
##   contig_gene = col_character(),
##   source_gene = col_character(),
##   feature_gene = col_character(),
##   start_gene = col_double(),
##   end_gene = col_double(),
##   unknown1_gene = col_character(),
##   strand_gene = col_character(),
##   unknown2_gene = col_character(),
##   notes_gene = col_character()
## )

# Filter for gene bodies that have 5 or more methylated loci 
min.filt_genes <- dplyr::count(meth_filter_genes, vars = c(group))
newdata <- min.filt_genes[which(min.filt_genes$n > 4), ]
sub_meth_table <- meth_filter_genes[meth_filter_genes$group %in% newdata$vars,]

# Calculate average % methylation within gene bodies by sample 
perc_meth_genes <- sub_meth_table %>% 
   group_by(population, sample, contig_gene, start_gene, end_gene) %>%
   dplyr::summarise(
    mean_percMeth = mean(percMeth, na.rm=TRUE),
    sd_percMeth=sd(percMeth, na.rm=TRUE),
    n())  

# check to make sure % methylation is calculated separately for each sample and gene region 
perc_meth_genes %>% filter(contig_gene=="Contig0", start_gene==12497, end_gene==93068) %>%
  ggplot(aes(x=sample, y=mean_percMeth)) + geom_bar(stat="identity")

# Reshape data. Need to have one row per sample, one column with the population, and separate columns with each gene region with % methylation. 
perc_meth_genes_wide <- perc_meth_genes %>% 
  ungroup() %>%
  tidyr::unite("gene_region", c("contig_gene", "start_gene", "end_gene"), sep = "_", remove = FALSE) %>%
  dplyr::select(population, sample, gene_region, mean_percMeth) %>%
  spread(gene_region, mean_percMeth) %>%
  column_to_rownames(var = "sample")
head(perc_meth_genes_wide[1:4]) #confirm correct format

##   population Contig0_12497_93068 Contig100188_461_2761 Contig100396_517_2908
## 1         HC            88.86439              97.79135              93.75000
## 2         HC            88.54722              91.18234              93.47985
## 3         HC            89.21439              88.03419              93.99802
## 4         HC            87.40210              90.34792              95.97155
## 5         HC            84.91479              94.96795              87.76224
## 6         HC            84.48810             100.00000             100.00000

#Now run the following line and it will provide Pst estimates for every gene.
genes_Pst <- Pst(perc_meth_genes_wide)

## [1] "Populations sizes are:"
## HC SS 
##  9  9

hist(genes_Pst$Pst_Values)

summary(genes_Pst$Pst_Values)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000003 0.0653852 0.2306499 0.2976943 0.5102395 0.9200812

nrow(genes_Pst)

## [1] 1248

# Write out Pst results 
write.table(genes_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene", "end_gene"), sep = "_", remove = FALSE), file = here::here("analyses", "DMGs", "Pst_gene.tab"), sep = '\t', na = "NA", row.names = FALSE, col.names = TRUE)

Heatmap of DMGs

DMG.ratios <- DMGs_2kbslop %>% 
  #mutate(contig_gene_start=paste(contig_gene, start_gene_2kb, sep="_")) %>%
  group_by(population, gene) %>%
  summarise(mean_percentMeth = mean(percMeth, na.rm = TRUE)) %>% 
  dcast(gene ~ population) %>% 
  mutate(ratio_HC.SS = HC/SS) %>%
  arrange(ratio_HC.SS) #in this ratio column, values <1 = HC hypomethylated, values >1 = SS hypomethylated

## Using mean_percentMeth as value column: use value.var to override.

ggplot(DMGs_2kbslop, aes(sample, gene, fill= percMeth)) + xlab("Sample") + geom_tile(na.rm = T) +
  scale_y_discrete(limits=(DMG.ratios[order(DMG.ratios$ratio_HC.SS),]$gene)) + 
  #scale_fill_viridis(discrete=FALSE) 
  theme(axis.title.y=element_blank(),
        axis.text.y=element_blank(),
        axis.ticks.y=element_blank()) + 
  scale_fill_distiller(palette = "YlGnBu", direction = 1)

  #scale_fill_gradient(low="gray5", high="white")

DMG analysis with relaxed loci settings

Filter for genes with at minimum 4 methylated loci

newdata_4loci <- min.filt_2kbslop[which(min.filt_2kbslop$n > 3), ]
sub_meth_table_2kbslop_4loci <- meth_filter_genes_2kbslop[meth_filter_genes_2kbslop$group %in% newdata_4loci$vars,]
save(sub_meth_table_2kbslop_4loci, file="../analyses/DMGs/R-objects/sub_meth_table_2kbslop_4loci")

Here is the number of genes that remain after filtering for those with 4 or more methylated loci within each gene region:

length(unique(sub_meth_table_2kbslop_4loci$gene))

## [1] 1723

Run GLM to test for differences among population for each gene individually

# create data frame to stored results
results_2kbslop_4loci <- data.frame()

gs <- unique(sub_meth_table_2kbslop_4loci$gene)

#first subset the unique dataframes and second run the GLMs
for(i in 1:length(gs)){
  
  #subset the dataframe gene by gene
  sub_meth_table_2kbslop1 <- subset(sub_meth_table_2kbslop_4loci, gene ==gs[i])
  
  # fit glm position model
  fit <- glm(matrix(c(numCs, numTs), ncol=2) ~ as.factor(population) + (1|sample), 
             data=sub_meth_table_2kbslop1, family=binomial)
  a <- anova(fit, test="Chisq")
  
  # capture summary stats to data frame
  df <- data.frame(sub_meth_table_2kbslop1[c("population", "sample", "contig_gene", "start_gene_2kb", "end_gene_2kb", "gene", "chr", "start", "sample", "coverage", "numCs", "numTs", "percMeth", "notes_gene")],
                   pval.treatment = a$`Pr(>Chi)`[2],
                   #pval.position = a$`Pr(>Chi)`[3], #uncomment if you want to include position of CpG within a gene
                   #pval.treatment_x_position = a$`Pr(>Chi)`[4], #uncomment if you want to include position of CpG within a gene interaction with treatment
                   stringsAsFactors = F)
  
  # bind rows of temporary data frame to the results data frame
  results_2kbslop_4loci <- rbind(results_2kbslop_4loci, df)
}

results_2kbslop_4loci[is.na(results_2kbslop_4loci)] <- 0
results_2kbslop_4loci$adj.pval.pop <- p.adjust(results_2kbslop_4loci$pval.treatment, method='BH')
#results_2kbslop$adj.pval.position <- p.adjust(results_2kbslop$pval.position, method='BH') #uncomment if you want to include position of CpG within a gene
#results_2kbslop$adj.pval.treatment_x_position <- p.adjust(results_2kbslop$pval.treatment_x_position, method='BH') #uncomment if you want to include position of CpG within a gene interaction with treatment

Extract only genes that were differentially methylated (4 loci min) (p-adj < 0.05):

#save df with differentially methylated genes 
DMGs_2kbslop_4loci <- subset(results_2kbslop_4loci, adj.pval.pop < 0.05) #%>%
#  mutate(contig_gene_start=paste(contig_gene, start_gene_2kb, sep="_"))

DMGs_2kbslop_4loci.genes <- DMGs_2kbslop_4loci[!duplicated(DMGs_2kbslop_4loci$gene), c("contig_gene", "start_gene_2kb", "end_gene_2kb", "notes_gene", "pval.treatment", "adj.pval.pop")]
save(DMGs_2kbslop_4loci, file="../analyses/DMGs/R-objects/DMGs_2kbslop_4loci")

Here is the number of differentially methylated genes (4 loci min):

length(unique(DMGs_2kbslop_4loci$gene))

## [1] 342

Extract GO terms for DMGs_2kbslop_4loci and save to file

# split gene data in "notes_gene" column into separate columns 
DMGs_2kbslop_4loci.genes <- DMGs_2kbslop_4loci.genes %>%
  mutate(ID=str_extract(notes_gene, "ID=(.*?);"),
       Parent=str_extract(notes_gene, "Parent=(.*?);"),
       Name=str_extract(notes_gene, "Name=(.*?);"),
       Alias=str_extract(notes_gene, "Alias=(.*?);"),
       AED=str_extract(notes_gene, "AED=(.*?);"),
       eAED=str_extract(notes_gene, "eAED=(.*?);"),
       Note=str_extract(notes_gene, "Note=(.*?);"),
       Ontology_term=str_extract(notes_gene, "Ontology_term=(.*?);"),
       Dbxref=str_extract(notes_gene, "Dbxref=(.*?);")
       )
write_csv(DMGs_2kbslop_4loci.genes, path = here::here("analyses", "DMGs", "DMGs_2kbslop_4loci.genes.csv"))

#Extract GO terms 
DMGs_2kbslop_4loci.genes.GO <- DMGs_2kbslop_4loci.genes %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern="Ontology_term=",replacement = "")) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern=";",replacement = "")) %>%
  separate(Ontology_term, sep=",", into=paste("GO", 1:11, sep="_")) %>%
  pivot_longer(cols=c("GO_1","GO_2","GO_3","GO_4","GO_5","GO_6","GO_7","GO_8","GO_9","GO_10","GO_11"), names_to = "GO_number", values_to = "GO_term") %>%
  dplyr::select(-GO_number) %>%
  filter(!is.na(Note) & !is.na(GO_term))

## Warning: Expected 11 pieces. Missing pieces filled with `NA` in 156 rows [1, 3,
## 4, 5, 9, 10, 11, 12, 14, 15, 17, 19, 20, 22, 23, 29, 31, 32, 33, 34, ...].

write_delim(DMGs_2kbslop_4loci.genes.GO[,c("GO_term","adj.pval.pop")], path = here::here("analyses/", "DMGs/", "DMGs_2kbslop_4loci.GO.txt"), delim = '\t', col_names = F) #write out df with just GO terms and p-adj values

Calculate Pst for each gene (min 4 loci per gene)

First, calculate Pst of average % methylation within genes and +/- 2kb. Only look genes that have at minimum 4 methylated loci (this was done in a previous chunk). I need the % methylation data for each of the gene regions, so I’ll first create that dataframe.

# How many gene regions are there after filtering for those with 4 methylated loci? 
sub_meth_table_2kbslop_4loci %>% 
       distinct(gene) %>%
  nrow()

## [1] 1723

# I think this filtering step doesn't account for the 2kb +/- start and stop 
perc_meth_genes_2kbslop_4loci <- sub_meth_table_2kbslop_4loci %>% 
   group_by(population, sample, contig_gene, start_gene_2kb, end_gene_2kb) %>%
   dplyr::summarise(
    mean_percMeth = mean(percMeth, na.rm=TRUE),
    sd_percMeth=sd(percMeth, na.rm=TRUE),
    n())  

# check to make sure % methylation is calculated separately for each sample and gene region 
perc_meth_genes_2kbslop_4loci %>% filter(contig_gene=="Contig0", start_gene_2kb==10497, end_gene_2kb==95068) %>%
  ggplot(aes(x=sample, y=mean_percMeth)) + geom_bar(stat="identity")

# How many unique gene regions? 
perc_meth_genes_2kbslop_4loci %>% 
       ungroup() %>% 
  dplyr::select(contig_gene, start_gene_2kb, end_gene_2kb) %>%
       distinct(contig_gene, start_gene_2kb, end_gene_2kb) %>%
  nrow()

## [1] 1723

# Reshape data. Need to have one row per sample, one column with the population, and separate columns with each gene region with % methylation. 
perc_meth_genes_2kbslop_4loci_wide <- perc_meth_genes_2kbslop_4loci %>% 
  ungroup() %>%
  tidyr::unite("gene_region", c("contig_gene", "start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
  dplyr::select(population, sample, gene_region, mean_percMeth) %>%
  spread(gene_region, mean_percMeth) %>%
  tibble::column_to_rownames(var = "sample")

head(perc_meth_genes_2kbslop_4loci_wide[1:4]) #confirm correct format

##   population Contig0_10497_95068 Contig100188_1_4123 Contig100396_1_4908
## 1         HC            88.86439            97.79135            93.75000
## 2         HC            88.54722            91.18234            93.47985
## 3         HC            89.21439            88.03419            93.99802
## 4         HC            87.40210            90.34792            95.97155
## 5         HC            84.91479            94.96795            87.76224
## 6         HC            84.48810           100.00000           100.00000

ncol(perc_meth_genes_2kbslop_4loci_wide) #1724 gene regions

## [1] 1724

Run Pst

#Now run the following line and it will provide Pst estimates for every gene.
genes_2kbslop_4loci_Pst <- Pst(perc_meth_genes_2kbslop_4loci_wide)

## [1] "Populations sizes are:"
## HC SS 
##  9  9

# Check out Pst distribution 
hist(genes_2kbslop_4loci_Pst$Pst_Values)

summary(genes_2kbslop_4loci_Pst$Pst_Values)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000003 0.0660358 0.2423961 0.3014315 0.5164664 0.9306490

nrow(genes_2kbslop_4loci_Pst)

## [1] 1723

# format of dataframe that I will save 
head(genes_2kbslop_4loci_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE))

##              Quant_Varia  contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1    Contig0_10497_95068      Contig0          10497        95068 0.01933289
## 2    Contig100188_1_4123 Contig100188              1         4123 0.73496201
## 3    Contig100396_1_4908 Contig100396              1         4908 0.18520587
## 4    Contig100994_1_8602 Contig100994              1         8602 0.70114357
## 5   Contig10117_48_10916  Contig10117             48        10916 0.05127876
## 6 Contig1013_14071_19514   Contig1013          14071        19514 0.01722196

# Write out Pst results 
write.table(genes_2kbslop_4loci_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE), file = here::here("analyses", "DMGs", "Pst_gene_2kbslop_4loci.tab"), sep = '\t', na = "NA", row.names = FALSE, col.names = TRUE)

head(genes_2kbslop_4loci_Pst %>% 
  separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
  filter(Pst_Values>0.5))

##              Quant_Varia  contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1    Contig100188_1_4123 Contig100188              1         4123  0.7349620
## 2    Contig100994_1_8602 Contig100994              1         8602  0.7011436
## 3 Contig101333_3371_8675 Contig101333           3371         8675  0.7122709
## 4     Contig1024_1_14330   Contig1024              1        14330  0.9200812
## 5 Contig103503_4845_9852 Contig103503           4845         9852  0.8293130
## 6     Contig10373_1_3003  Contig10373              1         3003  0.7098571

DMG analysis with relaxed loci settings

Filter for genes with at minimum 3 methylated loci

newdata_3loci <- min.filt_2kbslop[which(min.filt_2kbslop$n > 2), ]
sub_meth_table_2kbslop_3loci <- meth_filter_genes_2kbslop[meth_filter_genes_2kbslop$group %in% newdata_3loci$vars,]
save(sub_meth_table_2kbslop_3loci, file="../analyses/DMGs/R-objects/sub_meth_table_2kbslop_3loci")

Here is the number of genes that remain after filtering for those with 4 or more methylated loci within each gene region:

length(unique(sub_meth_table_2kbslop_3loci$gene))

## [1] 2177

Run GLM to test for differences among population for each gene individually

# create data frame to stored results
results_2kbslop_3loci <- data.frame()

gs <- unique(sub_meth_table_2kbslop_3loci$gene)

#first subset the unique dataframes and second run the GLMs
for(i in 1:length(gs)){
  
  #subset the dataframe gene by gene
  sub_meth_table_2kbslop1 <- subset(sub_meth_table_2kbslop_3loci, gene ==gs[i])
  
  # fit glm position model
  fit <- glm(matrix(c(numCs, numTs), ncol=2) ~ as.factor(population) + (1|sample), 
             data=sub_meth_table_2kbslop1, family=binomial)
  a <- anova(fit, test="Chisq")
  
  # capture summary stats to data frame
  df <- data.frame(sub_meth_table_2kbslop1[c("population", "sample", "contig_gene", "start_gene_2kb", "end_gene_2kb", "gene", "chr", "start", "sample", "coverage", "numCs", "numTs", "percMeth", "notes_gene")],
                   pval.treatment = a$`Pr(>Chi)`[2],
                   #pval.position = a$`Pr(>Chi)`[3], #uncomment if you want to include position of CpG within a gene
                   #pval.treatment_x_position = a$`Pr(>Chi)`[4], #uncomment if you want to include position of CpG within a gene interaction with treatment
                   stringsAsFactors = F)
  
  # bind rows of temporary data frame to the results data frame
  results_2kbslop_3loci <- rbind(results_2kbslop_3loci, df)
}

results_2kbslop_3loci[is.na(results_2kbslop_3loci)] <- 0
results_2kbslop_3loci$adj.pval.pop <- p.adjust(results_2kbslop_3loci$pval.treatment, method='BH')
#results_2kbslop$adj.pval.position <- p.adjust(results_2kbslop$pval.position, method='BH') #uncomment if you want to include position of CpG within a gene
#results_2kbslop$adj.pval.treatment_x_position <- p.adjust(results_2kbslop$pval.treatment_x_position, method='BH') #uncomment if you want to include position of CpG within a gene interaction with treatment

Extract only genes that were differentially methylated (3 loci min) (p-adj < 0.05):

#save df with differentially methylated genes 
DMGs_2kbslop_3loci <- subset(results_2kbslop_3loci, adj.pval.pop < 0.05) #%>%
#  mutate(contig_gene_start=paste(contig_gene, start_gene_2kb, sep="_"))

DMGs_2kbslop_3loci.genes <- DMGs_2kbslop_3loci[!duplicated(DMGs_2kbslop_3loci$gene), c("contig_gene", "start_gene_2kb", "end_gene_2kb", "notes_gene", "pval.treatment", "adj.pval.pop")]
save(DMGs_2kbslop_3loci, file="../analyses/DMGs/R-objects/DMGs_2kbslop_3loci")

Here is the number of differentially methylated genes (3 loci min):

length(unique(DMGs_2kbslop_3loci$gene))

## [1] 401

Extract GO terms for DMGs_2kbslop_3loci and save to file

# split gene data in "notes_gene" column into separate columns 
DMGs_2kbslop_3loci.genes <- DMGs_2kbslop_3loci.genes %>%
  mutate(ID=str_extract(notes_gene, "ID=(.*?);"),
       Parent=str_extract(notes_gene, "Parent=(.*?);"),
       Name=str_extract(notes_gene, "Name=(.*?);"),
       Alias=str_extract(notes_gene, "Alias=(.*?);"),
       AED=str_extract(notes_gene, "AED=(.*?);"),
       eAED=str_extract(notes_gene, "eAED=(.*?);"),
       Note=str_extract(notes_gene, "Note=(.*?);"),
       Ontology_term=str_extract(notes_gene, "Ontology_term=(.*?);"),
       Dbxref=str_extract(notes_gene, "Dbxref=(.*?);")
       )
write_csv(DMGs_2kbslop_3loci.genes, path = here::here("analyses", "DMGs", "DMGs_2kbslop_3loci.genes.csv"))

#Extract GO terms 
DMGs_2kbslop_3loci.genes.GO <- DMGs_2kbslop_3loci.genes %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern="Ontology_term=",replacement = "")) %>%
  mutate(Ontology_term = str_replace(Ontology_term, pattern=";",replacement = "")) %>%
  separate(Ontology_term, sep=",", into=paste("GO", 1:11, sep="_")) %>%
  pivot_longer(cols=c("GO_1","GO_2","GO_3","GO_4","GO_5","GO_6","GO_7","GO_8","GO_9","GO_10","GO_11"), names_to = "GO_number", values_to = "GO_term") %>%
  dplyr::select(-GO_number) %>%
  filter(!is.na(Note) & !is.na(GO_term))

## Warning: Expected 11 pieces. Missing pieces filled with `NA` in 179 rows [1, 3,
## 4, 5, 10, 11, 12, 13, 15, 16, 18, 20, 21, 23, 24, 30, 31, 33, 36, 37, ...].

write_delim(DMGs_2kbslop_3loci.genes.GO[,c("GO_term","adj.pval.pop")], path = here::here("analyses/", "DMGs/", "DMGs_2kbslop_3loci.GO.txt"), delim = '\t', col_names = F) #write out df with just GO terms and p-adj values

Calculate Pst for each gene (min 3 loci per gene)

First, calculate Pst of average % methylation within genes and +/- 2kb. Only look genes that have at minimum 3 methylated loci (this was done in a previous chunk). I need the % methylation data for each of the gene regions, so I’ll first create that dataframe.

# How many gene regions are there after filtering for those with 3 methylated loci? 
sub_meth_table_2kbslop_3loci %>% 
       distinct(gene) %>%
  nrow()

## [1] 2177

# I think this filtering step doesn't account for the 2kb +/- start and stop 
perc_meth_genes_2kbslop_3loci <- sub_meth_table_2kbslop_3loci %>% 
   group_by(population, sample, contig_gene, start_gene_2kb, end_gene_2kb) %>%
   dplyr::summarise(
    mean_percMeth = mean(percMeth, na.rm=TRUE),
    sd_percMeth=sd(percMeth, na.rm=TRUE),
    n())  

# check to make sure % methylation is calculated separately for each sample and gene region 
perc_meth_genes_2kbslop_3loci %>% filter(contig_gene=="Contig0", start_gene_2kb==10497, end_gene_2kb==95068) %>%
  ggplot(aes(x=sample, y=mean_percMeth)) + geom_bar(stat="identity")

# How many unique gene regions? 
perc_meth_genes_2kbslop_3loci %>% 
       ungroup() %>% 
  dplyr::select(contig_gene, start_gene_2kb, end_gene_2kb) %>%
       distinct(contig_gene, start_gene_2kb, end_gene_2kb) %>%
  nrow()

## [1] 2177

# Reshape data. Need to have one row per sample, one column with the population, and separate columns with each gene region with % methylation. 
perc_meth_genes_2kbslop_3loci_wide <- perc_meth_genes_2kbslop_3loci %>% 
  ungroup() %>%
  tidyr::unite("gene_region", c("contig_gene", "start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
  dplyr::select(population, sample, gene_region, mean_percMeth) %>%
  spread(gene_region, mean_percMeth) %>%
  tibble::column_to_rownames(var = "sample")

head(perc_meth_genes_2kbslop_3loci_wide[1:4]) #confirm correct format

##   population Contig0_10497_95068 Contig100188_1_4123 Contig100396_1_4908
## 1         HC            88.86439            97.79135            93.75000
## 2         HC            88.54722            91.18234            93.47985
## 3         HC            89.21439            88.03419            93.99802
## 4         HC            87.40210            90.34792            95.97155
## 5         HC            84.91479            94.96795            87.76224
## 6         HC            84.48810           100.00000           100.00000

ncol(perc_meth_genes_2kbslop_3loci_wide) #1724 gene regions

## [1] 2178

Run Pst

#Now run the following line and it will provide Pst estimates for every gene.
genes_2kbslop_3loci_Pst <- Pst(perc_meth_genes_2kbslop_3loci_wide)

## [1] "Populations sizes are:"
## HC SS 
##  9  9

# Check out Pst distribution 
hist(genes_2kbslop_3loci_Pst$Pst_Values)

summary(genes_2kbslop_3loci_Pst$Pst_Values)

##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## 0.0000003 0.0661707 0.2558255 0.3043360 0.5174845 0.9306490

nrow(genes_2kbslop_3loci_Pst)

## [1] 2177

# format of dataframe that I will save 
head(genes_2kbslop_3loci_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE))

##               Quant_Varia  contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1     Contig0_10497_95068      Contig0          10497        95068 0.01933289
## 2     Contig100188_1_4123 Contig100188              1         4123 0.73496201
## 3     Contig100396_1_4908 Contig100396              1         4908 0.18520587
## 4 Contig100743_7752_12849 Contig100743           7752        12849 0.51317776
## 5     Contig100994_1_8602 Contig100994              1         8602 0.70114357
## 6    Contig10117_48_10916  Contig10117             48        10916 0.05127876

# Write out Pst results 
write.table(genes_2kbslop_3loci_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE), file = here::here("analyses", "DMGs", "Pst_gene_2kbslop_3loci.tab"), sep = '\t', na = "NA", row.names = FALSE, col.names = TRUE)

head(genes_2kbslop_3loci_Pst %>% 
  separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
  filter(Pst_Values>0.5))

##               Quant_Varia  contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1     Contig100188_1_4123 Contig100188              1         4123  0.7349620
## 2 Contig100743_7752_12849 Contig100743           7752        12849  0.5131778
## 3     Contig100994_1_8602 Contig100994              1         8602  0.7011436
## 4  Contig101333_3371_8675 Contig101333           3371         8675  0.7122709
## 5      Contig1024_1_14330   Contig1024              1        14330  0.9200812
## 6  Contig103503_4845_9852 Contig103503           4845         9852  0.8293130

Calculate Pst for all genes represented in our methylation data

First, how many gene regions are there?

meth_filter_genes_2kbslop %>% 
       distinct(gene) %>%
  nrow()

## [1] 3754

I need the % methylation data for each of the gene regions, so I’ll first create that dataframe.

perc_meth_genes_2kbslop <- meth_filter_genes_2kbslop %>% 
   group_by(population, sample, contig_gene, start_gene_2kb, end_gene_2kb) %>%
   dplyr::summarise(
    mean_percMeth = mean(percMeth, na.rm=TRUE),
    sd_percMeth=sd(percMeth, na.rm=TRUE),
    n())  

# check a couple loci to make sure % methylation is calculated separately for each sample and gene region 
perc_meth_genes_2kbslop %>% filter(contig_gene=="Contig0", start_gene_2kb==10497, end_gene_2kb==95068) %>%
  ggplot(aes(x=sample, y=mean_percMeth)) + geom_bar(stat="identity")

perc_meth_genes_2kbslop %>% filter(contig_gene=="Contig1050", start_gene_2kb==4735, end_gene_2kb==27978) %>%
  ggplot(aes(x=sample, y=mean_percMeth)) + geom_bar(stat="identity")

# Reshape data. Need to have one row per sample, one column with the population, and separate columns with each gene region with % methylation. 
perc_meth_genes_2kbslop_wide <- perc_meth_genes_2kbslop %>% 
  ungroup() %>%
  tidyr::unite("gene_region", c("contig_gene", "start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
  dplyr::select(population, sample, gene_region, mean_percMeth) %>%
  spread(gene_region, mean_percMeth) %>%
  tibble::column_to_rownames(var = "sample")

head(perc_meth_genes_2kbslop_wide[1:4]) #confirm correct format

##   population Contig0_10497_95068 Contig100_48322_58076 Contig100188_1_4123
## 1         HC            88.86439                    90            97.79135
## 2         HC            88.54722                    90            91.18234
## 3         HC            89.21439                    80            88.03419
## 4         HC            87.40210                   100            90.34792
## 5         HC            84.91479                   NaN            94.96795
## 6         HC            84.48810                   NaN           100.00000

ncol(perc_meth_genes_2kbslop_wide) # 3754 gene regions

## [1] 3755

Run Pst

#Now run the following line and it will provide Pst estimates for every gene.
genes_2kbslop_Pst <- Pst(perc_meth_genes_2kbslop_wide)

## [1] "Populations sizes are:"
## HC SS 
##  9  9

# This object contains Pst values for all genes that contain methylation data.  
save(genes_2kbslop_Pst, file = "../analyses/methylation-filtered/R-objects/genes_2kbslop_Pst") #save object to file

# Check out Pst distribution & summary statistics 
hist(genes_2kbslop_Pst$Pst_Values)

mean(genes_2kbslop_Pst$Pst_Values)

## [1] 0.2938804

min(genes_2kbslop_Pst$Pst_Values)

## [1] 3.096144e-32

nrow(genes_2kbslop_Pst) #

## [1] 3754

# format of dataframe that I will save 
head(genes_2kbslop_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE))

##               Quant_Varia  contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1     Contig0_10497_95068      Contig0          10497        95068 0.01933289
## 2   Contig100_48322_58076    Contig100          48322        58076 0.06700810
## 3     Contig100188_1_4123 Contig100188              1         4123 0.73496201
## 4 Contig100199_5411_27437 Contig100199           5411        27437 0.06825011
## 5     Contig100396_1_4908 Contig100396              1         4908 0.18520587
## 6   Contig100499_508_4044 Contig100499            508         4044 0.15042326

# Write out Pst results 
write.table(genes_2kbslop_Pst %>% separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE), file = here::here("analyses", "DMGs", "Pst_gene_2kbslop.tab"), sep = '\t', na = "NA", row.names = FALSE, col.names = TRUE)

head(genes_2kbslop_Pst %>% 
  separate(Quant_Varia, into=c("contig_gene","start_gene_2kb", "end_gene_2kb"), sep = "_", remove = FALSE) %>%
  filter(Pst_Values>0.5))

##               Quant_Varia  contig_gene start_gene_2kb end_gene_2kb Pst_Values
## 1     Contig100188_1_4123 Contig100188              1         4123  0.7349620
## 2 Contig100743_7752_12849 Contig100743           7752        12849  0.5131778
## 3     Contig100844_1_4149 Contig100844              1         4149  0.5276145
## 4     Contig100994_1_8602 Contig100994              1         8602  0.7011436
## 5      Contig1010_1_34210   Contig1010              1        34210  0.5606926
## 6  Contig101333_3371_8675 Contig101333           3371         8675  0.7122709

Examine methylation patters by gene location

# Methylation data within genes 
meth_filter_genes

## # A tibble: 336,384 x 21
##    chr   start   end strand sample coverage numCs numTs percMeth population
##    <chr> <dbl> <dbl> <chr>   <dbl>    <dbl> <dbl> <dbl>    <dbl> <chr>     
##  1 Cont… 39226 39226 +           1       21    11    10     52.4 HC        
##  2 Cont… 39234 39234 +           1       24    10    14     41.7 HC        
##  3 Cont… 64179 64179 +           1       10     9     1     90   HC        
##  4 Cont… 71523 71523 +           1       13    13     0    100   HC        
##  5 Cont… 71533 71533 +           1       17    17     0    100   HC        
##  6 Cont… 71542 71542 +           1       16    15     1     93.8 HC        
##  7 Cont… 71546 71546 +           1       14    14     0    100   HC        
##  8 Cont… 71558 71558 +           1       17    17     0    100   HC        
##  9 Cont… 71563 71563 +           1       16    16     0    100   HC        
## 10 Cont… 71617 71617 +           1       19    15     4     78.9 HC        
## # … with 336,374 more rows, and 11 more variables: contig_gene <chr>,
## #   source_gene <chr>, feature_gene <chr>, start_gene <dbl>, end_gene <dbl>,
## #   unknown1_gene <chr>, strand_gene <chr>, unknown2_gene <chr>,
## #   notes_gene <chr>, gene <chr>, group <chr>

# Methylation data within genes +/- 2kb 
meth_filter_genes_2kbslop

## # A tibble: 379,314 x 21
##    chr   start   end strand sample coverage numCs numTs percMeth population
##    <chr> <dbl> <dbl> <chr>   <dbl>    <dbl> <dbl> <dbl>    <dbl> <chr>     
##  1 Cont… 39226 39226 +           1       21    11    10     52.4 HC        
##  2 Cont… 39234 39234 +           1       24    10    14     41.7 HC        
##  3 Cont… 64179 64179 +           1       10     9     1     90   HC        
##  4 Cont… 71523 71523 +           1       13    13     0    100   HC        
##  5 Cont… 71533 71533 +           1       17    17     0    100   HC        
##  6 Cont… 71542 71542 +           1       16    15     1     93.8 HC        
##  7 Cont… 71546 71546 +           1       14    14     0    100   HC        
##  8 Cont… 71558 71558 +           1       17    17     0    100   HC        
##  9 Cont… 71563 71563 +           1       16    16     0    100   HC        
## 10 Cont… 71617 71617 +           1       19    15     4     78.9 HC        
## # … with 379,304 more rows, and 11 more variables: contig_gene <chr>,
## #   source_gene <chr>, feature_gene <chr>, start_gene_2kb <dbl>,
## #   end_gene_2kb <dbl>, unknown1_gene <chr>, strand_gene <chr>,
## #   unknown2_gene <chr>, notes_gene <chr>, gene <chr>, group <chr>

Calculate the relative location of methylation sites across a gene

(distance from start of gene / length of gene) = (Methylated locus - start_gene) / (end_gene - start_gene)

meth_filter_genes <- meth_filter_genes %>% 
  mutate(gen_location = ((start-start_gene)/(end_gene-start_gene)))

# Histogram showing frequency of methylation data across gene (relative location)
meth_filter_genes %>% 
  group_by(gene, gen_location, population) %>% 
  summarise(mean_meth=mean(percMeth)) %>%
  ggplot(., aes(x=gen_location)) +
  stat_bin(bins=150, col="gray30", fill="lavenderblush3") +
  xlab("Relative location along gene") + 
  ylab("Frequency of methylation data") + 
  ggtitle("Methylation frequency relative to gene location") +
  theme_minimal()

# Percent methylation across gene 
meth_filter_genes$gen_location_round <- round(meth_filter_genes$gen_location, digits=2)

meth_filter_genes %>% 
  ungroup() %>%
  group_by(gen_location_round) %>% 
  summarise(mean_meth=mean(percMeth, na.rm=TRUE), sd_meth=sd(percMeth, na.rm=TRUE)) %>%
  ggplot(., aes(x=gen_location_round, y=mean_meth)) +
  geom_bar(stat="identity", col="gray30", fill="lavenderblush3") +
  # geom_errorbar(aes(ymin=mean_meth-sd_meth, ymax=mean_meth+sd_meth), width=.02,
  #                position=position_dodge(.9), ) +
  scale_y_continuous(limits = c(0, 100)) +
  xlab("Relative location along gene") +
  ylab("Mean % methylation") +
  ggtitle("% methylation by relative gene location") +
  theme_minimal()

# Plot methylation variance by gene location  
meth_filter_genes %>% 
  ungroup() %>%
  group_by(gen_location_round) %>% 
  summarise(var_meth=var(percMeth, na.rm=TRUE)) %>%
  ggplot(., aes(x=gen_location_round, y=var_meth)) +
  geom_bar(stat="identity", col="gray30", fill="lavenderblush3") +
  xlab("Relative location along gene") +
  ylab("Variance in % methylation") +
  ggtitle("Variance in % methylation by relative gene location") +
  theme_minimal()

meth_filter_genes_2kbslop <- meth_filter_genes_2kbslop %>% 
  mutate(gen_location = ((start-start_gene_2kb)/(end_gene_2kb-start_gene_2kb)))

hist(meth_filter_genes_2kbslop$gen_location, breaks = 100, main= "Frequency of methylation data by relative genome location")

# Histogram showing frequency of methylation data across genes with 2kb flanking regions (relative location)
meth_filter_genes_2kbslop %>% 
  group_by(gene, gen_location, population) %>% 
  summarise(mean_meth=mean(percMeth)) %>%
  ggplot(., aes(x=gen_location)) +
  stat_bin(bins=200, col="gray30", fill="lavenderblush3") +
  xlab("Relative location along gene") + 
  ylab("Frequency of methylation data") + 
  ggtitle("Methylation frequency along genes +/- 2kb flanks") +
  theme_minimal()

Plot methylation Pst values (mean) calculated at the loci level across a gene

load("../analyses/methylation-filtered/R-objects/perc.meth.Pst.done")

# split loci location info (contig, start, end) into three columns, merge with methylation by gene object, drop loci w/o Pst values, and plot 

perc.meth.Pst.done %>% 
  separate(Quant_Varia, c("chr", "start", "end"), "\\.") %>%
  mutate(start=as.numeric(start), end=as.numeric(end)) %>%
  right_join(., 
            meth_filter_genes %>% 
              dplyr::select(chr, start, end, start_gene, end_gene, 
                     gen_location, gen_location_round)) %>%
  drop_na(Pst_Values) %>% 
  #mutate(gen_location_round=as.factor(gen_location_round)) %>%
  ungroup() %>%
  group_by(gen_location_round) %>% 
  summarise(mean_Pst=mean(Pst_Values, na.rm=TRUE), sd_Pst=sd(Pst_Values, na.rm=TRUE)) %>%
  ggplot(., aes(x=gen_location_round, y=mean_Pst)) +
  geom_bar(stat="identity", col="gray30", fill="lavenderblush3") +
  # geom_errorbar(aes(ymin=mean_Pst, ymax=mean_Pst+sd_Pst), width=.01,
  #               position=position_dodge(.9), ) +
  xlab("Relative location along gene") +
  ylab("Methylation Pst") +
  ggtitle("Mean Methylation Pst (comparing population) by relative gene location") +
  theme_minimal() #+

## Joining, by = c("chr", "start", "end")

#  theme(axis.text.x=element_blank())

02-DMG-analysis

Laura H Spencer

1/30/2020

Load libraries

Load filtered methylation data object with sample info which was created in the notebook “01-methylkit”

Use bedtools slop to include 2kb regions on either side of gene regions

Run slopBed with gene feature file

Use intersectBed to find where loci and genes intersect, allowing loci to be mapped to annotated genes

Here is the number of loci associated with gene regions +/- 2kb:

Check out format of resulting gene regions

Now run intersectBed to NOT include the 2kb flanks

Check out resulting files which contain methylated loci that are within gene regions (+/- 2kb), and gene bodies:

For the fun of it, re-run slopBed with gene feature file to create 1) genes+2kb (downstream, 3’), and 2) genes-kb (upstream, 5’)

Binomial GLMs to test for differentially methylated genes regions.

Add columns for organization / filtering:

Here is the number of genes that are represented in our methylation data set:

Filter for genes with at minimum 5 methylated loci

Here is the number of genes that remain after filtering for those with 5 or more methylated loci within each gene region:

Run GLM to test for differences among population for each gene individually

Extract only genes that were differentially methylated (p-adj < 0.05),

Here is the number of differentially methylated genes (5 loci per gene):

Extract GO terms for DMGs_2kbslop and save to file

Here is a screen shot of the REVIGO plot showing GO terms associated with DMGs_2kbslop (5 methylated loci per gene)

Here is a script to re-create that REVIGO plot

IGV

Barplots showing % methylation of all methylated genes by population

Identify DMGs_2kbslop that contain DMLs

Here is the number of DML loci associated with DM gene regions:

Calculate Pst for each gene

First, calculate Pst of average % methylation within genes and +/- 2kb. Only look genes that have at minimum 5 methylated loci (this was done in a previous chunk). I need the % methylation data for each of the gene regions, so I’ll first create that dataframe.

Run Pst

Merge DMG list with Pst values

Second, calculate Pst of average % methylation only within genes. Only look genes that have at minimum 5 methylated loci (this was done in a previous chunk). I need the % methylation data for each of the gene regions, so I’ll first create that dataframe.

Heatmap of DMGs

DMG analysis with relaxed loci settings

Filter for genes with at minimum 4 methylated loci

Here is the number of genes that remain after filtering for those with 4 or more methylated loci within each gene region:

Run GLM to test for differences among population for each gene individually

Extract only genes that were differentially methylated (4 loci min) (p-adj < 0.05):

Here is the number of differentially methylated genes (4 loci min):

Extract GO terms for DMGs_2kbslop_4loci and save to file

Calculate Pst for each gene (min 4 loci per gene)

First, calculate Pst of average % methylation within genes and +/- 2kb. Only look genes that have at minimum 4 methylated loci (this was done in a previous chunk). I need the % methylation data for each of the gene regions, so I’ll first create that dataframe.

Run Pst

DMG analysis with relaxed loci settings

Filter for genes with at minimum 3 methylated loci

Here is the number of genes that remain after filtering for those with 4 or more methylated loci within each gene region:

Run GLM to test for differences among population for each gene individually

Extract only genes that were differentially methylated (3 loci min) (p-adj < 0.05):

Here is the number of differentially methylated genes (3 loci min):

Extract GO terms for DMGs_2kbslop_3loci and save to file

Calculate Pst for each gene (min 3 loci per gene)

First, calculate Pst of average % methylation within genes and +/- 2kb. Only look genes that have at minimum 3 methylated loci (this was done in a previous chunk). I need the % methylation data for each of the gene regions, so I’ll first create that dataframe.

Run Pst

Calculate Pst for all genes represented in our methylation data

First, how many gene regions are there?

I need the % methylation data for each of the gene regions, so I’ll first create that dataframe.

Run Pst

Examine methylation patters by gene location

Calculate the relative location of methylation sites across a gene

Use bedtools `slop` to include 2kb regions on either side of gene regions