1 Background

This Rmd file will create a bisulfite-converted genome by, and for, Bismark (Krueger and Andrews 2011) using the Porites_evermanni_v1.fa file. The genome FastA was taken from the Genoscop corals webpage.

Due to large sizes of output files, the files cannot be sync’d to GitHub. As such, the output directories will be gzipped and available here:

2 Inputs

  • Directory containing a FastA file with the file extension: .fa or .fasta (also ending in .gz).

3 Outputs

  • CT Conversion

    • Bowtie2 index files.
    • CT conversion FastA
  • GA conversion

    • Bowtie2 index files.
    • GA conversion FastA.

4 Create a Bash variables file

This allows usage of Bash variables across R Markdown chunks.

{
echo "#### Assign Variables ####"
echo ""

echo "# Data directories"
echo 'export timeseries_dir=/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular'
echo 'export output_dir_top=${timeseries_dir}/E-Peve/data'
echo 'export genome_dir=${timeseries_dir}/E-Peve/data'
echo ""

echo "# Paths to programs"
echo 'export programs_dir="/home/shared"'
echo 'export bismark_dir="${programs_dir}/Bismark-0.24.0"'
echo 'export bowtie2_dir="${programs_dir}/bowtie2-2.4.4-linux-x86_64"'
echo ""

echo "# Set number of CPUs to use"
echo 'export threads=20'
echo ""

echo "# Print formatting"
echo 'export line="--------------------------------------------------------"'
echo ""
} > .bashvars

cat .bashvars
#### Assign Variables ####

# Data directories
export timeseries_dir=/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular
export output_dir_top=${timeseries_dir}/E-Peve/data
export genome_dir=${timeseries_dir}/E-Peve/data

# Paths to programs
export programs_dir="/home/shared"
export bismark_dir="${programs_dir}/Bismark-0.24.0"
export bowtie2_dir="${programs_dir}/bowtie2-2.4.4-linux-x86_64"

# Set number of CPUs to use
export threads=20

# Print formatting
export line="--------------------------------------------------------"

5 Bisfulite conversion

# Load bash variables into memory
source .bashvars

${bismark_dir}/bismark_genome_preparation \
${genome_dir} \
--parallel ${threads} \
--bowtie2 \
--path_to_aligner ${bowtie2_dir} \
1> ${genome_dir}/Peve-bs-genome.stderr
Using 20 threads for the top and bottom strand indexing processes each, so using 40 cores in total
Writing bisulfite genomes out into a single MFA (multi FastA) file

Bisulfite Genome Indexer version v0.24.0 (last modified: 19 May 2022)

Step I - Prepare genome folders - completed



Step II - Genome bisulfite conversions - completed


Bismark Genome Preparation - Step III: Launching the Bowtie 2 indexer
Building a SMALL index
Building a SMALL index

=========================================

Parallel genome indexing complete. Enjoy!

5.1 Inpect BS output

# Load bash variables into memory
source .bashvars

tree -h ${genome_dir}/Bisulfite_Genome
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome
├── [4.0K]  CT_conversion
│   ├── [184M]  BS_CT.1.bt2
│   ├── [134M]  BS_CT.2.bt2
│   ├── [442K]  BS_CT.3.bt2
│   ├── [134M]  BS_CT.4.bt2
│   ├── [184M]  BS_CT.rev.1.bt2
│   ├── [134M]  BS_CT.rev.2.bt2
│   └── [586M]  genome_mfa.CT_conversion.fa
└── [4.0K]  GA_conversion
    ├── [184M]  BS_GA.1.bt2
    ├── [134M]  BS_GA.2.bt2
    ├── [442K]  BS_GA.3.bt2
    ├── [134M]  BS_GA.4.bt2
    ├── [184M]  BS_GA.rev.1.bt2
    ├── [134M]  BS_GA.rev.2.bt2
    └── [586M]  genome_mfa.GA_conversion.fa

2 directories, 14 files

5.2 Compress output folder

source .bashvars

tar -czvf ${genome_dir}/Bisulfite_Genome.tar.gz ${genome_dir}/Bisulfite_Genome
tar: Removing leading `/' from member names
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/CT_conversion/
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/CT_conversion/BS_CT.4.bt2
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/CT_conversion/BS_CT.rev.1.bt2
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/CT_conversion/genome_mfa.CT_conversion.fa
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/CT_conversion/BS_CT.1.bt2
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/CT_conversion/BS_CT.2.bt2
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/CT_conversion/BS_CT.3.bt2
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/CT_conversion/BS_CT.rev.2.bt2
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/GA_conversion/
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/GA_conversion/BS_GA.rev.1.bt2
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/GA_conversion/BS_GA.2.bt2
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/GA_conversion/BS_GA.4.bt2
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/GA_conversion/genome_mfa.GA_conversion.fa
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/GA_conversion/BS_GA.rev.2.bt2
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/GA_conversion/BS_GA.1.bt2
/home/shared/8TB_HDD_01/sam/gitrepos/urol-e5/timeseries_molecular/E-Peve/data/Bisulfite_Genome/GA_conversion/BS_GA.3.bt2

5.3 Create MD5sum

source .bashvars

cd ${genome_dir}

md5sum Bisulfite_Genome.tar.gz | tee Bisulfite_Genome.tar.gz.md5
6e51804f328dff149acd17fababc0272  Bisulfite_Genome.tar.gz

REFERENCES

Krueger, Felix, and Simon R. Andrews. 2011. “Bismark: A Flexible Aligner and Methylation Caller for Bisulfite-Seq Applications.” Bioinformatics 27 (11): 1571–72. https://doi.org/10.1093/bioinformatics/btr167.
LS0tCnRpdGxlOiAiMDAuMTAtRS1QZXZlLUJTLWdlbm9tZSIKYXV0aG9yOiAiU2FtIFdoaXRlIgpkYXRlOiAiMjAyNS0wMS0wMiIKb3V0cHV0OiAKICBib29rZG93bjo6aHRtbF9kb2N1bWVudDI6CiAgICB0aGVtZTogY29zbW8KICAgIHRvYzogdHJ1ZQogICAgdG9jX2Zsb2F0OiB0cnVlCiAgICBudW1iZXJfc2VjdGlvbnM6IHRydWUKICAgIGNvZGVfZm9sZGluZzogc2hvdwogICAgY29kZV9kb3dubG9hZDogdHJ1ZQogIGdpdGh1Yl9kb2N1bWVudDoKICAgIHRvYzogdHJ1ZQogICAgbnVtYmVyX3NlY3Rpb25zOiB0cnVlCiAgaHRtbF9kb2N1bWVudDoKICAgIHRoZW1lOiBjb3NtbwogICAgdG9jOiB0cnVlCiAgICB0b2NfZmxvYXQ6IHRydWUKICAgIG51bWJlcl9zZWN0aW9uczogdHJ1ZQogICAgY29kZV9mb2xkaW5nOiBzaG93CiAgICBjb2RlX2Rvd25sb2FkOiB0cnVlCmJpYmxpb2dyYXBoeTogcmVmZXJlbmNlcy5iaWIKLS0tCgojIEJhY2tncm91bmQKClRoaXMgUm1kIGZpbGUgd2lsbCBjcmVhdGUgYSBiaXN1bGZpdGUtY29udmVydGVkIGdlbm9tZSBieSwgYW5kIGZvciwgQmlzbWFyayBbQGtydWVnZXIyMDExXSB1c2luZyB0aGUgYFBvcml0ZXNfZXZlcm1hbm5pX3YxLmZhYCBmaWxlLiBUaGUgZ2Vub21lIEZhc3RBIHdhcyB0YWtlbiBmcm9tIHRoZSBbR2Vub3Njb3AgY29yYWxzIHdlYnBhZ2VdKGh0dHBzOi8vd3d3Lmdlbm9zY29wZS5jbnMuZnIvY29yYWxzL2dlbm9tZXMuaHRtbCkuCgpEdWUgdG8gbGFyZ2Ugc2l6ZXMgb2Ygb3V0cHV0IGZpbGVzLCB0aGUgZmlsZXMgY2Fubm90IGJlIHN5bmMnZCB0byBHaXRIdWIuIEFzIHN1Y2gsIHRoZSBvdXRwdXQgZGlyZWN0b3JpZXMgd2lsbCBiZSBnemlwcGVkIGFuZCBhdmFpbGFibGUgaGVyZToKCi0gW2h0dHBzOi8vZ2FubmV0LmZpc2gud2FzaGluZ3Rvbi5lZHUvZ2l0cmVwb3MvdXJvbC1lNS90aW1lc2VyaWVzX21vbGVjdWxhci9FLVBldmUvZGF0YS9CaXN1bGZpdGVfR2Vub21lLnRhci5nel0oaHR0cHM6Ly9nYW5uZXQuZmlzaC53YXNoaW5ndG9uLmVkdS9naXRyZXBvcy91cm9sLWU1L3RpbWVzZXJpZXNfbW9sZWN1bGFyL0UtUGV2ZS9kYXRhL0Jpc3VsZml0ZV9HZW5vbWUudGFyLmd6KSAoMS41R0IpCgotIFtodHRwczovL2dhbm5ldC5maXNoLndhc2hpbmd0b24uZWR1L2dpdHJlcG9zL3Vyb2wtZTUvdGltZXNlcmllc19tb2xlY3VsYXIvRS1QZXZlL2RhdGEvQmlzdWxmaXRlX0dlbm9tZS50YXIuZ3oubWQ1XShodHRwczovL2dhbm5ldC5maXNoLndhc2hpbmd0b24uZWR1L2dpdHJlcG9zL3Vyb2wtZTUvdGltZXNlcmllc19tb2xlY3VsYXIvRS1QZXZlL2RhdGEvQmlzdWxmaXRlX0dlbm9tZS50YXIuZ3oubWQ1KQoKICAtIE1ENTogYDVhMGQ0ZjY5OWQ3ZDQ2ZWI5Zjk5NmU2Nzc4NDE1ODJhYAoKIyBJbnB1dHMKCi0gRGlyZWN0b3J5IGNvbnRhaW5pbmcgYSBGYXN0QSBmaWxlIHdpdGggdGhlIGZpbGUgZXh0ZW5zaW9uOiAuZmEgb3IgLmZhc3RhIChhbHNvIGVuZGluZyBpbiAuZ3opLgoKIyBPdXRwdXRzCgotIENUIENvbnZlcnNpb24KCiAgLSBCb3d0aWUyIGluZGV4IGZpbGVzLgogIC0gQ1QgY29udmVyc2lvbiBGYXN0QQogIAotIEdBIGNvbnZlcnNpb24KCiAgLSBCb3d0aWUyIGluZGV4IGZpbGVzLgogIC0gR0EgY29udmVyc2lvbiBGYXN0QS4KCmBgYHtyIHNldHVwLCBpbmNsdWRlPUZBTFNFfQpsaWJyYXJ5KGtuaXRyKQprbml0cjo6b3B0c19jaHVuayRzZXQoCiAgZWNobyA9IFRSVUUsICAgICAgICAgIyBEaXNwbGF5IGNvZGUgY2h1bmtzCiAgZXZhbCA9IEZBTFNFLCAgICAgICAgIyBFdmFsdWF0ZSBjb2RlIGNodW5rcwogIHdhcm5pbmcgPSBGQUxTRSwgICAgICMgSGlkZSB3YXJuaW5ncwogIG1lc3NhZ2UgPSBGQUxTRSwgICAgICMgSGlkZSBtZXNzYWdlcwogIGNvbW1lbnQgPSAiIiAgICAgICAgICMgUHJldmVudHMgYXBwZW5kaW5nICcjIycgdG8gYmVnaW5uaW5nIG9mIGxpbmVzIGluIGNvZGUgb3V0cHV0CikKYGBgCgojIENyZWF0ZSBhIEJhc2ggdmFyaWFibGVzIGZpbGUKClRoaXMgYWxsb3dzIHVzYWdlIG9mIEJhc2ggdmFyaWFibGVzIGFjcm9zcyBSIE1hcmtkb3duIGNodW5rcy4KCmBgYHtyIHNhdmUtYmFzaC12YXJpYWJsZXMtdG8tcnZhcnMtZmlsZSwgZW5naW5lPSdiYXNoJywgZXZhbD1UUlVFfQp7CmVjaG8gIiMjIyMgQXNzaWduIFZhcmlhYmxlcyAjIyMjIgplY2hvICIiCgplY2hvICIjIERhdGEgZGlyZWN0b3JpZXMiCmVjaG8gJ2V4cG9ydCB0aW1lc2VyaWVzX2Rpcj0vaG9tZS9zaGFyZWQvOFRCX0hERF8wMS9zYW0vZ2l0cmVwb3MvdXJvbC1lNS90aW1lc2VyaWVzX21vbGVjdWxhcicKZWNobyAnZXhwb3J0IG91dHB1dF9kaXJfdG9wPSR7dGltZXNlcmllc19kaXJ9L0UtUGV2ZS9kYXRhJwplY2hvICdleHBvcnQgZ2Vub21lX2Rpcj0ke3RpbWVzZXJpZXNfZGlyfS9FLVBldmUvZGF0YScKZWNobyAiIgoKZWNobyAiIyBQYXRocyB0byBwcm9ncmFtcyIKZWNobyAnZXhwb3J0IHByb2dyYW1zX2Rpcj0iL2hvbWUvc2hhcmVkIicKZWNobyAnZXhwb3J0IGJpc21hcmtfZGlyPSIke3Byb2dyYW1zX2Rpcn0vQmlzbWFyay0wLjI0LjAiJwplY2hvICdleHBvcnQgYm93dGllMl9kaXI9IiR7cHJvZ3JhbXNfZGlyfS9ib3d0aWUyLTIuNC40LWxpbnV4LXg4Nl82NCInCmVjaG8gIiIKCmVjaG8gIiMgU2V0IG51bWJlciBvZiBDUFVzIHRvIHVzZSIKZWNobyAnZXhwb3J0IHRocmVhZHM9MjAnCmVjaG8gIiIKCmVjaG8gIiMgUHJpbnQgZm9ybWF0dGluZyIKZWNobyAnZXhwb3J0IGxpbmU9Ii0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tLS0tIicKZWNobyAiIgp9ID4gLmJhc2h2YXJzCgpjYXQgLmJhc2h2YXJzCmBgYAoKIyBCaXNmdWxpdGUgY29udmVyc2lvbgoKYGBge3IgYmlzbWFyay1nZW5vbWUtY29udmVyc2lvbiwgZW5naW5lPSdiYXNoJywgZXZhbD1UUlVFfQojIExvYWQgYmFzaCB2YXJpYWJsZXMgaW50byBtZW1vcnkKc291cmNlIC5iYXNodmFycwoKJHtiaXNtYXJrX2Rpcn0vYmlzbWFya19nZW5vbWVfcHJlcGFyYXRpb24gXAoke2dlbm9tZV9kaXJ9IFwKLS1wYXJhbGxlbCAke3RocmVhZHN9IFwKLS1ib3d0aWUyIFwKLS1wYXRoX3RvX2FsaWduZXIgJHtib3d0aWUyX2Rpcn0gXAoxPiAke2dlbm9tZV9kaXJ9L1BldmUtYnMtZ2Vub21lLnN0ZGVycgpgYGAKCiMjIElucGVjdCBCUyBvdXRwdXQKYGBge3IgaW5zcGVjdC1CUy1vdXRwdXQsIGVuZ2luZT0nYmFzaCcsIGV2YWw9VFJVRX0KIyBMb2FkIGJhc2ggdmFyaWFibGVzIGludG8gbWVtb3J5CnNvdXJjZSAuYmFzaHZhcnMKCnRyZWUgLWggJHtnZW5vbWVfZGlyfS9CaXN1bGZpdGVfR2Vub21lCmBgYAoKIyMgQ29tcHJlc3Mgb3V0cHV0IGZvbGRlcgpgYGB7ciBjb21wcmVzcy1CUy1kaXJlY3RvcnksIGVuZ2luZT0nYmFzaCcsIGV2YWw9VFJVRX0Kc291cmNlIC5iYXNodmFycwoKdGFyIC1jenZmICR7Z2Vub21lX2Rpcn0vQmlzdWxmaXRlX0dlbm9tZS50YXIuZ3ogJHtnZW5vbWVfZGlyfS9CaXN1bGZpdGVfR2Vub21lCmBgYAoKIyMgQ3JlYXRlIE1ENXN1bQpgYGB7ciBtZDVzdW0sIGVuZ2luZT0nYmFzaCcsIGV2YWw9VFJVRX0Kc291cmNlIC5iYXNodmFycwoKY2QgJHtnZW5vbWVfZGlyfQoKbWQ1c3VtIEJpc3VsZml0ZV9HZW5vbWUudGFyLmd6IHwgdGVlIEJpc3VsZml0ZV9HZW5vbWUudGFyLmd6Lm1kNQpgYGAKCgojIFJFRkVSRU5DRVMKCg==