{
"metadata": {
"name": "",
"signature": "sha256:31db6795116b3a0c014e696531ca6a470dba3f41c0d2f296a468f2f2ff275495"
},
"nbformat": 3,
"nbformat_minor": 0,
"worksheets": [
{
"cells": [
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"GOAL: Figure out how I made the following file"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!head /Volumes/web/trilobite/Crassostrea_gigas_v9_tracks/Cgigas_v9_TE.gff"
],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Once determine this, might update software and databases...."
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"via Oyster Genome Paper"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cd /Volumes/Bay3/Software"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"/Volumes/Bay3/Software\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cd RepeatMasker"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"/Volumes/Bay3/Software/RepeatMasker\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ls"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u001b[31mArrayList.pm\u001b[m\u001b[m* \u001b[31mRepeatMaskerConfig.tmpl\u001b[m\u001b[m*\r\n",
"\u001b[31mArrayListIterator.pm\u001b[m\u001b[m* \u001b[31mRepeatProteinMask\u001b[m\u001b[m*\r\n",
"\u001b[31mCrossmatchSearchEngine.pm\u001b[m\u001b[m* \u001b[31mSearchEngineI.pm\u001b[m\u001b[m*\r\n",
"\u001b[31mDateRepeats\u001b[m\u001b[m* \u001b[31mSearchResult.pm\u001b[m\u001b[m*\r\n",
"\u001b[31mDeCypherSearchEngine.pm\u001b[m\u001b[m* \u001b[31mSearchResultCollection.pm\u001b[m\u001b[m*\r\n",
"\u001b[31mDupMasker\u001b[m\u001b[m* \u001b[31mSeqDBI.pm\u001b[m\u001b[m*\r\n",
"\u001b[31mFastaDB.pm\u001b[m\u001b[m* \u001b[31mSimpleBatcher.pm\u001b[m\u001b[m*\r\n",
"HTMLAnnotHeader.html \u001b[31mTRF.pm\u001b[m\u001b[m*\r\n",
"INSTALL \u001b[31mTRFResult.pm\u001b[m\u001b[m*\r\n",
"\u001b[34mLibraries\u001b[m\u001b[m/ \u001b[31mTaxonomy.pm\u001b[m\u001b[m*\r\n",
"\u001b[31mLineHash.pm\u001b[m\u001b[m* \u001b[31mWUBlastSearchEngine.pm\u001b[m\u001b[m*\r\n",
"\u001b[34mMatrices\u001b[m\u001b[m/ \u001b[31mWUBlastXSearchEngine.pm\u001b[m\u001b[m*\r\n",
"\u001b[31mNCBIBlastSearchEngine.pm\u001b[m\u001b[m* bluegrad.jpg\r\n",
"\u001b[31mProcessRepeats\u001b[m\u001b[m* \u001b[31mconfigure\u001b[m\u001b[m*\r\n",
"\u001b[31mPubRef.pm\u001b[m\u001b[m* daterepeats.help\r\n",
"README license.txt\r\n",
"\u001b[34mRM_15585.FriNov301316572012\u001b[m\u001b[m/ \u001b[34mout\u001b[m\u001b[m/\r\n",
"\u001b[34mRM_28601.FriNov301852042012\u001b[m\u001b[m/ oyster.v9_pls.fa\r\n",
"\u001b[31mRepbaseEMBL.pm\u001b[m\u001b[m* oysterv9_90.fa\r\n",
"\u001b[31mRepbaseRecord.pm\u001b[m\u001b[m* repeatmasker.help\r\n",
"\u001b[31mRepeatAnnotationData.pm\u001b[m\u001b[m* rmblastdb.log\r\n",
"\u001b[31mRepeatMasker\u001b[m\u001b[m* taxonomy.dat\r\n",
"RepeatMaskerConfig.pm \u001b[34mutil\u001b[m\u001b[m/\r\n"
]
}
],
"prompt_number": 5
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Running RepeatMasker"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!./RepeatMasker -help"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"RepeatMasker version open-3.3.0\r\n",
"######################################################################\r\n",
"RepeatMasker\r\n",
"Developed by Arian Smit and Robert Hubley\r\n",
"Please refer to: Smit, AFA, Hubley, R. & Green, P \"RepeatMasker\" at\r\n",
"http://www.repeatmasker.org\r\n",
" \r\n",
"The interspersed repeat databases are modified versions of \r\n",
"those found in \"RepBase Update\" (http://www.girinst.org/)\r\n",
"######################################################################\r\n",
"\r\n",
"\r\n",
"RepeatMasker is a program that screens DNA sequences for interspersed\r\n",
"repeats and low complexity DNA sequences. The output of the program is\r\n",
"a detailed annotation of the repeats that are present in the query\r\n",
"sequence as well as a modified version of the query sequence in which\r\n",
"all the annotated repeats have been masked (default: replaced by\r\n",
"Ns). Sequence comparisons in RepeatMasker are performed by the program\r\n",
"cross_match, an efficient implementation of the Smith-Waterman-Gotoh\r\n",
"algorithm developed by Phil Green, or by WU-Blast developed by Warren\r\n",
"Gish.\r\n",
"\r\n",
"\r\n",
"This help file discusses the following topics:\r\n",
"\r\n",
"0 Basic input and output\r\n",
"\r\n",
"1 Options\r\n",
"1.1 Species and contamination check options\r\n",
"1.2 Options effecting which repeats get masked\r\n",
"1.3 Speed and search parameters\r\n",
"1.4 Output and formatting\r\n",
"1.5 ProcessRepeats options\r\n",
"\r\n",
"2 Methodology and quality of output\r\n",
"2.1 Methodology\r\n",
"2.2 Scoring matrices\r\n",
"2.3 Databases\r\n",
"2.4 Sensitivity and speed\r\n",
"2.5 Selectivity and matches to coding sequences\r\n",
"2.6 Low complexity DNA and simple repeats\r\n",
"\r\n",
"3 How to read the results\r\n",
"3.1 The annotation (.out) file\r\n",
"3.2 Alignments\r\n",
"3.3 The summary (.tbl) file\r\n",
"\r\n",
"4 Applications\r\n",
"4.1 Use in database searches\r\n",
"4.2 Identification of DNA source and bacterial insertions\r\n",
"4.3 DateRepeats - Masking lineage-specific repeats for genomic alignments\r\n",
"4.4 Use with gene prediction programs and other applications\r\n",
"\r\n",
"5 References\r\n",
"\r\n",
"\r\n",
"0 INPUT and OUTPUT\r\n",
"\r\n",
"Input format:\r\n",
"\r\n",
"Sequences have to be in the ' FASTA format':\r\n",
"\r\n",
">sequencename all kind of info\r\n",
"AGCGATCGCATCGAGCGCATTCGCATGGGG\r\n",
">sequencename2 all kind of info\r\n",
"GCCCATGCGATCGAGCTTCGCTAGCATAGCGATCA\r\n",
"\r\n",
"The program accepts FASTA format with errors and raw sequence files,\r\n",
"but does not work with other formats like GenBank, Staden, etc..\r\n",
"\r\n",
"You can use RepeatMasker on a file containing multiple FASTA format\r\n",
"sequences and on multiple sequence files at the same time:\r\n",
"\r\n",
"RepeatMasker *.fasta\r\n",
"\r\n",
"This command will mask all files that end with .fasta in the current\r\n",
"directory and give separate reports for each file. Note that if you\r\n",
"have multiple small sequences it is considerably faster to run\r\n",
"RepeatMasker on one batch file than on many single sequence files. The\r\n",
"summary file will be more informative as well. However, analysis on\r\n",
"single files (when larger than 2 kb each) can be slightly more\r\n",
"accurate, since GC levels for each sequence will be calculated and\r\n",
"used to choose appropriate parameters.\r\n",
"\r\n",
"\r\n",
"Standard output:\r\n",
"\r\n",
"RepeatMasker returns a .masked file containing the query sequence(s)\r\n",
"with all identified repeats and low complexity sequences masked. These\r\n",
"masked sequences are listed and annotated in the .out file. The masked\r\n",
"sequences are printed in the same order as they are in the submitted\r\n",
"file, whereas the sequences are presented alphabetically in the\r\n",
"annotation table. The .tbl file is a summary of the repeat content of\r\n",
"the analyzed sequence.\r\n",
"\r\n",
"\r\n",
"\r\n",
"1 OPTIONS\r\n",
"\r\n",
"1.1 Species options\r\n",
"\r\n",
"-species Indicate source species of query DNA\r\n",
"\r\n",
"-lib [filename] Allows the use of a custom library\r\n",
"\r\n",
"contamination checking options\r\n",
"-is_only only clips E coli insertion elements out of FASTA and .qual files\r\n",
"-is_clip clips IS elements before analysis (default: IS only reported)\r\n",
"-no_is skips bacterial insertion element check\r\n",
"-rodspec only checks for rodent specific repeats (no RepeatMasker run)\r\n",
"-primspec only checks for primate specific repeats (no RepeatMasker run)\r\n",
"\r\n",
"For detailed explanation of the contamination detection options, see\r\n",
"\"4.2 Identification of DNA source\" below.\r\n",
"\r\n",
"\r\n",
"-spec\r\n",
"\r\n",
"Interspersed repeats mostly are copies of transposable elements in\r\n",
"different states of erosion. Thus, dependent on the time of activity\r\n",
"of the source transposable element, interspersed repeats generally are\r\n",
"specific to a (clade of) species, and different redatabase\r\n",
"(http://www.ncbi.nlm.nih.gov/Taxonomy/taxonomyhome.html). In\r\n",
"principal, all unique clade names occurring in this database can be\r\n",
"used. Examples are:\r\n",
"\r\n",
"-species \"sus scrofa\"\r\n",
"-species chimpanzee\r\n",
"-species arabidopsis\r\n",
"-species canidae\r\n",
"-species mammals\r\n",
"\r\n",
"Capitalization is ignored, multiple words need to bound by apostrophes.\r\n",
"\r\n",
"RepeatMasker builds one or more repeat consensus files the first time\r\n",
"a species/group has been chosen, or when a new database has been\r\n",
"downloaded. These will be written in a subdirectory of the Libraries\r\n",
"directory named after the date of the repeat database version and the\r\n",
"Latin name of the clade. For example, \"-species monocotyledons\"\r\n",
"creates the file\r\n",
"\"..../RepeatMasker/Libraries/20040616/liliopsida/specieslib\". \r\n",
"Currently, only for mammalian species multiple files are created,\r\n",
"bearing names like \"shortcutlib\" and \"longlib\", which the queries are\r\n",
"compared to sequentially.\r\n",
"\r\n",
"The creation of these files takes some time (a few seconds sometimes),\r\n",
"but the next times RepeatMasker is run on the same species these\r\n",
"existing files will be used. When Wu-BlAST is used as the search\r\n",
"engeine (see 1.3), blastable libraries are built, again as a one time\r\n",
"event for each species.\r\n",
"\r\n",
"After multiple database updates, the libraries could hog some space,\r\n",
"and you may consider deleting the older\r\n",
"\"..../RepeatMasker/Libraries/\" directories.\r\n",
"\r\n",
"The files contain all repeats of the RepeatMasker database that have\r\n",
"been found in the genome of the given species, or have been found in a\r\n",
"related species and are thought to predate the speciation time of the\r\n",
"two. For example, -species gorilla, will create a gorilla repeat file\r\n",
"that is almost as big as the human file, because almost all repeats in\r\n",
"human predate the 6-10 million years that separates us from the\r\n",
"gorilla, though none of the consensus sequences have been derived from\r\n",
"Gorilla DNA. A repeat file for hyraxes, for which order no repeats\r\n",
"have been submitted to the database yet, will contain all repeats\r\n",
"found in the human genome that are thought to be older than the origin\r\n",
"of most mammalian orders.\r\n",
"\r\n",
"If a group of species is indicated, all repeats are included that are\r\n",
"found in any species belonging to this clade. Thus, \"-species diptera\"\r\n",
"leads to comparison against repeats found in the genomes of any\r\n",
"diptera species, currently primarily represented by fruitfly and\r\n",
"mosquitoes, and \"-species murinae\" compares the query to all known\r\n",
"murine repeats, including rat and mouse.\r\n",
"\r\n",
"Not all \"common\" English names occur in the taxonomy database. For\r\n",
"example, \"chimp\", \"squirrels\", \"grasses\", or \"carnivores\" are not\r\n",
"present. The program will suggest functional names using Soundex, with\r\n",
"oftentimes unexpected results. Using Latin names is always safest.\r\n",
"\r\n",
"\r\n",
"util/queryRepeatDatabase.pl\r\n",
"\r\n",
"The script queryRepeatDatabase.pl in the util subdirectory of the\r\n",
"RepeatMasker directory allows you to check if a species is covered and\r\n",
"which repeats the query will be compared to if the species indication\r\n",
"is used. For example, \r\n",
"\r\n",
"util/queryRepeatDatabase.pl -species sorghum -stat\r\n",
"\r\n",
"shows that, besides the universal simple repeat and bacterial\r\n",
"insertion elements contamination checks the query will be compared to\r\n",
"only 4 sorghum specific repeats (some of the many maize/corn specific\r\n",
"repeats may also occur in sorghum, but this has not yet been studied).\r\n",
"Type queryRepeatDatabase.pl for further options with this script.\r\n",
"\r\n",
"These are the numbers and bp of repeat consensus sequences (excluding\r\n",
"simple repeats and RNAs) as of May 2009 for the best represented clades\r\n",
"\r\n",
"species # of consensi total bp\r\n",
"All mammals combined 3081 4253979\r\n",
"Primates * 585 902148\r\n",
"Rodents * 606 931299\r\n",
"Carnivores * 130 158362\r\n",
"Perissodactyls * 130 220814\r\n",
"Ruminants * 112 130320\r\n",
"Bats * 131 112724\r\n",
"Marsupials 554 863923\r\n",
"Monotremes 102 159182\r\n",
"Birds 425 644078\r\n",
"Amphibia (mostly frog) 230 428828 \r\n",
"Teleost fish 1140 2807233\r\n",
"Tunicates 134 368438 \r\n",
"Sea urchins 211 560185\r\n",
"Flies 306 906766\r\n",
"Mosquitos 363 914943\r\n",
"Other insects 356 1080649\r\n",
"Nematodes 461 698036\r\n",
"Flatworms 209 641758\r\n",
"Cnidarians 911 3057775 \r\n",
"Fungi 256 695278\r\n",
"Arabidopsis 544 1460558\r\n",
"Other dicot plants 742 2563646\r\n",
"Rice 575 1430176\r\n",
"Maize / corn 439 1566688\r\n",
"Other monocot plants 303 912057\r\n",
"Algae 186 533952\r\n",
"\r\n",
"* Only order-specific elements; these genomes are also matched to 400+\r\n",
"consensus sequences for elements active before the origin of orders.\r\n",
"\r\n",
"\r\n",
"-lib \r\n",
"\r\n",
"The majority of species are of course not yet covered in the repeat\r\n",
"databases and many are far from complete, but you may have your own\r\n",
"collection. At other times you may want to mask or study only a\r\n",
"particular type of repeat.\r\n",
"\r\n",
"For these types of siutations, you can use the -lib option to\r\n",
"specify a custom library of sequences to be masked in the query. The\r\n",
"library file needs to contain sequences in FASTA format. Unless a full\r\n",
"path is given on the command line the file is assumed to be in the\r\n",
"same directory as the sequence file. \r\n",
"\r\n",
"The recommended format for IDs in a custom library is:\r\n",
"\r\n",
">repeatname#class/subclass\r\n",
"or simply\r\n",
">repeatname#class\r\n",
"\r\n",
"In this format, the data will be processed (overlapping repeats are\r\n",
"merged etc), alternative output (.ace or .gff) can be created and an\r\n",
"overview .tbl file will be created. Classes that will be displayed in\r\n",
"the .tbl file are 'SINE', 'LINE', 'LTR', 'DNA', 'Satellite', anything\r\n",
"with 'RNA' in it, 'Simple_repeat', and 'Other' or 'Unknown' (the\r\n",
"latter defaults when class is missing). Subclasses are plentiful. They\r\n",
"are not all tabulated in the .tbl file or necessarily spelled\r\n",
"identically as in the repeat files, so check the RepeatMasker.embl\r\n",
"file for names that can be parsed into the .tbl file.\r\n",
"\r\n",
"You can combine the repeats available in the RepeatMasker library \r\n",
"with a custom set of consensus sequences. To accomplish this \r\n",
"use the queryRepeatDatabase.pl tool provided in the util \r\n",
"directory of the RepeatMasker distribution. [ Running the program\r\n",
"without any options will print the documentation to the screen. ]\r\n",
"Use this tool to extract RepeatMasker sequences and concatenate\r\n",
"them to your custom sequences in a new library file.\r\n",
" \r\n",
"\r\n",
"1.2 Masking options (options that determine what kind of repeats are masked)\r\n",
"\r\n",
"-cutoff [number] sets cutoff score for masking repeats when using -lib\r\n",
" (default cutoff 225)\r\n",
"-nolow does not mask low complexity DNA or simple repeats\r\n",
"-l(ow) same as nolow (historical)\r\n",
"-(no)int only masks low complex/simple repeats (no interspersed repeats)\r\n",
"-alu only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)\r\n",
"-div [number] masks only those repeats that are less than [number] percent\r\n",
" diverged from the consensus sequence\r\n",
"\r\n",
"-cutoff\r\n",
"When using a local library you may want to change the minimum score\r\n",
"for reporting a match. The default is 225, lowering it below 200 will\r\n",
"usually start to give you significant numbers of false matches,\r\n",
"raising it to 250 will guarantee that all matches are real. Note that\r\n",
"low complexity regions in otherwise complex repeat sequences in your\r\n",
"library are most likely to give false matches.\r\n",
"\r\n",
"\r\n",
"-nolow / -l(ow)\r\n",
"With the option -nolow or -l(ow) only interspersed repeats are\r\n",
"masked. By default simple tandem repeats and low complexity\r\n",
"(polypurine, AT-rich) regions are masked besides the interspersed\r\n",
"repeats. For database searches the default setting is recommended, but\r\n",
"sometimes, e.g. when using the masked sequence to predict the presence\r\n",
"of exons, it may be better to skip the low complexity masking.\r\n",
"\r\n",
"\r\n",
"-noint / -int\r\n",
"\r\n",
"When using the -noint or -int option only low complexity DNA and\r\n",
"simple repeats will be masked in the query sequence.\r\n",
"Inexact simple repeats may be spanned and hidden by an interspersed\r\n",
"repeat annotation. In particular, most A-rich simple repeats derived\r\n",
"from the poly A tails of SINEs and LINES are merged with the\r\n",
"annotation of the SINE or LINE (i.e. you can't tell there is a simple\r\n",
"repeat). Thus, if you're interested in finding the location of\r\n",
"potentially polymorphic simple repeats, this option is recommended.\r\n",
"\r\n",
"\r\n",
"-norna\r\n",
"Because of their close similarity to SINEs and the abundance of some\r\n",
"of their pseudogenes, RepeatMasker by default screens for matches to\r\n",
"small pol III transcribed RNAs (mostly tRNAs and snRNAs). When you're\r\n",
"interested in small RNA genes, you should use the -norna option that\r\n",
"leaves these sequences unmasked, while still masking SINEs.\r\n",
"\r\n",
"\r\n",
"-alu\r\n",
"-div\r\n",
"You can limit the masking and annotation to (primate) Alu repeats with\r\n",
"the -alu option and to a subset of less diverged (younger) repeats\r\n",
"with the option -div. For example,\r\n",
"\r\n",
"\"RepeatMasker -div 20 -mus mysequence\"\r\n",
"\r\n",
"will mask only those rodent repeats and simple repeats that are less\r\n",
"than 20% diverged from the consensus sequence and\r\n",
"\r\n",
"\"RepeatMasker -div 10 -alu mysequence\"\r\n",
"\r\n",
"will mask Alus that are less than 10% diverged from the Alu consensus\r\n",
"sequences and no other repeats.\r\n",
"\r\n",
"The -div option may be used to limit the masking to those repeats that\r\n",
"are specific to a species group for use in subsequent comparison of\r\n",
"orthologous genomic loci. Notice that a more sophisticated method to\r\n",
"mask lineage-specific repeats (currently only in mammals) is now\r\n",
"available with the script DateRepeats (4.3).\r\n",
"\r\n",
"\r\n",
"\r\n",
"1.3 Options effecting speed and search parameters\r\n",
"\r\n",
"-q Quick search; 5-10% less sensitive, 3-4 times faster than default\r\n",
"-qq Rush job; about 10% less sensitive,\r\n",
"-s Slow search; 0-5% more sensitive, 2.5 times slower than default.\r\n",
"-pa(rallel) [number] \r\n",
" Number of processors to use in parallel (only works for \r\n",
" batch files or sequences larger than 50 kb)\r\n",
"-engine [crossmatch|wublast|decypher] \r\n",
" Select a non-default search engine to use. If not specified\r\n",
" RepeatMasker will use the default configured at install time.\r\n",
"-w(ublast) Use WU-blast, rather than cross_match as engine\r\n",
" **DEPRECATED** Use -engine [crossmatch|wublast|decypher] now.\r\n",
"-frag [number] Maximum sequence length masked without fragmenting \r\n",
" (default 40000).\r\n",
"-maxsize [nr] Maximum length for which IS- or repeat clipped sequences \r\n",
" can be produced (default 4000000). Memory requirements go \r\n",
" up with higher maxsize.\r\n",
"-gc [number] Use matrices calculated for 'number' percentage background \r\n",
" GC level.\r\n",
"-gccalc Program calculates the GC content even for batch files/small \r\n",
" sequences.\r\n",
"-nocut Skips the steps in which repeats are excised.\r\n",
"-noisy Prints cross_match progress report to screen (defaults to \r\n",
" .stderr file)\r\n",
"\r\n",
"-s -q -qq\r\n",
"RepeatMasker can be run at four different sensitivity/speed levels,\r\n",
"with the option -q providing quick (less sensitive) and -s slow\r\n",
"(sensitive) results compared to default. The option -qq has been added\r\n",
"for when you're in a frightful hurry. Each higher gear is about 2-3\r\n",
"times faster, and 90% as sensitive as the next lower gear. See \"2.4\r\n",
"Sensitivity and Speed\" below for details\r\n",
"\r\n",
"\r\n",
"-w(ublast)\r\n",
" **DEPRECATED** See -engine.\r\n",
"\r\n",
"-engine [crossmatch|wublast|decypher]\r\n",
"By default, RepeatMasker uses the search engine configured\r\n",
"during installation as the default. To use the non-default\r\n",
"search engine you can specify it with the -engine parameter.\r\n",
"\r\n",
"Before June 2004, the script MaskerAid (written by Joey Bedell, Ian\r\n",
"Korf and Warren Gish at the St Louis Washington University Genome\r\n",
"Center) was necessary to use WU-BLAST with RepeatMasker, but that\r\n",
"functionality is now built in. RepeatMasker includes a search engine\r\n",
"object that allows relatively straightforward integration of other\r\n",
"search engines. Currently only WU-BLAST has the flexibility to accept\r\n",
"all cross_match options.\r\n",
"\r\n",
"For longer sequences, default RepeatMasker runs with WU-BLAST take\r\n",
"about as long as cross_match powered runs at -qq settings (see \"2.4\r\n",
"Sensitivity and speed\"). The speed settings have relatively little\r\n",
"effect on the speed when using WU-BLAST, with the fastest settings\r\n",
"1.25-1.75 as fast as the slowest settings, while the sensitivity\r\n",
"increases significantly. Thus, I recommend to always run RepeatMasker\r\n",
"in sensitive (-s) or default mode when using WU-BLAST. I've made the\r\n",
"difference in parameters between sensitive and default settings larger\r\n",
"at -w settings, to make these speed options more meaningful and gain\r\n",
"more sensitivity (with little cost in speed).\r\n",
"\r\n",
"Even with these more extreme parameters, the sensitivity can't quite\r\n",
"reach that of the sensitive settings using cross_match, but it comes\r\n",
"very close, and the huge difference in speed make this option\r\n",
"very attractive.\r\n",
"\r\n",
"The output format with the -w option is identical to default and\r\n",
"scores are comparable, as the same complexity adjustment is applied.\r\n",
"The only difference is that, when using the wublast option, hyphens\r\n",
"in the sequence are retained (in default mode all non-letters were\r\n",
"deleted from the sequence). WU-BLAST uses hyphens to indicate\r\n",
"insurmountable barriers and alignments will not span hyphens.\r\n",
"\r\n",
"\r\n",
"-pa(rallel)\r\n",
"For sequences over 50 kb long or files wit multiple sequences,\r\n",
"RepeatMasker can use multiple processors. When you type:\r\n",
"\r\n",
"RepeatMasker -par 10 \r\n",
"\r\n",
"A batch file of sequences will run with up to 10 sequences at the\r\n",
"time, until all sequences are done, while a file with one large\r\n",
"sequence will analyze the sequence in up to 10 fragments at the same\r\n",
"time. The minimum fragment size is 25 or 33 kb, the maximum 66 kb (all\r\n",
"sequences over 100 kb are divided in 33-66 kb fragments). For the\r\n",
"batch files no minimum size exists. Thus,\r\n",
"\r\n",
"If contains: RM runs in parallel:\r\n",
"one 60 kb sequence two 30 kb fragments\r\n",
"one 400 kb sequence ten 40 kb fragments\r\n",
"one 1 Mb sequence ten 50 kb fragments, twice\r\n",
"ten 500 bp sequences ten 500 bp sequences\r\n",
"two 500 kb sequences ten 50 kb fragments, twice\r\n",
"\r\n",
"Processing of the detected matches takes place after all batches or\r\n",
"fragments have been cross-matched with the databases. \r\n",
"Beware that, generally, you have a limited number of processor IDs\r\n",
"allotted. RepeatMasker uses 4 PIDs for each parallel job, so if you're\r\n",
"allotted 64 user PIDs, you can 'only' run 16 fragments/batches in\r\n",
"parallel.\r\n",
"\r\n",
"\r\n",
"-frag \r\n",
"Even when the -par option is not used, RepeatMasker transparently\r\n",
"fragments sequences over 40 kb in fragments of equal sizes with 1 kb\r\n",
"overlaps. Similarly, sequence batches containing more than 51 kb are\r",
"\r\n",
"subdivided in batches of 40 kb or less. The -frag option sets the\r\n",
"maximum fragment and batch size\r\n",
"\r\n",
"The only visible effect of the fragmentation is in the alignment\r\n",
"files, where alignments at the edges of the fragments can be\r\n",
"duplicated and/or truncated. The 1 kb overlap between fragments\r\n",
"almost guarantees that there is no loss in sensitivity at the\r\n",
"edges. Fragmentation initially was implemented to allow the size of\r\n",
"sequences and sequence batches to be unlimited. Cross_match can be\r\n",
"very memory intensive when SW alignments have to be performed in large\r\n",
"matrices. This may happen with short minmatch and large bandwidth\r\n",
"settings. Note that RepeatMasker should not croak when cross_match\r\n",
"runs out of memory; it will redo the failed search with a higher word\r\n",
"length or smaller bandwidth until it succeeds. However, this will lead\r\n",
"to gradually less sensitive comparisons.\r\n",
"\r\n",
"Fragmentation also can improve repeat detection when a genomic\r\n",
"sequence contains large regions of DNA with significantly different GC\r\n",
"levels (isochores), since sets of scoring matrices are chosen based on\r\n",
"the GC level of a fragment.\r\n",
"\r\n",
"Since April 2002 the maximum fragment size is hardwired to be half of\r\n",
"\"maxsize\" (see below).\r\n",
"\r\n",
"\r\n",
"-maxsize\r\n",
"To limit the memory requirements of the script an upper boundary to\r\n",
"the amount of sequence stored in a single array in the script is set\r\n",
"to 4 million bp. This parameter can be reduced with the -maxsize\r\n",
"option to a minimum of 500000, for severely memory-impaired computers.\r\n",
"\r\n",
"The size of maxsize further determines the largest length single\r\n",
"sequence from which E. coli insertion sequences and full-length\r\n",
"repeats can be clipped. Increase the size of maxsize to allow removal\r\n",
"of IS elements from larger sequences, like: RepeatMasker -is_clip\r\n",
"-maxsize 9999999999 muntjakchromosome1\r\n",
"\r\n",
"\r\n",
"-gc\r\n",
"-gccalc\r\n",
"Neutral mutation patterns differ significantly depending on the GC\r\n",
"richness of a locus and we have calculated optimal scoring matrices\r\n",
"for the alignment to consensus sequences in a range of background GC\r\n",
"levels (see 2.2). Usually, RepeatMasker calculates the percentage of\r\n",
"the sequence consisting of Gs and Cs and uses the appropriate\r\n",
"matrices. However, the program defaults to using 'average' 43% GC\r\n",
"matrices when the query is shorter than 2000 bp or a batch file is\r\n",
"analyzed. This is because short sequences can diverge greatly from the\r\n",
"GC level of the locus. For example, CpG islands and exons are more GC\r\n",
"rich than the surrounding DNA, whereas a LINE-1 element can be more AT\r\n",
"rich than the background. In a batch file, RepeatMasker analyses all\r\n",
"sequences together with the same matrices. The percentage GC in all\r\n",
"the sequences combined may be inappropriate for some sequence entries;\r\n",
"using high GC level matrices in AT rich sequences (and vice versa) may\r\n",
"result in false masking.\r\n",
"\r\n",
"One can override this behavior in two ways:\r\n",
"With the option -gc you can set the GC level to a certain percentage:\r\n",
"\r\n",
"RepeatMasker -gc 37 mybatchofsequences.fa\r\n",
"\r\n",
"lets the program use matrices appropriate for 37% GC background. The\r\n",
"batch could, for example, contain ESTs from a single locus with a\r\n",
"known GC level. \r\n",
"Alternatively, the -gccalc option forces RepeatMasker to use the\r\n",
"actual GC level of a short sequence or the average GC level of a batch\r\n",
"of sequences. The latter sequences, for example, may be contigs in a\r\n",
"sequencing project.\r\n",
"\r\n",
"\r\n",
"-nocut\r\n",
"The option -nocut skips a step in the default procedure for human and\r\n",
"rodent queries, in which full-length younger insert are spliced out of\r\n",
"the query to reconstruct a pre-insertion situation. RepeatMasker is\r\n",
"generally more sensitive and efficient including the deletion step as\r\n",
"it can unearth older repeats that were interrupted by these younger\r\n",
"elements.\r\n",
"\r\n",
"\r\n",
"\r\n",
"1.4 Output options\r\n",
"\r\n",
"-a shows the alignments in a .align output file; -ali(gnments) also works\r\n",
"-inv alignments are presented in the orientation of the repeat (with option -a)\r\n",
"\r\n",
"-cut saves a sequence (in file.cut) from which full-length repeats are excised\r\n",
" (temporarily disfunctional)\r\n",
"-small returns complete .masked sequence in lower case\r\n",
"-xsmall returns repetitive regions in lowercase (rest capitals) rather than masked\r\n",
"-x returns repetitive regions masked with Xs rather than Ns\r\n",
"\r\n",
"-poly reports simple repeats that may be polymorphic (in file.poly)\r\n",
"-ace creates an additional output file in ACeDB format\r\n",
"-gff creates an additional General Feature Finding format output\r\n",
"-u creates an untouched annotation file besides the manipulated file\r\n",
"-xm creates an additional output file in cross_match format (for parsing)\r\n",
"\r\n",
"-fixed creates an (old style) annotation file with fixed width columns\r\n",
"-no_id leaves out final column with unique ID for each element\r\n",
"-e(xcln) calculates repeat densities (in .tbl) excluding runs of >25 Ns in query\r\n",
"\r\n",
"-noisy prints cross_match progress report to screen (defaults to .stderr file)\r\n",
"\r\n",
"\r\n",
"-a / -ali(gnments) \r\n",
"-inv\r\n",
"Alignments are saved in a .align file when using the option -a. They\r\n",
"are shown in the orientation of the query sequence, unless you use the\r\n",
"option -inv as well, which will return alignments in the orientation\r\n",
"of the repeats (see 3.2 Alignments).\r\n",
"\r\n",
"\r\n",
"-cut\r\n",
"The -cut option to RepeatMasker is not supported in this release. It\r\n",
"will be rolled into a new annotation utility in the near future. If\r\n",
"you need this functionality sooner please send an email to Robert\r\n",
"Hubley ( rhubley@systemsbiology.org ). Thanks for your patience.\r\n",
"\r\n",
"The option made the program save a file \"file.cut\" which contains\r\n",
"an intermediate sequence in the masking progress. In this sequence all\r\n",
"full-length elements, young LINE-1 3' ends, and close to perfect simple\r\n",
"repeats were deleted.\r\n",
"\r\n",
"\r\n",
"-x\r\n",
"When -x is used the repeat sequences are replaced by Xs instead of\r\n",
"Ns. The latter allows one to distinguish the masked areas from\r\n",
"possibly existing ambiguous bases in the original sequence. However,\r\n",
"when running BLAST searches (and maybe other programs) Xs are deleted\r\n",
"out of the query and the returned BLAST matches will have position\r\n",
"numbers not necessarily corresponding to that of the original\r\n",
"sequence.\r\n",
"\r\n",
"\r\n",
"-xsmall\r\n",
"When the option -xsmall is used a sequence is returned in the .masked\r\n",
"file in which repeat regions are in lower case and non-repetitive\r\n",
"regions are in capitals.\r\n",
"\r\n",
"\r\n",
"-poly\r\n",
"You can get a list of potentially polymorphic microsatellites with the\r\n",
"option -poly. This is simply a subset of the list in .out, with\r\n",
"dimeric to tetrameric repeats less than 10 % diverged from perfection.\r\n",
"\r\n",
"\r\n",
"-xm\r\n",
"When using the -xm option an additional output file (.out.xm) is\r\n",
"created that contains the same information as the .out file (excluding\r\n",
"the low-complexity/simple DNA), but then in the original cross_match\r\n",
"format. This output is harder to read but there are programs that\r\n",
"require the exact cross_match output format.\r\n",
"\r\n",
"\r\n",
"-u\r\n",
"The script ProcessRepeats adjusts the original RepeatMasker output so\r\n",
"that the annotation more closely reflects reality. With the option -u\r\n",
"a .ori.out file is created that contains the original (but sorted)\r\n",
"cross_match summary lines.\r\n",
"\r\n",
"\r\n",
"-ace\r\n",
"With the -ace option the script creates an .ace file. This is merely a\r\n",
"suggestion. The columns in the table currently are:\r\n",
"\r\n",
"Motif_homol RepeatMasker(method) \r\n",
" \r\n",
"\r\n",
"\r\n",
"\r\n",
"-gff\r\n",
"The script creates a .gff file with the annotation in 'General Feature\r\n",
"Finding' format. See http://www.sanger.ac.uk/Software/GFF for\r\n",
"details. The current output follows a Sanger convention:\r\n",
"\r\n",
" RepeatMasker Similarity \r\n",
" . Target \"Motif:\"\r\n",
" \r\n",
"\r\n",
"In this line, 'RepeatMasker' becomes 'RepeatMasker_SINE' if the match\r\n",
"is against an Alu. I don't know why.\r\n",
"\r\n",
"\r\n",
"\r\n",
"-fixed\r\n",
"Since April 1999 the column widths in the annotation table are\r\n",
"adjusted to the maximum length of any string occurring in a column;\r\n",
"this allows long sequence names to be spelled out completely.\r\n",
"Previously, a fixed column width table was returned, which can still\r\n",
"be obtained by using the -fixed option. Parsing should not be effected\r\n",
"by this change of default behavior, as the same number of columns with\r\n",
"the same formatted text are still separated by white space.\r\n",
"\r\n",
"\r\n",
"-no_id\r\n",
"Since September 2000 a column displaying a unique number (ID) for each\r\n",
"integrated element is printed by default. This used to be optional\r\n",
"(-id). Fragments of a single element, separated from each other by\r\n",
"subsequent insertions of other elements, deletions or recombinations,\r\n",
"carry the same number. This feature allows better interpretation of\r\n",
"the data and should greatly help proper graphical display of the\r\n",
"repeats.\r\n",
"\r\n",
"The column follows all other columns, except for the (rare) indication\r\n",
"that an annotation overlaps another annotation (*). This change, which\r\n",
"was announced in the previous release, should not hinder most parsing\r\n",
"scripts. If it causes problems, the old format can be retrieved with\r\n",
"the option -no_id.\r\n",
"\r\n",
"\r\n",
"-excln\r\n",
"The percentages displayed in the .tbl file are calculated using a\r\n",
"total sequence length excluding runs of 25 Ns or more. This is useful\r\n",
"when analyzing draft sequences that are often concatenated contigs\r\n",
"separated by (sometimes very) long stretches of Ns. This option can\r\n",
"be used with ProcessRepeats as well. The number of Ns in long runs in\r\n",
"the query are apparent in the .tbl file, and you only need to run\r\n",
"ProcessRepeats with the option on the .cat file.\r\n",
"\r\n",
"\r\n",
"-noisy\r\n",
"RepeatMasker used to print the voluminous cross_match progress reports\r\n",
"to the screen. Since the Dec 1998 version this output is stored in a\r\n",
".stderr file and a more informative much smaller progress report is\r\n",
"printed to the screen. The option -noisy allows one to see the\r\n",
"cross-match reports coming by on the screen (yeah).\r\n",
"\r\n",
"\r\n",
"\r\n",
"1.5 ProcessRepeats options\r\n",
"\r\n",
"When you have already run RepeatMasker and want to recreate the .out\r\n",
"or .tbl file, you only need to rerun ProcessRepeats on the .cat\r\n",
"file(s), which will take just a small fraction of the time required to\r\n",
"rerun RepeatMasker. Such a situation can occur when you've\r\n",
"accidentally deleted the .out or .tbl file or want additional or\r\n",
"differentially formatted output files. Note that alignment files\r\n",
"cannot be created unless RepeatMasker was run with the -a option and\r\n",
"that the original .tbl and .out file will be overwritten unless you\r\n",
"rename them.\r\n",
"\r\n",
"ProcessRepeats -species mus -nolow -gff -excln myhumongousmousesequence.cat \r\n",
"\r\n",
"Repeat matches are processed differently for different query species,\r\n",
"so the -species mus option is necessary. With the -nolow option, the\r\n",
".out file will not contain information on simple repeats and low\r\n",
"complexity DNA anymore. The -gff option creates an additional output\r\n",
"file in GFF format, and the -excln option displays the density of\r\n",
"repeats in the .tbl file as a percentage of those bp that are not\r\n",
"contained in long stretches of Ns.\r\n",
"\r\n",
"\r\n",
"The options/flags for ProcessRepeats are:\r\n",
"\r\n",
"-species Identical as for the RepeatMasker script\r\n",
"\r\n",
"-lib skips most of processing, does not produce a .tbl file unless the\r\n",
" custom library is in the >name#class format.\r\n",
"\r\n",
"-nolow does not display simple repeats or low_complexity DNA in the annotation\r\n",
"-noint skips steps specific to interspersed repeats, saving lots of time \r\n",
"-u creates an untouched annotation file besides the manipulated file\r\n",
"-xm creates an additional output file in cross_match format (for parsing)\r\n",
"-ace creates an additional output file in ACeDB format\r\n",
"-gff creates an additional Gene Feature Finding format\r\n",
"-poly creates an output file listing only potentially polymorphic simple repeats\r\n",
"-no_id leaves out final column with unique number for each element (was default)\r\n",
"-fixed creates an (old style) annotation file with fixed width columns\r\n",
"-excln calculates repeat densities excluding long stretches of Ns in the query\r\n",
"-orf2 results in sometimes negative coordinates for L1 elements; all L1 subfamilies\r\n",
" are aligned over the ORF2 region, sometimes improving interpretation of data\r\n",
"-a shows the alignments in a .align output file\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"2 METHODOLOGY AND QUALITY OF OUTPUT\r\n",
"\r\n",
"2.1 Methodology\r\n",
"\r\n",
"RepeatMasker compares the query sequence against one or more files of\r\n",
"FASTA sequences. The sequences in the libraries provided with\r\n",
"RepeatMasker are consensus sequences derived from alignment of\r\n",
"multiple copies of interspersed or satellite repeats. For interspersed\r\n",
"repeats, a consensus tends to approach the sequence of the\r\n",
"transposable element from which the repeat is derived.\r\n",
"\r\n",
"Both cross_match and WU-blast perform their Smith-Waterman (SW)\r\n",
"alignments by first identifying exact word matches and restricting the\r\n",
"alignment to a band or matrix surrounding this exact\r\n",
"match(es). Overlapping matrices are merged. The speed settings of\r\n",
"RepeatMasker are purely changes in the minimum word length from which\r\n",
"an alignment can be seeded and, in some cases, changes in the width of\r\n",
"the band. A wider bandwidth allows more gaps in the alignment and,\r\n",
"more importantly, increases the likelihood that neighboring matrices\r\n",
"overlap.\r\n",
"\r\n",
"Cross_match does a low complexity adjustment of the raw SW score. When\r\n",
"WU-blast is used, the RepeatMasker script performs this adjustment. Low\r\n",
"complexity matches are the primary cause of false matches, and this\r\n",
"adjustment contributes significantly to the high selectivity of\r\n",
"RepeatMasker (see 2.5)\r\n",
"\r\n",
"As a result of the existence of many related consensus sequences in\r\n",
"the database, usually multiple repeats match one region in the query\r\n",
"at the same time. Generally, cross_match and WU-blast report to the\r\n",
"script only those matches that are less than 80-90% overlapped by a\r\n",
"higher scoring match. This implies that, at first approximation, names\r\n",
"are assigned to repeats based on the highest SW score. Given\r\n",
"appropriate consensus sequences and alignment parameters, this is\r\n",
"intuitively correct as well. However, the scripts have a lot of code\r\n",
"to improve on this first approximation, primarily to deal with partial\r\n",
"matches.\r\n",
"\r\n",
"The cut-off SW score above which matches are reported is empirically\r\n",
"derived (see '2.5 selectivity' below). Note that there is no cut-off\r\n",
"divergence level; reported matches can be less than 60% identical.\r\n",
"\r\n",
"The alignments parameters -substitution matrices, and gap initiation\r\n",
"and extension penalties- are derived from data harbored in multiple\r\n",
"alignments of a special subset of interspersed repeats. The derived\r\n",
"matrices are theoretically optimal for a series of conditions (see\r\n",
"below). The gap penalties are sub-optimal, primarily because gap\r\n",
"lengths have a non-linear distribution and are poorly represented by a\r\n",
"single gap-extension penalty.\r\n",
"\r\n",
"For primate, rodent and other mammalian DNA, the query is compared to\r\n",
"consecutive subsets of repeat libraries. For primates, perfect simple\r\n",
"repeats, full-length Alus, full-length short interspersed repeats, and\r\n",
"young L1 3' ends are first (and in that order) clipped from the\r\n",
"sequence to expose underlying older elements. Subsequently, the query\r\n",
"is compared to most repeats, a set of ancient elements under\r\n",
"especially sensitive settings, a large set of long retroviral\r\n",
"sequences under faster settings (to save time), and AT-rich L1 3' ends\r\n",
"that may have been discarded earlier as low complexity\r\n",
"matches. Finally, simple repeats and low complexity regions are\r\n",
"masked.\r\n",
"\r\n",
"\r\n",
"\r\n",
"2.2 Scoring matrices\r\n",
"\r\n",
"We have calculated statistically optimal scoring matrices for the\r\n",
"alignment of neutrally diverging (non-selected) sequences in human DNA\r\n",
"to their original sequence. These matrices have been in use since the\r\n",
"May 1998 release. The matrices were derived from alignments of DNA\r\n",
"transposon fossils to their consensus sequences. A series of different\r\n",
"matrices are used dependent on the divergence level (14-25%) of the\r\n",
"repeats and the background GC level (35-53%, neutral mutation patterns\r\n",
"differ significantly in different isochores).\r\n",
"\r\n",
"These matrices are (close to) optimal for human genomic sequences\r\n",
"longer than 10 kb, for which length the GC level usually is\r\n",
"representative of the isochore in which the sequence lives. However,\r\n",
"the GC level of small fragments can diverge a lot from the surrounding\r\n",
"(e.g. a fragment spanning a CpG island, a GC rich exon or an AT-rich\r\n",
"LINE-1 element) and RepeatMasker defaults to using matrices derived for\r\n",
"a 43% GC background when a sequence is shorter than 2000 bp or when a\r\n",
"batch file is submitted. When the appropriate background GC level is\r\n",
"known, this can be entered with the -gc option.\r\n",
"\r\n",
"(Note that these matrices are an integral portion of RepeatMasker and\r\n",
"are covered under the same restrictions as the scripts and databases\r\n",
"as described in the signed software agreement).\r\n",
"\r\n",
"\r\n",
"\r\n",
"2.3 Repeat databases\r\n",
"\r\n",
"The interspersed repeat databases provided in the RepeatMasker package\r\n",
"are maintained in synch with the repeat databases (RepBase Update)\r\n",
"copyrighted by the Genetic Information Research Institute\r\n",
"(G.I.R.I.). Whereas non-mammalian libraries currently are identical to\r\n",
"the RepBase Update FASTA files except for formatting and corrections,\r\n",
"mammalian databases are extensively modified. The modification\r\n",
"primarily entails inclusion of complete sets of subfamilies for Alu\r\n",
"and L1, modifications to avoid false matches and false annotations,\r\n",
"and subdivision in multiple sets for optimization of the analysis.\r\n",
"\r\n",
"We transformed the RepBase database from a set of prototypes to a set\r\n",
"of consensus sequences (described in my dissertation) to allow both\r\n",
"determination of the origin of these repeats and improved detection. A\r\n",
"consensus properly derived from a multiple alignment of copies closely\r\n",
"approaches the original transposable element, since substitutions\r\n",
"accumulate by-and-large unselected in copies of transposable\r\n",
"elements. Because of the latter, a copy is on average twice as close\r\n",
"to the consensus as to any other copy. Consensus sequences are also\r\n",
"more sensitive search tools because directional substitution matrices\r\n",
"can be used (see above).\r\n",
"\r\n",
"Consensus sequences would be identical to the original transposable\r\n",
"element if all copies were inserted at about the same time from a\r\n",
"single source. DNA transposon copies approach this ideal, but\r\n",
"retroposons (giving rise to most repeats in our genome) live for long\r\n",
"periods in a genome and evolve doing so. Thus, over time the sequence\r\n",
"of the transposable element has changed, and a single consensus does\r\n",
"not describe the original sequence of each copy. Also, usually at any\r\n",
"time multiple distinct sequences with a common origin, cousins if you\r\n",
"will, were active. This situation is reflected by the presence in the\r\n",
"databases of multiple subfamilies for the more common retroposons\r\n",
"(usually having the same name ending in a different number or letter.\r\n",
"\r\n",
"The mammalian repeat libraries contain, besides consensus sequences\r\n",
"for transposon derived repeats, consensus satellite units, and a set\r\n",
"of *small structural RNA sequences*. The latter have created a large\r\n",
"amount of processed pseudogenes in our genome, and in that way are\r\n",
"interspersed repeats.\r\n",
"\r\n",
"\r\n",
"\r\n",
"2.4 Sensitivity and speed \r\n",
"\r\n",
"The program can be run at four levels of sensitivity. The only\r\n",
"difference between these settings is the minimum match or word length\r\n",
"in the initial (not quite) hashing step of the cross_match program\r\n",
"(see the cross_match/PHRAP documentation). For mammalian queries, he\r\n",
"\"slow\" setting will find and mask 0-5% more repetitive DNA sequences\r\n",
"than by default, whereas the \"quick\" settings miss 5-10%, and the\r\n",
"\"rush\" (-qq) settings may miss 10-25% of the sequences masked by\r\n",
"default. The alignments may extend more or be somewhat more accurate\r\n",
"in the more sensitive settings as well.\r\n",
"\r\n",
"Following are benchmark times for random 1 Mbp of sequences of a\r\n",
"variety of different species run in parallel on 4 Pentium4 2.4Ghz\r\n",
"processors with 3 GB RAM with June 2004 RepeatMasker databases. The\r\n",
"percentage of the query masked is given in parentheses.\r\n",
"\r\n",
" ------------------------ cross_match ------------------------\r\n",
"Species WUBlast (Def) Rush Quick Default Slow\r\n",
"------- ------------- ------------- ------------- ------------- -------------\r\n",
"Human 02:54 (39.26) 01:54 (33.91) 05:05 (36.85) 22:15 (39.92) 57:54 (40.58)\r\n",
"Human-reversed 01:09 ( 1.98) 01:05 ( 2.00) 03:39 ( 2.06) 18:44 ( 2.07) 53:37 ( 2.09)\r\n",
"Chimpanzee 03:00 (40.83) 01:50 (35.24) 04:45 (38.70) 20:22 (41.59) 53:14 (42.24)\r\n",
"Mouse 03:31 (54.02) 01:47 (48.65) 04:21 (51.74) 18:54 (54.15) 47:26 (55.18)\r\n",
"Rat 04:46 (66.07) 02:05 (62.07) 04:32 (63.84) 19:41 (65.97) 48:23 (67.20)\r\n",
"Dog 02:24 (34.62) 01:32 (29.15) 03:07 (32.44) 12:29 (35.09) 30:14 (35.69)\r\n",
"Arabidopsis 01:01 ( 3.02) 00:51 ( 2.95) 04:41 ( 3.00) 46:52 ( 3.12) 1:46:53 ( 3.13)\r\n",
"Ciona savigny 01:25 (15.64) 01:02 (13.12) 01:30 (14.45) 06:13 (15.90) 15:24 (16.30)\r\n",
"C. elegans 02:35 (22.63) 01:38 (20.84) 02:39 (22.52) 12:12 (23.21) 25:15 (23.59)\r\n",
"Drosophila 01:59 (47.21) 01:23 (43.08) 02:30 (45.60) 15:49 (47.51) 39:24 (48.38)\r\n",
"Chicken 00:42 ( 6.52) 00:35 ( 6.18) 00:58 ( 6.42) 04:59 ( 6.53) 11:48 ( 6.58)\r\n",
"Fugu 00:35 ( 5.89) 00:34 ( 5.40) 00:49 ( 5.70) 03:51 ( 5.89) 09:20 ( 6.05)\r\n",
"\r\n",
"\r\n",
"The human-reversed sequence is the \"human\" sequence reversed but not\r\n",
"complemented. 2% of this sequence is (properly) masked as simple\r\n",
"repeats or low complexity DNA.\r\n",
"\r\n",
"Note that for many non-mammalian species the slower settings do not\r\n",
"dramatically increase the percentage recognized as interspersed\r\n",
"repeats. Most of the repeats in the databases for these species are\r\n",
"relatively young and thus are easily detected. This particular 1Mbp\r\n",
"Arabidopsis sequence is an extreme example, where at slow settings in\r\n",
"almost two hours only 1800 bp more is masked than at rush settings in\r\n",
"51 seconds (the Arabidopsis database is large).\r\n",
"\r\n",
"The speed is also dependent on the repeat content of the sequence. For\r\n",
"human sequences, Alu rich sequences are analyzed fastest, LINE rich\r\n",
"sequences somewhat slower, repeat poor regions slower still, and long\r\n",
"satellite regions can take a while.\r\n",
"\r\n",
"\r\n",
"If you have several shorter sequences it is much faster to run\r\n",
"RepeatMasker on a batch file (all sequences in one file). On above\r\n",
"computer, in the rush mode (cross_match), a batch of 10 5 kb sequences\r\n",
"is analyzed in 23 seconds, 20 5kb in 34 sec., etc.\r\n",
"\r\n",
"The user time for larger sequences or sequence batches (50 kb and up)\r\n",
"is linearly related to the length of the query due to the\r\n",
"fragmentation of the query sequence.\r\n",
"\r\n",
"The increase in speed by using multiple processors is dependent on the\r\n",
"usage of the computer and the above-mentioned non-linear relationships\r\n",
"of sequence length and processing time. However, under the right\r\n",
"circumstances, using 2 processors can increase the speed close to\r\n",
"twofold, because the most time-consuming processes are performed in\r\n",
"parallel.\r\n",
"\r\n",
"\r\n",
"\r\n",
"2.5 Selectivity and matches to coding sequences \r\n",
"\r\n",
"The cutoff Smith-Waterman scores for masking interspersed repeats are\r\n",
"conservative, since masking of one short potentially interesting\r\n",
"region generally is more harmful than not masking a number of hard to\r\n",
"find matches. If there are any false matches, they tend to have\r\n",
"scores close to the cutoff, which is 225 for most repeats, 300 for the\r\n",
"low-complexity LINE-1 search*, and 180 for the very old MIR, LINE2 and\r\n",
"MER5 sequences.\r\n",
"\r\n",
"* most LINE-1s are detected with a 225 cut-off, but in one step in\r\n",
"RepeatMasker the low-complexity score adjustment is turned off to find\r\n",
"ancient A-rich L1 elements.\r\n",
"\r\n",
"With each release, we test for the occurrence of false matches in\r\n",
"randomized and in inverted (but not complemented) DNA including a\r\n",
"range of isochores from 36% to 54% GC. To retain seeds for Smith\r\n",
"Waterman alignments, sequences are randomized at the 10 bp word\r\n",
"level. Note that the inverted sequences retain the low complexity and\r\n",
"simple repeat patterns of the original sequences. Even at sensitive\r\n",
"settings, for which false matches are most likely, the 1998-2004\r\n",
"versions of RepeatMasker have reported no (false) matches at all to\r\n",
"interspersed repeats in the randomized or inverted sequences. No\r\n",
"simple repeats were reported in the randomized queries.\r\n",
"\r\n",
"In a 1999 test, RepeatMasker returned only a single probably false\r\n",
"match (71 bp) when analyzing a batch of 4440 coding regions in human\r\n",
"mRNAs (7.2 Mb) at sensitive settings. The coding regions were\r\n",
"collected from GenBank, based on annotations, filtered for the\r\n",
"presence of complete ORFs and initiator methionines, and made more or\r\n",
"less non-redundant. When each coding region was analyzed individually\r\n",
"using the -gccalc option, 5 matches (414 bp, 0.006%) were falsely\r\n",
"masked (156 bp at default speed, 76 bp at quick settings). In this\r\n",
"analysis each sequence was analyzed with matrices chosen based on the\r\n",
"actual GC level, even for very short sequences, while in the batch\r\n",
"analysis of the coding regions the 'average' 43% GC matrices were\r\n",
"used.\r\n",
"\r\n",
"The 1998 and later versions of RepeatMasker show somewhat more false\r\n",
"masking when a pre-1998 version of cross_match is used. These are\r\n",
"primarily the result of improper assumptions of the background\r\n",
"nucleotide frequency used in the scoring matrix calculation when\r\n",
"adjusting for the complexity of a match. Specifically, a very GC rich\r\n",
"region in an AT-rich isochore (like an exon) may improperly match a GC\r\n",
"rich repeat, since the scores for C/G matches are higher in the used\r\n",
"scoring matrix than for AT matches (calculated for this AT rich\r\n",
"background) whereas the old cross_match assumed that a 50% GC\r\n",
"background in these calculations and equal scores for A/T and G/C\r\n",
"matches have been given. The new version of cross_match reads the\r\n",
"correct nucleotide background level from the matrix used.\r\n",
"\r\n",
"\r\n",
"2.6 Simple repeats and low complexity DNA\r\n",
"\r\n",
"Low-complexity DNA \r\n",
"\r\n",
"By default, along with the interspersed repeats, RepeatMasker masks\r\n",
"low-complexity DNA. Simple repeats (micro-satellites) can originate at\r\n",
"any site in the genome, and therefore have an interspersed\r\n",
"character. Other low-complexity DNA, primarily poly-purine/\r\n",
"poly-pyrimidine stretches, or regions of extremely high AT or GC\r\n",
"content will result in spurious matches in some database searches as\r\n",
"well (especially in the ungapped BLASTN searches). For example,\r\n",
"extremely AT-rich regions consistently will give very low probability\r\n",
"matches to mitochondrial DNA in BLASTN searches. The settings are very\r\n",
"stringent, and we think that few if any sequences informative in\r\n",
"database searches are masked as low-complexity DNA. However, you can\r\n",
"skip the low-complexity DNA masking using the option -nolow or -l(ow).\r\n",
"\r\n",
"Under the current settings a 100 bp stretch of DNA is masked when it\r\n",
"is >87% AT or >89% GC, a 30 bp stretch has to contain 29 A/T (or GC)\r\n",
"nucleotides. The settings are slightly more stringent than the\r\n",
"original settings, partly because the gapped BLAST programs are less\r\n",
"sensitive to short regions of low complexity then the old gapless\r\n",
"BLAST. In coding regions I have not yet found extensive regions (>10\r\n",
"bp) masked as low complexity DNA that would not be masked by the\r\n",
"combined XNU and SEG filters routinely used in BLASTX.\r\n",
"\r\n",
"\r\n",
"Annotation of simple repeats \r\n",
"\r\n",
"Although RepeatMasker does a good job in masking simple repeats to\r\n",
"avoid spurious matches in database searches, it is not written to find\r\n",
"and indicate all possibly polymorphic simple repeat sequences. Only\r\n",
"di- to pentameric and some hexameric repeats are scanned for and\r\n",
"simple repeats shorter than 20 bp are ignored. The -poly option prints\r\n",
"out a separate list of simple repeats of < 10% divergence from a\r\n",
"perfect repeat. However, even long perfect repeats may not be\r\n",
"presented in this list; e.g. two perfect 40 bp long (CA)n repeats\r\n",
"interrupted by 10 Ts are aligned in one piece and may be reported as\r\n",
"having > 10% divergence from the consensus. Many perfect hexameric or\r\n",
"longer unit repeats will be listed as more or less diverged smaller\r\n",
"unit repeats and may not appear in the .polyout file.\r\n",
"\r\n",
"Also note that, in the default output, simple repeats expanded from\r\n",
"the poly A tails of Alus and LINE-1 are now included in the Alu or\r\n",
"LINE-1 annotation. This cleans up the annotation a bit and lets the\r\n",
"stand-alone poly A regions stand out (they may indicate the presence\r\n",
"of a processed pseudogene). However, even perfect simple repeats in\r\n",
"such tails will be hidden in the .out file.\r\n",
"\r\n",
"A program optimized to quickly find all dimeric to pentameric repeats\r\n",
"is sputnik, available at http://espressosoftware.com/pages/sputnik.jsp. \r\n",
"\r\n",
"\r\n",
"Local duplications, tandem repeats and satellites.\r\n",
"\r\n",
"Gary Benson's program \"Tandem Repeat Finder\" (another catchy name)\r\n",
"currently is the standard for finding satellites and all other direct\r\n",
"repeats (http://tandem.bu.edu/trf/trf.html). \r\n",
"Any local duplications (tandem, inverted, interrupted) can be detected\r\n",
"with the program miropeats (http://www.genome.ou.edu/miropeats.html),\r\n",
"which presents this similarity information graphically.\r\n",
"\r\n",
"\r\n",
"\r\n",
"3 HOW TO READ THE RESULTS\r\n",
"\r\n",
"3.1 The annotation (.out) file\r\n",
"\r\n",
"The annotation file contains the cross_match summary lines. It lists\r\n",
"all best matches (above a set minimum score) between the query\r\n",
"sequence and any of the sequences in the repeat database or with low\r\n",
"complexity DNA. The term \"best matches\" reflects that a match is not\r\n",
"shown if its domain is over 80% or 90% contained within the domain of\r\n",
"a higher scoring match, where the \"domain\" of a match is the region in\r\n",
"the query sequence that is defined by the alignment start and\r\n",
"stop. These domains have been masked in the returned masked sequence\r\n",
"file. In the output, matches are ordered by query name, and for each\r\n",
"query by position of the start of the alignment.\r\n",
"\r\n",
"Example: \r\n",
"\r\n",
" SW perc perc perc query position in query matching repeat position in repeat\r\n",
"score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID\r\n",
"...\r\n",
" 1320 15.6 6.2 0.0 HSU08988 6563 6781 (22462) C MER7A DNA/MER2_type (0) 337 104 20\r\n",
"12279 10.5 2.1 1.7 HSU08988 6782 7718 (21525) C Tigger1 DNA/MER2_type (0) 2418 1486 19\r\n",
" 1769 12.9 6.6 1.9 HSU08988 7719 8022 (21221) C AluSx SINE/Alu (0) 317 1 17\r\n",
"12279 10.5 2.1 1.7 HSU08988 8023 8694 (20549) C Tigger1 DNA/MER2_type (932) 1486 818 19\r\n",
" 2335 11.1 0.3 0.7 HSU08988 8695 9000 (20243) C AluSg SINE/Alu (5) 305 1 18\r\n",
"12279 10.5 2.1 1.7 HSU08988 9001 9695 (19548) C Tigger1 DNA/MER2_type (1600) 818 2 19\r\n",
" 721 21.2 1.4 0.0 HSU08988 9696 9816 (19427) C MER7A DNA/MER2_type (224) 122 2 20\r\n",
"\r\n",
"This is a sequence in which a Tigger1 DNA transposon has integrated into a MER7 DNA transposon copy. Subsequently two Alus integrated in the Tigger1 sequence. The first line is interpreted as such:\r\n",
"\r\n",
" 1320 = Smith-Waterman score of the match, usually complexity adjusted\r\n",
" The SW scores are not always directly comparable. Sometimes\r\n",
" the complexity adjustment has been turned off, and a variety of\r\n",
" scoring-matrices are used dependent on repeat age and GC level.\r\n",
"\r\n",
" 15.6 = % divergence = mismatches/(matches+mismatches) **\r\n",
" 6.2 = % of bases opposite a gap in the query sequence (deleted bp)\r\n",
" 0.0 = % of bases opposite a gap in the repeat consensus (inserted bp)\r\n",
" HSU08988 = name of query sequence\r\n",
" 6563 = starting position of match in query sequence\r\n",
" 6781 = ending position of match in query sequence\r\n",
" (22462) = no. of bases in query sequence past the ending position of match\r\n",
" C = match is with the Complement of the repeat consensus sequence\r\n",
" MER7A = name of the matching interspersed repeat\r\n",
" DNA/MER2_type = the class of the repeat, in this case a DNA transposon \r\n",
" fossil of the MER2 group (see below for list and references)\r\n",
" (0) = no. of bases in (complement of) the repeat consensus sequence \r\n",
" prior to beginning of the match (0 means that the match extended \r\n",
" all the way to the end of the repeat consensus sequence)\r\n",
" 337 = starting position of match in repeat consensus sequence\r\n",
" 104 = ending position of match in repeat consensus sequence\r\n",
" 20 = unique identifier for individual insertions \r\n",
"\r\n",
" An asterisk (*) following the final column (see below example)\r\n",
" indicates that there is a higher-scoring match whose domain partly\r\n",
" (<80%) includes the domain of the current match.\r\n",
"\r\n",
"** This has changed in August 2001: cross_match output gives the\r\n",
"percent mismatches/(matches+mismatches+unaligned bases in query). I\r\n",
"didn't think this definition is otherwise commonly used and most users\r\n",
"will assume the divergence level would be\r\n",
"mismatches/(matches+mismatches).\r\n",
"\r\n",
"Note that the SW score and divergence numbers for the three Tigger1\r\n",
"lines are identical. This is because the information is derived from a\r\n",
"single alignment (the Alus were deleted from the query before the\r\n",
"alignment with the Tigger element was performed). The ProcessRepeats\r\n",
"script makes educated guesses if any pair of fragments is derived from\r\n",
"the same element or not; if so, the fragments will have the same ID in\r\n",
"the last column, in this example it figured that the MER7A fragments\r\n",
"represent one insert.\r\n",
"\r\n",
"Here is another example that shows how much trouble ProcessRepeats\r\n",
"takes to defragment elements and how the ID can be useful in\r\n",
"interpreting the results:\r\n",
"\r\n",
" 7120 19.9 0.6 0.3 NT_001227 85631 87837 (19816) + L1PA16 LINE/L1 1 1885 (4964) 123 \r\n",
" 2503 14.9 6.5 0.7 NT_001227 87839 88241 (19412) + MSTA LTR/MaLR 1 428 (0) 100 \r\n",
" 867 12.9 2.7 0.0 NT_001227 88242 88388 (19265) + MSTA-int LTR/MaLR 1 151 (1500) 100 *\r\n",
" 5219 19.5 2.9 0.6 NT_001227 88386 89342 (18311) + MSTA-int LTR/MaLR 629 1607 (44) 100 \r\n",
" 8003 3.5 0.8 0.0 NT_001227 89362 90773 (16880) C L1PA3 LINE/L1 (0) 6155 4745 103 \r\n",
" 7677 3.5 0.0 0.0 NT_001227 90795 94059 (13594) C L1PA3 LINE/L1 (0) 6155 2872 104 \r\n",
" 9050 6.5 0.4 0.1 NT_001227 94060 95127 (12526) C MER11C LTR/ERVK (0) 1071 1 106 \r\n",
" 7677 3.5 0.0 0.0 NT_001227 95128 97101 (10552) C L1PA3 LINE/L1 (3282) 2873 900 104 \r\n",
" 5619 7.8 0.3 0.9 NT_001227 97097 97865 (9788) C L1PA3 LINE/L1 (5370) 776 13 104 *\r\n",
" 320 16.9 0.0 1.7 NT_001227 97876 97934 (9719) + MSTA-int LTR/MaLR 1594 1651 (0) 100 \r\n",
" 1475 19.0 4.8 5.6 NT_001227 97935 98255 (9398) + MSTA LTR/MaLR 1 323 (48) 100 \r\n",
" 2322 14.4 0.8 1.6 NT_001227 98256 98629 (9024) + THE1C LTR/MaLR 1 371 (0) 112 \r\n",
"10051 12.9 3.5 4.3 NT_001227 98630 100221 (7432) + THE1C-int LTR/MaLR 1 1580 (0) 112 \r\n",
" 2359 15.7 0.3 1.9 NT_001227 100224 100598 (7055) + THE1C LTR/MaLR 3 371 (0) 112 \r\n",
" 1475 19.0 4.8 5.6 NT_001227 100599 100646 (7007) + MSTA LTR/MaLR 323 371 (0) 100 \r\n",
" 1360 19.4 8.2 1.7 NT_001227 100662 100955 (6698) + MSTA LTR/MaLR 114 426 (0) 113 \r\n",
"11892 24.7 1.9 2.0 NT_001227 100968 101243 (6410) + L1PA16 LINE/L1 1881 2143 (4706) 123 \r\n",
" 2062 11.9 8.4 0.0 NT_001227 101244 101563 (6090) C L1PA12 LINE/L1 (10) 6164 5818 116 \r\n",
"11892 24.7 1.9 2.0 NT_001227 101564 105425 (2228) + L1PA16 LINE/L1 2137 5989 (860) 123 \r\n",
" 257 0.0 0.0 2.9 NT_001227 105436 105469 (2184) + (TAA)n Simple 2 34 (0) 118 \r\n",
" 2189 18.2 0.2 0.7 NT_001227 105470 105893 (1760) + L1PA16 LINE/L1 6062 6483 (386) 123 \r\n",
" 255 6.1 0.0 0.0 NT_001227 105896 105928 (1725) + (TA)n Simple 1 33 (0) 120 *\r\n",
" 369 0.0 0.0 0.0 NT_001227 105928 105968 (1685) + (GA)n Simple 2 42 (0) 121 \r\n",
" 305 18.8 0.0 1.0 NT_001227 105971 106066 (1587) + (TA)n Simple 2 96 (0) 122 \r\n",
" 1589 21.2 1.6 1.1 NT_001227 106068 106449 (1204) + L1PA16 LINE/L1 6485 6868 (1) 123 \r\n",
"\r\n",
"This entire 20,819 bp block of sequence is comprised by an L1PA16\r\n",
"(#123), in which 7 or 8 elements have integrated (it is unclear to me\r\n",
"if the MSTA #113 is a separate integration or a tandem\r\n",
"duplication). There are at least four layers, with MER11 (#106)\r\n",
"inserted in L1PA3 (#104) inserted in MSTA (#100, maybe in #113)\r\n",
"inserted in L1PA16. L1PA16 is already primate specific, so that all\r\n",
"these insertions took place during primate evolution.\r\n",
"\r\n",
"The ID column helps much in deciphering the events. It also should be\r\n",
"a basis for graphic display of RepeatMasker output.\r\n",
"\r\n",
"\r\n",
"\r\n",
"3.2 Alignments\r\n",
"\r\n",
"When using the -a option, a .align file is created that contains\r\n",
"alignments of your query sequence to the matching repeat consensus\r\n",
"sequences. The alignments are given in the same order as listed in the\r\n",
".out file. They are always in the orientation of the query; you can\r\n",
"use the -inv option to produce all alignments in the orientation of\r\n",
"the consensus sequence.\r\n",
"\r\n",
"The alignments are in the cross_match/SWAT format, in which mismatches\r\n",
"rather than matches are indicated (transitions with an i and\r\n",
"transversions with a v). The description line preceding the alignment\r\n",
"is similar to that seen in the .out file. In the example of an\r\n",
"alignment below, an old retrovirus-like LTR (MLT1H) has been\r\n",
"interrupted by the more recent insertion of a short DNA transposon\r\n",
"(MADE2):\r\n",
"\r\n",
"384 28.89 9.24 2.17 chr1_4622259_4622561 21 77 (225) MLT1H#LTR/MaLR 23 88 (461) 5\r\n",
"\r\n",
" chr1_4622259_ 21 TGGCC-CAATTCTTTACCTCTC--TGCCTCTTGTGCCTTTTG-------G 60\r\n",
" - ? ii i -- iv i-i i -------i\r\n",
" MLT1H#LTR/MaL 23 TGGCCACAATTMTCCACCCCTCCCTGTATCC-ATGCCCTTTGCAATGTGA 71\r\n",
"\r\n",
" chr1_4622259_ 61 CTTTGCCATTTCTTCTA 77\r\n",
" vii i i i \r\n",
" MLT1H#LTR/MaL 72 CTTTGCAGCTCCTCCCA 88\r\n",
"\r\n",
"Transitions / transversions = Transitions / transversions = Unknown\r\n",
"Gap_init rate = Unknown\r\n",
"\r\n",
"\r\n",
"557 11.25 0.00 0.00 chr1_4622259_4622561 78 157 (145) C MADE1#DNA/Mariner (0) 80 1 3\r\n",
"\r\n",
" chr1_4622259_ 78 TTAGGTTGGTGCAAAAGTAATTGTGGGTTTTAGCATTTAAAGTAATACCA 127\r\n",
" i v iv ? iv \r\n",
"C MADE1#DNA/Mar 80 TTAGGTTGGTGCAAAAGTAATTGCGGTTTTTGCCATTRAAAGTAATGGCA 31\r\n",
"\r\n",
" chr1_4622259_ 128 AAAACCACAACTACTTTTGCACCAACCTAA 157\r\n",
" i i \r\n",
"C MADE1#DNA/Mar 30 AAAACCGCAATTACTTTTGCACCAACCTAA 1\r\n",
"\r\n",
"Transitions / transversions = Transitions / transversions = Unknown\r\n",
"Gap_init rate = Unknown\r\n",
"\r\n",
"\r\n",
"384 28.89 9.24 2.17 chr1_4622259_4622561 158 283 (19) MLT1H#LTR/MaLR 89 218 (331) 5\r\n",
"\r\n",
" chr1_4622259_ 158 TAGTAAAAGCAGAGGATAAT-----ATTCCTTGTCTTTGGGTTTGTCATG 202\r\n",
" i -- i i ii vv v ----- ii vv i i v i v \r\n",
" MLT1H#LTR/MaL 89 CA--AGAGGTGGAGTCTATTTCCCCACCCCTTGAATCTGGGCTGGCCTTG 136\r\n",
"\r\n",
" chr1_4622259_ 203 TGACTCTCTTTGGCCATGGGAACATAGGCAAAAATGACT-TGTGCCCCTT 251\r\n",
" iv vvi ii - i i v- vv \r\n",
" MLT1H#LTR/MaL 137 TGACTTGCTTTGGCCAATAGAATGT-GGCAGAAGTGACGGTGTGCCAGTT 185\r\n",
"\r\n",
" chr1_4622259_ 252 CTGAGCCCCGGCCTTGAGAGGTCTT-CATGCTT 283\r\n",
" iv ii i - i \r\n",
" MLT1H#LTR/MaL 186 CTGAGCCTAGGCCTCAAGAGGCCTTGCACGCTT 218\r\n",
"\r\n",
"Transitions / transversions = Transitions / transversions = Unknown\r\n",
"Gap_init rate = Unknown\r\n",
"\r\n",
"Note that the description line is identical for the first and third\r\n",
"alignment. Before the query was compared to the MLT1H consensus,\r\n",
"RepeatMasker had recognized the MADE1 element and had removed it from\r\n",
"the query sequence to more or less reconstruct the pre-MADE1-insertion\r\n",
"situation. Thus, position 21 to 283 of the query could be aligned to\r\n",
"the MLT1H consensus in a single piece. Since 2004 (RepeatMasker3.0 and\r\n",
"up), such alignments are broken up to present all matches in serial\r\n",
"order.\r\n",
"\r\n",
"Alignments are especially useful for designing PCR primers in a region\r\n",
"full of repeats. It is possible to design primers contained in a\r\n",
"common repeat that still work in a whole genome, when the 3' end is in\r\n",
"a region that is very different from the consensus.\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"Discrepancies between alignments and the .out file\r\n",
"\r\n",
"Discrepancies between alignments and annotation result from the\r\n",
"adjustments made by the ProcessRepeats script to produce more legible\r\n",
"annotation. This annotation also tends to be closer to the biological\r\n",
"reality than the raw cross_match output.\r\n",
"\r\n",
"For example, adjustments often are necessary when a repeat is\r\n",
"fragmented through deletions, insertions, or an inversion. Many\r\n",
"subfamilies of repeats closely resemble each other, and when a repeat\r\n",
"is fragmented these fragments can be assigned different subfamily\r\n",
"names in the raw output. ProcessRepeats often can decide if fragments\r\n",
"are derived from the same integrated transposable element and which\r\n",
"subfamily name is appropriate (subsequently given to all\r\n",
"fragments). This can result in discrepancies in the repeat name and\r\n",
"matching positions in the consensus sequence (subfamily consensus\r\n",
"sequences differ in length).\r\n",
"\r\n",
"In many cases matches are fused into one annotation. To give a few\r\n",
"common examples: \r\n",
"\r\n",
"- In large sequences that are analyzed in fragments consecutive\r\n",
"fragments overlap and repeats in these overlaps will appear twice\r\n",
"(partially or wholly) in the alignment file but are merged in the .out\r\n",
"file.\r\n",
"\r\n",
"- A-rich simple repeats originated from the poly A tail of Alus and\r\n",
"LINE-1s are incorporated in the annotation of the Alu or LINE-1.\r\n",
"\r\n",
"- There is an 'endless' number of subfamilies for retroposons which\r\n",
"can not all be represented in the databases and sometimes an element\r\n",
"is matched by overlapping pieces of two related subfamilies (which\r\n",
"will be merged).\r\n",
"\r\n",
"- You may find large discrepancies in position numbering if an element\r\n",
"includes tandem repeat units. For example, MER109 contains multiple\r\n",
"~300 bp repeat units that can lead to overlapping matches. In the\r\n",
"annotation such matches are fused.\r\n",
"\r\n",
"- Simple repeats or satellites that are longer than the number of\r\n",
"units represented in the repeat library will be represented by\r\n",
"multiple, genrally overlapping alignments in the .align file, but only\r\n",
"a single annotation line in the .out file.\r\n",
"\r\n",
"\r\n",
"Specific LINE problems:\r\n",
"\r\n",
"Some other discrepancies between alignments and annotations are\r\n",
"specific to LINE-like elements. These repeats usually do not appear as\r\n",
"complete elements in the consensus database. For LINE-1, this is\r\n",
"mostly due to the contrast in conservation over the length of its\r\n",
"sequence during its evolution in the mammalian genome; the ~3 kb ORF2\r\n",
"region of LINE-1 has been very conserved, whereas the untranslated\r\n",
"regions and ORF1 to a lesser degree have evolved very fast. Thus the\r\n",
"3' end or 5' end of an ancient LINE-1 does not even remotely resemble\r\n",
"that of the currently active LINE-1, whereas the coding region for\r\n",
"reverse transcriptase is closely related. Thus, many subfamilies have\r\n",
"been defined for both the 5' and 3' UTRs (48 and 55, resp.) of LINE-1\r\n",
"elements in human DNA, whereas only 11 ORF2 entries are present in the\r\n",
"database. Besides the fact that some 3' ends have multiple defined 5'\r\n",
"ends, and vice versa, the program would become very slow when each\r\n",
"query is compared to 55 full length (6 to 8 kb) LINE-1 elements.\r\n",
"Thus, LINE-1 elements are presented in the database in 3 pieces, and\r\n",
"the ProcessRepeats script puts these pieces together. As a result both\r\n",
"the names of the repeats and position numbering in the consensus\r\n",
"sequence are generally different in the alignments than in the output\r\n",
"file. The LINE2 elements are likewise broken up in 3' UTRs for\r\n",
"different subfamilies and one 5'UTR-ORF2 region.\r\n",
"\r\n",
"Between LINE-1 subfamilies, the 3' UTR ranges from 500 bp to over 2000\r\n",
"bp (in L1MC/D3), and the length of the 5' UTR is even more variable,\r\n",
"even between subfamilies that show strong similarity in the 3' UTR. To\r\n",
"allow the LINE-1 fragments to be put together, all position numbers in\r\n",
"older LINE-1 subfamilies are normalized relative to the position of\r\n",
"ORF2 (the conserved part of LINE-1) in a complete L1PA2 element. Since\r\n",
"some older elements have much longer 5' UTRs or ORF1-ORF2 linker\r\n",
"regions than L1PA2, this often results in the assignment of negative\r\n",
"position numbers for the 5' end of LINEs. Since the March2000 release,\r\n",
"such positions and all positions in fragments thought to be part of\r\n",
"the same LINE-1 insert are readjusted to count from the 5' end (which\r\n",
"is not necessarily the very 5' end of the LINE-1 source gene, as these\r\n",
"are hard to derive for old elements). One problem with this approach\r\n",
"is that positions are not adjusted in detached 3' fragments that are\r\n",
"somehow not recognized by the program as originating from the same\r\n",
"insertion. Thereby, the common origin of the 5' fragments and 3'\r\n",
"fragments may become completely obscured. You can use the option\r\n",
"'-orf2' of ProcessRepeats to retrieve an output in which all LINE-1s\r\n",
"are numbered so that position 1 of ORF2 is aligned (resulting in\r\n",
"occasionally negative positions).\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"3.3 The summary (.tbl) file\r\n",
"\r\n",
"The summary file is pretty much self-explanatory. Below is an example.\r\n",
"\r\n",
"==================================================\r\n",
"file name: AC027410.fa\r\n",
"sequences: 1\r\n",
"total length: 152192 bp (148791 bp excl N-runs)\r\n",
"GC level: 39.59 %\r\n",
"bases masked: 88734 bp ( 59.64 %)\r\n",
"==================================================\r\n",
" number of length percentage\r\n",
" elements* occupied of sequence\r\n",
"--------------------------------------------------\r\n",
"SINEs: 195 45195 bp 30.37 %\r\n",
" ALUs 178 43249 bp 29.07 %\r\n",
" MIRs 17 1946 bp 1.31 %\r\n",
"\r\n",
"LINEs: 54 31173 bp 20.95 %\r\n",
" LINE1 36 24602 bp 16.53 %\r\n",
" LINE2 18 6571 bp 4.42 %\r\n",
" L3/CR1 0 0 bp 0.00 %\r\n",
"\r\n",
"LTR elements: 13 5833 bp 3.92 %\r\n",
" MaLRs 8 4079 bp 2.74 %\r\n",
" ERVL 0 0 bp 0.00 %\r\n",
" ERV_classI 5 1754 bp 1.18 %\r\n",
" ERV_classII 0 0 bp 0.00 %\r\n",
"\r\n",
"DNA elements: 17 4459 bp 3.00 %\r\n",
" MER1_type 12 1903 bp 1.28 %\r\n",
" MER2_type 4 2466 bp 1.66 %\r\n",
"\r\n",
"Unclassified: 0 0 bp 0.00 %\r\n",
"\r\n",
"Total interspersed repeats: 86660 bp 58.24 %\r\n",
"\r\n",
"\r\n",
"Small RNA: 2 124 bp 0.08 %\r\n",
"\r\n",
"Satellites: 0 0 bp 0.00 %\r\n",
"Simple repeats: 22 1151 bp 0.77 %\r\n",
"Low complexity: 22 799 bp 0.54 %\r\n",
"==================================================\r\n",
"\r\n",
"* most repeats fragmented by insertions or deletions\r\n",
" have been counted as one element\r\n",
"Runs of >20 Ns in query were excluded in % calcs\r\n",
"\r\n",
"The query species was assumed to be Pan troglodytes\r\n",
"RepeatMasker version 20040617 , default mode\r\n",
"run with cross_match version 0.990329\r\n",
"RepBase Update 9.04, RM database version 20040617\r\n",
"----------------------------------------------------\r\n",
"\r\n",
"AC027410 was a draft sequence, with individual contigs separated by\r\n",
"poly N linkers. In this case, the option -excln was used, so that\r\n",
"these strings of Ns were ignored for the percent calculations.\r\n",
"\r\n",
"\r\n",
"The classification in this table is well defined (see my reviews in\r\n",
"COGD) and forms a good basis for visual presentation and tabulation of\r\n",
"the repeats in your study.\r\n",
"\r\n",
"We've been able to classify almost all human repeats, most of them\r\n",
"even in subclasses. The totals for the classes often are higher than\r\n",
"the sum of the subclasses, because not all elements fit in a subclass\r\n",
"and minor subclasses are not listed separately in the table (e.g. for\r\n",
"the human table the Mariner, Tc2, Piggybac, Zaphod, and Arthur\r\n",
"families of DNA transposons). The HAL1 element, derived from LINE-1,\r\n",
"is added to the LINE-1 total in this table.\r\n",
"\r\n",
"Note that the \"MER\" subclasses have no relationship to each other. The\r\n",
"term MER (MEdium Reiterated repeats) was introduced for purely\r\n",
"administrative purposes to give the beast a name. The MER1 and MER2\r\n",
"groups were named after the first member of these groups identified as\r\n",
"an interspersed repeat in our genome. In the literature they're also\r\n",
"known as the Tigger and Charlie groups.\r\n",
"\r\n",
"The nomenclature of mammalian repeats derived from retrovirus-like\r\n",
"elements is different from older versions. I've now divided this class\r\n",
"up in the traditional class I, class II (ERVK), class III (ERVL)\r\n",
"retroviruses and the ERVL-derived but very distinct non-autonomous\r\n",
"MaLR elements. Since 'class III' is not an accepted classification\r\n",
"yet, for now this class is called ERVL. The large MER4-group of\r\n",
"non-autonomous LTR elements merges seamlessly with class I endogenous\r\n",
"retroviruses, making it hard to define, and is now incorporated in the\r\n",
"latter group. The ERV classes are most readily distinguished by the\r\n",
"size of the insertion site duplication: 4 in class I, 6 in class II,\r\n",
"and 5 in class III, though there are some exceptions to this rule. My\r\n",
"LTR classification is not based target size duplication sizes, but on\r\n",
"the encoded proteins in the internal sequences or, if these are not\r\n",
"known, on matches to LTRs with internal sequences.\r\n",
"\r\n",
"\r\n",
"As described above, the ProcessRepeats script tries very hard to find\r\n",
"out which repeat fragments were derived from the same insertion event\r\n",
"of a transposable element, but there still will be a slight\r\n",
"overestimate of the copy numbers.\r\n",
"\r\n",
"\r\n",
"There may be slight differences in the number of \"bases masked\" and\r\n",
"the sum of the bases annotated in this .tbl file. At this moment bases\r\n",
"are masked based on the unprocessed matches (as they are in the .cat\r\n",
"file) and most of the discrepancies are accounted for by unmasked\r\n",
"regions between flanking identical simple repeats, annotated as one\r\n",
"stretch if fewer than 10 bases separate them, and fragments of repeats\r\n",
"shorter than 10 bp which are not annotated but are masked.\r\n",
"\r\n",
"\r\n",
"\r\n",
"4 APPLICATIONS\r\n",
"\r\n",
"4.1 Use in database searches\r\n",
"\r\n",
"RepeatMasker is most commonly used to avoid spurious matches in\r\n",
"database searches. Generally this step is strongly recommended before\r\n",
"doing BLASTN or BLASTX equivalent searches with mammalian DNA\r\n",
"sequence.\r\n",
"\r\n",
"The most common concern is of course if RepeatMasker ever masks coding\r\n",
"regions. \r\n",
"We found that false matches in coding regions are extremely rare, but\r\n",
"did identify 38 genuine fragments of interspersed repeats (4214 bp) in\r\n",
"the (annotated) coding regions of the 4440 human mRNAs (7.2 Mb)\r\n",
"analyzed (excluding annotated coding sequences of LINE-1 elements and\r\n",
"endogenous retroviruses). We verified matches with lower scores by\r\n",
"comparing the translation products to close homologous or redundant\r\n",
"entries in the database (the repeat matching regions always were\r\n",
"exactly missing). In the majority of these cases, the sequences appear\r\n",
"to be improperly annotated or to represent either artificially or\r\n",
"naturally defective mRNAs (e.g. alternatively spliced exons comprised\r\n",
"of a small fragment of a repeat). Genuine overlaps of interspersed\r\n",
"repeats with coding sequences usually involve terminal regions of the\r\n",
"ORFs. Since the transposable element derived region is unique to the\r\n",
"protein in that (group of) species, the masking does not interfere\r\n",
"with database searches.\r\n",
"\r\n",
"However, some cautionary comments are necessary. First, a few active\r\n",
"cellular genes are derived from transposable elements (see my list of\r\n",
"50 in our genome in Lander et al. 2001). Some of these genes will be\r\n",
"partially masked by a (related) transposon in the repeat database. EST\r\n",
"and cDNA matches beyond the masked region should alert you.\r\n",
"\r\n",
"Also remember that, currently only for mammals, RepeatMasker screens\r\n",
"for small RNA (pseudo)genes because of their similarity to SINEs. The\r\n",
"number of matches to small RNAs are listed in the overview table;\r\n",
"(close to) exact matches are possibly active genes, although related\r\n",
"active genes not in the database may show diverged matches. If you're\r\n",
"interested in (small) RNA genes, you should use the -norna option to\r\n",
"leave these sequences unmasked, while SINEs will remain masked.\r\n",
"\r\n",
"A final caution relates to the fact that 3' UTRs of transcripts are\r\n",
"about as dense in interspersed repeats as intergenic regions\r\n",
"are. Thus, many ESTs are completely masked as repetitive DNA. I\r\n",
"recommend that, when you compare a genomic sequence against the EST\r\n",
"database or use ESTs as a query in nucleotide searches, you search\r\n",
"with the unmasked sequence as well; use a long minimum match (word\r\n",
"length/ word size) like 40 bp to identify exact matches and avoid most\r\n",
"background. Unfortunately the maximum word length that can be used in\r\n",
"the NCBI BLASTN program is 18 (due to memory limitations).\r\n",
"\r\n",
"\r\n",
"\r\n",
"4.2 Identification of DNA source (contamination detection)\r\n",
"\r\n",
"Bacterial insertion elements \r\n",
"\r\n",
"Bacterial insertion sequences (IS elements) often pop up in foreign\r\n",
"sequences, as their activity in the E. coli is not always successfully\r\n",
"suppressed during cloning. As late as 2002, human entries in the\r\n",
"'finished' section of GenBank contained over a hundred IS elements.\r\n",
"\r\n",
"With each run, RepeatMasker includes a quick check for bacterial\r\n",
"insertion elements that may have inserted during cloning. You can turn\r\n",
"this off with the -no_is option. The -is_only option limits the run to\r\n",
"this check only.\r\n",
"\r\n",
"When a full-length element is found and a target site duplication is\r\n",
"confirmed, its location is both reported to the screen and stored in a\r\n",
".alert file. The latter also contains information of possible\r\n",
"mouse<->human contamination.\r\n",
"\r\n",
"-is_clip, -is_only\r\n",
"\r\n",
"With the -is_only and is_clip options, the detected IS and one of the\r\n",
"flanking repeats is clipped out to restore the pre-cloning artifact\r\n",
"situation before comparison with the repeat databases. The original\r\n",
"query FASTA file will remain unchanged. An insertion-sequence-clipped,\r\n",
"but otherwise unmasked query sequence is printed to .withoutIS.\r\n",
"\r\n",
"For single sequences larger than 4 Mbp, the -maxsize option needs to\r\n",
"be set to a number larger than the sequence length to retrieve this\r\n",
"file.\r\n",
"\r\n",
"With either of these options, a properly adjusted quality string is\r\n",
"printed to a file with the suffix .qual.withoutIS when a corresponding\r\n",
"PHRED quality file (.qual) is in the same directory. Note that these\r\n",
"names won't be such that the clipped sequence and quality file form a\r\n",
"pair for subsequent cross_match/PHRAP work. They need to be renamed,\r\n",
"as I assume one wants to do anyway.\r\n",
"\r\n",
"\r\n",
"Most but not all IS elements can be precisely cut out. The element may\r\n",
"be at the edge of a sequence, or (rarely) the element may have\r\n",
"inserted improperly, lacking target site duplications or missing\r\n",
"terminal bases (internal deletion products are generally handled\r\n",
"okay). These matches are reported, but are left untouched even in\r\n",
"_is_only or is_clip mode.\r\n",
"\r\n",
"The location of any IS element is both reported to the screen and\r\n",
"stored in an .alert file. The latter also contains information of\r\n",
"possible mouse<->human contamination.\r\n",
"\r\n",
"Here are the specifics of IS element insertions:\r\n",
"\r\n",
"IS1 8-10 bp duplication\r\n",
"IS2 5 bp duplication; published sequence was too short\r\n",
"IS3 3 bp duplication\r\n",
"IS4 No examples of clonal artifacts; no dup site info\r\n",
"IS5 4 bp duplication; preferred target TCTAGA\r\n",
"IS10 9 bp duplication; extreme preference for CGCTNAGCN; published\r\n",
" sequence for IS5 & 10 were too long, included preferred target site\r\n",
"IS30 2 bp duplication\r\n",
"IS150 3 bp dup, with one exception (4 bp); strong pref for CAGNNTGGGGCY\r\n",
"IS186 10 or 11 bp dup\r\n",
"Tn1000 5 bp duplication;\r\n",
"\r\n",
"\r\n",
"\r\n",
"\r\n",
"Human, mouse, or rat sequence contamination or mix-up.\r\n",
"\r\n",
"A straightforward way to distinguish murine and human DNA is by\r\n",
"checking for either rodent-specific or primate specific\r\n",
"repeats. Likewise, rodent or primate contamination in any other\r\n",
"mammalian or non-mammalian background can be picked up as well. If\r\n",
"your lab has, say, a rat and a pink fairy armadillo sequencing\r\n",
"project, rat DNA in a supposedly armadillo sequence can be picked up\r\n",
"quite reliably, depending on the length of the query.\r\n",
"\r\n",
"When the option -rodspec or -primspec is used, RepeatMasker only\r\n",
"checks the query against a small library of repeats that have not\r\n",
"(yet) been observed in the 'other' species. The locations of the\r\n",
"matches are printed to .alert. This function will be expanded to\r\n",
"other mammals, when these species are starting to be sequenced in\r\n",
"earnest.\r\n",
"\r\n",
"I've checked for the specificity of the reported matches quite\r\n",
"extensively. Whenever two or more types of repeats are reported, the\r\n",
"odds are that the alert is correct. Very occasionally, a single\r\n",
"reported match could be a false alert. This is especially possible\r\n",
"when a 'new' mammalian species is analyzed, because, unbeknownst to\r\n",
"me, a related repeat may have amplified in such a genome.\r\n",
"\r\n",
"\r\n",
"Other species contamination.\r\n",
"\r\n",
"When a supposedly mammalian clone is of non-mammalian origin, very few\r\n",
"if any interspersed repeats will be reported by RepeatMasker. Rodent\r\n",
"or primate genomic sequences are on average 40-50% dense in\r\n",
"recognizable interspersed repeats, so that any stretch of genomic DNA\r\n",
"of significant length (say 30 kb or more) showing less than 10%\r\n",
"density in interspersed repeats is of suspect origin. An automated\r\n",
"alert for such a situation is not included, as query sequences of\r\n",
"coding regions or transcripts, generally of very low repeat density,\r\n",
"would constantly cause an alert.\r\n",
"\r\n",
"\r\n",
"\r\n",
"4.3 DateRepeats - Masking lineage-specific repeats for genomic alignments\r\n",
"\r\n",
"Since June 2003 each repeat consensus in the mammalian repeat\r\n",
"libraries has a phylogenetic label. The interspersed repeat is\r\n",
"expected to be found in all species belonging to the specified genus,\r\n",
"family, order, etc. The label is based on the presence or absence of\r\n",
"repeats at orthologous sites in different genomes and on the average\r\n",
"divergence of repeat copies from their derived consensus\r\n",
"sequence. This phylogeny will become much more accurate and refined\r\n",
"over time.\r\n",
"\r\n",
"The tag allows RepeatMasker to compare queries only to repeats\r\n",
"expected to be found in the query species, without having to provide a\r\n",
"library for each species. For example, a rat query currently is\r\n",
"compared to repeats tagged with Rattus, Murinae, Muridae, Rodentia,\r\n",
"Eutheria, and Mammalia.\r\n",
"\r\n",
"It also allows one to mask only those interspersed repeats that have\r\n",
"arrived in a genome after the speciation of two species. For optimal\r\n",
"alignment of genomic sequences of two species, 'ancestral repeats'\r\n",
"that are located at orthologous sites in both genomes (unless deleted)\r\n",
"should not be masked, whereas 'lineage-specific' repeats should be\r\n",
"masked or clipped out. An experimental version of this RepeatMasker\r\n",
"feature has been used in the alignment of the mouse to the human\r\n",
"genome (Waterston et al. 2002). By clipping out rather than masking\r\n",
"the lineage-specific repeats the aligning fraction for the mouse\r\n",
"genome could be increased from 35% to 40% (see also Schwartz et al\r\n",
"2003, NAR 31:3518-24, and Thomas et al 2003, Nature 424:788-93).\r\n",
"\r\n",
"As of the September 2003 version, the RepeatMasker package contains a\r\n",
"script \"DateRepeats\" that takes a RepeatMasker .out file and creates\r\n",
"annotation with added column(s) indicating if a repeat is expected to\r\n",
"be present in the indicated 'other species' as well as a sequence with\r\n",
"lineage-specific repeats masked only. The script currently works only\r\n",
"for a few mammalian species (human, mouse, rat, cat, dog, cow, pig,\r\n",
"horse, rabbit), but refinement is inevitable.\r\n",
"\r\n",
"DateRepeats -query -comp\r\n",
" [-comp -mask ]\r\n",
"\r\n",
"The required flags are:\r\n",
"-q -query the species that has been analyzed\r\n",
"-c -comp other mammalian species; can be used multiple times, adding extra \r\n",
" columns to the annotation in a \r\n",
"\r\n",
"Optional parameters are:\r\n",
"-m -mask produces a sequence file with all lineage specific repeats masked, i.e.\r\n",
" those predicted to be in the -query and absent in the -mask species\r\n",
" (sequence and RepeatMasker files must be in same directory)\r\n",
" must correspond to one of the -comp species\r\n",
"-a -aggressive also mask those repeats unclear to be lineage specific or ancestral\r\n",
"-n -nolow does not mask (micro)satellites or low complexity DNA, which are \r\n",
" generally lineage specific, but hard to date \r\n",
"(-a and -n have no effect unless -m is used)\r\n",
"\r\n",
"In the first release of this script the for -q, -c, and -m\r\n",
"are limited to human, mouse, rat, cat, dog, cow, pig, horse and\r\n",
"rabbit.\r\n",
"\r\n",
"\r\n",
"For example the command line:\r\n",
"DateRepeats chr3_4000001_4005000.out -q mouse -c rat -c human\r\n",
"prints the following output to chr3_4000001_4005000.out_rat_human:\r\n",
"\r\n",
" SW perc perc perc query position in query matching repeat position in repeat rat hum\r\n",
"score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID\r\n",
"\r\n",
"12436 19.0 1.7 7.7 chr3_400 9 536 (4464) + L1_Mur2 LINE/L1 1850 2408 (3469) 1 X 0 \r\n",
" 2728 5.2 0.0 3.9 chr3_400 537 896 (4104) + ORR1A0 LTR/MaLR 1 346 (0) 2 0 0 \r\n",
"12436 19.0 1.7 7.7 chr3_400 897 2441 (2559) + L1_Mur2 LINE/L1 2408 3665 (2212) 1 X 0 \r\n",
" 5229 7.2 0.0 1.1 chr3_400 2442 3134 (1866) + L1Md_F2 LINE/L1 5877 6580 (2) 3 0 0 \r\n",
"12436 19.0 1.7 7.7 chr3_400 3135 4259 (741) + L1_Mur2 LINE/L1 3665 4856 (1021) 1 X 0 \r\n",
" 394 20.9 0.0 0.9 chr3_400 4260 4351 (649) + Lx2 LINE/L1 6892 6996 (1) 4 X 0 \r\n",
"\r\n",
"The X indicates that the repeat is expected to be present at\r\n",
"orthologous sites, while the O predicts an absence. None of the above\r\n",
"repeats are found in the human genome. Notice that a mouse-specific\r\n",
"ORR1 (#2) and L1 (#4) has inserted into a rodent specific L1 (#1).\r\n",
"\r\n",
"\r\n",
"A few lines of the output for\r\n",
"DateRepeats chr19.fa.out -q rat -c mouse -c human -m mouse -a -n\r\n",
"are\r\n",
"1199 17.9 11.2 1.8 chr19 313706 313990 (58909535) C Lx9 LINE/L1 (9) 7635 7324 377 X 0\r\n",
" 23 0.0 0.0 0.0 chr19 314125 314147 (58909378) + AT_rich Low_complexity 1 23 (0) 378 - -\r\n",
" 726 17.5 1.3 8.9 chr19 314152 314308 (58909217) + B1_Rn SINE/Alu 2 146 (2) 379 ? 0\r\n",
" 228 32.8 0.0 7.4 chr19 314355 314476 (58909049) C B3 SINE/B2 (77) 139 27 380 X 0\r\n",
"\r\n",
"The chr19.fa.masked_vs_mouse file contains a sequence appropriately\r\n",
"masked for alignment against mouse. It has repeats 377 & 380\r\n",
"masked. The -n flag leaves the AT-rich region unmasked, while the -a\r\n",
"flag forced it to mask repeat 379 as well. B1_Rn is rat specific, but\r\n",
"the 17.5% divergence from the consensus is much higher than the\r\n",
"average divergence level of non-functional DNA since the rat-mouse\r\n",
"split. It therefore gets the \"?\" assignment. The rules for assigning\r\n",
"question marks are arbitrary (2-fold lower divergence than expected,\r\n",
"1.5x higher than expected for a repeat at the boundary).\r\n",
"\r\n",
"\r\n",
"4.4 Use in gene prediction and other applications\r\n",
"\r\n",
"Predicting genes from a masked sequence has several problems. First,\r\n",
"one should use the option -nolow to avoid masking low complexity\r\n",
"regions and trinucleotide repeats in coding regions. But even with\r\n",
"only interspersed repeats masked, gene prediction programs may fail to\r\n",
"identify exons correctly. As pointed out above, sometimes tail ends of\r\n",
"coding regions may have originated from transposable elements. Some\r\n",
"gene prediction programs suggest the extend of 3' UTRs. These will be\r\n",
"often overestimated in masked DNA, as many genuine poly A signals are\r\n",
"located in interspersed repeats. Finally, even if no coding regions\r\n",
"have been masked, splice sites may be compromised; e.g. the\r\n",
"polypyrimidine region that contributes to an acceptor splice site may\r\n",
"be contained within a repeat.\r\n",
"\r\n",
"Thus, I generally recommend to run a gene prediction program on\r\n",
"unmasked DNA (as well) and compare the predicted genes and exons with\r\n",
"the RepeatMasker output. Some gene prediction program allow you to\r\n",
"force certain exons out of the predictions (e.g. often the old ORFs of\r\n",
"LINE-1 elements and endogenous retroviruses are included in\r\n",
"genes). Work is also in progress at several sites to incorporate\r\n",
"RepeatMasker into gene prediction programs, in which cases matches to\r\n",
"repeats are weighted in along with the other parameters used.\r\n",
"\r\n",
"\r\n",
"Other uses\r\n",
"\r\n",
"Many people mask repeats before designing primers or oligo probes from\r\n",
"sequence data. I've often been told that primers/probes designed from\r\n",
"regions unmasked by RepeatMasker have a much better success rate. A\r\n",
"cautionary note here is that unmasked regions not necessarily are\r\n",
"unique in the genome (e.g. many lower copy repeats are not in the\r\n",
"database yet) and experiments should be performed as if no filtering\r\n",
"against repeats has been done. The alignments can help in designing\r\n",
"primers from sequences that are completely masked. Regions that\r\n",
"diverge much from the consensus are less likely to misbehave than\r\n",
"others.\r\n",
"\r\n",
"RepeatMasker is sometimes used during assembly of large genomic\r\n",
"sequences. This procedure probably is most useful in very Alu rich\r\n",
"regions; in that situation I recommend to only mask the Alus, and\r\n",
"maybe limit the masking to those Alus less than 15% diverged (-div\r\n",
"15).\r\n",
"\r\n",
"There are plenty of other uses, e.g. analysis of repeats can reveal a\r\n",
"lot about the evolution of a locus (deletions vs. insertions,\r\n",
"inversions, approximate time of these events). When you're doing that\r\n",
"you're a specialist and don't need any help from this help file (maybe\r\n",
"from some of the literature sited below though).\r\n",
"\r\n",
"\r\n",
"\r\n",
"5 REFERENCES\r\n",
"\r\n",
"Reference for RepeatMasker\r\n",
"\r\n",
"We appreciate it if you could refer to the web page (Smit,AFA &\r\n",
"Green,P RepeatMasker at http://www.repeatmasker.org) or otherwise to\r\n",
"Smit, AFA & Green, P., unpublished.\r\n",
"\r\n",
"The EMBL format of the RepBase Update database contains references for\r\n",
"individual repeats, as well as annotation with respect to divergence\r\n",
"level, affiliation, copy number, etc. Much if not most of the\r\n",
"information in this database is not published elsewhere. It can be\r\n",
"accessed at http://www.girinst.org/.\r\n",
"\r\n",
"We are trying to keep the nomenclature of the interspersed repeats in\r\n",
"the output of RepeatMasker identical to that of the reference\r\n",
"database. In most cases the names correspond to those most commonly\r\n",
"used in the literature.\r\n",
"\r\n",
"\r\n",
"There is much too much literature out there to list these days. My own\r\n",
"views on the repeat structure in mammalian genomes have most recently\r\n",
"been described in the following papers:\r\n",
"\r\n",
"Waterston et al. (2002) Initial sequencing and comparative analysis of\r\n",
"the mouse genome. Nature. 420(6915):5 20-62.\r\n",
"\r\n",
"Lander E. S., et al. (2001). Initial sequencing and analysis of the\r\n",
"human genome. Nature 409(6822): 860-921.\r\n",
"\r\n",
"Smit, A.F.A. (1999) Interspersed repeats and other mementos of\r\n",
"transposable elements in mammalian genomes. Curr Opin Genet Devel 9\r\n",
"(6), 657-663.\r\n",
"\r\n",
"If you have ideas for improvements or found a problem, drop a note\r\n",
"at rhubley@systemsbiology.org or asmit@systemsbiology.org\r\n",
"\r\n",
"/*****************************************************************************\r\n",
"* Copyright (C) University of Washington 1996-1999 Developed by Arian Smit,\r\n",
"* Philip Green and Colin Wilson of the University of Washington Department of\r\n",
"* Genomics.\r\n",
"*\r\n",
"* Copyright (C) Arian Smit 2000-2001\r\n",
"*\r\n",
"* Copyright (C) Institute for Systems Biology 2002-2009 Developed by\r\n",
"* Arian Smit and Robert Hubley.\r\n",
"*\r\n",
"* This work is licensed under the Open Source License v2.1. To view a copy\r\n",
"* of this license, visit http://www.opensource.org/licenses/osl-2.1.php or\r\n",
"* see the license.txt file contained in this distribution.\r\n",
"/*****************************************************************************\r\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!./RepeatMasker -dir /Volumes/Bay3/Software/RepeatMasker/out/ -gff oyster.v9_pls.fa "
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"RepeatMasker version open-3.3.0\r\n",
"Search Engine: NCBI/RMBLAST\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Master RepeatMasker Database: /Volumes/Bay3/Software/RepeatMasker/Libraries/RepeatMaskerLib.embl ( Complete Database: 20120418 )\r\n",
"\r\n",
"\r\n",
"\r\n",
"analyzing file oyster.v9_pls.fa\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\r\n",
"Some previous RepeatMasker output files were moved to the directory \r\n",
"/Volumes/Bay3/Software/RepeatMasker/out//oyster.v9_pls.fa.preTueDec91000372014.RMoutput \r\n",
"in order not to overwrite them.\r\n",
"\r\n"
]
}
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!head /Volumes/Bay3/Software/RepeatMasker/Libraries/RepeatMaskerLib.embl"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"CC ****************************************************************\r\n",
"CC *\r\n",
"CC RepeatMasker Database *\r\n",
"CC (C) 1997-2011 Genetic Information Research Institute *\r\n",
"CC All rights reserved *\r\n",
"CC *\r\n",
"CC Prepared by: Smit, A., Hubley, R. *\r\n",
"CC See accompanying README.html/README.txt for details. *\r\n",
"CC *\r\n",
"CC RELEASE 20120418; *\r\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!fgrep -c \"//\" /Volumes/Bay3/Software/RepeatMasker/Libraries/RepeatMaskerLib.embl"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"25053\r\n"
]
}
],
"prompt_number": 3
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Seems to contain 25053 sequences"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!fgrep -c \">\" /Volumes/Bay3/Software/RepeatMasker/oyster.v9_pls.fa"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"11969\r\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "heading",
"level": 2,
"metadata": {},
"source": [
"OUTPUT "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!head /Volumes/Bay3/Software/RepeatMasker/out/oyster.v9_pls.fa.out.gff"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"##gff-version 2\r\n",
"##date 2014-12-10\r\n",
"##sequence-region oyster.v9_pls.fa\r\n",
"C10011\tRepeatMasker\tsimilarity\t3\t33\t 9.7\t+\t.\tTarget \"Motif:(TTTG)n\" 1 31\r\n",
"C10039\tRepeatMasker\tsimilarity\t114\t147\t 5.9\t+\t.\tTarget \"Motif:AT_rich\" 1 34\r\n",
"C10195\tRepeatMasker\tsimilarity\t121\t144\t 0.0\t+\t.\tTarget \"Motif:(GA)n\" 1 24\r\n",
"C10241\tRepeatMasker\tsimilarity\t3\t42\t 5.0\t+\t.\tTarget \"Motif:AT_rich\" 1 40\r\n",
"C10287\tRepeatMasker\tsimilarity\t29\t129\t21.4\t+\t.\tTarget \"Motif:(CTATG)n\" 1 98\r\n",
"C10325\tRepeatMasker\tsimilarity\t42\t62\t 0.0\t+\t.\tTarget \"Motif:(CATA)n\" 3 23\r\n",
"C10409\tRepeatMasker\tsimilarity\t5\t117\t18.4\t+\t.\tTarget \"Motif:(CATAG)n\" 1 109\r\n"
]
}
],
"prompt_number": 6
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!head /Volumes/Bay3/Software/RepeatMasker/out/oyster.v9_pls.fa.ref"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"1070 26.64 1.73 1.03 scaffold1522 126307 126595 (4145) 7SLRNA#SINE/Alu 2 292 (28) [b2368s1i0]\r\n",
"\r\n",
"Matrix = Unknown\r\n",
"Transitions / transversions = Transitions / transversions = Unknown\r\n",
"Gap_init rate = Unknown, avg. gap size = Unknown\r\n",
"\r\n",
"275 26.67 4.00 0.00 scaffold1522 126307 126381 (4359) AluJb#SINE/Alu 138 215 (97) [b2368s1i0]\r\n",
"\r\n",
"Matrix = Unknown\r\n",
"Transitions / transversions = Transitions / transversions = Unknown\r\n"
]
}
],
"prompt_number": 7
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!head -30 /Volumes/Bay3/Software/RepeatMasker/out/oyster.v9_pls.fa.tbl"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"==================================================\r\n",
"file name: oyster.v9_pls.fa \r\n",
"sequences: 11969\r\n",
"total length: 558601156 bp (493101115 bp excl N/X-runs)\r\n",
"GC level: 33.44 %\r\n",
"bases masked: 11116376 bp ( 1.99 %)\r\n",
"==================================================\r\n",
" number of length percentage\r\n",
" elements* occupied of sequence\r\n",
"--------------------------------------------------\r\n",
"SINEs: 458 35352 bp 0.01 %\r\n",
" ALUs 0 0 bp 0.00 %\r\n",
" MIRs 345 28122 bp 0.01 %\r\n",
"\r\n",
"LINEs: 1363 132656 bp 0.02 %\r\n",
" LINE1 229 32509 bp 0.01 %\r\n",
" LINE2 210 20105 bp 0.00 %\r\n",
" L3/CR1 877 75828 bp 0.01 %\r\n",
"\r\n",
"LTR elements: 231 102314 bp 0.02 %\r\n",
" ERVL 5 289 bp 0.00 %\r\n",
" ERVL-MaLRs 1 43 bp 0.00 %\r\n",
" ERV_classI 87 9362 bp 0.00 %\r\n",
" ERV_classII 7 412 bp 0.00 %\r\n",
"\r\n",
"DNA elements: 926 93229 bp 0.02 %\r\n",
" hAT-Charlie 16 834 bp 0.00 %\r\n",
" TcMar-Tigger 329 15568 bp 0.00 %\r\n",
"\r\n",
"Unclassified: 8 765 bp 0.00 %\r\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!head -40 /Volumes/Bay3/Software/RepeatMasker/out/oyster.v9_pls.fa.out"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" SW perc perc perc query position in query matching repeat position in repeat\r\n",
"score div. del. ins. sequence begin end (left) repeat class/family begin end (left) ID\r\n",
"\r\n",
" 192 9.7 0.0 0.0 C10011 3 33 (123) + (TTTG)n Simple_repeat 1 31 (0) 1 \r\n",
" 20 5.9 0.0 0.0 C10039 114 147 (10) + AT_rich Low_complexity 1 34 (0) 2 \r\n",
" 216 0.0 0.0 0.0 C10195 121 144 (15) + (GA)n Simple_repeat 1 24 (0) 3 \r\n",
" 26 5.0 0.0 0.0 C10241 3 42 (118) + AT_rich Low_complexity 1 40 (0) 4 \r\n",
" 272 21.4 0.0 3.1 C10287 29 129 (31) + (CTATG)n Simple_repeat 1 98 (0) 5 \r\n",
" 189 0.0 0.0 0.0 C10325 42 62 (99) + (CATA)n Simple_repeat 3 23 (0) 6 \r\n",
" 210 18.4 0.0 3.7 C10409 5 117 (45) + (CATAG)n Simple_repeat 1 109 (0) 7 \r\n",
" 25 3.1 0.0 0.0 C10589 5 36 (129) + AT_rich Low_complexity 1 32 (0) 8 \r\n",
" 230 8.1 2.7 0.0 C10655 37 73 (93) + (TCCG)n Simple_repeat 4 41 (0) 9 \r\n",
" 222 6.2 0.0 0.0 C10715 67 98 (68) + (G)n Simple_repeat 1 32 (0) 10 \r\n",
" 27 0.0 0.0 0.0 C10845 115 141 (27) + AT_rich Low_complexity 1 27 (0) 11 \r\n",
" 22 0.0 0.0 0.0 C1085 1 22 (82) + AT_rich Low_complexity 1 22 (0) 12 \r\n",
" 318 10.6 0.0 0.0 C10901 1 47 (122) + (TCCG)n Simple_repeat 4 50 (0) 13 \r\n",
" 529 0.0 0.0 0.0 C10907 110 169 (0) + tRNA-Pro-CCA tRNA 1 60 (15) 14 \r\n",
" 270 0.0 0.0 0.0 C10927 1 30 (139) + (TC)n Simple_repeat 1 30 (0) 15 \r\n",
" 46 0.0 0.0 0.0 C10927 44 89 (80) + AT_rich Low_complexity 1 46 (0) 16 \r\n",
" 25 3.1 0.0 0.0 C11015 101 132 (39) + AT_rich Low_complexity 1 32 (0) 17 \r\n",
" 230 8.1 2.7 0.0 C11027 48 84 (87) + (TCCG)n Simple_repeat 4 41 (0) 18 \r\n",
" 258 8.1 0.0 0.0 C11045 91 127 (44) + (G)n Simple_repeat 1 37 (0) 19 \r\n",
" 198 29.3 0.0 5.3 C11083 51 169 (3) + (TACTG)n Simple_repeat 2 114 (0) 20 \r\n",
" 24 5.3 0.0 0.0 C11097 34 71 (101) + AT_rich Low_complexity 1 38 (0) 21 \r\n",
" 23 0.0 0.0 0.0 C11099 136 158 (14) + AT_rich Low_complexity 1 23 (0) 22 \r\n",
" 23 7.8 0.0 0.0 C11113 50 100 (73) + AT_rich Low_complexity 1 51 (0) 23 \r\n",
" 21 3.6 0.0 0.0 C11163 3 30 (143) + AT_rich Low_complexity 1 28 (0) 24 \r\n",
" 249 3.3 0.0 0.0 C11215 36 65 (109) + (TCCG)n Simple_repeat 3 32 (0) 25 \r\n",
" 287 4.9 2.4 0.0 C11227 30 70 (104) + (TCCG)n Simple_repeat 4 45 (0) 26 \r\n",
" 23 9.2 0.0 0.0 C11264 110 174 (0) + AT_rich Low_complexity 1 65 (0) 27 \r\n",
" 305 10.0 2.0 0.0 C11292 88 137 (38) + (CAGA)n Simple_repeat 4 54 (0) 28 \r\n",
" 23 5.4 0.0 0.0 C11302 1 37 (138) + AT_rich Low_complexity 1 37 (0) 29 \r\n",
" 296 24.9 0.0 6.8 C11350 31 155 (21) + (TACTG)n Simple_repeat 3 119 (0) 30 \r\n",
" 29 0.0 0.0 0.0 C11462 88 116 (61) + AT_rich Low_complexity 1 29 (0) 31 \r\n",
" 180 0.0 0.0 0.0 C11464 6 25 (152) + (CA)n Simple_repeat 2 21 (0) 32 \r\n",
" 34 4.2 0.0 0.0 C11480 126 173 (5) + AT_rich Low_complexity 1 48 (0) 33 \r\n",
" 414 0.0 0.0 0.0 C11484 133 178 (0) + (TCTA)n Simple_repeat 2 47 (0) 34 \r\n",
" 311 6.5 2.2 0.0 C11574 49 94 (85) + (TCTG)n Simple_repeat 4 50 (0) 35 \r\n",
" 198 0.0 0.0 0.0 C1159 1 22 (82) + (TC)n Simple_repeat 1 22 (0) 36 \r\n",
" 243 7.9 0.0 0.0 C11614 15 52 (128) + (CATAG)n Simple_repeat 1 38 (0) 37 \r\n"
]
}
],
"prompt_number": 10
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!head /Volumes/Bay3/Software/RepeatMasker/out/oyster.v9_pls.fa.cat"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"23 5.41 0.00 0.00 scaffold18720 145 181 (447) C AT_rich#Low_complexity (0) 300 264 5\r\n",
"22 5.56 0.00 0.00 scaffold23246 290 325 (1600) C AT_rich#Low_complexity (254) 46 11 5\r\n",
"22 20.69 0.00 0.00 scaffold23246 290 318 (1607) AT_rich#Low_complexity 127 155 (145) 5\r\n",
"14 7.14 0.00 0.00 scaffold18356 343 370 (197) AT_rich#Low_complexity 46 73 (227) 5\r\n",
"21 0.00 0.00 0.00 scaffold18356 350 370 (197) C AT_rich#Low_complexity (79) 221 201 5\r\n",
"21 0.00 0.00 0.00 scaffold23246 368 388 (1537) AT_rich#Low_complexity 278 298 (2) 5\r\n",
"24 9.59 0.00 0.00 scaffold31684 635 707 (9200) C AT_rich#Low_complexity (14) 286 214 5\r\n",
"22 3.45 0.00 0.00 scaffold20428 680 708 (313) C AT_rich#Low_complexity (0) 300 272 5\r\n",
"21 0.00 0.00 0.00 scaffold350 685 705 (9342) AT_rich#Low_complexity 30 50 (250) 5\r\n",
"40 3.70 0.00 0.00 scaffold1040 693 746 (3693) AT_rich#Low_complexity 184 237 (63) 5\r\n"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!head /Volumes/Bay3/Software/RepeatMasker/out/oyster.v9_pls.fa.alert"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"The following E coli IS elements could not be confidently clipped out:\r\n",
" IS186#ARTEFACT in scaffold960frag-7: 33085 - 33255\r\n",
" IS186#ARTEFACT in scaffold960frag-7: 33481 - 33575\r\n",
" IS186#ARTEFACT in scaffold960frag-7: 33816 - 33896\r\n",
"The following E coli IS elements could not be confidently clipped out:\r\n",
" IS186#ARTEFACT in scaffold72frag-2: 28834 - 28914\r\n",
"The following E coli IS elements could not be confidently clipped out:\r\n",
" IS10#ARTEFACT in scaffold563frag-9: 9335 - 9438\r\n",
"The following E coli IS elements could not be confidently clipped out:\r\n",
" IS10#ARTEFACT in scaffold1593: 1001 - 1104\r\n"
]
}
],
"prompt_number": 12
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Running RepeatProteinMask"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
""
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!head /Volumes/web/cnidarian/TJGR_oyster.v9.fa.annot"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"pValue Score Method SeqID Begin End Repeat Type Begin End \r\n",
"9.50e-05 109 WUBlastX C10153 3 158 - BEL11_DR LTR/Pao 656 707\r\n",
"8.90e-05 97 WUBlastX C10177 2 157 - CR1-20_SP_pol LINE/L2 679 731\r\n",
"7.30e-12 174 WUBlastX C10191 2 157 - Copia3_OS LTR/Copia 1150 1201\r\n",
"2.60e-01 59 WUBlastX C10245 5 154 - Penelope-13_XT_ LINE/Penelope 435 484\r\n",
"2.00e-04 85 WUBlastX C10291 2 160 - COPMET LTR/Copia 75 125\r\n",
"3.20e-01 50 WUBlastX C10475 3 149 - Tx1-2_BF_pol LINE/L1-Tx1 1083 1131\r\n",
"6.30e-01 59 WUBlastX C10673 37 162 + DIRS-5_NV_pol LTR/DIRS 189 230\r\n",
"1.80e-05 132 WUBlastX C10675 1 165 + CR1-1_LG_pol LINE/L2 827 881\r\n",
"1.30e-02 100 WUBlastX C10805 1 168 - I-2_AC_pol LINE/I 498 553\r\n"
]
}
],
"prompt_number": 11
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cd /Volumes/Bay3/Software/RepeatMasker\n"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"/Volumes/Bay3/Software/RepeatMasker\n"
]
}
],
"prompt_number": 1
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"ls"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"\u001b[31mArrayList.pm\u001b[m\u001b[m* \u001b[31mRepeatMaskerConfig.tmpl\u001b[m\u001b[m*\r\n",
"\u001b[31mArrayListIterator.pm\u001b[m\u001b[m* \u001b[31mRepeatProteinMask\u001b[m\u001b[m*\r\n",
"\u001b[31mCrossmatchSearchEngine.pm\u001b[m\u001b[m* \u001b[31mSearchEngineI.pm\u001b[m\u001b[m*\r\n",
"\u001b[31mDateRepeats\u001b[m\u001b[m* \u001b[31mSearchResult.pm\u001b[m\u001b[m*\r\n",
"\u001b[31mDeCypherSearchEngine.pm\u001b[m\u001b[m* \u001b[31mSearchResultCollection.pm\u001b[m\u001b[m*\r\n",
"\u001b[31mDupMasker\u001b[m\u001b[m* \u001b[31mSeqDBI.pm\u001b[m\u001b[m*\r\n",
"\u001b[31mFastaDB.pm\u001b[m\u001b[m* \u001b[31mSimpleBatcher.pm\u001b[m\u001b[m*\r\n",
"HTMLAnnotHeader.html \u001b[31mTRF.pm\u001b[m\u001b[m*\r\n",
"INSTALL \u001b[31mTRFResult.pm\u001b[m\u001b[m*\r\n",
"\u001b[34mLibraries\u001b[m\u001b[m/ \u001b[31mTaxonomy.pm\u001b[m\u001b[m*\r\n",
"\u001b[31mLineHash.pm\u001b[m\u001b[m* \u001b[31mWUBlastSearchEngine.pm\u001b[m\u001b[m*\r\n",
"\u001b[34mMatrices\u001b[m\u001b[m/ \u001b[31mWUBlastXSearchEngine.pm\u001b[m\u001b[m*\r\n",
"\u001b[31mNCBIBlastSearchEngine.pm\u001b[m\u001b[m* bluegrad.jpg\r\n",
"\u001b[31mProcessRepeats\u001b[m\u001b[m* \u001b[31mconfigure\u001b[m\u001b[m*\r\n",
"\u001b[31mPubRef.pm\u001b[m\u001b[m* daterepeats.help\r\n",
"README license.txt\r\n",
"\u001b[34mRM_15585.FriNov301316572012\u001b[m\u001b[m/ \u001b[34mout\u001b[m\u001b[m/\r\n",
"\u001b[34mRM_28601.FriNov301852042012\u001b[m\u001b[m/ oyster.v9_pls.fa\r\n",
"\u001b[31mRepbaseEMBL.pm\u001b[m\u001b[m* oysterv9_90.fa\r\n",
"\u001b[31mRepbaseRecord.pm\u001b[m\u001b[m* repeatmasker.help\r\n",
"\u001b[31mRepeatAnnotationData.pm\u001b[m\u001b[m* rmblastdb.log\r\n",
"\u001b[31mRepeatMasker\u001b[m\u001b[m* taxonomy.dat\r\n",
"RepeatMaskerConfig.pm \u001b[34mutil\u001b[m\u001b[m/\r\n"
]
}
],
"prompt_number": 2
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!./RepeatProteinMask /Volumes/Bay3/Software/trf_data/in/oyster.v9.fa "
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Masking Simple and Low Complexity Repeats...\r\n"
]
}
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"pwd"
],
"language": "python",
"metadata": {},
"outputs": [
{
"metadata": {},
"output_type": "pyout",
"prompt_number": 3,
"text": [
"u'/Users/sr320/git-repos/nb-2015/TE'"
]
}
],
"prompt_number": 3
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"cd /Volumes/Bay3/Software/RepeatMasker"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"/Volumes/Bay3/Software/RepeatMasker\n"
]
}
],
"prompt_number": 4
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!./RepeatProteinMask -noLowSimple /Volumes/Bay3/Software/trf_data/in/oyster.v9.fa"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Identifying Simple and Low Complexity Repeats...(masking turned off)\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" - TRF : 66143\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" - RepeatMasker: 0\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Masking Repeat Proteins...\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
" - Protein Hits = 58468\r\n"
]
},
{
"output_type": "stream",
"stream": "stdout",
"text": [
"Done!\r\n"
]
}
],
"prompt_number": 9
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!tail /Volumes/Bay3/Software/trf_data/in/oyster.v9.fa.annot"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
"2.00e-06 63 WUBlastX scaffold998 145302 145862 - Gypsy-15_NV_pol LTR/Gypsy 324 514\r\n",
"2.40e-03 78 WUBlastX scaffold998 145314 145685 - Gypsy-1_RO_pol LTR/Gypsy 402 523\r\n",
"2.00e-06 27 WUBlastX scaffold998 145850 146155 - Gypsy-15_NV_pol LTR/Gypsy 223 328\r\n",
"7.10e-02 65 WUBlastX scaffold998 145931 146152 - Gyp3_Cis LTR/Gypsy 351 422\r\n",
"5.30e-07 31 WUBlastX scaffold998 146157 146279 - Gypsy-30_DR_pol LTR/Gypsy-Cigr 46 85\r\n",
"9.91e-01 61 WUBlastX scaffold998 146840 147142 - Ulysses_gag LTR/Gypsy 289 394\r\n",
"3.60e-12 40 WUBlastX scaffold999 165169 165276 + ISL2EU-4_HM_ex DNA/IS4EU 83 118\r\n",
"1.80e-03 23 WUBlastX scaffold999 165639 165812 + ISL2EU-2_HM_ex DNA/IS4EU 322 380\r\n",
"9.20e-05 93 WUBlastX scaffold999 165695 166144 + ISL2EU-5_HM_ex DNA/Kolobok-IS4 368 510\r\n",
"1.50e-10 97 WUBlastX scaffold999 165827 166144 + ISL2EU-1_HM_ex DNA/IS4EU 356 457\r\n"
]
}
],
"prompt_number": 16
},
{
"cell_type": "code",
"collapsed": false,
"input": [
"!wc -l /Volumes/Bay3/Software/trf_data/in/oyster.v9.fa.annot"
],
"language": "python",
"metadata": {},
"outputs": [
{
"output_type": "stream",
"stream": "stdout",
"text": [
" 58469 /Volumes/Bay3/Software/trf_data/in/oyster.v9.fa.annot\r\n"
]
}
],
"prompt_number": 15
},
{
"cell_type": "heading",
"level": 1,
"metadata": {},
"source": [
"Create BED with just Protein hits"
]
},
{
"cell_type": "code",
"collapsed": false,
"input": [],
"language": "python",
"metadata": {},
"outputs": []
}
],
"metadata": {}
}
]
}