# extrinsic information configuration file for AUGUSTUS # include with --extrinsicCfgFile=filename # date: 01.08.2006 # Mario Stanke (mario@soe.ucsc.edu) # source of extrinsic information: # M manual anchor (required) # P protein database hit # E est database hit # C combined est/protein database hit # D Dialign # R retroposed genes # T transMapped refSeqs [SOURCES] M RM # # individual_liability: Only unsatisfiable hints are disregarded. By default this flag is not set # and the whole hint group is disregarded when one hint in it is unsatisfiable. # [SOURCE-PARAMETERS] T individual_liability # feature bonus malus gradelevelcolumns # r+/r- # # the gradelevel colums have the following format for each source # sourcecharacter numscoreclasses boundary ... boundary gradequot ... gradequot # [GENERAL] start 1 1 M 1 1e+100 RM 1 1 stop 1 1 M 1 1e+100 RM 1 1 tss 1 1 M 1 1e+100 RM 1 1 tts 1 1 M 1 1e+100 RM 1 1 ass 1 1 M 1 1e+100 RM 1 1 dss 1 1 M 1 1e+100 RM 1 1 exonpart 1 1 M 1 1e+100 RM 1 1 exon 1 1 M 1 1e+100 RM 1 1 intronpart 1 1 M 1 1e+100 RM 1 1 intron 1 1 M 1 1e+100 RM 1 1 CDSpart 1 1 M 1 1e+100 RM 1 1 CDS 1 1 M 1 1e+100 RM 1 1 UTRpart 1 1 M 1 1e+100 RM 1 1 UTR 1 1 M 1 1e+100 RM 1 1 irpart 1 1 M 1 1e+100 RM 1 1 nonexonpart 1 1 M 1 1e+100 RM 1 1.15 genicpart 1 1 M 1 1e+100 RM 1 1 # # Explanation: # # The gff/gtf file containint the hints must contain somewhere in the last # column an entry source=?, where ? is one of the source characters listed in # the line after [SOURCES] above. You can use different sources when you have # hints of different reliability of the same type, e.g. exon hints from ESTs # and exon hints from evolutionary conservation information. # # In the [GENERAL] section the entries second column specify a bonus for obeying # a hint and the entry in the third column specify a malus (penalty) for # predicting a feature that is not supported by any hint. The bonus and the # malus is a factor that is multiplied to the posterior probability of gene # structueres. # Example: # CDS 1000 0.7 .... # means that, when AUGUSTUS is searching for the most likely gene structure, # every gene structure that has a CDS exactly as given in a hint gets # a bonus factor of 1000. Also, for every CDS that is not supported the # probability of the gene structure gets a malus of 0.7. Increase the bonus to # make AUGUSTUS obey more hints, decrease the malus to make AUGUSTUS predict few # features that are not supported by hints. The malus helps increasing # specificity, e.g. when the exons predicted by AUGUSTUS are suspicious because # there is no evidence from ESTs, mRNAs, protein databases, sequence # conservation, transMapped expressed sequences. # Setting the malus to 1.0 disables those penalties. Setting the bonus to 1.0 # disables the boni. # # start: translation start (start codon), specifies an interval that contains # the start codon. The interval can be larger than 3bp, in which case # every potential start codon in the interval gets a bonus. The highest bonus is given # to potential start codons in the middle of the interval, the bonus fades off towards the ends. # For most species every start codon is an ATG, but if alternate translation_table's are used, then # all potential start codons are considered. # stop: translation end (stop codon), see 'start' # tss: transcription start site, see 'start' # tts: transcription termination site, see 'start' # ass: acceptor (3') splice site, the last intron position # dss: donor (5') splice site, the first intron position # exonpart: part of an exon in the biological sense. The bonus applies only # to exons that contain the interval from the hint. Just # overlapping means no bonus at all. The malus applies to every # base of an exon. Therefore the malus for an exon is exponential # in the length of an exon: malus=exonpartmalus^length. # Therefore the malus should be close to 1, e.g. 0.99. # exon: exon in the biological sense. Only exons that exactly match the # hint get a bonus. Exception: The exons that contain the start # codon and stop codon. This malus applies to a complete exon # independent of its length. # intronpart: introns both between coding and non-coding exons. The bonus # applies to every intronic base in the interval of the hint. # intron: An intron gets the bonus if and only if it is exactly as in the hint. # CDSpart: part of the coding part of an exon. (CDS = coding sequence) # CDS: coding part of an exon with exact boundaries. For internal exons # of a multi exon gene this is identical to the biological # boundaries of the exon. For the first and the last coding exon # the boundaries are the boundaries of the coding sequence (start, stop). # UTR: exact boundaries of a UTR exon or the untranslated part of a # partially coding exon. # UTRpart: The hint interval must be included in the UTR part of an exon. # irpart: The bonus applies to every base of the intergenic region. If UTR # prediction is turned on (--UTR=on) then UTR is considered # genic. If you choose against the usual meaning the bonus of # irparts to be much smaller than 1 in the configuration file you # can force AUGUSTUS to not predict an intergenic region in the # specified interval. This is useful if you want to tell AUGUSTUS # that two distant exons belong to the same gene, when AUGUSTUS # tends to split that gene into smaller genes. # nonexonpart: intergenic region or intron. The bonus applies to very non-exon # base that overlaps with the interval from the hint. It is # geometric in the length of that overlap, so choose it close to # 1.0. This is useful as a weak kind of masking, e.g. when it is # unlikely that a retroposed gene contains a coding region but you # do not want to completely forbid exons. # genicpart: everything that is not intergenic region, i.e. intron or exon or UTR if # applicable. The bonus applies to every genic base that overlaps with the # interval from the hint. This can be used in particular to make Augustus # predict one gene between positions a and b if a and b are experimentally # confirmed to be part of the same gene, e.g. through ESTs from the same clone. # alias: nonirpart # # Any hints of types dss, intron, exon, CDS, UTR that (implicitly) suggest a donor splice # site allow AUGUSTUS to predict a donor splice site that has a GC instead of the much more common GT. # AUGUSTUS does not predict a GC donor splice site unless there is a hint for one. # # Starting in column number 4 you can tell AUGUSTUS how to modify the bonus # depending on the source of the hint and the score of the hint. # The score of the hints is specified in the 6th column of the hint gff/gtf. # If the score is used at all, the score is not used directly through some # conversion formula but by distinguishing different classes of scores, e.g. low # score, medium score, high score. The format is the following: # First, you specify the source character, then the number of classes (say n), then you # specify the score boundaries that separate the classes (n-1 thresholds) and then you specify # for each score class the multiplicative modifier to the bonus (n factors). # # Examples: # # M 1 1e+100 # means for the manual hint there is only one score class, the bonus for this # type of hint is multiplied by 10^100. This practically forces AUGUSTUS to obey # all manual hints. # # T 2 1.5 1 5e29 # For the transMap hints distinguish 2 classes. Those with a score below 1.5 and # with a score above 1.5. The bonus if the lower score hints is unchanged and # the bonus of the higher score hints is multiplied by 5x10^29. # # D 8 1.5 2.5 3.5 4.5 5.5 6.5 7.5 0.58 0.4 0.2 2.9 0.87 0.44 0.31 7.3 # Use 8 score classes for the DIALIGN hints. DIALIGN hints give a score, a strand and # reading frame information for CDSpart hints. The strand and reading frame are often correct but not # often enough to rely on them. To account for that I generated hints for all # 6 combinations of a strand and reading frame and then used 2x2x2=8 different # score classes: # {low score, high score} x {DIALIGN strand, opposite strand} x {DIALIGN reading frame, other reading frame} # This example shows that scores don't have to be monotonous. A higher score # does not have to mean a higher bonus. They are merely a way of classifying the # hints into categories as you wish. In particular, you could get the effect of # having different sources by having just hints of one source and then distinguishing # more scores classes. # # # Future plans: # - Add fuzzy intron hints. Introns get a bonus only when they approximately # have the same boundaries as in the hint. # - Make the splice site hints fuzzy also. Allow a hint interval that contains a # likely splice site, as opposed to only an individual position. # - Write a program that automatically optimizes the boni and mali given an # annotated test set of genes and hints for that set of sequences.