program alpro(protseq, alprop, symvec, sequ, output); (* alpro: frequency and information of aligned sequences Thomas D. Schneider, Ph.D. National Institutes of Health National Cancer Institute Gene Regulation and Chromosome Biology Laboratory Molecular Information Theory Group Frederick, Maryland 21702-1201 schneidt@mail.nih.gov http://alum.mit.edu/www/toms (permanent) *) label 1; (* end of program *) const (* begin module version *) version = 2.02; (* of alpro.p 2011 Apr 14 2011 Apr 14, 2.02: report number of gap columns 2011 Mar 09, 2.01: use key Shenkin.Mastrandrea1991 2011 Mar 09, 2.00: why blanks on output? extra writeln removed. 2010 Sep 07, 1.99: actually update the alprop! 2010 Sep 07, 1.98: create sequ file if requested 2007 Jul 16, 1.97: clean up 2007 Jul 16, 1.96: handle sequences with dots when in fasta format 2006 Aug 02, 1.95: cleanup 2006 Aug 02, 1.94: crashing case: eof in read of sequence 2006 Jan 30, 1.93: use name consistently: globin.protseq 2003 Aug 4, 1.92: tell which letter was not allowed, ' ' and '-' allowed 2002 Jan 4, 1.91: upgrade version number processing 2000 Dec 10, 1.90: clustalw2alpro.pl and msf2alpro.pl announced 2000 Dec 6, 1.89: autoupgrade of alprop 2000 Nov 28, 1.86: upgrade documentation, check gpc compatibility 2000 Jun 17, 1.85: maximum sequence set higher 2000 Mar 17, 1.84: bug fixd: 2 bits is not the maximum for proteins! 1999 Dec 30, 1.83: report range 1999 Dec 10, 1.81: bug removed: genomic composition now works for > kickover 1999 Nov 29, 1.71: new parameters for genomic composition. WARNING: old parameter files will be blocked, see examples for new. 1996 September 10: parameter for doing varlogs added. 1996 July 9: program now accepts standard fasta format origin 1990 May 7 edited by RMS June 1991 *) updateversion = 1.99; (* defines lowest acceptable current parameter file *) (* end module version *) (* begin module describe.alpro *) (* name alpro: frequency and information of aligned sequences synopsis alpro(protseq: in, alprop: inout, symvec: out, sequ: output, output: out) files protseq: Aligned sequences in one of two formats. The first line, intended for identification of the entire data set, is skipped. This header line must begin with an asterisk '*' or '>'. When the header begins with '>', fasta format is used, otherwise the original protseq format is used. In the original protseq format, the remaining lines are used for the sequences. They are divided into `entries'. The beginning of an entry has any (positive) number of identification lines, each of which begins with an asterisk '*'. The sequence follows. Gaps are indicated with dashes (-). The end of the sequence is indicated by a period. The program automatically figures out what the sequences are so that the correct kind of information calculation can be made. Sequences can be DNA (ACGT - 4 characters), RNA (with U - 4 characters), protein (20 characters) or alphabetic (26 characters). Fasta format has two differeces. First, all identification lines begin with '>'. Second, sequences do not end with a period. Instead, they end with the next sequence entry identifier (ie another '>') or the end of the file. In this format dashes '-' or dots '.' may be used as the alignment character. If fasta format is used then the dots represent bases of the first sequence. (New as of 2007 jul 16; previously the dot became a dash.) Spaces are allowed in the sequence, but they are ignored. alprop: parameters to control the program, a series of lines: 1. parameterversion: The version number of the program. This allows the user to be warned if an old parameter file is used. 2. alignment: alignment point for the sequences. This allows one to assign the numbering in the symvec. 3. normalization: 4 integers (a, c, g, t) giving the relative frequencies of random sequence to normalize against. Use "1 1 1 1" normally. If the data represent randomized chemical synthesis and there are biases in the bases, use the base biases. Normalization is performed on the frequencies using equations 1 and 2 of Schneider1989: fo(b,l) = rho(b) fi(b,l) (1) f'o(b,l) = f'i(b,l) rho(b) / sum_b [f'i(b,l) rho(b,l)] (2) For fi(b,l) being the frequencies defined by the second line, fo(b,l) should = 1/4 for DNA. This defines rho(b). (Note: rho is not a function of position for this prograam, so rho(b) not rho(b,l).) DO NOT USE THIS FEATURE UNLESS YOU HAVE GENERATED SYNTHETIC RANDOM SEQUENCE as in Schneider1989. USE 1:1:1:1 NORMALLY. See Schneider.ridebate1999 for further discussion. 4. varlogo: If the first letter is 'v' then the makelogo program will produce a 'varlogo'. This method was invented by Peter Shenkin (Shenkin.Mastrandrea1991). In a regular sequence logo the vertical scale is the information content. However in some systems, as in the immunoglobulin variable regions, one is not interested in the conservation, but rather the degree of variability. This is best expressed as the uncertainty Hafter rather than the information R = Hbefore - Hafter. Basically, it "turns over" the curve. 5. genomic composition: 4 integers (gna, gnc, gng, gnt) giving the numbers of A, C, G and T in the genome of the organism from which DNA or RNA sequences come from. This genomic composition is used to compute Hgenome. The information content is Rsequence(l) = Hgenome - (H(b,l) + e(n)), where e(n) is the small sample correction. You can use '1 1 1 1' to set an equiprobable genome. See Schneider.ridebate1999 for a discussion of relevant issues. 6. sequ: If the first letter is 's' then create a file called sequ that contains the full sequences followed by a period which can be used by makebk to create a Delila book. This allows one to convert periods from the protseq into the corresponding letter of the initial sequence. Old versions of alpro will be automatically upgraded to new versions if you set the version number to less than 1. symvec: Table of frequencies and information content. The information measure is corrected for small sample size (Schneider et al, 1986). The format of this file is the same as produced by dalvec. sequ: raw sequences followed by periods for creating a delila book. This is generated only if the 6th parameter is 's'. output: messages to the user description Take an aligned set of sequences and produce input to the makelogo program for producing a sequence logo. Small sample size and odd genomic composition are accounted for. The program will take lines that begin with '>' to accomodate fasta format. However, sequences still must end with a period. This program provides a 'short cut' for making logos. The "longer" route (in terms of numbers of programs and complexity, but not significantly time to compute) is formed by these Delila programs: dbbk.p, catal.p, delila.p, alist.p, encode.p, rseq.p, dalvec.p * dbbk converts from genbank to delila format * catal creates a delila library * delila extracts the precise sequences you want (powerful!) * alist shows the extracted, aligned sequences * encode converts the aligned sequences to binary vectors * rseq converts the binary vectors to a table of computed information * dalvec converts the table of computed information to a symvec Why use alpro? Because it is currently the *only* way to get a protein sequence logo, and it is currently the only way to handle sequences with gaps in them (someday Delila will do these things). Why use the above Delila programs? Because they provide much more flexibility for chosing the range of sites (via Delila) and interfacing with the sequence walker programs (via the information table, rsdata). examples * Example protseq file * This is an example sequence. AG-EGCTT. * This is the second example sequence. * It is the last one. YLREBS-A. Example parameter file (NOTE CHANGED FORMAT AS OF 1999 NOVEMBER 29!): 1.71 version of alpro that this parameter file is designed for. 1 alignment point 1 1 1 1 normalization bases normal a first letter 'v' will give varlogo 1 1 1 1 genomic composition The files globin.protseq (see below) and protseq.fasta are working examples. Use protseq.makelogop and colors.protein with makelogo. If you also use protein.wave as the wave file, you can see how much the logo corresponds to an alpha helix. documentation @article{Hein1990, author = "Jotun Hein", title = "Unified approach to alignment and phylogenies", journal = "Methods Enzymol", volume = "183", pages = "626-645", year = "1990"} @article{Schneider1986, author = "T. D. Schneider and G. D. Stormo and L. Gold and A. Ehrenfeucht", title = "Information content of binding sites on nucleotide sequences", journal = "J. Mol. Biol.", volume = "188", pages = "415-431", year = "1986"} @article{Schneider.Stephens.Logo, author = "T. D. Schneider and R. M. Stephens", title = "Sequence Logos: A New Way to Display Consensus Sequences", journal = "Nucl. Acids Res.", volume = "18", pages = "6097-6100", year = "1990"} @article{Schneider1989, author = "T. D. Schneider and G. D. Stormo", title = "Excess Information at Bacteriophage {T7} Genomic Promoters Detected by a Random Cloning Technique", year = "1989", journal = "Nucl. Acids Res.", volume = "17", pages = "659-674"} @article{Schneider.ridebate1999, author = "T. D. Schneider", title = "Measuring Molecular Information", journal = "Journal of Theoretical Biology", volume = "201", pages = "87-92", note = "\htmladdnormallink {http://www.lecb.ncifcrf.gov/\~{}toms/paper/ridebate/} {http://www.lecb.ncifcrf.gov/\~{}toms/paper/ridebate/}", year = "1999"} as: http://www.lecb.ncifcrf.gov/~toms/paper/ridebate/ @article{Shenkin.Mastrandrea1991, author = "P. S. Shenkin and B. Erman and L. D. Mastrandrea", title = "{Information-theoretical entropy as a measure of sequence variability}", journal = "Proteins", volume = "11", pages = "297--313", pmid = "1758884", comment = "was Shenkin1991", year = "1991"} see also {Standard parameter file: } alprop {PROTEIN EXAMPLE: THE GLOBINS} {To try the alpro program, use the standard} alprop {with a copy of these files as your protseq:} {Example input file for protseq: } globin.protseq {Example like globin.protseq but in fasta format: } globin.fasta {The symvec file generated by alpro with this globin data should be close to or identical with this symvec: } globin.symvec {Then you can use the program that makes the logo,} makelogo.p{, to create a logo. You will need these files:} {symvec (from above or from the archive): } globin.symvec {marks (currently empty): } globin.marks {colors file to use for proteins: } protein.colors {wave file to use for proteins:} protein.wave {makelogop (parameter file to use for this globin example):} globin.makelogop {NOTE: each file needs to have the name that makelogo expects. Get the file and rename it.} {After you run makelogo, the resulting sequence logo should be like this:} globin.logo.ps {Read the manual page on} makelogo.p {to learn how to control the display more.} {There is a more powerful way to make DNA logos. See:} http://www.lecb.ncifcrf.gov/~toms/logoprograms.html {Related programs:} dbbk.p, catal.p, delila.p, alist.p, encode.p, rseq.p, dalvec.p {What the heck is Pascal system error 0? See:} http://www.lecb.ncifcrf.gov/~toms/pascalp2c.html#system.error.0 {Michael Sauder has generously written two perl scripts to convert files into the protseq format that alpro uses:} {convert from CLUSTAL format to protseq:} clustalw2alpro.pl {convert from MSF format to protseq:} msf2alpro.pl author Dr. Thomas D. Schneider National Cancer Institute Laboratory of Experimental and Computational Biology Frederick, Maryland 21702-1201 toms@ncifcrf.gov permanent email: toms@alum.mit.edu http://www.lecb.ncifcrf.gov/~toms/ bugs technical notes Historical note: The program originally only created a vector that contained the characters of the alphabet, so the output was called an 'alvec'. To reflect the use of symbols, the name of the output file was changed to symvec, but I like 'alpro', and 'prosym' is awkward that I decided to keep the name alpro. Later I generalized the program to handle DNA or RNA or alphabetic sequences, but kept the name. Now it might be considered to be the 'alignment professional'. Oh well. The feature which adjusts the stack height when there is a small amounts of data, (described in the second paragraph of page 6100 of the logo paper), has been removed now because the ability to display the variance as a standard deviation by makelogo alerts the person that the position has little data in it. Thanks to Peter Shenkin for the suggestion. The original feature was described as follows: "Positions that contain mostly spacer characters for the alignment are also reduced in weight by multiplying the information by the maximum number of sequences and dividing it by the actual number at the spacer position. Thus if there are 10,000 sequences, a position with 200 A's would would be close to 2 bits of pattern. However, since the position only represents 2% of the sequences, this program would only give it a weight of 0.02*2 = 0.04 bits. A better method is not known. However, this prevents one from being fooled by positions that don't appear in most sequences." *) (* end module describe.alpro *) (* begin module alpro.const *) infofield = 8; (* size of field for printing information in bits *) infodecim = 5; (* number of decimal places for printing information *) kickover = 50; (* the kickover value for computing information *) nfield = 4; (* size of field for printing n, the number of sites *) maxpos = 20000; (* maximum number of positions the program can handle *) (* end module alpro.const *) (* begin module alpro.type *) type sequence = record {zzz} letters: array[1..maxpos] of char; (* one of the sequences *) length: integer; (* the length of the sequence *) end; (* end module alpro.type *) var protseq, (* input aligned sequences *) alprop, (* parameters to control the program *) sequ, (* filled in protseq sequences *) symvec: text; (* output frequency table *) (* begin module halt *) procedure halt; (* stop the program. the procedure performs a goto to the end of the program. you must have a label: label 1; declared, and also the end of the program must have this label: 1: end. examples are in the module libraries. this is the only goto in the delila system. *) begin writeln(output,' program halt.'); goto 1 end; (* end module halt version = 2.21; (@ of calhnb 1988 feb 24 *) (* begin module copyaline *) procedure copyaline(var fin, fout: text); (* copy a line from file fin to file fout *) begin (* copyaline *) while not eoln(fin) do begin fout^ := fin^; put(fout); get(fin) end; readln(fin); writeln(fout); end; (* copyaline *) (* end module copyaline version = 4.08; (@ of prgmod.p 1990 May 7 *) (* begin module alphabetic *) function alphabetic(c: char): boolean; (* Is c in the alphabet? *) begin alphabetic := ((ord(c) >= ord('a')) and (ord(c) <= ord('z'))) or ((ord(c) >= ord('A')) and (ord(c) <= ord('Z'))) end; (* end module alphabetic *) (* begin module alpro.readsequence *) procedure readsequence(var f: text; var s: sequence; var ok: boolean); (* Read in the sequence s from file. The sequence may have a header consisting of any number of lines that begin with an asterisk '*' or '>'. The sequence itself contains any character except space and period '.'. A period ends the sequence. Report if the sequence was read ok. *) var c: char; (* a character from f *) done: boolean; (* done reading the sequence *) i: integer; (* position on the sequence *) begin i := 0; ok := true; with s do begin done := false; while (not done) and ok do begin (* skip header lines *) while (not eof(f)) and ((f^ = '*') or (f^ = '>')) do readln(f); if eof(f) then begin writeln(output,'end of file without end of sequence'); ok := false end; (* read across one line of input *) while (not eoln(f)) and (not done) do begin read(f,c); if c = '.' then done := true (* ignore rest of line *) else if c <> ' ' then begin i := i + 1; if i > maxpos then begin writeln(output,'A sequence is longer than', ' constant maxpos (',maxpos:1,').'); ok := false end else s.letters[i] := c end else begin writeln(output,'no spaces allowed in the sequences'); ok := false end end; readln(f); end; if i = 0 then begin (* sequence will be zero long if there was none read *) writeln(output,'zero length sequence found!'); ok := false end else length := i; end; end; (* end module alpro.readsequence *) (* begin module alpro.readfasta *) procedure readfasta(var f: text; var s: sequence; firstsequence: sequence; var ok: boolean); (* Read in the sequence s from file. The sequence may have a header consisting of any number of lines that begin with '>'. The sequence itself contains any character except space and period '.'. A '>' or end of file ends the sequence. Report if the sequence was read ok. *) var c: char; (* a character from f *) done: boolean; (* done reading the sequence *) i: integer; (* position on the sequence *) begin i := 0; ok := true; with s do begin done := false; while (not done) and ok do begin (* skip header lines *) while (not eof(f)) and (f^ = '>') do readln(f); if eof(f) then begin done := true; end else while (not eoln(f)) and (not done) do begin (* read across one line of input *) read(f,c); if alphabetic(c) or (c = '-') or (c = '.') then begin i := i + 1; if i > maxpos then begin writeln(output,'A sequence is longer than', ' constant maxpos (',maxpos:1,').'); ok := false end else if c = '.' {yyy} { original code: then s.letters[i] := '-' } then begin s.letters[i] := firstsequence.letters[i] end else s.letters[i] := c end else begin if c <> ' ' then begin writeln(output,'only alphabetic characters are', ' allowed in the sequences,', ' not "',c,'"'); ok := false end end end; if eof(f) then done := true else begin if eoln(f) then readln(f); (* 2006 Aug 2: we could be eof now, so f^ bombs. protect it: *) if not eof(f) then begin if f^ = '>' then done := true end else done := true; end; end; if i = 0 then begin (* sequence will be zero long if there was none read *) writeln(output,'zero length sequence found!'); ok := false end else length := i; end; end; (* end module alpro.readfasta *) (* begin module alpro.writesequence *) procedure writesequence(var f: text; s: sequence; linewidth: integer); var {c: char; (* a character from f *)} {done: boolean; (* done reading the sequence *)} i: integer; (* position on the sequence *) line: integer; (* position on the line *) begin line := 0; for i := 1 to s.length do begin if line > linewidth then begin writeln(f); line := 0; end; write(f, s.letters[i]); line := line + 1; end; {yyy} end; (* end module alpro.writesequence *) (* begin module capitalize *) function capitalize(c: char): char; (* convert the character c to upper case *) var n: integer; (* c is the n'th letter of the alphabet *) begin n := ord(c); if (n >= ord('a')) and (n <= ord('z')) then c := chr( n - ord('a') + ord('A')); capitalize := c end; (* end module capitalize version = 4.09; (@ of prgmod.p 1990 May 18 *) (* begin module calehnb *) procedure calehnb(n: integer; gna,gnc,gng,gnt: integer; var hg, ehnb, varhnb: real (* ; debugging: boolean *)); (* calculate e(hnb) in bits/bp (ehnb) for a number (n) of example sequence sites. gna to gnt are the composition to use for the genome probabilities of a to t. the genomic entropy hg and the variance var(hnb) (=varhnb) are also calculated. if the variable debugging is passed to the procedure then the individual combinations of hnb are displayed. note: this procedure should not be broken into smaller procedures so that it remains efficient. version = 3.02; of procedure calehnb 1983 nov 23 *) const maxsize = 200; (* the largest n allowed *) accuracy = 10000; (* less than (1/accuracy) bits error is demanded for the sum of pnb (see variable 'total') at the end of the procedure *) var log2: real; (* natural log of 2, used to find log base 2 *) logn: real; (* log of n *) nlog2: real; (* n * log2 *) gn: integer; (* sum of gna..gnt *) logpa,logpc,logpg,logpt: real; (* logs of genome probabilities *) (* log of n factorial is the sum of i=1 to n of log(i). the array below represents these logs up to n *) logfact: array[0..maxsize] of real; (* precalculated values of -p*log2(p), where p=nb/n for nb = 0 .. n. m stands for minus *) mplog2p: array[0..maxsize] of real; i: integer; (* index for logfact and mplog2p *) logi: real; (* natural log of i *) na, nc, ng, nt: integer; (* numbers of bases in a site *) done: boolean; (* true when the loop is completed *) pnb: real; (* multinomial probability of a combination of na, nc, ng, nt *) hnb: real; (* entropy for a combination of na..nt *) pnbhnb: real; (* pnb*hnb, an intermediate result *) sshnb: real; (* sum of squares of hnb *) (* variables for testing program correctness: *) total: real; (* sum of pnb over all combinations of na..nt if this is not 1.00, the program is in error *) counter: integer; (* counts the number of times through the loop *) begin (* calehnb *) (* prevent access to outside the arrays: *) if n > maxsize then begin writeln(output,' procedure calehnb: n > maxsize (', n:1,'>',maxsize:1,')'); halt end; counter := 0; total := 0.0; log2 := ln(2); logn := ln(n); nlog2 := n * log2; (* get logs of genome probabilities *) gn := gna + gnc + gng + gnt; logpa := ln(gna/gn); logpc := ln(gnc/gn); logpg := ln(gng/gn); logpt := ln(gnt/gn); (* find genomic entropy *) hg := -(gna*logpa + gnc*logpc + gng*logpg + gnt*logpt)/(gn*log2); ehnb := 0.0; (* start error entropy at zero *) sshnb := 0.0; (* make table of log of n factorial up to n and entropies for nb/n *) logfact[0] := 0.0; (* factorial(0) = 0 *) mplog2p[0] := 0.0; for i := 1 to n do begin logi := ln(i); logfact[i] := logfact[i - 1] + logi; mplog2p[i] := - i * (logi - logn) / nlog2 end; (* begin by looking at the combination with all a: na = n *) na := n; nc := 0; ng := 0; nt := 0; (* the following loop simulates a number of nested loops of the form: for b1=a to t do for b2=b1 to t do for b3=b2 to t do ... for bn=b(n-1) to t do ... the resulting set of variables increase in alphabetic order since no inner loop variable can have a value less than any outer loop. the number of times through the inner-most loop is given by: o = (n + 1)*(n + 2)*(n + 3)/6 in the case where there are four symbols (a,c,g,t) and n is the number of nested loops. a recursive set of loops would be possible, but it would use up too much memory in practical cases (up to n=150 or higher). a second algorithm sequests the loop variables into an array and increments them there. however, the goal is to get all possible combinations for na, nc, ng, nt, where the sum of these is n. the nested loops provide all the combinations in alphabetic order, assuring that there can not be any duplicates. to find nb (one of na..nt) one would look at which of the variables b1 to bn were of value b. this is a wasteful operation. the loop below simulates the array of control variables by changing each nb directly. *) done := false; repeat (* pnb is calculated by taking the log of the expression fact(n) na nc ng nt pnb = ------------------- pa * pc * pg * pt . fact(na).. fact(nt) log(pnb) generates a series of sums, allowing the calculation to proceed by addition and multiplication rather than multiplication and exponentiation. the factorials become tractable in this way *) pnb := exp(logfact[n] (* n factorial *) - logfact[na] - logfact[nc] - logfact[ng] - logfact[nt] + na * logpa + nc * logpc + ng * logpg + nt * logpt); hnb := mplog2p[na] + mplog2p[nc] + mplog2p[ng] + mplog2p[nt]; pnbhnb := pnb * hnb; ehnb := ehnb + pnbhnb; sshnb := sshnb + pnbhnb * hnb; (* sum of squares of hnb *) (* the following section keeps track of the calculation and writes out the current set of nb. *) counter := succ(counter); (* if debugging then begin write(output,' ',counter:2,' '); for i := 1 to na do write(output,'a'); for i := 1 to nc do write(output,'c'); for i := 1 to ng do write(output,'g'); for i := 1 to nt do write(output,'t'); write(output,' ',na:3,nc:3,ng:3,nt:3); writeln(output,' pnb = ',pnb:10:5); end; *) total := total + pnb; (* the remaining portion of this repeat loop generates the values of na, nc, ng and nt. notice that there are 7 possibilities at each loop increment. other than the stop, in each case the sum of na+nc+ng+nt remains constant (=n). *) if nt > 0 then begin (* ending on a t - do outer loops *) if ng > 0 then begin (* turn g into t *) ng := ng - 1; nt := nt + 1 end else if nc > 0 then begin (* turn one c into g, and all t to g (note ng = 0 initially) *) nc := nc - 1; ng := nt + 1; nt := 0 end else if na > 0 then begin (* turn one a into c and all g and t to c. (note ng=nc=0 initially) *) na := na - 1; nc := nt + 1; nt := 0 end else done := true (* since nt = n *) end else begin (* no t - increment innermost loop *) if ng > 0 then begin (* turn g into t *) ng := ng - 1; nt := nt + 1 end else if nc > 0 then begin (* turn c into g *) nc := nc - 1; ng := ng + 1 end else begin (* na > 0; turn a into c *) na := na - 1; nc := nc + 1 end end until done; (* final adjustment: we only have the sum of squares so far *) varhnb := sshnb - ehnb*ehnb; (* if this message appears, there is either a bug in the code or the computer cannot be as accurate as requested *) if accuracy <> round(accuracy * total) then begin writeln(output,' procedure calehnb: the sum of probabilities is'); writeln(output,' not accurate to one part in ',accuracy:1); writeln(output,' the sum of the probabilities is ',total:10:8); end; (* if this message appear, then there is an error in the repeat-until loop: it did not repeat as many times as is expected from the algorithm *) if counter <> round((n+1)*(n+2)*(n+3)/6) then begin writeln(output,' procedure calehnb: program error, the number of'); writeln(output,' calculations is in error'); halt end; (* writeln(output, ' total: ',total:10:5); writeln(output,' count = ',counter:1); writeln(output,' (n+1)*(n+2)*(n+3)/6 = ', round((n+1)*(n+2)*(n+3)/6):1); *) end; (* calehnb *) (* end module calehnb version = 2.21; (@ of calhnb 1988 feb 24 *) (* begin module calaehnb *) procedure calaehnb(n: integer; gna,gnc,gng,gnt: integer; var hg, aehnb, avarhnb: real); (* calculate ae(hnb) in bits/bp (=aehnb) for a number (n) of example sequence sites. gna to gnt are the composition to use for the genome probabilities of a to t. the genomic entropy (=hg) and the variance avar(hnb) (=avarhnb) are also calculated. this procedure gives approximations for e(hnb) and var(hnb). it is based on a formula derived by jeff haemer. see also: g. p. basharin theory probability appl. 4(3): 333-336 (1959) 'on a statistical estimate for the entropy of a sequence of independent random variables' version = 3.01; of procedure calaehnb 1983 nov 23 *) var log2: real; (* natural log of 2 *) gn: integer; (* sum of gna..gnt *) pa, pc, pg, pt: real; (* genomic probabilities *) e: real; (* the approximate sampling error *) begin (* calaehnb *) log2 := ln(2); gn := gna + gnc + gng + gnt; pa := gna/gn; pc := gnc/gn; pg := gng/gn; pt := gnt/gn; hg := -( (pa*ln(pa)) +(pc*ln(pc)) +(pg*ln(pg)) +(pt*ln(pt)))/log2; e := 3/(2*log2*n); aehnb := hg - e; avarhnb := e * e end; (* calaehnb *) (* end module calaehnb version = 2.21; (@ of calhnb 1988 feb 24 *) (* begin module alpro.themain *) procedure themain(var protseq, alprop, symvec, sequ: text); (* the main procedure of the program *) const normalsum = 100000; (* the sum to which the normalized data should come to *) var alignment: integer; (* the alignment point of the sequences *) b: integer; (* index to the letters *) c: char; (* a character from protseq *) e: real; (* for calculating small sample bias and variance *) f: real; (* frequency of a letter *) fasta: boolean; (* if true, read sequence in fasta format *) fprimeo: array[0..26, 1..maxpos] of real; (* normalized frequencies *) gna,gnc,gng,gnt: integer; (* genomic composition *) i: integer; (* position on the sequence *) imax: integer; (* maximum position on the sequence *) ln2: real; (* natural log of 2 for calculating bits *) hg: real; (* uncertainty value returned by procedure calehnb *) Hgenomic: real; (* genomic uncertainty *) gaps: integer; (* number of gap columns *) ehnb: real; (* E(Hnb) *) Hmax: real; (* log2 of s *) lengthchange: boolean; (* a new sequence has different length *) n: integer; (* counting the input sequences *) na, nc, ng, nt: integer; (* describe bias in input frequencies of a randomization experiment *) normalizing: boolean; (* if true, normalize the frequencies by na, nc ... *) normsum: real; (* sum to compute normalization, sum of f'*rho *) ntotal: integer; (* sum of na, nc, ng, nt *) ntrue: integer; (* true count in a position, deleting blanks *) nbi: array[0..26, 1..maxpos] of integer; (* number of each each amino acid (b) at each position (i). zero is for blanks. *) parameterversion: real; (* parameter version number *) previouslength: integer; (* lenght of the previous sequence read in *) rho: array[0..26] of real; (* normalizing factor *) r: real; (* information at a position *) RNA: boolean; (* true of U was seen in the sequences and number of symbols s is 4 *) s: integer; (* number of symbols *) avarhnb: real; (* variance of the information r *) seq: sequence; (* one of the sequences *) seq1: sequence; (* the first sequence *) sequdo: boolean; (* parameter to control writing of the sequ file *) varlogo: boolean; (* if true, make data output for a varlogo *) willdie: boolean; (* the program will halt later on *) ok: boolean; (* a bad sequence poisons the program *) begin writeln(output,'alpro ',version:4:2); reset(protseq); rewrite(symvec); (* default values: *) alignment := 0; na := 1; nc := 1; ng := 1; nt := 1; varlogo := false; gna := 1; gnc := 1; gng := 1; gnt := 1; (* read parameters *) reset(alprop); if not eof(alprop) then begin readln(alprop, parameterversion); (* This mechanism may not have proper rounding; replaced on 2002 Jan 4 if (parameterversion < updateversion) or (round(parameterversion) = parameterversion) then begin *) if round(100*parameterversion) < round(100*updateversion) then begin writeln(output, 'You have an old parameter file!'); (* if the parameter given was less than 1.00, provide a new file *) if parameterversion > 1.00 then begin halt end else begin writeln(output,'WARNING: automatically upgrading alprop'); rewrite(alprop); writeln(alprop,version:4:2, ' version of alpro that this parameter file is designed for.'); writeln(alprop,'1 alignment point'); writeln(alprop,'1 1 1 1 normalization bases'); writeln(alprop,'normal a first letter ''v'' will give varlogo'); writeln(alprop,'1 1 1 1 genomic composition'); writeln(alprop,'sequ a first letter ''s'' will give the sequ file'); reset(alprop); readln(alprop); (* skip the version number *) end end; readln(alprop, alignment); if not eof(alprop) then readln(alprop, na, nc, ng, nt); if not eof(alprop) then if alprop^ = 'v' then varlogo := true else varlogo := false else begin writeln(output,'missing most of the alprop file'); halt; end; readln(alprop); if not eof(alprop) then readln(alprop, gna, gnc, gng, gnt); if not eof(alprop) then if alprop^ = 's' then sequdo := true else sequdo := false else sequdo := false; readln(alprop); end; if not ((na=nc) and (nc=ng) and (ng=nt)) then normalizing := true else normalizing := false; if varlogo then writeln(output,'Creating varlogo form of data.') else writeln(output,'Creating logo form of data.'); if (na < 1) or (nc < 1) or (ng < 1) or (nt < 0) then begin writeln(output, 'normalizing integers must be positive'); halt end; ntotal := na + nc + ng + nt; if ntotal < 1 then begin writeln(output, 'sum normalizing integers must be positive'); halt end; writeln(output,'alignment at ',alignment:1); writeln(output,'bases for normalization: ',na:1,' ',nc:1,' ',ng:1,' ',nt:1); writeln(output,'genomic composition: ',gna:1,' ',gnc:1,' ',gng:1,' ',gnt:1); if eof(protseq) then begin writeln(output,'empty of protseq file'); halt end; if protseq^ = '>' then fasta := true else if protseq^ = '*' then fasta := false else begin writeln(output,'protseq must be either fasta or original format'); halt end; if fasta then writeln(output, 'Fasta format') else writeln(output, 'protseq format'); (* start the header of the symvec from the protseq *) writeln(symvec,'* alpro ',version:4:2); if fasta then while not eof(protseq) and (protseq^ = '>') do begin get(protseq); (* skip the ">" *) write(symvec,'*'); copyaline(protseq, symvec); end else while not eof(protseq) and (protseq^ = '*') do copyaline(protseq,symvec); (* clear the table *) for i := 1 to maxpos do for b := 0 to 26 do nbi[b,i] := 0; (* 2010 Sep 07 *) if sequdo then begin writeln(output,'Writing sequ file'); rewrite(sequ); end; s := 4; n := 0; RNA := false; imax := 0; willdie := false; previouslength := -maxint; while not eof(protseq) do begin (* read one sequence *) n := n + 1; if fasta then readfasta(protseq,seq,seq1,ok) else readsequence(protseq,seq,ok); if n = 1 then begin seq1 := seq; writeln(output, 'copied sequence:'); writesequence(output, seq1, 60); end; if n = 2 then begin writeln(output, 'sequence ',n:1,':'); writesequence(output, seq, 60); end; (* 2010 Sep 07 *) if sequdo then begin writesequence(sequ, seq, 60); writeln(sequ, '.'); end; if not ok then willdie := true; { The old method was too verbose! Silence it. writeln(output,'sequence ',n:3,' is ',seq.length:3,' letters long'); } if seq.length <> previouslength then begin writeln(output,'sequence ',n:1,' is ',seq.length:3,' letters long'); previouslength := seq.length; lengthchange := true; end else begin if lengthchange then begin writeln(output,'... ', '(only sequences that differ in length will be noted)'); writeln(output, 'The range is: from ', (1-alignment):nfield, ' to ', (seq.length-alignment):nfield); lengthchange := false end end; for i := 1 to seq.length do begin c := capitalize(seq.letters[i]); if c = 'U' then RNA := true; if (ord(c) < ord('A')) or (ord(c) > ord('Z')) then b := 0 else b := ord(c) - ord('A') + 1; (* make b be 1 to 26 for A to Z *) (* determine how many symbols should be output *) if s <> 26 then begin if not(c in ['A','C','G','T','U','-']) then begin if c in ['B','J','O','U','X','Z','-'] then begin writeln(output,'The character ',c, ' is not valid for DNA, RNA or proteins'); s := 26; writeln(output,'Number of symbols has been changed to ',s:1); end else if s = 4 then begin writeln(output,'The character ', c,' is not valid for DNA or RNA'); s := 20; writeln(output,'Number of symbols has been changed to ',s:1); end; end; end; nbi[b,i] := succ(nbi[b,i]); end; if seq.length > imax then imax := seq.length end; if willdie then begin rewrite(symvec); halt end; ln2 := ln(2.0); Hmax := ln(s)/ln2; if s = 4 then begin (* for DNA/RNA, compute genomic uncertainty, Hgenomic *) ntrue := gna+gnc+gng+gnt; Hgenomic := -( + (gna/ntrue)*ln(gna/ntrue) + (gnc/ntrue)*ln(gnc/ntrue) + (gng/ntrue)*ln(gng/ntrue) + (gnt/ntrue)*ln(gnt/ntrue) )/ln2; writeln(output,'Hgenomic = ',(Hgenomic):infofield:infodecim, ' bits'); end else begin Hgenomic := Hmax; end; writeln(output,'Number of symbols, s = ',s:1); writeln(output,'Hmax = ',Hmax:5:2, ' bits'); gaps := 0; if s <> 4 then RNA := false; if s = 26 then writeln(symvec,'* ALPHABETIC ALIGNMENT'); if s = 20 then writeln(symvec,'* PROTEIN ALIGNMENT'); if (s = 4) and (not RNA) then writeln(symvec,'* DNA ALIGNMENT'); if (s = 4) and RNA then writeln(symvec,'* RNA ALIGNMENT'); if (s <> 4) and normalizing then begin writeln(output,'normalization is only possible for DNA/RNA'); halt end; writeln(symvec,'* position, samples, information, variance'); writeln(symvec,s:1,' number of symbols'); (* write out the array *) for i := 1 to imax do begin (* calculate the information *) (* ntrue := n - nbi[0,i]; (@ true number of samples @) Although fast, this caluclation is incorrect because it doesn't account for sequences being MISSING *) (* find the actual number of sequences passing this position *) (* this method guarantees a correct result, but is a tich slow. *) ntrue := 0; for b := 1 to 26 do ntrue := ntrue + nbi[b,i]; if normalizing then begin normsum := 0; for b := 1 to 26 do begin c := chr(b+ord('A')-1); if c in ['A','C','G','T'] then begin case c of 'A': rho[b] := 1/(s*(na/ntotal)); 'C': rho[b] := 1/(s*(nc/ntotal)); 'G': rho[b] := 1/(s*(ng/ntotal)); 'T': rho[b] := 1/(s*(nt/ntotal)); end; (* f'i(b,l) *) normsum := (nbi[b,i]/ntrue) * rho[b] + normsum end else rho[b] := 0.0; end end; if ntrue <= 0 then begin write(output,' WARNING: something is really STRANGE!'); writeln(output,' Position ',i-alignment:1, ' has ',ntrue:1,' sequences!'); gaps := gaps + 1; r := 0.0; avarhnb := 0.0; end else begin r := 0.0; for b := 1 to 26 do begin if normalizing then begin (* normalize *) (* f'o = f'i(b,l) * rho / sum (f'i * rho) *) fprimeo[b,i] := (nbi[b,i]/ntrue) * rho[b]/normsum; f := fprimeo[b,i] end else f := nbi[b,i]/ntrue; if f > 0 then r := r + f * ln(f); (* ie, - (f log f) *) end; r := Hmax + r / ln2; (* add in the Hbefore and convert to bits *) if (ntrue <= kickover) and (s = 4) then begin (* NEW as of 1999 Nov 29: genomic uncertainty computation. Also the old name 'e' was replaced by the correct 'ehnb'. *) calehnb(ntrue, gna,gnc,gng,gnt, hg,ehnb,avarhnb); (* hg - ehnb is the small sample error correction, e(n). *) (* Hmax - Hgenomic corrects for genomic uncertainty *) r := r - (hg - ehnb) - (Hmax - Hgenomic); end else begin (* a correction for small sample sizes *) e := (s-1) / (2 * ln2 * ntrue); r := r - e - (Hmax - Hgenomic); (* calculate small sample variance. The calculation is based on the procedure calaehnb in the rseq.p program. *) avarhnb := e * e; end; (* force positions with little data to count less: (arbitrary method). NOTE: this may not be necessary anymore with the calculation of the variance. *) (* This feature is removed now because the ability to display the variance as a standard deviation alerts the person that the position has little data in it. *) (* r := r * ntrue / n; *) end; if normalizing then begin ntrue := 0; for b := 1 to 26 do ntrue := ntrue + round(fprimeo[b,i]*normalsum); end; (* flip over to get varlogo - an easy way to do it! *) if varlogo then r := Hmax - r; (* display the data *) writeln(symvec,(i-alignment):nfield, ' ',ntrue:infofield, ' ', r:infofield:infodecim, (* variance of rsl in scientific notation: *) ' ', avarhnb:(infofield+3)); (* add 3 to variance, as in dalvec code *) if s = 20 then begin (* only write the 20 amino acids out *) for b := 1 to 26 do begin c := chr(b+ord('A')-1); if not (c in ['B','J','O','U','X','Z']) then writeln(symvec,c,' ',nbi[b,i]:4); end; end else if s = 4 then begin (* only write the 4 bases out *) for b := 1 to 26 do begin c := chr(b+ord('A')-1); (* This produces all 5 symbols, which is wrong... if (c in ['A','C','G','T','U']) then writeln(symvec,c,' ',nbi[b,i]:4); writeln(output,'c=',c); this worked but is awkward: if (c in ['A','C','G','T','U']) then case c of 'A','C','G': writeln(symvec,c,' ',nbi[b,i]:4); 'T': if not RNA then writeln(symvec,c,' ',nbi[b,i]:4); 'U': if RNA then writeln(symvec,c,' ',nbi[b,i]:4); end *) if (c in ['A','C','G']) or ((c = 'T') and (not RNA)) or ((c = 'U') and RNA) then begin if normalizing then writeln(symvec,c,' ',round(fprimeo[b,i]*normalsum):4) else writeln(symvec,c,' ',nbi[b,i]:4); end; end; end else begin (* write all symbols out *) for b := 1 to 26 do begin c := chr(b+ord('A')-1); writeln(symvec,c,' ',nbi[b,i]:4); end; end; end; if gaps > 0 then begin writeln(output,'There are ',gaps:1,' extra columns of gaps in your data.'); end; { writeln(output,' The number of symbols used is ',s:1); } end; (* end module alpro.themain *) begin themain(protseq, alprop, symvec, sequ); 1: end.