program consensus(symvec, consensusp, list, conmarks, output); (* consensus: convert symvec to consensus to point out flaws in consensus Dr. Thomas D. Schneider National Cancer Institute Laboratory of Experimental and Computational Biology Molecular Information Theory Group Frederick, Maryland 21702-1201 toms@ncifcrf.gov permanent email: toms@alum.mit.edu (use only if first address fails) http://www.lecb.ncifcrf.gov/~toms/ *) label 1; (* end of program *) const (* begin module version *) version = 1.17; (* of consensus.p 2005 May 26 2005 May 26, 1.17: cite paper: Consensus Sequence Zen. 2002 Feb 6, 1.16: report consensus so it can be grabbed. 2002 Feb 6, 1.15: report information content of consensus 2002 Feb 4, 1.14: upgrade documentation 2002 Feb 4, 1.12: compressed consensus 2002 Feb 3, 1.11: lower case letters for variations > cutoff 2002 Feb 3, 1.10: functional, improve documentation 2002 Feb 3, 1.05: produces written consensus sequence based on cutoff 2002 Feb 3, 1.04: release to web 2002 Feb 3, 1.02: upgrading 2001 Aug 31, 1.01: functional 2001 Aug 31, 1.00: origin *) updateversion = 1.05; (* defines lowest acceptable current parameter file *) (* end module version *) (* begin module describe.consensus *) (* name consensus: convert symvec to consensus to point out flaws in consensus synopsis consensus(symvec: in, consensusp: in, list: out, conmarks: out, output: out) files symvec: a file that contains base frequency information for a protein or DNA binding site. list: Processing of the symvec to show how the consensus model destroys data. conmarks: consensus sequences in marks format for putting onto a sequence logo. This requires the ShiftLettering function provided in marks.lettering. Concatenate marks.lettering to conmarks to get the marks file of makelogo. That is, in unix: cat marks.lettering conmarks > marks consensusp: parameters to control the program. The file must contain the following parameters, one per line: parameterversion: The version number of the program. This allows the user to be warned if an old parameter file is used. The second line contains two integers: fromrange: integer; the 5' or N terminus of the range to display torange: integer; the 3' or C terminus of the range to display The third line contains one real number for the cutoff below which a base is labeled 'n'. output: messages to the user description How bad is the TATAAT consensus? What are the relative fractions? The program is designed to allow one to look at a consensus sequence. Its purpose is to ATTACK THE CONCEPT OF CONSENSUS sequences. Do not use this program to publish a consensus sequence without indicating that the consensus sequence method is highly flawed! examples documentation T. D. Schneider, Consensus Sequence Zen, Applied Bioinformatics, 1 (3), 111-119, 2002. See link below. see also {example parameter file:} consensusp {Source of ShiftLettering function:} marks.lettering {Programs that make symvec files:} alpro.p dalvec.p {If you are tempted to use this program to publish something, DON'T!!! Read this:} http://www.lecb.ncifcrf.gov/~toms/glossary.html#consensus_sequence {and the Consensus Sequence Zen paper:} http://www.lecb.ncifcrf.gov/~toms/paper/zen/ {Use these programs instead:} {make a sequence logo:} makelogo.p {make sequence walkers:} makewalker.p {If you don't know what those programs do, "go look it up":} {sequence logos:} http://www.lecb.ncifcrf.gov/~toms/glossary.html#sequence_logo {sequence walkers:} http://www.lecb.ncifcrf.gov/~toms/glossary.html#sequence_walker {This is a sequence logo:} http://www.lecb.ncifcrf.gov/~toms/icons/donor.pure.gif {This is a sequence walker:} http://www.lecb.ncifcrf.gov/~toms/icons/donorwalkersmall.gif {Examples of What You Miss (!) If You Use A Consensus Sequence:} http://www.lecb.ncifcrf.gov/~toms/papers/splice/ http://www.lecb.ncifcrf.gov/~toms/papers/walker/walker.html#fig2 http://www.lecb.ncifcrf.gov/~toms/papers/baseflip/ http://www.lecb.ncifcrf.gov/~toms/papers/repan3/ author Thomas Dana Schneider bugs This program could be used by some foolish idiot to make a consensus and publish it! Well, ok, you can if your purpose is to attack the concensus sequence concept. Otherwise, please read the 'see also' section above. technical notes *) (* end module describe.consensus *) (* begin module string.const *) maxstring = 2000; (* the maximum string *) (* end module string.const version = 4.54; (@ of prgmod.p 2001 Aug 29 *) type (* begin module string.type *) stringptr = ^string; (* pointer to a string *) string = record (* a string of characters *) letters: array[1..maxstring] of char; (* the letters in the string *) length: integer; (* the number of characters in the string *) current: integer; (* the letter we are working on *) next: stringptr; (* the next string in a series *) end; (* end module string.type version = 4.54; (@ of prgmod.p 2001 Aug 29 *) var symvec, (* file used by this program *) consensusp, (* file used by this program *) list, (* file used by this program *) conmarks: text; (* file used by this program *) (* begin module halt *) procedure halt; (* stop the program. the procedure performs a goto to the end of the program. you must have a label: label 1; declared, and also the end of the program must have this label: 1: end. examples are in the module libraries. this is the only goto in the delila system. *) begin writeln(output,' program halt.'); goto 1 end; (* end module halt version = 'delmod 6.16 84 mar 12 tds/gds'; *) (* begin module copyaline *) procedure copyaline(var fin, fout: text); (* copy a line from file fin to file fout *) begin (* copyaline *) while not eoln(fin) do begin fout^ := fin^; put(fout); get(fin) end; readln(fin); writeln(fout); end; (* copyaline *) (* end module copyaline version = 4.54; (@ of prgmod.p 2001 Aug 29 *) (* begin module clearstring *) procedure clearstring(var ribbon: string); (* empty the string *) var index: integer; (* to the ribbon *) begin (* clearstring *) with ribbon do begin for index := 1 to maxstring do letters[index] := ' '; length := 0; current := 0; end end; (* clearstring *) procedure initializestring(var ribbon: string); (* start the string with a nil pointer. This routine should be called before doing linked list work. This allows the standard string routines to clear the string without killing the pointer. *) begin (* initializestring *) clearstring(ribbon); ribbon.next := nil; end; (* initializestring *) (* end module clearstring version = 4.54; (@ of prgmod.p 2001 Aug 29 *) (* begin module writestring *) procedure writestring(var tofile: text; var s: string); (* write the string s to file tofile, no writeln *) var i: integer; (* index to s *) begin (* writestring *) with s do for i := 1 to length do write(tofile, letters[i]) end; (* writestring *) (* end module writestring version = 4.54; (@ of prgmod.p 2001 Aug 29 *) (* begin module capitalize *) function capitalize(c: char): char; (* convert the character c to upper case *) var n: integer; (* c is the n'th letter of the alphabet *) begin n := ord(c); if (n >= ord('a')) and (n <= ord('z')) then c := chr( n - ord('a') + ord('A')); capitalize := c end; (* end module capitalize *) (* begin module decapitalize *) function decapitalize(c: char): char; (* convert the character c to lower case *) var n: integer; (* c is the n'th letter of the alphabet *) begin n := ord(c); if (n >= ord('A')) and (n <= ord('Z')) then c := chr( n - ord('A') + ord('a')) else c := chr(n); decapitalize := c end; (* end module decapitalize *) (* begin module twocode *) function twocode(a, b: char): char; (* return the two letter code for bases a and b. a and b can be upper or lower case, but upper case letters are returned. Uses: module decapitalize. *) begin a := decapitalize(a); b := decapitalize(b); case a of 'a': case b of 'a': twocode := 'A'; 'c': twocode := 'M'; 'g': twocode := 'R'; 't': twocode := 'W'; end; 'c': case b of 'a': twocode := 'M'; 'c': twocode := 'C'; 'g': twocode := 'S'; 't': twocode := 'Y'; end; 'g': case b of 'a': twocode := 'R'; 'c': twocode := 'S'; 'g': twocode := 'G'; 't': twocode := 'K'; end; 't': case b of 'a': twocode := 'W'; 'c': twocode := 'Y'; 'g': twocode := 'K'; 't': twocode := 'T'; end; end; end; (* end module twocode *) (* begin module threecode *) function threecode(a, c, g: char): char; (* return the three letter code for bases a, b and c. a and b can be upper or lower case, but upper case letters are returned. Uses: module decapitalize. THE LETTERS MUST BE ALPHABETIZED. *) begin a := decapitalize(a); c := decapitalize(c); g := decapitalize(g); (* who is not represented? There are only four cases if we know that! *) if a <> 'a' then begin (* it must be cgt *) threecode := 'B' end else begin (* a must be 'a' *) if c <> 'c' then begin (* it must be agt *) threecode := 'D' end else begin (* a ='a' and c='c' *) if g <> 'g' then begin (* it must be act *) threecode := 'H' end else threecode := 'V' (* it must be acg *) end; end end; (* end module threecode *) (* begin module consensus.themain *) procedure themain(var symvec, consensusp, list, conmarks: text); (* the main procedure of the program *) const wid = 5; (* width of numbers *) dec = 2; (* decimal places for numbers *) maxs = 4; (* maximum s allowed by current program *) var bits: real; (* the number of bits in the consensus *) c: char; (* one of the symbols *) conposition: integer; (* position in constring *) concount: integer; (* count of Ns in the consensus *) { conlevel: integer; (* number of bases that pass the cutoff at a position *) constring: string; (* resulting consensus string - ICK!!! *) } conlevel: array[1..maxstring] of integer; (* number of bases that pass the cutoff at a position *) constring: array [1..maxs] of string; (* multi-level consensus strings *) compressed: string; (* compressed consensus string *) cutoff: real; (* the lowest frequency for using a base in the consensus *) first: boolean; (* used to obtain the number of symbols from the start of symvec *) fromrange: integer; (* the 5' or N terminus of the range to display *) n: integer; (* number of sequences at a position l *) ncl: integer; (* number of c position l *) l: integer; (* position in the alignment *) d: integer; (* display index *) f: real; (* frequency of symbol c *) letter: integer; (* index count of the symbols *) mfsymbol: char; (* maximum frequency symbol *) mffraction: real; (* fraction of maximum frequency symbol *) parameterversion: real; (* parameter version number *) probability: real; (* probability of consensus *) Rsequence: real; (* information content *) Rvar: real; (* variance of Rsequence *) torange: integer; (* the 3' or C terminus of the range to display *) s: integer; (* number of symbols *) sindex: integer; (* index for s, the number of symbols *) begin writeln(output,'consensus ',version:4:2); reset(consensusp); readln(consensusp, parameterversion); if round(100*parameterversion) < round(100*updateversion) then begin if parameterversion <= 1.05 then begin writeln(output, 'You have an old parameter file!'); writeln(output, 'Upgrading ...'); readln(consensusp, fromrange, torange); rewrite(consensusp); writeln(consensusp,version:4:2, ' version of consensus that this parameter file', ' is designed for.'); writeln(consensusp, fromrange:3,' ',torange:3,' range to display'); cutoff := 0.5; writeln(consensusp, cutoff:4:2, ' the lowest frequency for using a base in the consensus'); reset(consensusp); readln(consensusp, parameterversion); end; end; readln(consensusp, fromrange, torange); readln(consensusp, cutoff); reset(symvec); rewrite(list); writeln(list,'* consensus ',version:4:2); writeln(list,'* range: ',fromrange:1,' to ',torange:1); writeln(list,'* cutoff: ',cutoff:4:2); (* header *) first := true; probability := 1.0; while not eof(symvec) do begin if symvec^ = '*' then copyaline(symvec, list) else begin if first then begin first := false; readln(symvec, s); writeln(list, s:wid,' number of symbols'); if s > maxs then begin writeln(output, ' number of symbols ',s:1, ' exceeds the allowed maximum,', ' maxs = ',maxs:1); halt end; for sindex := 1 to s do clearstring(constring[s]); end else begin readln(symvec,l,n,Rsequence,Rvar); mfsymbol := ' '; mffraction := -1.0; if (l >= fromrange) and (l <= torange) then begin (* start the consensus string *) conposition := l - fromrange +1; for sindex := 1 to s do begin constring[sindex].letters[conposition] := 'N'; constring[sindex].length := constring[sindex].length + 1; end; conlevel[conposition] := 0; for letter := 1 to s do begin readln(symvec, c, ncl); (* process the position *) f := ncl/n; if f >= mffraction then begin mffraction := f; mfsymbol := c; (* capture the letter *) if f >= cutoff then begin conlevel[conposition] := succ(conlevel[conposition]); constring[conlevel[conposition]].letters[conposition] := capitalize(c); end end; { if l = -19 then begin writeln(output, 'l = ',l:1); writeln(output, 'ncl = ',ncl:1); writeln(output, 'c = ',c:1); writeln(output, 'conlevel[conposition] = ',conlevel[conposition]:1); end; if l = -18 then begin writeln(output, 'l = ',l:1); writeln(output, 'conlevel[conposition] = ',conlevel[conposition]:1); halt end; } write(list, l:wid, ' ', n:wid, ' ', ncl:wid, ' ', f:wid:dec ); (* display the fraction of the base in a small graph *) write(list,' |'); for d := 0 to 10 do begin if f < d/10 then write(list,'-') else write(list,c) end; write(list,' |'); writeln(list); (* at the last letter, write results out *) if letter = s then begin writeln(list,'* maximum fraction: ',mfsymbol, ' at ',mffraction:wid:dec); writeln(list,'*'); probability := probability * mffraction end; end; end (* skip the entire position l quickly: *) else for letter := 1 to s do readln(symvec); end end end; writeln(list,'* probability of consensus: ',probability:wid:dec); writeln(list,'* Final consensus:'); writeln(list,'* '); for sindex := 1 to s do begin write (list,'* '); for l := fromrange to torange do begin conposition := l - fromrange +1; if (constring[sindex].letters[conposition] <> 'N') or (sindex = 1) then write(list, constring[sindex].letters[conposition]) else write(list, ' '); end; writeln(list); end; writeln(list,'* '); (* compress the consensus to single letter code *) bits := 0; clearstring(compressed); compressed.length := constring[1].length; for l := fromrange to torange do begin conposition := l - fromrange +1; { writeln(output,'conposition= ',conposition:1); writeln(output,'conlevel[conposition]= ',conlevel[conposition]:1); } case conlevel[conposition] of 0: compressed.letters[conposition] := 'N'; (* no consensus *) 1: begin compressed.letters[conposition] (* single consensus base *) := constring[1].letters[conposition]; bits := bits + 2.0; end; 2: begin (* use two letter code *) compressed.letters[conposition] {:= '2';} := twocode(constring[1].letters[conposition], constring[2].letters[conposition]); bits := bits + 1.0; end; 3: begin compressed.letters[conposition] {:= '3';} := threecode(constring[1].letters[conposition], constring[2].letters[conposition], constring[3].letters[conposition]); bits := bits + (2.0 - ln(3)/ln(2)); (* log base 2 of 3 *) end; 4: begin compressed.letters[conposition] := 'n'; end; end; end; write (list,'* '); writestring(list,compressed); writeln(list); writeln(list,'* '); writeln(list,'* DO NOT PUBLISH THIS!!!'); writeln(list,'* Well, ok, you can if your purpose'); writeln(list,'* is to attack the concensus sequence concept.'); writeln(list,'* Otherwise, read this:'); writeln(list,'* http://www.lecb.ncifcrf.gov/~toms/glossary.html#consensus_sequence'); rewrite(conmarks); concount := 0; for l := fromrange to torange do begin conposition := l - fromrange +1; write(list, l:5,' '); for sindex := 1 to s do begin if (constring[sindex].letters[conposition] <> 'N') or TRUE then write(list, constring[sindex].letters[conposition]); end; if constring[1].letters[conposition] <> 'N' then concount := succ(concount); write(list, ' ', compressed.letters[conposition]); writeln(list); (* The definition of shiftlettering and an example: * U TRIGGER IGNORED base bits sfactor (letters) R G B ShiftLettering * U 1.1 0 0 0 1 (A) 0 0 0 ShiftLettering *) writeln(conmarks, 'U 5 0 ', l:5 ,' -0.5 1.3 (', { constring[1].letters[conposition], (* first string *) } compressed.letters[conposition], ') 0 0 0 ShiftLettering'); end; writeln(list, concount:1,' out of ',(torange-fromrange+1):1, ' non-N positions in the consensus'); write(list, bits:wid:dec,' '); writestring(list, compressed); writeln(list); end; (* end module consensus.themain *) begin themain(symvec, consensusp, list, conmarks); 1: end.