program sites(database, standard, caps, latex, list, sorted, stats, tables, rsdata, sequ, makebkp, output); (* sites: analyse sites from randomized sequence data base Dr. Thomas D. Schneider National Cancer Institute Laboratory of Experimental and Computational Biology Frederick, Maryland 21702-1201 toms@ncifcrf.gov permanent email: toms@alum.mit.edu http://www.lecb.ncifcrf.gov/~toms/ *) label 1; (* end of program *) const (* begin module version *) version = 8.09; (* of sites.p 2002 Mar 6 2002 Mar 6, 8.09: upgrade documentation 1996 Jul 27, 8.08: final version for publication origin 1987 oct 27 *) (* end module version *) (* begin module describe.sites *) (* name sites: analyse sites from randomized sequence data base synopsis sites(database: in, standard: in, caps: out, latex: out, list: out, sorted: out, stats: out, tables: out, rsdata: out, sequ: out, makebkp: out, output: out) files database: database consisting of DNA sequence data. The first line is the name of the database. The remaining lines consist of experimental packages. The start of a package is a line like: @ -27 11 -21 5 0.85 The '@' must be left justified as the first character on the line. The numbers are defined to be: @ FROM.range TO.range FROM.random TO.random fraction.canonical FROM.range: the coordinate of the first base reported in the database TO.range: the coordinate of the last base reported in the database FROM.random: the coordinate of the first randomized base TO.random: the coordinate of the last randomized base fraction.canonical: the fraction of the canonical base during chemical synthesis. The next line defines the canonical sequence which was 'randomized'. It is in the format of the remaining sequences. The first sequence in the package is always the standard, so do not forget to include it! The sequences follow the standard. The format of the standard and the randomized sequences consists of: DNA sequence, plasmid name, primer, experiment, date (year, month, day) separated by one space each instead of commas. The sequence may contain any of the characters: "acgtxd.". "x" means that the base is not known. "d" means that that base was deleted. The program will reject these sequences (to make pure data), but this allows them to be stored in the database. "." means 'the same as the standard sequence in this position'. This allows one to enter sequences as a set of changes from the standard. The next experimental package begins with another '@'. The data from each experimental package are gathered as frequencies and normalized by using the given canonical base frequency. The normalized frequencies from all the packages are averaged to produce the final results. This allows one to combine several experiments together, however all experiments are given the same weight. This is reasonable if the experiments have similar canonical frequencies and numbers of sequences, but is probably not correct if one experiment carries more "importance" than another. A method to accounting for these different weightings is not known. standard: Use the rsdata output of the rseq program from the natural sequences as your standard. It is used for statistical comparison of the experiment to wild-type sequences. caps: listing of the database sorted and with capital letters showing changes from the standard and database errors. latex: just like list, but in a form that can be run through the typesetting program LaTeX. list: listing of the database in an easy-to-read format showing only the changes from the standard. Also gives the tables of numbers of bases. sorted: the list sorted by sequence stats: frequency statistics of the database differences. summary of information results. tables: frequency tables for various stages of the normalization. rsdata: This simulates the output of the rseq program by giving the numbers of bases (b) at each position (i). When the frequency tables are normalized in this program, the effective number of sequences is lost. To make sure that the numbers reported in rsdata are accurate, they are multiplied by constant scaleup. The table can be run through dalvec and makelogo to make a sequence logo. The variance, varhnb, is set to be negative to indicate that no method is known for how to calculate it. An earlier version of the program gave the minimum error based on the number of sequences in the database, but people tended to miss this fact when looking at the final sequence logo, so were unduely impressed by the data. sequ: raw sequences (after processing) ready for makebk makebkp: input for makebk to create the book output: messages to the user description The function of the sites program is to gather, collate and analyze data from a randomization experiment. See the reference given below. It was designed to help enter sequence data. One may enter several copies of a particular sequence, and they will be joined together by merging their data. Sequences of the same clone are identified by their common plasmid names. Inconsistent data are flagged. First the program sorts the data and checks that multiple entries are consistant with one another. If they are not, the program halts and you should look into the caps file to figure out what is wrong. The program converts the database into a more readable form in list, and provides statistical analysis. If the standard is: gaattcaaattaatacgactcactatagggagaaagctt pTS37 kc7 ex100 87 nov 2 and one of the data base lines is: gaattcaaattaattcgactcactttagggaaaaagctt pTS331 1204 ex394 87 nov 2 the program presents the data in file list as: ..............t.........t......a....... pTS331 1204 ex394 87 nov 2 which is more readable. This allows entry as a sequence, but display in a form that is easy to understand. If two primers are used, and data are found for both, then the name becomes 'both'. The stats file contains tables of the wild type frequencies and the experimental frequencies. examples See database.t7 and standard.t7. documentation @article{Schneider1989, author = "T. D. Schneider and G. D. Stormo", title = "Excess Information at Bacteriophage {T7} Genomic Promoters Detected by a Random Cloning Technique", year = "1989", journal = "Nucl. Acids Res.", volume = "17", pages = "659-674"} see also {Examples:} database.t7 {and} standard.t7 {Related programs:} siva.p, dalvec.p, makelogo.p, makebk.p author Tom Schneider bugs For sorting all plasmid initials are ignored, sorting is by the plasmid number only. A correction for small sample size is not known for the normalized experimental data. Certainly the method given in program Calhnb is not right. Therefore, the program does not report the expected variation. *) (* end module describe.sites *) (* begin module sites.const *) maxstring = 800; (* largest string allowed *) maxseq = 100; (* largest sequence allowed *) maxentries = 1000; (* largest number of data entries *) primerlength = 4; (* length of primers on output, to make them line up *) fillermax = 21; (* for filling a trigger *) (* end module sites.const *) (* begin module rseq.const *) negativeinfinity = -1000000; (* the definition of negative infinity = 100x scaleup *) scaleup = 10000; (* the amount that the frequencies are to be multiplied by. This determines the number of digits accuracy in the resulting table *) posfield = 4; (* field size for aligned sequence positions *) infofield = 8; (* field size for bits *) infodecim = 5; (* decimal places for bits *) nfield = 4; (* size of field for printing n, the number of sites *) (* end module rseq.const *) type (* begin module rs.type *) rstype = record (* types of data in the data table from rseq *) rstart, rstop: integer; (* range of the data *) l, (* position *) nal,ncl,ngl,ntl: integer; (* numbers of each base *) rsl, (* rsequence(l) *) rs, (* running sum of rs *) varhnb, (* variance estimate for small sample correction *) sumvar: (* sum of varhnb *) real; nl: integer; (* = a+c+g+t *) ehnb: real; (* hg-e(n) *) flag: char; (* e = exact, a = approximate sampling statistics *) end; (* end module rs.type version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module interact.type *) string = record (* a string of characters *) letters: array[1..maxstring] of char; (* the letters in the string *) length: integer; (* the number of characters in the string *) current: integer; (* the letter we are working on *) end; (* end module interact.type version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module trigger.type *) trigger = record (* an object to be searched for *) seek: string; (* the characters looked for *) state: integer; (* how close to triggering we are *) skip: boolean; (* trigger not found- skip the line *) found: boolean (* the trigger was found *) end; (* end module trigger.type version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module filler.type *) (* the following is an array used to fill a string. it is convenient to have it much shorter than the maxstring, so that it is easy to fill the string using procedure fillstring. the user must declare the value of constant fillermax. *) filler = packed array[1..fillermax] of char; (* end module filler.type version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module sites.type *) base = (a,c,g,t,x); (* the nucleotide bases and unknown base *) (* the standard entry format is: gaattcaaattaattcgactcactttagggaaaaagctt pTS331 1204-- ex394 87 nov 2 seq------------------------------------ name-- primer ex--- date------ *) entryptr = ^entry; entry = record seq: string; (* the sequence read in *) name: string; (* the plasmid *) number: integer; (* the number of the plasmid *) primer: string; (* the primer *) ex: string; (* the experiment number *) date: string; (* the date of the developed gel *) next: entryptr end; entries = array[1..maxentries] of entryptr; (* a set of "entry"s *) coordinates = record (* defines the overall region of a set of sequences, fromrange to torange and the subset of that region which is randomized, fromrandom to torandom. Since the fraction of canonical base is also associated with these numbers, they are included here too. *) fromrange: integer; (* first position in user coordinates *) torange: integer; (* last position in user coordinates *) lengthrange: integer; (* the region of data in use *) fromrandom: integer; (* start of random region in user coordinates *) torandom: integer; (* end of random region in user coordinates *) lengthrandom: integer; (* the region of data which was randomized *) fracan: real; (* fraction of canonical base in random region. Note that fracan is always associated with a set of coordinates. *) notfracan: real; (* fraction of non-canonical bases, (1-fracan *) end; ilarray = record data: array[1..maxseq] of integer; (* integer(l) *) coor: coordinates; end; iblarray = record data: array[a..t, 1..maxseq] of integer; (* integer(b,l) *) coor: coordinates; end; rlarray = record data: array[1..maxseq] of real; (* real(l) *) coor: coordinates; end; rblarray = record data: array[a..t, 1..maxseq] of real; (* real(b,l) *) coor: coordinates; end; (* end module sites.type *) var (* begin module sites.var *) database, (* sequence database *) tables, (* frequency tables *) caps, (* capitalized and sorted listing of the database *) latex, (* latex *) list, (* easy-to-read listing of the database, by name *) sorted, (* sorted listing of the database, by sequence *) standard, (* standard sequence *) stats, (* statistics of the database *) rsdata, (* f(b,l) table *) sequ, (* raw processed sequences *) makebkp: (* input file for makebk *) text; (* end module sites.var *) (* begin module halt *) procedure halt; (* stop the program. the procedure performs a goto to the end of the program. you must have a label: label 1; declared, and also the end of the program must have this label: 1: end. examples are in the module libraries. this is the only goto in the delila system. *) begin writeln(output,' program halt.'); goto 1 end; (* end module halt version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module copyaline *) procedure copyaline(var fin, fout: text); (* copy a line from file fin to file fout *) begin (* copyaline *) while not eoln(fin) do begin fout^ := fin^; put(fout); get(fin) end; readln(fin); writeln(fout); end; (* copyaline *) (* end module copyaline version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module capitalize *) function capitalize(c: char): char; (* convert the character c to upper case *) var n: integer; (* c is the n'th letter of the alphabet *) begin n := ord(c); if (n >= ord('a')) and (n <= ord('z')) then c := chr( n - ord('a') + ord('A')); capitalize := c end; (* end module capitalize version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module interact.clearstring *) procedure clearstring(var ribbon: string); (* empty the string *) var index: integer; (* to the ribbon *) begin (* clearstring *) with ribbon do begin for index := 1 to maxstring do letters[index] := ' '; length := 0; current := 0; end end; (* clearstring *) (* end module interact.clearstring version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module interact.writestring *) procedure writestring(var tofile: text; var s: string); (* write the string s to file tofile, no writeln *) var i: integer; (* index to s *) begin (* writestring *) with s do for i := 1 to length do write(tofile, letters[i]) end; (* writestring *) (* end module interact.writestring version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module trigger.proc *) (* this module allows one to scan a series of characters, as from an array or a file, and to "trigger" or detect a simple string in the series. the advantage of the trigger is that several triggers can "observe" a stream of characters at once, each looking for a different thing. some other modules required: interact.const, interact.type *) procedure resettrigger(var t: trigger); (* reset the trigger to ground state *) begin (* resettrigger *) with t do begin state := 0; skip := false; found := false end end; (* resettrigger *) procedure testfortrigger(ch: char; var t: trigger); (* look at the character ch. if it is part of the trigger (at the current trigger state), then the trigger state goes higher. if it is not part of the trigger then the trigger state is reset, skip is true and one should skip onward to find the trigger. if the trigger is found, found is true. *) begin (* testfortrigger *) with t do begin state := succ(state); (* if debugging then begin writestring(list,seek); writeln(list,'testfortrigger seek.letters[',state:1,']:', seek.letters[state],' ch:',ch); end;*) if seek.letters[state] = ch then begin skip := false; if state = seek.length then found := true else found := false end else begin (* reset trigger *) state := 0; skip := true; found := false end end end; (* testfortrigger *) (* end module trigger.proc version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module filler.fillstring *) procedure fillstring(var s: string; a: filler); (* this procedure makes it reasonably easy to fill the string s with characters. one calls the procedure as: *) (* 1 2 3 4 5 *) (* 12345678901234567890123456789012345678901234567890 *) (* fillstring(s, 'this-is-the-string '); the two comments make it easy to line the characters up. also, for this example, it was assumed that the length of filler as defined by the constant fillermax was 50. *) var length: integer; (* of the string without trailing blanks *) index: integer; (* of s *) begin (* fillstring *) clearstring(s); length := fillermax; while (length > 1) and (a[length] = ' ') do length := pred(length); if (length = 1) and (a[length] = ' ') then begin writeln(output, 'fillstring: the string is empty'); halt end; for index := 1 to length do s.letters[index] := a[index]; s.length := length; s.current := 1 end; (* fillstring *) (* end module filler.fillstring version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module filler.filltrigger *) procedure filltrigger(var t: trigger; a: filler); (* fill the trigger t *) begin (* filltrigger *) fillstring(t.seek,a) end; (* fillstring *) (* end module filler.filltrigger version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* ********************************************************************** *) (* ********************************************************************** *) (* begin module rs.readrsrange *) procedure readrsrange(var rsdata: text; var r: rstype); (* read the range data from rsdata to r. data is assumed to be the rsdata file from program rseq. *) var index: integer; (* for counting lines of rsdata *) skip: char; (* a character to skip the '*' on the line *) begin for index := 1 to 11 do readln(rsdata); readln(rsdata, skip, r.rstart, r.rstop); (* writeln(output, 'range: ',r.rstart:1,' ',r.rstop:1);*) end; (* end module rs.readrsrange version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module rs.getrsbegin *) procedure getrsbegin(var infile: text); (* skip to the beginning of the data in a data file from the rseq program *) var ch: char; (* a character read from infile *) begindata: trigger; (* a trigger to locate the beginning of the data *) begin (* 1 2 *) (* 123456789012345678901 *) filltrigger(begindata ,'l a c g t'); resettrigger(begindata); reset(infile); while not begindata.found do begin if eoln(infile) then readln(infile); if eof(infile) then begin writeln(output,'beginning of data not found'); halt; end; read(infile,ch); testfortrigger(ch,begindata); end; readln(infile); end; (* end module rs.getrsbegin version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module rs.readrsdata *) procedure readrsdata(var rsdata: text; var rdata: rstype); (* read data from the data file of program rseq into the datatype *) begin with rdata do begin read(rsdata,l,nal,ncl,ngl,ntl,rsl,rs,varhnb,sumvar,nl,ehnb); (* skip spaces to find the flag: *) while rsdata^=' ' do get(rsdata); readln(rsdata,flag); (* writeln(output,'readrsdata: l a c g t flag = ', l:1,' ',nal:1,' ',ncl:1,' ',ngl:1,' ',ntl:1,' ',flag); *) end end; (* end module rs.readrsdata version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* ********************************************************************** *) (* ********************************************************************** *) (* begin module package.numbar *) (* ************************************************************************ *) (* begin module numberdigit *) function numberdigit(number, logplace:integer): char; (* return the digit at the place value ('logplace') position of number. example: numberdigit(13625, 3) = 3 numberdigit(13625, 4) = 1 *) var place: integer; (* the exponent of logplace *) count: integer; (* used to make place *) absolute: integer; (* the absolute value of number *) acharacter: char; (* the character to be returned *) procedure digit; (* extract a digit at the place position *) var tenplace: integer; (* ten times place *) z: integer; (* an intermediate value *) d: integer; (* the digit extracted *) begin (* digit *) tenplace:=10*place; z:=absolute-((absolute div tenplace)*tenplace); if place = 1 then d:=z else d:= z div place; case d of 0: acharacter:='0'; 1: acharacter:='1'; 2: acharacter:='2'; 3: acharacter:='3'; 4: acharacter:='4'; 5: acharacter:='5'; 6: acharacter:='6'; 7: acharacter:='7'; 8: acharacter:='8'; 9: acharacter:='9'; end end; (* digit *) procedure sign; (* put a negative sign out or a positive sign *) begin (* sign *) if number <0 then acharacter:='-' else acharacter:='+' end; (* sign *) begin (* numberdigit *) place:=1; for count:=1 to logplace do place:=10*place; if number=0 then begin if place=1 then acharacter:='0' else acharacter:=' ' end else begin absolute:=abs(number); if absolute < (place div 10) then acharacter:=' ' else if absolute >= place then digit else sign end; numberdigit:=acharacter end; (* numberdigit *) (* end module numberdigit version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module numbersize *) function numbersize(n: integer):integer; (* calculate amount of space to be reserved for the integer n *) const ln10 = 2.30259; (* natural log of 10 - for conversion to log base 10 *) epsilon = 0.00001; (* a small number to correct log base 10 errors *) begin (* numbersize *) if n = 0 then numbersize:=1 else numbersize:=trunc(ln(abs(n))/ln10 + epsilon) + 2; (* the epsilon assures that we do not lose a place due to roundoff. eg, sometimes log base 10 of 10 would be 0.9999 instead of 1, and we would not do it right... note: this will fail for very large numbers on the order of 1/epsilon. *) (* the 2 is for the sign and last digit *) end; (* numbersize *) (* end module numbersize version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module numberbar *) procedure numberbar(var afile: text; spaces, firstnumber, lastnumber: integer; var linesused: integer); (* write a bar of numbers to a file, with several spaces before. the number of lines used is returned *) var logplace: integer; (* the log of the digit being looked at *) spacecount: integer; (* count of spaces *) number: integer; (* the current number being written *) begin if abs(firstnumber) > abs(lastnumber) then linesused:= numbersize(firstnumber) else linesused:= numbersize(lastnumber); for logplace:=linesused-1 downto 0 do begin for spacecount:=1 to spaces do write(afile,' '); for number:=firstnumber to lastnumber do write(afile,numberdigit(number,logplace)); writeln(afile) end end; (* end module numberbar version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* ************************************************************************ *) (* end module package.numbar version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module chartorange *) function chartorange(ch: char): base; begin if ch in ['a','c','g','t','x'] then case ch of 'a': chartorange := a; 'c': chartorange := c; 'g': chartorange := g; 't': chartorange := t; 'x': chartorange := x; end else begin writeln(output, 'chartorange: ',ch,' is not a base'); halt end end; (* end module chartorange *) (* begin module basetochar *) function basetochar(b: base): char; begin case b of a: basetochar := 'a'; c: basetochar := 'c'; g: basetochar := 'g'; t: basetochar := 't'; x: basetochar := 'x'; end end; (* end module basetochar *) (* begin module makenumber *) procedure makenumber(name: string; var number: integer; var found: boolean); (* make a integer number from the name. If a number was not detected, found is false. *) var l: integer; (* position in the string *) begin found := false; number := 0; for l := 1 to name.length do begin if name.letters[l] in ['0','1','2','3','4','5','6','7','8','9'] then begin found := true; number := number * 10; (* make room for the next digit *) case name.letters[l] of '0': number := number + 0; '1': number := number + 1; '2': number := number + 2; '3': number := number + 3; '4': number := number + 4; '5': number := number + 5; '6': number := number + 6; '7': number := number + 7; '8': number := number + 8; '9': number := number + 9; end end end end; (* end module makenumber version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module getstring *) procedure getstring(var f: text; var s: string; spaces: boolean); (* read in the string. if spaces is true, then there will be spaces in the string and the string ends at the end of the line. otherwise the string ends either at a space (and does not include it) or at the end of the line *) var c: char; (* a character read in *) begin with s do begin length := 0; c := 'x'; while ((c<>' ') or spaces) and (not eoln(f)) do begin read(f,c); if (c <> ' ') or spaces then begin length := length + 1; if length > maxstring then begin writeln(output,'a string being read is bigger than ',maxstring:1); writeln(output,'increase constant maxstring'); halt end; letters[length] := c end end end end; (* end module getstring *) (* begin module putstring *) procedure putstring(var f: text; var s: string); (* write the string to f *) var l: integer; (* index to the string *) begin for l := 1 to s.length do write(f,s.letters[l]) end; (* end module putstring *) (* begin module copystring *) procedure copystring(a: string; var b: string); (* copy string a to b *) var l: integer; (* index to the string *) begin b.length := a.length; for l := 1 to a.length do b.letters[l] := a.letters[l] end; (* end module copystring version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module equalstring *) function equalstring(a, b: string): boolean; (* test for equality between two strings at current positions *) var index: integer; (* index to both strings *) equal: boolean; (* are letters in a and b the same? *) begin (* equalstring *) if a.length = b.length then begin index := 1; repeat equal := (a.letters[index] = b.letters[index]); index := succ(index) until (not equal) or (index > a.length); equalstring := equal end else equalstring := false end; (* equalstring *) (* end module equalstring version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module lessstring *) function lessstring(a,b: string): boolean; (* is string a before string b in the alphabet? *) var done: boolean; (* done checking the strings *) l: integer; (* index to the string *) begin (* note: lengths assumed equal *) l := 0; done := false; repeat l := l + 1; if l > a.length (* the whole strings matched *) then begin done := true; lessstring := false end else if(b.letters[l] <> a.letters[l]) then begin done := true; lessstring := (ord(b.letters[l]) > ord(a.letters[l])) end until done end; (* end module lessstring version = 4.11; (@ of prgmod.p 1991 Apr 22 *) (* begin module equalseq *) function equalseq(a,b: string): boolean; (* are the sequence strings a and b equal? an 'x' means that they match *) var p: integer; (* position in the strings *) failure: boolean; (* true means the strings did not match *) begin if a.length <> b.length then equalseq := false else begin failure := false; p := 1; repeat if (a.letters[p] <> 'x') and (b.letters[p] <> 'x') then if a.letters[p]<>b.letters[p] then failure := true; p := p + 1 until (p > a.length) or failure; equalseq := not failure end end; (* end module equalseq *) (* begin module diffseq *) procedure diffseq(var f: text; a,b: string); (* show the differences between sequences a and b to file f *) var p: integer; (* position in the strings *) begin p := 1; if a.length <> b.length then begin writeln(f,'sequences are different in length'); end else begin for p := 1 to a.length do begin if (a.letters[p] <> 'x') and (b.letters[p] <> 'x') then if a.letters[p]<>b.letters[p] then write(f,'^') else write(f,' ') else write(f,' ') { else write(f,' ') else write(f,a.letters[p]) qqq } end; writeln(f) end; end; (* end module diffseq *) (* begin module sites.identify *) procedure identify(var f: text; e: entryptr); (* identify the entry data, write to file f *) begin with e^ do begin write(f,'seq: "'); putstring(f,seq); writeln(f,'"'); write(f,'name: "'); putstring(f,name); writeln(f,'"'); write(f,'primer: "'); putstring(f,primer); writeln(f,'"'); write(f,'experiment: "'); putstring(f,ex); writeln(f,'"'); write(f,'date: "'); putstring(f,date); writeln(f,'"'); end end; (* end module sites.identify *) (* begin module sites.getentry *) procedure getentry(var f: text; var e: entryptr; std: entryptr; var gotten: boolean); (* get the entry e from f, tell whether it is acceptable. If the base is a period '.', then use the standard sequence in std instead. When the standard is being read, std is empty. If the user illegally puts a period in the standard, then the routinge will halt because the user tried to use a period in the standard. *) var extra: integer; (* extra spaces to align primer names *) foundnumber: boolean; (* found a number in the string 'name' *) l: integer; (* index to a string *) begin with e^ do begin getstring(f, seq,false); getstring(f, name,false); makenumber(name,number,foundnumber); if not foundnumber then begin (* if this is the standard, it's ok, otherwise not *) if std = nil then number := 0 else begin writeln(output,'getentry: sequence:'); putstring(output, seq); writeln(output); writeln(output,'has a name:'); putstring(output, name); writeln(output); writeln(output,'which cannot be converted to a number.'); writeln(output,'The sites program requires a plasmid number for'); writeln(output,'every database entry except the canonical', ' standard.'); halt end end; getstring(f,primer,false); (* make the primers align *) if primer.length < primerlength then begin for extra := primer.length+1 to primerlength do primer.letters[extra] := ' '; primer.length := primerlength; end; getstring(f, ex,false); getstring(f, date,true); next := nil; (* check the entry *) gotten := true; l := 1; for l := 1 to seq.length do begin if not (seq.letters[l] in ['a','c','g','t','x','d','.']) then begin identify(output,e); writeln(output,'non "acgtxd." character found in sequence'); halt end; if seq.letters[l] = '.' then begin if std = nil then begin writeln(output,'A period (.) cannot be used', ' in a standard sequence.'); halt end; (* use the standard: *) seq.letters[l] := std^.seq.letters[l] end; if gotten then if seq.letters[l] = 'd' then begin (* it is the responsibility of the calling program to tell the user about dropped entries identify(output,e); *) gotten := false end; end; { allow person to have plasmid name that does not begin with p: if name.letters[1] <> 'p' then begin identify(output,e); writeln(output,'WARNING: The plasmid name does not begin with "p"'); end; } end; readln(f); end; (* end module sites.getentry *) (* begin module sites.putid *) procedure putid(var f: text; s: entryptr); (* put the identifier (without the sequence) of s to f *) begin with s^ do begin putstring(f, name); write(f,' '); putstring(f,primer); write(f,' '); putstring(f, ex); write(f,' '); putstring(f, date); end end; (* end module sites.putid *) (* begin module sites.putentry *) procedure putentry(var f: text; s: entryptr); (* put the entry in s to f *) begin putstring(f,s^.seq); write(f,' '); putid(f,s); writeln(f); end; (* end module sites.putentry *) (* begin module sites.clearcoor *) procedure clearcoor(var c: coordinates); (* clear the coordinates c *) begin with c do begin fromrange := 0; torange := 0; fromrandom := 0; torandom := 0; lengthrange := 0; lengthrandom := 0; fracan := 0; notfracan := 0; end; end; (* end module sites.clearcoor *) (* begin module sites.clearibl *) procedure clearibl(var fbl: iblarray); (* clear the fbl array *) var b: base; (* a base in the d sequence *) l: integer; (* position in the d sequence *) begin with fbl.coor do begin for l := 1 to maxseq do for b := a to t do fbl.data[b,l] := 0; end; clearcoor(fbl.coor); end; (* end module sites.clearibl *) (* begin module sites.clearrbl *) procedure clearrbl(var fbl: rblarray); (* clear the fbl array *) var b: base; (* a base in the d sequence *) l: integer; (* position in the d sequence *) begin with fbl.coor do begin for l := 1 to maxseq do for b := a to t do fbl.data[b,l] := 0.0; end; clearcoor(fbl.coor); end; (* end module sites.clearrbl *) (* begin module sites.copycoords *) procedure copycoords(a: coordinates; var b: coordinates); (* copy the coordinates a into b *) begin b.fromrange := a.fromrange; b.torange := a.torange; b.lengthrange := a.lengthrange; b.fromrandom := a.fromrandom; b.torandom := a.torandom; b.lengthrandom := a.lengthrandom; b.fracan := a.fracan; b.notfracan := a.notfracan; end; (* end module sites.copycoords *) (* begin module sites.copyrbl *) procedure copyrbl(Aarray: rblarray; var Barray: rblarray); (* copy a into b *) var b: base; (* index to array *) l: integer; (* index to array *) begin copycoords(Aarray.coor,Barray.coor); for l := 1 to Aarray.coor.lengthrange do for b:= a to t do Barray.data[b,l] := Aarray.data[b,l] end; (* end module sites.copyrbl *) (* begin module sites.makenotfracan *) procedure makenotfracan(fracan: real; var notfracan: real); (* make the proportion of non canonical bases (not fraction canonical = notfracan). This is so simple it is better than passing it all over the place *) begin notfracan := (1.0-fracan)/3.0; (* fraction of non canonical bases *) end; (* end module sites.makenotfracan *) (* begin module sites.setcoordinates *) procedure setcoordinates(var a: coordinates); (* determine the lengths of the coordinate set and the fraction canonical and its inverse. *) begin with a do begin lengthrange := torange - fromrange + 1; lengthrandom := torandom - fromrandom + 1; makenotfracan(fracan,notfracan); end; end; (* end module sites.setcoordinates *) (* begin module sites.showiblarray *) procedure showiblarray(var afile: text; ar: iblarray); (* show the iblarray ar to afile *) var b: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) l: integer; (* index to array *) begin with ar.coor do begin for coordinate := fromrange to torange do begin l := coordinate - fromrange + 1; write(afile,coordinate:4,' '); if (coordinate >= fromrandom) and (coordinate <= torandom) then write(afile,'R') else write(afile,' '); for b:= a to t do begin write(afile,' ',ar.data[b,l]:5); end; writeln(afile); end; end; end; (* end module sites.showiblarray *) (* begin module sites.showrblarray *) procedure showrblarray(var afile: text; ar: rblarray); (* show the rblarray ar to afile *) var b: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) l: integer; (* index to array *) begin with ar.coor do begin for coordinate := fromrange to torange do begin l := coordinate - fromrange + 1; write(afile,coordinate:4,' '); if (coordinate >= fromrandom) and (coordinate <= torandom) then write(afile,'R') else write(afile,' '); for b:= a to t do begin write(afile,' ',ar.data[b,l]:5:2); end; writeln(afile); end; end; end; (* end module sites.showrblarray *) (* begin module sites.getrsdata *) procedure getrsdata(var rsdata: text; var fbl: iblarray); (* read from the standard rsdata format the natural numbers of bases at each position in the sites, f(b,l) *) var l: integer; (* position in the fbl array *) rdata: rstype; (* data from rseq *) position: integer; (* a location in the aligned sequence *) begin reset(rsdata); clearibl(fbl); if eof(rsdata) then begin writeln(output,'empty standard rsdata file'); halt end else begin (* find the range of the graph in bases and prepare for reading the rsdata for graphing *) readrsrange(rsdata,rdata); getrsbegin(rsdata); (* set the size *) with fbl.coor do begin fromrange := rdata.rstart; torange := rdata.rstop; fromrandom := fromrange; torandom := torange; fracan := 0.25; end; setcoordinates(fbl.coor); (* create the fbl *) for position := rdata.rstart to rdata.rstop do begin (* skip lines with an '*' *) if rsdata^ <> '*' then begin readrsdata(rsdata,rdata); l := position - rdata.rstart + 1; (* move to the next rsdata line and give the data *) fbl.data[a,l] := rdata.nal; fbl.data[c,l] := rdata.ncl; fbl.data[g,l] := rdata.ngl; fbl.data[t,l] := rdata.ntl; { write(output,l:nfield, (* position *) ' ',rdata.nl:infofield, (* number sequences *) ' ',rdata.rsl:infofield:infodecim, (* information *) ' ',rdata.varhnb:infofield:infodecim); (* variance of rsl *) write(output,' a ',rdata.nal:nfield); write(output,' c ',rdata.ncl:nfield); write(output,' g ',rdata.ngl:nfield); write(output,' t ',rdata.ntl:nfield); writeln(output); } end; end; end end; (* end module sites.getrsdata *) (* begin module sites.announcement *) procedure announcement(var list: text; header: string); (* announce simple things to the list file *) begin writeln(list,'sites ',version:4:2); writeln(list,'the database is:'); putstring(list,header); writeln(list); end; (* end module sites.announcement *) (* begin module sites.mkheader *) procedure mkheader(var list: text; header: string; experiment: integer; coords: coordinates; std: entryptr); (* make the header to file list *) var l: integer; (* dummy variable *) begin if experiment = 1 then announcement(list,header); with coords do begin writeln(list); writeln(list,'Experiment ',experiment:1); writeln(list,' range: from ',fromrange:4, ' to ',torange:4); writeln(list,' random: from ',fromrandom:4,' to ',torandom:4); writeln(list,' fraction canonical: ',fracan:4:2); writeln(list); numberbar(list,0,fromrange,torange,l); putentry(list,std); end; end; (* end module sites.mkheader *) (* begin module sites.dataheader *) procedure dataheader(var list: text; header: string; std: entryptr); (* put out the header for data files *) begin writeln(list,'* sites ',version:4:2); write(list,'* the database is: '); putstring(list,header); writeln(list); write(list,'* '); putid(list,std); writeln(list); end; (* end module sites.dataheader *) (* begin module sites.crunch *) procedure crunch(var data: entries; numentries: integer; standard: entryptr; symbol: char); (* crush the data so that anything the same as the standard is now the symbol WARNING: this will affect statistics if any are done afterward! There are numentries of sequences to do this to. *) var l: integer; (* index to datum and standard *) n: integer; (* index to the entries *) begin for n := 1 to numentries do for l := 1 to data[n]^.seq.length do if data[n]^.seq.letters[l] = standard^.seq.letters[l] then data[n]^.seq.letters[l] := symbol end; (* end module sites.crunch *) (* begin module sites.uncrunch *) procedure uncrunch(var data: entries; numentries: integer; standard: entryptr; symbol: char); (* undo the crunch operation *) var l: integer; (* index to datum and standard *) n: integer; (* index to the entries *) begin for n := 1 to numentries do for l := 1 to data[n]^.seq.length do if data[n]^.seq.letters[l] = symbol then data[n]^.seq.letters[l] := standard^.seq.letters[l] end; (* end module sites.uncrunch *) (* begin module sites.sortdata *) procedure sortdata(var data: entries; numentries: integer; onname: boolean); (* sort the entries by name (onname is true) or by sequence (onname is false). *) type position = integer; function lessthan(a, b: position): boolean; (* is a less than b? *) begin if onname then lessthan := (data[a]^.number < data[b]^.number) else lessthan := lessstring(data[a]^.seq,data[b]^.seq) end; procedure swap(a, b: position); (* switch positions a and b *) var hold: entryptr; begin hold:=data[a]; data[a]:=data[b]; data[b]:=hold end; (* begin module quicksort *) procedure quicksort(left, right: position); (* quick sort a list between positions left and right, into ascending order. a position is simply a scalar of the form 0..max. the array to be sorted is dimensioned 1..max. (the difference in the ranges is important to the correct operation of the sort...) two external routines are used: function lessthan(a, b: position): boolean is a generalized test for value-at-a < value-at-b. procedure swap(a, b: position) switches the items at positions a and b. since these routines are external, the procedure is general. this procedure taken from the book 'algorithms + data structures = programs' by niklaus wirth, prentice-hall, inc., englewood cliffs, n.j.(1976), pp. 76-82 *) var lower, upper: position; (* the positions looked at currently *) center: position; (* the rough center of the region being sorted *) begin lower := left; center := (left + right) div 2; upper := right; repeat while lessthan(lower, center) do lower := succ(lower); while lessthan(center, upper) do upper := pred(upper); if lower <= upper then begin (* ho ho keep track of the center through the map: *) if lower = center then center:=upper else if upper = center then center:=lower; swap(lower, upper); lower := succ(lower); upper := pred(upper) end until lower > upper; if left < upper then quicksort(left, upper); if lower < right then quicksort(lower, right) end; (* end module quicksort version = 'prgmod 3.96 85 mar 18 tds'; *) begin (* sort the entries *) quicksort(1,numentries); end; (* end module sites.sortdata *) (* begin module sites.readheader *) procedure readheader(var database: text; var header: string); (* read the header from the database *) begin reset(database); getstring(database,header,true); readln(database); end; (* end module sites.readheader *) (* begin module sites.readexpt *) procedure readexpt(var database: text; var coords: coordinates; var std: entryptr); (* read from database the coordinates of the next experiment (expt), its coordinates, including thefraction of canonical base, calculate the fractions of the other bases, and the standard sequence. *) var gotten: boolean; (* got the standard and it has no deleted bases *) skip: char; (* charactyer for skipping the '@' of the database *) begin if database^ <> '@' then begin writeln(output,'PROGRAM ERROR: @ missing'); halt end; with coords do begin readln(database, skip, fromrange, torange, fromrandom, torandom, fracan); setcoordinates(coords); (* get the standard canonical sequence *) if std = nil then new(std); getentry(database,std,nil,gotten); if not gotten then begin writeln(output,'The standard must not contain deletion bases (d).'); halt end; if std^.seq.length <> torange - fromrange + 1 then begin writeln(output, 'the length of the standard is not consistent with', ' the given coordinates'); halt end; if fromrange > torange then begin writeln(output, 'fromrange (',fromrange:1, ') is greater than torange (', torange:1,')!'); halt; end; if fromrandom < fromrange then begin writeln(output, 'fromrandom (', fromrandom:1, ') is less than fromrange (', fromrange:1,')!'); writeln(output,'Randomized positions must be', ' within the region sequenced'); halt; end; if torandom > torange then begin writeln(output, 'torandom (', torandom:1, ') is greater than torange (', torange:1,')!'); writeln(output,'Randomized positions must be', ' within the region sequenced'); halt; end; if fromrandom > torandom then begin writeln(output, 'fromrandom (',fromrandom:1, ') is greater than torandom (', torandom:1,')!'); halt; end; if (fracan < 0.0) or (fracan > 1.0) then begin writeln(output, 'fraction canonical (',fracan:4:2, ') must be between 0 and 1!'); halt; end; end; end; (* end module sites.readexpt *) (* begin module sites.findextent *) procedure findextent(var database: text; var global: rblarray); (* find the extents of the range of the data and the random region *) var c: coordinates; (* coordinates of the experiment being read *) experiments: integer; (* the number of separate experiments performed *) std: entryptr; (* standard canonical sequence *) begin std := nil; (* allow multiple readings of std *) clearrbl(global); with global.coor do begin fromrange := maxint; torange := -maxint; fromrandom := maxint; torandom := -maxint; experiments := 0; writeln(output); writeln(output,'Checking Experiments'); while not eof(database) do begin if database^ = '@' then begin experiments := succ(experiments); writeln(output,'Experiment ',experiments:1); readexpt(database, c, std); (* find out the overall range of the data *) if c.fromrange < fromrange then fromrange := c.fromrange; if c.torange > torange then torange := c.torange; if c.fromrandom < fromrandom then fromrandom := c.fromrandom; if c.torandom > torandom then torandom := c.torandom; end else readln(database) end; if experiments = 0 then begin writeln(output,'No experiments found in database!'); writeln(output,'(perhaps you forgot the @ and standard lines?)'); halt end; if (fromrange = maxint) or (torange =-maxint) then begin writeln(output,'The experimental information is missing!'); halt; end; writeln(output, ' overall global range: ', fromrange:1,' to ',torange:1); writeln(output, 'overall randomized regions: ', fromrandom:1,' to ',torandom:1); end; (* now we can define the global length: *) global.coor.fracan := c.fracan; setcoordinates(global.coor); dispose(std); (* clear out what we don't need now *) end; (* end module sites.findextent *) (* begin module sites.readdata *) procedure readdata(var database: text; var data: entries; std: entryptr; var numentries: integer); (* Read the entry data for one experiment from the database, return number of them in numentries. *) var d: integer; (* actual number of sequence read in *) datum: entryptr; (* one of the entries *) done: boolean; (* done with reading one experimental set *) gotten: boolean; (* got the standard and it has no deleted bases *) begin numentries := 0; d := 0; done := false; while not done do begin if eof(database) then done := true; if not done then if database^ = '@' then done := true; if not done then begin (* create the space for the next one *) numentries := numentries + 1; d := d + 1; if numentries > maxentries then begin writeln(output,'The number of entries exceeds ',maxentries:1); writeln(output,'increase constant maxentries.'); halt end; { datum := data[numentries]; if datum = nil then new(data[numentries]); } if data[numentries] = nil then new(data[numentries]); datum := data[numentries]; getentry(database,datum,std,gotten); {here is the point to check the entry lengths } (* check that the length of the sequence matches the standard *) if datum^.seq.length <> std^.seq.length then begin writeln(output); writeln(output,'readdata: length of sequence datum ',d:1, ' does not equal the standard.'); writeln(output,'standard: ', std^.seq.length:5); writeln(output,' datum: ', datum^.seq.length:5); identify(output,datum); writeln(output,'WARNING: THIS SEQUENCE WILL BE SKIPPED'); writeln(output); numentries := numentries - 1; (* backpeddle! *) end; { writeln(output); writeln(output,'***********************************'); writeln(output,numentries:1); identify(output,data[numentries]); writeln(output,'***********************************'); writeln(output); } if not gotten then begin writeln(output,'WARNING: this entry was dropped', ' because of deletions ("d"):'); identify(output,data[numentries]); writeln(output); (* leave the space for the next one! *) numentries := numentries - 1; end end end end; (* end module sites.readdata *) (* begin module sites.copyentry *) procedure copyentry(a: entryptr; var b: entryptr); (* copy the entry a to b, except for the sequence itself *) begin b^.seq.length := a^.seq.length; copystring(a^. name, b^. name); b^.number := a^.number; copystring(a^.primer, b^.primer); copystring(a^. ex, b^. ex); copystring(a^. date, b^. date); end; (* end module sites.copyentry *) (* begin module sites.capsequence *) procedure capsequence(datum: entryptr; var naked: entryptr; standard: entryptr); (* using the standard, capitalize the sequence in datum whenever they are unequal, put the result in naked *) var l: integer; (* index to datum and standard *) begin copyentry(datum, naked); for l := 1 to datum^.seq.length do if datum^.seq.letters[l] <> standard^.seq.letters[l] then case datum^.seq.letters[l] of 'a': naked^.seq.letters[l] := 'A'; 'c': naked^.seq.letters[l] := 'C'; 'g': naked^.seq.letters[l] := 'G'; 't': naked^.seq.letters[l] := 'T'; 'x': naked^.seq.letters[l] := 'x'; (* nothing to note *) end else naked^.seq.letters[l] := datum^.seq.letters[l]; end; (* end module sites.capsequence *) (* begin module sites.strip *) procedure strip(coords: coordinates; datum: entryptr; var naked: entryptr; standard: entryptr; symbol, blank: char); (* strip the sequence in datum using the standard: whenever they are equal, put the symbol. outside of the random region use the blank character. put the result in naked *) var coordinate: integer; (* true coordinate system *) l: integer; (* index to datum and standard *) begin copyentry(datum, naked); with coords do for coordinate := fromrange to torange do begin l := coordinate - fromrange + 1; if datum^.seq.letters[l] = standard^.seq.letters[l] then if (coordinate >= fromrandom) and (coordinate <= torandom) then naked^.seq.letters[l] := symbol else naked^.seq.letters[l] := blank else naked^.seq.letters[l] := datum^.seq.letters[l]; end end; (* end module sites.strip *) (* begin module sites.checkdata *) procedure checkdata(var list: text; data: entries; d: integer; { standard: entryptr; } var failures: integer); (* check that the length of the sequence matches the standard. also, check that the sequence at d is identical with the sequence at the previous number if they have the same plasmid number. the routine assumes that the data are sorted. report results in list. increment the number of failures reported. *) { var datum: entryptr; (* one of the entries *) } begin { formerly the size check was done here, but move it out to readdata. (* check that the length of the sequence matches the standard *) datum := data[d]; if datum^.seq.length <> standard^.seq.length then begin writeln(output,'Checkdata: length of sequence datum ',d:1, ' does not equal the standard.'); writeln(output,'standard: ', standard^.seq.length:5); writeln(output,' datum: ', datum^.seq.length:5); identify(output,datum); there was a halt here end; } if d > 1 then begin if data[d]^.number = data[d-1]^.number then begin if not equalseq(data[d]^.seq,data[d-1]^.seq) then begin diffseq(list,data[d]^.seq,data[d-1]^.seq); failures := failures + 1; write(list,'DATA ERROR: sequence of "'); putid(list,data[d-1]); writeln(list,'"'); write(list,' does not equal "'); putid(list,data[d]); writeln(list,'"') end end end end; (* end module sites.checkdata *) (* begin module sites.figure *) procedure figure(var f: text; var data: entries; numentries: integer; {header: string; not anymore} experiment: integer; coords: coordinates; std: entryptr; kind, blank: char; testforequal: boolean); (* this procedure is similar to the display procedure, but it is designed just to put out a figure for a paper *) (* display the set of entries in data to file f. note: the entries are not destroyed if kind = 'c' then capitalize the differences to the standard; if kind = 's' then strip to '.' non differences to the standard. Regions outside the randomized area are set to 'blank'. If testforequal is true, then whenever a sequence is equal to the proceeding one and the strip option is chosen, equal symbols will be used in place of the periods. *) var datum: entryptr; (* one of the entries *) duplicates: integer; (* number of sequences duplicated *) dummy: integer; (* dummy variable *) naked: entryptr; (* datum changed relative to the std *) number: integer; (* the number of one datum *) begin writeln(f,'Experiment ',experiment:1); writeln(f); with coords do begin numberbar(f,0,fromrange,torange,dummy); putentry(f,std); end; duplicates := 0; new(naked); for number := 1 to numentries do begin datum := data[number]; if kind = 's' then if testforequal then begin if number > 1 then if equalstring(datum^.seq,data[number-1]^.seq) then begin strip(coords, datum,naked,std,'=',blank); duplicates := duplicates + 1 end else strip(coords, datum,naked,std,'.',blank) else strip(coords, datum,naked,std,'.',blank) end else strip(coords, datum,naked,std,'.',blank) else if kind = 'c' then capsequence(datum,naked,std); (* replacement for putentry(f,naked); *) putstring(f,naked^.seq); write(f,' '); (* replacement for putid(f,naked); *) with naked^ do begin putstring(f, name); write(f,' '); (* note: no primer name in this version of the program for the final latex figures: putstring(f,primer); write(f,' '); putstring(f, ex); write(f,' '); putstring(f, date); *) end; writeln(f); end; dispose(naked); end; (* end module sites.figure *) (* begin module sites.display *) procedure display(var f: text; var data: entries; numentries: integer; header: string; experiment: integer; coords: coordinates; std: entryptr; kind, blank: char; testforequal: boolean; var duplicates: integer); (* display the set of entries in data to file f. note: the entries are not destroyed if kind = 'c' then capitalize the differences to the standard; if kind = 's' then strip to '.' non differences to the standard. Regions outside the randomized area are set to 'blank'. If testforequal is true, then whenever a sequence is equal to the proceeding one and the strip option is chosen, equal symbols will be used in place of the periods. duplicates is the number of sequences duplicated *) const dashline = 80; (* number of characters in a dashed line *) var datum: entryptr; (* one of the entries *) failures: integer; (* number of failed checks *) naked: entryptr; (* datum changed relative to the std *) number: integer; (* the number of one datum *) procedure dashit; (* make a dashed line *) var dashcount: integer; (* for counting dashes *) begin for dashcount := 1 to dashline do write(f,'-'); writeln(f) end; begin mkheader(f, header, experiment, coords, std); failures := 0; duplicates := 0; new(naked); for number := 1 to numentries do begin { writeln(f,'debug: display, number=',number:1); } datum := data[number]; if kind = 's' then if testforequal then begin if number > 1 then if equalstring(datum^.seq,data[number-1]^.seq) then begin strip(coords, datum,naked,std,'=',blank); duplicates := duplicates + 1 end else strip(coords, datum,naked,std,'.',blank) else strip(coords, datum,naked,std,'.',blank) end else strip(coords, datum,naked,std,'.',blank) else if kind = 'c' then begin capsequence(datum,naked,std); (* now put a line if the plasmid number has changed *) if number = 1 then dashit else if not equalstring(datum^.name,data[number-1]^.name) then dashit end; putentry(f,naked); checkdata(f,data,number,{std,}failures); end; if failures > 0 then begin writeln(output); writeln(output,failures:1,' data errors:'); writeln(output,'inconsistant readings, no analysis will be done'); halt end; dispose(naked); end; (* end module sites.display *) (* begin module sites.compress *) procedure compress(var data: entries; var numentries: integer; std: entryptr; var caps: text); (* compress together multiple entries of the same plasmid number, change numentries to account for the compression. The standard sequence std allows the detection of cases where a mutant is chosen against an x. That is, if one sequence has 'x' and another has 'a' then 'a' should be chosen. If 'a' is the standard sequence this is not a problem. If however, 'a' is NOT the standard sequence, then the user should be warned, because this may be an error that goes down as a mutation. The warnings go into caps, where they are most useful *) var f: char; (* a character in the fillspot sequence *) fillspot: integer; (* the location to put the next entry *) l: integer; (* index into the sequences *) r: char; (* a character in the readspot sequence *) readspot: integer; (* the number of one datum that we are inspecting *) procedure death; (* put here to make code easier to read below *) begin writeln(output,'program error in compress:'); writeln(output,'position in sequence:',l:2); writeln(output,'filling at ',fillspot:2,', letter is: ',f); writeln(output,'reading at ',readspot:2,', letter is: ',r); writeln(output,'this sequence difference was missed'); halt end; procedure warning(b: char); (* see if the base b is standard, and if not, warn the user *) begin if b <> std^.seq.letters[l] then begin writeln(output,'check sequence',data[fillspot]^.number, ' - see caps file'); writeln(caps); write(caps,'WARNING: sequence: '); putid(caps,data[fillspot]); writeln(caps); write(caps,'and sequence : '); putid(caps,data[readspot]); writeln(caps); writeln(caps,'differ by an x, and the other base', ' is mutant (',b,'). Check that it IS mutant.') end end; begin fillspot := 1; readspot := 2; while readspot <= numentries do begin if data[fillspot]^.number = data[readspot]^.number then begin (* merge the readspot entry into the fillspot entry *) for l := 1 to data[fillspot]^.seq.length do begin f := data[fillspot]^.seq.letters[l]; r := data[readspot]^.seq.letters[l]; if f <> r then begin (* only act if they differ *) if f = 'x' then begin data[fillspot]^.seq.letters[l] := r; warning(r); end else if r <> 'x' then death else warning(f); end end; (* change the primer to indicate the merge *) if not equalstring(data[fillspot]^.primer,data[readspot]^.primer) then begin with data[fillspot]^.primer do begin letters[1] := 'b'; letters[2] := 'o'; letters[3] := 't'; letters[4] := 'h'; length := 4; end end end else begin (* move the readspot entry to fillspot *) fillspot := fillspot + 1; (* could not merge, so move to next spot *) data[fillspot] := data[readspot]; (* and fill it *) end; readspot := readspot + 1 (* continue to read through the entries *) end; numentries := fillspot; (* also account for the other way to detect the end of the data: *) if numentries < maxentries then data[numentries+1] := nil; writeln(output,'compressed data set: ',numentries:1,' sequences'); end; (* end module sites.compress *) (* begin module sites.detfi *) procedure detfi(fo: rblarray; std: entryptr; var fi: rblarray); (* determine fi(b,l), the input frequency array, from the standard sequence in std. *) var b: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) inrandom: boolean; (* true if working within the randomized region *) l: integer; (* index to array *) begin clearrbl(fi); copycoords(fo.coor,fi.coor); with fo.coor do begin for coordinate := fromrandom to torandom do begin l := coordinate - fromrange + 1; inrandom := (coordinate >= fromrandom) and (coordinate <= torandom); for b:= a to t do begin if basetochar(b) = std^.seq.letters[l] then if inrandom then fi.data[b,l] := fracan else fi.data[b,l] := 1.0 else if inrandom then fi.data[b,l] := notfracan else fi.data[b,l] := 0.0 end end end end; (* end module sites.detfi *) (* begin module sites.detrho *) procedure detrho(fi: rblarray; fo: rblarray; var rho: rblarray); (* determine rho(b,l) *) var b: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) l: integer; (* index to array *) begin clearrbl(rho); copycoords(fi.coor,rho.coor); with fi.coor do for coordinate := fromrandom to torandom do begin l := coordinate - fromrange + 1; for b:= a to t do begin if fi.data[b,l]=0.0 then writeln(output,'detrho: 7 zowie! PROGRAM ERROR!'); if fi.data[b,l] > 0.0 then rho.data[b,l] := fo.data[b,l]/fi.data[b,l] else if fo.data[b,l] = fi.data[b,l] then rho.data[b,l] := 1.0 else rho.data[b,l] := 0.0 end end end; (* end module sites.detrho *) (* begin module sites.detfpi *) procedure detfpi(fi: rblarray; var fpi: rblarray); (* determine fpi(b,l), as all 1/4 values. This represents the prime (normalized) input frequency values *) var b: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) l: integer; (* index to array *) begin clearrbl(fpi); copycoords(fi.coor,fpi.coor); with fi.coor do for coordinate := fromrandom to torandom do begin l := coordinate - fromrange + 1; for b := a to t do fpi.data[b,l] := 0.25 end end; (* end module sites.detfpi *) (* begin module sites.detfpo *) procedure detfpo(fpi: rblarray; rho: rblarray; var fpo: rblarray); (* determine fpo(b,l), experimental frequencies, normalized *) var b: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) l: integer; (* index to array *) sum: real; (* normalizing sum *) begin clearrbl(fpo); copycoords(fpi.coor,fpo.coor); with fpi.coor do for coordinate := fromrandom to torandom do begin l := coordinate - fromrange + 1; sum := 0.0; for b:= a to t do sum := sum + fpi.data[b,l] * rho.data[b,l]; if sum=0.0 then writeln(output,'detfpo: 6 zowie! PROGRAM ERROR!'); (* make sure it sums to 1 by dividing by sum *) for b:= a to t do fpo.data[b,l] := fpi.data[b,l] * rho.data[b,l] / sum end end; (* end module sites.detfpo *) (* begin module sites.detfpon *) procedure detfpon(wildtype: iblarray; fpi: rblarray; var fpon: rblarray); (* determine fpon(b,l), natural information frequencies output "normalized". "Normalized" because it simply assumes equiprobable before state. See molecular machines paper I. *) var b: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) inrandom: boolean; (* true if working within the randomized region *) l: integer; (* index to array *) sum: real; (* normalizing sum *) begin clearrbl(fpon); copycoords(fpi.coor,fpon.coor); with fpi.coor do for coordinate := fromrange to torange do begin l := coordinate - fromrange + 1; inrandom := (coordinate >= fromrandom) and (coordinate <= torandom); if inrandom then begin sum := 0.0; for b:= a to t do sum := sum + wildtype.data[b,l]; if sum > 0.0 then for b:= a to t do fpon.data[b,l] := wildtype.data[b,l]/sum else begin writeln(output); writeln(output,'WARNING: wild type frequency table is odd'); writeln(output,'at coordinate ',coordinate:1, ' it does not have any data.'); writeln(output); fpon.data[b,l] := 1 end; { ;if sum=0.0 then writeln(output,'5 zowie! PROGRAM ERROR!'); } end else for b := a to t do fpon.data[b,l] := fpi.data[b,l] end end; (* end module sites.detfpon *) (* begin module sites.calcinfo *) procedure calcinfo({sample: samplestat; not useful, since method not known} fi, fo: rblarray; var i: rlarray); (* calculate information table from the input and output frequencies. correct the curve using the sample statistics *) var b: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) l: integer; (* index to array *) ln2: real; (* natural log of 2 *) begin ln2 := ln(2); with fi.coor do for coordinate := fromrandom to torandom do begin l := coordinate - fromrange + 1; i.data[l] := 0.0; for b:= a to t do if fo.data[b,l] > 0 then i.data[l] := i.data[l] + fo.data[b,l] * ln(fo.data[b,l]/fi.data[b,l]); (* else no information is added *) i.data[l] := i.data[l]/ln2; (* convert to bits *) (* since we don't know how to calculate sample error, this correction is deleted for now: i.data[l] := i.data[l] - sample.en *) end end; (* end module sites.calcinfo *) (* begin module sites.detfproportion *) procedure detfproportion( nbl: iblarray; std: entryptr; var fproportion: rblarray); (* determine fi(b,l) by using the base proportions of output fo array, for statistical calculations *) var b: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) frac: array[a..t] of real; (* proportions of bases as fractions *) inrandom: boolean; (* true if working within the randomized region *) l: integer; (* index to array *) prop: array[a..t] of integer; (* proportions of bases *) total: integer; (* total bases in output *) begin clearrbl(fproportion); copycoords(nbl.coor,fproportion.coor); for b := a to t do prop[b] := 0; (* clear array *) with nbl.coor do for coordinate := fromrandom to torandom do begin l := coordinate - fromrange + 1; for b:= a to t do begin prop[b] := prop[b] + nbl.data[b,l] end end; total := 0; for b := a to t do total := total + prop[b]; for b := a to t do frac[b] := prop[b]/total; if total=0.0 then writeln(output,'9 zowie! PROGRAM ERROR!'); write(stats,' base proportions in random region: '); write(stats,' a = ',prop[a]:1); write(stats,' c = ',prop[c]:1); write(stats,' g = ',prop[g]:1); write(stats,' t = ',prop[t]:1); writeln(stats); (* write(output,' base frac in random region: '); write(output,' a = ',frac[a]:10:5); write(output,' c = ',frac[c]:10:5); write(output,' g = ',frac[g]:10:5); write(output,' t = ',frac[t]:10:5); writeln(output); *) with nbl.coor do for coordinate := fromrange to torange do begin l := coordinate - fromrange + 1; inrandom := (coordinate >= fromrandom) and (coordinate <= torandom); for b:= a to t do begin if basetochar(b) = std^.seq.letters[l] then if inrandom then fproportion.data[b,l] := frac[b] else fproportion.data[b,l] := 1.0 else if inrandom then fproportion.data[b,l] := frac[b] else fproportion.data[b,l] := 0.0; end; end end; (* end module sites.detfproportion *) (* begin module sites.detfbottle *) procedure detfbottle( nbl: iblarray; var missing: boolean; std: entryptr; var fbotcon: rblarray; var fbottle: rblarray); (* determine fi(b,l) using the base proportions nbl. segregated by the bottles, for statistical calculations. fbottle: input frequencies using fo array proportions separated by base bottles fbotcon: input frequencies using fo array proportions separated by base bottles, assuming that the canonical is all that matters If a base is missing from a bottle, then missing is true and statistical calculations are impossible! *) var b: base; (* index to array *) badbottle: boolean; (* true when one bottle is bad because it has a missing base but should be a dirty bottle *) bottle: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) frac: array[a..t] of real; (* proportions of bases as fractions *) inrandom: boolean; (* true if working within the randomized region *) l: integer; (* index to array *) prop: array[a..t] of integer; (* proportions of bases *) total: integer; (* total bases in output *) begin clearrbl(fbotcon); copycoords(nbl.coor,fbotcon.coor); clearrbl(fbottle); copycoords(nbl.coor,fbottle.coor); missing := false; writeln(stats,' base proportions in random region'); with nbl.coor do for bottle := a to t do begin for b := a to t do prop[b] := 0; (* clear array *) for coordinate := fromrandom to torandom do begin l := coordinate - fromrange + 1; if basetochar(bottle) = std^.seq.letters[l] then for b:= a to t do begin prop[b] := prop[b] + nbl.data[b,l] end end; total := 0; for b := a to t do total := total + prop[b]; if total=0.0 then writeln(output,'8 zowie! PROGRAM ERROR!'); for b := a to t do frac[b] := prop[b]/total; badbottle := false; for b := a to t do badbottle := badbottle or (prop[b] = 0); missing := missing or badbottle; write(stats,' for bottle ',basetochar(bottle),': '); write(stats,' a = ',prop[a]:5); write(stats,' c = ',prop[c]:5); write(stats,' g = ',prop[g]:5); write(stats,' t = ',prop[t]:5); if badbottle then write(stats, ' <= MISSING BASE(s)'); writeln(stats); write(stats,' for bottle ',basetochar(bottle),': '); write(stats,' a = ',frac[a]:5:2); write(stats,' c = ',frac[c]:5:2); write(stats,' g = ',frac[g]:5:2); write(stats,' t = ',frac[t]:5:2); writeln(stats); for coordinate := fromrange to torange do begin l := coordinate - fromrange + 1; inrandom := (coordinate >= fromrandom) and (coordinate <= torandom); if basetochar(bottle) = std^.seq.letters[l] then for b:= a to t do begin if inrandom then begin fbottle.data[b,l] := frac[b]; (* fbotcon is defined as the fraction of the canonical base, with equially likely bases for the other bases. I don't think this is a reasonable model for most nucleic acid recognizers, but Charles Lawrence suggested testing for it, so here it is. *) if b = bottle then fbotcon.data[b,l] := frac[b] else fbotcon.data[b,l] := (1.0-frac[bottle])/3.0 end (* outside the random region, just make the bottles be the fixed bases of the standard *) else if b=bottle then begin fbottle.data[b,l] := 1.0; fbotcon.data[b,l] := 1.0 end else begin fbottle.data[b,l] := 0.0; fbotcon.data[b,l] := 0.0 end end end end; if missing then begin writeln(output,'***********************************************'); writeln(output,'* WARNING WARNING WARNING WARNING WARNING *'); writeln(output,'* One of the dirty bottles is missing a base! *'); writeln(output,'* See the stats file for details. *'); writeln(output,'***********************************************'); end; end; (* end module sites.detfbottle *) (* begin module sites.signif *) procedure signif(var stats: text; nl: ilarray; fi, fo: rblarray; paramod: integer); (* figure out the significance of the difference between the two frequency arrays fi and fo. put results in stats. paramod is the number of parameters modeled. signif assumes identical sizes for the three arrays. *) var b: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) degfree: integer; (* number of degrees of freedom *) exp: real; (* expected number of counts *) gsq: real; (* G-squared *) infinite: boolean; (* true for infinite significance *) l: integer; (* index to array *) obs: real; (* observed number of counts *) xsq: real; (* chi squared *) begin gsq := 0; xsq := 0; infinite := false; with nl.coor do begin for coordinate := fromrandom to torandom do begin l := coordinate - fromrange + 1; if not infinite then for b:= a to t do begin if fi.data[b,l]=0.0 then begin writeln(stats, 'fi(',basetochar(b),',',coordinate:1,') = 0', ' so the significance is infinite'); infinite := true; end; if not infinite then begin if fo.data[b,l] > 0 then gsq := gsq + nl.data[l] * fo.data[b,l] * ln(fo.data[b,l]/fi.data[b,l]); (* else don't add *) obs := nl.data[l] * fo.data[b,l]; exp := nl.data[l] * fi.data[b,l]; if exp > 0 then xsq := xsq + sqr(obs-exp)/exp; end; end; (* writeln(output,'n[',l:2,']=',nl.data[l]:3);*) end; if not infinite then begin gsq := 2*gsq; degfree := 3 * (torandom - fromrandom + 1) - paramod; writeln(stats,' ',degfree:1,' degrees of freedom'); writeln(stats,' G-squared = ',gsq:6:2); writeln(stats,' X-squared = ',xsq:6:2); writeln(stats); end; end; end; (* end module sites.signif *) (* begin module sites.mle *) procedure mle(var list: text; std: entryptr; nbl: iblarray); (* calculate the maximum-likelihood value for the canonical base *) var b: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) l: integer; (* index to array *) sum: integer; (* sum of the canonical bases *) total: integer; (* total number of sequenced bases *) begin sum := 0; total := 0; with nbl.coor do begin for coordinate := fromrandom to torandom do begin l := coordinate - fromrange + 1; for b:= a to t do begin if basetochar(b) = std^.seq.letters[l] then sum := sum + nbl.data[b,l]; total := total + nbl.data[b,l] end; end; writeln(list,'number of canonical bases: ',sum:5); writeln(list,'number of bases sequenced: ',total:5); if total=0.0 then writeln(output,'2 zowie! PROGRAM ERROR!'); writeln(list,'maximum likelihood estimate of canonical: ', '(',sum:1,'/',total:1,')=',(sum/total):5:3); end; end; (* end module sites.mle *) {this is probably never going to be used: procedure showit(var list: text; std: entryptr; header: string; e,n: rlarray); (* show the information curves, experimental (e) and natural (n) into the list file *) var coordinate: integer; (* the coordinate relative to zero in the standard *) l: integer; (* index to array *) begin dataheader(list,header,std); writeln(list,'* information curves, ready for graphics'); writeln(list,'* n: natural, e: experimental (normalized) curve'); writeln(list,'* type, coordinate, info'); with e.coor do begin for coordinate := fromrange to torange do begin l := coordinate - fromrange + 1; writeln(list,'n ', coordinate:3,' ',n.data[l]:10:8); end; for coordinate := fromrange to torange do begin l := coordinate - fromrange + 1; writeln(list,'e ', coordinate:3,' ',e.data[l]:10:8); end end end; } (* begin module sites.tabulate *) procedure tabulate(data: entries; numentries: integer; co: coordinates; var nl: ilarray; var nbl: iblarray; var fbl: rblarray); (* tabulate the three arrays from the data, numentries is the number to tabulate *) var b: base; (* a base in the d sequence *) l: integer; (* position in the d sequence *) d: integer; (* data on a sequence *) begin clearrbl(fbl); copycoords(co,nl.coor); copycoords(co,nbl.coor); copycoords(co,fbl.coor); (* clear the nl and nbl arrays *) for l := 1 to maxseq do begin for b := a to t do nbl.data[b,l] := 0; nl.data[l] := 0; end; (* tabulate the nbl table *) for d := 1 to numentries do begin for l := 1 to data[d]^.seq.length do begin b := chartorange(data[d]^.seq.letters[l]); if b in [a,c,g,t] then nbl.data[b,l] := nbl.data[b,l] + 1; end end; (* create nl from nbl. note: all have the same length *) for l := 1 to data[1]^.seq.length do for b := a to t do nl.data[l] := nl.data[l] + nbl.data[b,l]; for l := 1 to data[1]^.seq.length do if nl.data[l]=0 then writeln(output,'13 zowie! PROGRAM ERROR!'); if data[1]^.seq.length <> nl.coor.lengthrange then writeln(output,'13.5 zowie!'); (* create fbl from nbl and nl *) for l := 1 to data[1]^.seq.length do for b := a to t do fbl.data[b,l] := nbl.data[b,l] / nl.data[l]; end; (* end module sites.tabulate *) (* begin module sites.makefreq *) procedure makefreq(var rbl: rblarray); (* make the rbl array a frequency table by normalizing *) var b: base; (* a base *) l: integer; (* a position *) sum: real; (* sum of rbl at l *) begin with rbl do for l := 1 to coor.lengthrange do begin sum := 0.0; for b := a to t do sum := sum + rbl.data[b,l]; if sum <> 0.0 then for b := a to t do rbl.data[b,l] := rbl.data[b,l]/sum; end end; (* end module sites.makefreq *) (* begin module sites.normalize *) procedure normalize(std: entryptr; var fi: rblarray; (* experimental input frequencies (0,fracan, notfracan) *) fo: rblarray; (* experimental output frequencies (raw sequence data) *) var rho: rblarray; (* the ratio of fo / fi *) var fpi: rblarray; (* the normalized input table (0.25's everywhere) *) var fpo: rblarray); (* the fi table normalized. *) (* normalize the fo array (the fbl array) and put the result in fpo (the fblnormalized array). Intermediates that are generated: rho: the ratio of fo / fi fpi: the normalized input table (0.25's everywhere) fpo: the normalized output table. *) begin detfi(fo, std, fi); (* determine fi(b,l), using fo as an example *) detrho(fi, fo, rho); (* determine rho(b,l), as fo/fi *) detfpi(fi, fpi); (* determine f'i(b,l), as all 0.25. *) detfpo(fpi, rho, fpo); (* determine f'o(b,l), see the paper for method *) end; (* end module sites.normalize *) (* begin module sites.addrbl *) procedure addrbl(ar: rblarray; var br: rblarray); (* Add the frequency data of array ar into array br, over the randomized region only. The reason for limiting it to this region is that otherwise one would add fixed base data into random base data, which would be a misrepresentation. Make sure that the randomized range of ar fits correctly into that of br. If it does not fit, then the size of br was defined incorrectly. That is, br must be larger than ar. *) var b: base; (* a base *) la: integer; (* coordinate in ar array *) lb: integer; (* coordinate in b array *) la1toN: integer; (* conversion of la to 1 to N coordinates *) lb1toN: integer; (* conversion of lb to 1 to N coordinates *) begin if ar.coor.fromrandom < br.coor.fromrandom then begin writeln(output, 'addrbl: attempt to add array ar out of bounds of br'); writeln(output,'ar.coor.fromrandom = ',ar.coor.fromrandom:1, ' br.coor.fromrandom = ',br.coor.fromrandom:1); halt end; if ar.coor.torandom > br.coor.torandom then begin writeln(output, 'addrbl: attempt to add array ar out of bounds of br'); writeln(output,'ar.coor.torandom = ',ar.coor.torandom:1, ' br.coor.torandom = ',br.coor.torandom:1); halt end; for la := ar.coor.fromrandom to ar.coor.torandom do begin la1toN := la - ar.coor.fromrange + 1; lb := la; lb1toN := lb - br.coor.fromrange + 1; (* program check *) if (lb < br.coor.fromrandom) or (lb > br.coor.torandom) or (lb < br.coor.fromrange) or (lb > br.coor.torange) then begin writeln(output,'addrbl: program error in array b'); halt end; for b := a to t do br.data[b,lb1toN] := br.data[b,lb1toN] + ar.data[b,la1toN]; end; end; (* end module sites.addrbl *) (* begin module sites.showstatistics *) procedure showstatistics(var stats: text; coords: coordinates; experiment: integer; header: string; (* database identifier *) std: entryptr; (* the canonical standard *) wildtype: iblarray; (* wild type frequencies *) nbl: iblarray; duplicates: integer); (* show the table of results *) var b: base; (* one of the bases at l *) coordinate: integer; (* the position relative to the zero base *) l: integer; (* position in the site *) begin mkheader(stats, header, experiment, coords, std); writeln(stats); { writeln(stats,' l | b | experimental | wildtype |'); uuu } writeln(stats,' | | experimental | wildtype |'); writeln(stats,' l | b | A C G T | A C G T |'); writeln(stats,'------------------------------------------------------'); for l := 1 to coords.lengthrange do begin coordinate := l-1+coords.fromrange; write(stats,coordinate:4); write(stats,' |'); (* canonical base *) if (coordinate < coords.fromrandom) or (coordinate > coords.torandom) then write(stats,' ',capitalize(std^.seq.letters[l])) else write(stats,' ',std^.seq.letters[l]); write(stats,' |'); for b := a to t do begin (* experimental frequencies *) write(stats,' ',nbl.data[b,l]:4); end; write(stats,' |'); for b := a to t do begin (* wild type frequencies *) write(stats,' ',wildtype.data[b,l]:4); end; write(stats,' |'); writeln(stats); end; writeln(stats); writeln(stats, 'number of duplicate sequences: ',duplicates:1,' sequences'); writeln(stats); end; (* end module sites.showstatistics *) (* begin module sites.makersdata *) procedure makersdata(var rsdata, database, standard: text; fpo: rblarray; ipe: rlarray; nlexpt: ilarray); (* make the fbl(b,l) table for use by the MakeLogo program for producing a logo. The result is an rsdata file. Assume that there are nfactor sequences. Use database and standard for identification. A bug: nlexpt is supposed to be the number of bases at position l in the experiment. However, if there are several experiments, they should not be weighted equally. Also, it is not correct to calculate the sampling error from these numbers since partial (fracan > 0.25) randomization implies that nl is effectively smaller than the number of sequences. This portion of the code is kept, but awaits better understanding of how to do the calculation. *) const nfactor = scaleup; (* number of sequences assumed for the table. This simply preserves the resolution to several decimal places *) var avarhnb: real; (* variance of the information r *) b: base; (* index to array *) coordinate: integer; (* the coordinate relative to zero in the standard *) e: real; (* for calculating small sample bias and variance *) l: integer; (* index to array *) ln2: real; (* natural log of 2 *) nbl: integer; (* number of bases b at position l *) nl: integer; (* number of bases at position l *) rsl: real; (* information at position l *) rs: real; (* total information *) s: integer; (* nubmer of symbols *) sumavarhnb: real; (* sum of variance of the information r *) begin (* simulate the header of a data file output from rseq *) writeln(rsdata,'* sites ',version:4:2); reset(database); write(rsdata,'* '); copyaline(database,rsdata); reset(standard); write(rsdata,'* '); copyaline(standard,rsdata); for l := 1 to 8 do writeln(rsdata,'*'); with fpo.coor do writeln(rsdata,'* ',fromrandom:nfield,' ',torandom:nfield); for l := 1 to 12 do writeln(rsdata,'*'); writeln(rsdata,'* l a c g t rs(l) rs', ' var(hnb) sum-var n(l) e(hnb) f'); s := 4; (* number of symbols *) ln2 := ln(2.0); rs := 0.0; sumavarhnb := 0.0; with fpo.coor do begin for coordinate := fromrandom to torandom do begin l := coordinate - fromrange + 1; (* l *) write(rsdata,' ',coordinate:nfield); (* a c g t *) nl := 0; for b := a to t do begin nbl := round(nfactor*fpo.data[b,l]); nl := nbl + nl; write(rsdata,' ',nbl:(nfield+1)); end; { If nlexpt does not represent the global numbers of bases at each position in the experiment, (ie, only represents the numbers of bases of one experiment), then the data outside the nlexpt will be missing and the program will detect the error: if nlexpt.data[l]=0 then begin writeln(output,'10 zowie! PROGRAM ERROR!'); writeln(output,'l = ',l:1); writeln(output,'coordinate = ',coordinate:1); for s := 1 to nlexpt.coor.lengthrange do writeln(output,'nlexpt.data[',s:1,']=',nlexpt.data[s]:1); end; (* a correction for small sample sizes *) } (* fool the compiler so that it thinks we are using nlexpt: *) e := (s-1) / (2 * ln2 * nlexpt.data[l]); (* override: we really don't know a method!! *) e := 0.0; (* rs(l) *) rsl := ipe.data[l]-e; write(rsdata,' ',rsl:infofield:infodecim); (* rs *) rs := rs + rsl; write(rsdata,' ',rs:infofield:infodecim); (* var(hnb) sum-var n(l) e(hnb) f *) (* calculate small sample variance. The calculation is based on the procedure calaehnb in the rseq.p program. *) avarhnb := e * e; avarhnb := -1.0; (* This is the mark in the rsdata file that indicates that the sample variation should not be displayed in the sequence logo (since we don't know a method). *) sumavarhnb := avarhnb + sumavarhnb; write(rsdata, ' ',avarhnb:infofield:infodecim, (* var(hnb) *) ' ',sumavarhnb:infofield:infodecim, (* sum-var *) ' ',nl:nfield, (* n(l) *) ' ',(2.0-e):infofield:infodecim, (* e(hnb) *) ' ','a'); (* f *) writeln(rsdata); end; writeln(rsdata,'* Rsequence = ',rs:infofield:infodecim, { ,' +/- ', sqrt(avarhnb):infofield:infodecim,} ' for range from ',fromrandom:1,' to ',torandom:1); writeln(output); writeln(output,'Rsequence = ',rs:infofield:infodecim, { ' +/- ',sqrt(avarhnb):infofield:infodecim,} ' for range from ',fromrandom:1,' to ',torandom:1); end; end; (* end module sites.makersdata *) (* begin module sites.dostats *) procedure dostats(var stats: text; experiment: integer; (* which experiment *) std: entryptr; (* the canonical standard *) wildtype: iblarray; (* wild type frequencies *) nl: ilarray; nbl: iblarray; fi,fo,fpi: rblarray); (* Calculate statistics of the base frequencies observed in the experiment, based on various kinds of models. *) var fbottle: rblarray; (* input frequencies using fo array proportions separated by base bottles *) fbotcon: rblarray; (* input frequencies using fo array proportions separated by base bottles, assuming that the canonical is all that matters *) fproportion: rblarray; (* input frequencies using fo array proportions *) fpon: rblarray; (* natural frequencies *) { probably never to be used: ie: rlarray; (* information of experimental data *) ipe: rlarray; (* information of experimental data normalized *) ipn: rlarray; (* information of natural data *) } missing: boolean; (* true if one of the dirty bottles had one of the bases missing! *) procedure bar(var stats: text; experiment: integer); begin writeln(stats); write(stats,'Experiment ',experiment:1); writeln(stats, '---------------------------------'); end; begin bar(stats,experiment); writeln(stats, 'significance for completely saturated model'); signif(stats, nl, fi, fo, 0); showrblarray(stats, fi); bar(stats,experiment); writeln(stats, 'significance for completely proportional model'); detfproportion(nbl, std, fproportion); (* determine proportional info *) signif(stats, nl, fproportion, fo, 3); showrblarray(stats, fproportion); bar(stats,experiment); writeln(stats, 'significance for completely proportional model, bottles'); (* determine proportional info *) detfbottle(nbl, missing, std, fbotcon, fbottle); if missing then writeln(stats,'Cannot calculate significance because of missing', ' bases in dirty bottle') else signif(stats, nl, fbottle, fo, 12); showrblarray(stats, fbottle); bar(stats,experiment); writeln(stats, 'significance for completely proportional model, bottles'); writeln(stats, 'canonical base, equally likely for other bases'); if missing then writeln(stats,'Cannot calculate significance because of missing', ' bases in dirty bottle') else signif(stats, nl, fbotcon, fo, 4); showrblarray(stats, fbotcon); bar(stats,experiment); writeln(stats,'calculate the maximum-likelihood value', ' for the canonical base'); mle(stats, std, nbl); bar(stats,experiment); detfpon(wildtype, fpi, fpon); (* determine f'on(b,l) *) writeln(stats, 'significance for wild-type sequences model'); signif(stats, nl, fpon, fo, 4*nl.coor.lengthrandom); showrblarray(stats, fpon); writeln(stats); writeln(stats, '---------------------------------------------'); { probably never to be used: showit(list, std, header, ipe, ipn); (* display info results *) } end; (* end module sites.dostats *) (* begin module sites.fancytable *) procedure fancytable(var list: text; experiment: integer; (* experiment number *) nl: ilarray; (* number of sequences at position l *) nbl: iblarray; (* number of bases at position l *) std: entryptr; LaTeX: boolean); (* show the n(b,l) table with the number of sequences at each position, n(l). Display the canonical sequence from std. If LaTeX is true, put the data out in LaTeX table format *) var b: base; (* one of the bases at l *) coordinate: integer; (* the position relative to the zero base *) l: integer; (* position in the site *) begin if LaTeX then begin writeln(list); writeln(list,'\newpage'); writeln(list,'\begin{table}[p] % h here; t top; b bottom;', ' p page of floats'); writeln(list,'\begin{center}'); (*drop nl: writeln(list,'\begin{tabular}{|rcrrrr|r|}'); *) writeln(list,'\begin{tabular}{|rcrrrr|}'); writeln(list,'\hline'); writeln(list,'Coor- & Canonical ', '& \multicolumn{4}{c|}{Numbers of} \\'); {,} { not in use ' & Number of \\');} writeln(list,'dinate & Base & A & C & G & T \\'); { & sequences \\');} writeln(list,'\hline'); end else begin writeln(list); page(list); writeln(list); writeln(list,' l | b | A | C | G | T | number'); writeln(list,'------------------------------------------------'); (* -27 | g | 0 | 0 | 53 | 0 | 53 *) end; with nbl.coor do begin for coordinate := fromrange to torange do begin l := coordinate - fromrange + 1; write(list,' ',coordinate: nfield); if LaTeX then write(list, ' &') else write(list,' |'); write(list,' ',std^.seq.letters[l]); for b := a to t do begin if LaTeX then write(list, ' &') else write(list,' |'); write(list,' ',nbl.data[b,l]:nfield); end; if not LaTeX then write(list,' |'); { modified to prevent nl on latex tables: } if not LaTeX then write(list,' ',nl.data[l]:nfield); if LaTeX then write(list, ' \\'); writeln(list); end; end; if LaTeX then begin writeln(list,'\hline'); writeln(list,'\end{tabular}'); writeln(list,'\end{center}'); writeln(list,'\caption{Number of bases observed at each position'); writeln(list,' in the randomized sequences', ' of experiment ',experiment:1,'}'); writeln(list,' \label{table.nbl',experiment:1,'}'); writeln(list,'\end{table}'); writeln(list); end; writeln(list); end; (* end module sites.fancytable *) (* begin module sites.pageheader *) procedure pageheader(var list: text; header: string; experiment: integer; std: entryptr); (* put a few things out *) begin writeln(list); announcement(list, header); putentry(list, std); writeln(list); writeln(list, 'Experiment ', experiment:1); end; (* end module sites.pageheader *) (* begin module sites.dotables *) procedure dotables(var list: text; header: string; experiment: integer; std: entryptr; nbl: iblarray; (* number of bases at position l *) fi, fo, rho, fpi, fpo: rblarray); (* write the arrays out to the list file *) procedure newpage; begin writeln(list); page(list); pageheader(list,header,experiment,std); end; begin if experiment > 1 then newpage else pageheader(list,header,experiment,std); writeln(list, 'n(b,l): input numbers'); showiblarray(list, nbl); newpage; writeln(list, 'fi(b,l), input frequency table', ' (fractions canonical and not canonical):'); showrblarray(list, fi); newpage; writeln(list,'fo(b,l), raw experimental frequencies:'); showrblarray(list, fo); newpage; writeln(list,'rho(b,l), ratio of fo to fi tables:'); showrblarray(list, rho); newpage; writeln(list,'f''i(b,l), normalized input frequencies (0.25)'); showrblarray(list, fpi); newpage; writeln(list,'f''o(b,l), normalized output frequency table:'); showrblarray(list, fpo); end; (* end module sites.dotables *) (* begin module putmaxstring *) procedure putmaxstring(var f: text; var s: string; var counter: integer; maxcounter: integer); (* write the string to f. Like procedure putstring, but the number of characters put out is counted in counter and if it exceeds maxcounter, no more characters are output. *) var l: integer; (* index to the string *) begin for l := 1 to s.length do begin counter := succ(counter); if counter < maxcounter then write(f,s.letters[l]) end; end; (* end module putmaxstring *) (* begin module sites.makesequ *) procedure makesequ(var sequ, makebkp: text; coords: coordinates; data: entries; numentries: integer; std: entryptr); (* standard canonical sequence *) (* make the raw sequence file in sequ, consisting of just the sequence followed by a period. Make instructions to feed makebk in makebkp *) const namemax = 20; (* longest length name allowed to be output *) var counter: integer; (* count of characters put out *) thenumber: integer; (* the number of one datum *) l: integer; (* index to the string *) x: boolean; (* an x was found in the sequence *) begin rewrite(sequ); rewrite(makebkp); writeln(makebkp,'sites ',version:4:2); writeln(makebkp,'b'); (* book created *) putstring(makebkp,std^.name); (* title *) writeln(makebkp); writeln(makebkp,'y'); (* pieces numbered *) counter := 0; putmaxstring(makebkp,std^.name,counter,namemax); writeln(makebkp,' organism key name'); putstring(makebkp,std^.name); writeln(makebkp,' organism full name'); writeln(makebkp,'from sites ',version:4:2); writeln(makebkp); (* no notes *) writeln(makebkp,'map units'); counter := 0; putmaxstring(makebkp,std^.name,counter,namemax); writeln(makebkp,' chromosome key name'); putstring(makebkp, std^.name); writeln(makebkp,' chromosome full name'); writeln(makebkp); (* no notes *) writeln(makebkp,'0'); (* chromosome map beginning *) writeln(makebkp,'100'); (* chromosome map ending *) for thenumber := 1 to numentries do begin { writeln(sequ,'debug: display, thenumber=',thenumber:1); } {zzz} with data[thenumber]^ do begin (* putstring(sequ,seq); would just put the string out. But we want to convert x's to the standard: *) x := false; for l := 1 to seq.length do with seq do if letters[l] = 'x' then begin write(sequ,std^.seq.letters[l]); x := true; writeln(output); writeln(output,'NOTE: x bases replaced by standard in sequ.'); writeln(output,'See notes in book created using makebk 1 then begin page(list); writeln(latex,'\newpage'); end; display(list,data,numentries,header,experiment, coords, std,'s',' ',false,duplicates); fancytable(list, experiment, nl, nbl, std, false); writeln(latex,'\twocolumn'); writeln(latex,'\scriptsize'); writeln(latex,'\begin{verbatim}'); figure(latex,data,numentries,{header,}experiment, coords, std,'s',' ',true); writeln(latex,'\end{verbatim}'); writeln(latex,'\normalsize'); fancytable(latex, experiment, nl, nbl, std, true); dotables(tables,header,experiment,std,nbl,fi,fo,rho,fpi,fpo); (* by putting a 'z' in the standard positions, unchanged positions are considered to be further down the alphabet than changed ones. *) (*if not onname then *) crunch(data,numentries,std,'z'); sortdata(data,numentries,false); (* sorted on sequence *) (* leave the data set as we found it, in case other operations are desired (no side effect) *) (*if not onname then *) uncrunch(data,numentries,std,'z'); display(sorted, data, numentries, header, experiment, coords, std, 's', ' ', true, duplicates); writeln(output,'number of duplicate sequences: ', duplicates:1); showstatistics(stats, coords, experiment,header, std, wildtype, nbl, duplicates); dostats(stats, experiment, std, wildtype, nl, nbl, fi,fo,fpi); end end; writeln(output,' Number of entries in ALL experiments: ', totalentries:5); (* convert global into a frequency table again. Note: this could have been done by dividing by 'experiment', but if the sum was not exactly 1 because of roundoff, error would propagate. By finding out what the sum is for every position, as is done in makefreq, the error should be minimized. *) makefreq(global); writeln(tables); page(tables); writeln(tables); pageheader(tables,header,experiment,std); writeln(tables, 'Average of normalized frequency tables', ' from all experiments:'); showrblarray(tables, global); calcinfo(fpi, global, ipe); (* calculate normalized information table for experiment *) makersdata(rsdata,database,standard, global,ipe,nl); (* logo of experimental results *) (* note: bug: In this call, instead of nl we should use the global nl, or the appropriate (unknown) function for the effective numbers of bases at each position in the experiments. It is not known how to find that however. *) (* finish the typesetting *) writeln(latex,'\end{document}'); (* make raw sequences *) makesequ(sequ,makebkp,coords,data,numentries,std); end; (* end module sites.themain *) begin themain(database, standard, caps, latex, list, sorted, stats, tables, rsdata, sequ, makebkp); 1: end.