program dalvec(rsdata, dalvecp, symvec, output); (* dalvec: converts Rseq rsdata file to symvec format by Thomas Schneider module libraries required: delman, prgmods *) label 1; (* end of program *) const (* begin module version *) version = 2.20; (* of dalvec.p 1995 June 24 origin 1990 May 7 from rsgra *) (* end module version *) (* begin module describe.dalvec *) (* name dalvec: converts Rseq rsdata file to symvec format synopsis dalvec(rsdata: in, dalvecp: in, symvec: out, output: out) files rsdata: data file from rseq program dalvecp: parameters to control dalvec If empty, then the normal sequence logo will be produced. If the first character of the first line is a 'c', then a chi-logo is produced. The height of this logo is the information. The heights of the individual letters are, however, not the frequencies, but rather their partial chi-square values. The expected value is 1/4 of the number of characters. This is compared to the observed value by: partial chi-square =(observed - expected)^2/expected These partial values are normalized and placed in symvec in place of the relative frequencies. Thus the significance of each letter is used. When the observed is less than expected, the reported value is made negative. Makelogo prints these characters upside down. symvec: reformating of the rsdata file for input to the makelogo program. A series of header lines begining with asterisk ("*") are produced. The next line contains one integer which is the number of symbols in the vector (4 for DNA or RNA, 20 for proteins). After this, the format of the file is a series of entries. Each entry has two parts. The first part is on one line and contains position total information position: the position number total: the sum of the values in the vector information: the information content of the vector. The remaining parameters on the line are from the rsdata file: rs: sum of rsl varhnb: variance of rsl sumvar: sum of varhnb ehnb: 2-e(n) The second part consists of a list of a series of symbol lines. The number of these matches the numer of symbols (4 in the case of DNA), representing the the numbers of bases or amino acids at the position in an aligned set of sequences. Each line begins with the character of the symbol, followed by the number of that symbols. output: messages to the user description Convert the rsdata file from rseq into a format that the makelogo program can use. The format is a 'symbol vector'. ChiLogos: If you leave the parameter file empty, then the standard sequence logo will be created. However, if the first letter of the file is a 'c', then a new kind of logo will emerge from makelogo: the chi-logo. The height is as it was before. The height of the individual letters is different, instead of being proportional to the frequency of the letter, it is proportional to the significance of the letter, based on the chi-square test. That is, the expected number of letters is the number of letters at that position, n(l) divided by 4 (for simplicity!). The observed number comes from the rsdata file. The partial-chi square is (observed-expected)^2/expected. Note that the sum of the partials is the normal chi-square. So bases that contribute strongly get big. Also, bases that are under represented are printed UPSIDE DOWN, so you can (usually) tell you have a chilogo at a glance. The chilogo allows one to see the importance of the infrequent letters. The technical mechanism for making a letter upside down is to have its number negative in the symvec file. examples see also rseq.p, makelogo.p author Thomas D. Schneider bugs The program originally only created a vector that contained the characters of the alphabet, so the output was called an 'alvec'. To reflect the use of symbols, the name of the output file was changed to symvec, but I like 'dalvec', and 'dsymvec' is so awkward that I decided to keep the name dalvec. *) (* end module describe.dalvec *) (* begin module dalvec.const *) infofield = 8; (* size of field for printing information in bits *) infodecim = 5; (* number of decimal places for printing information *) nfield = 4; (* size of field for printing n, the number of sites *) mapmax = 26; (* the number of bases or amino acids to be sorted *) pwid = 8; (* width in character places to print PostScript numbers *) pdec = 5; (* decimal places to print PostScript numbers *) pnum = 100000; (* 10^pdec for testing when something will round to zero *) cxmove = 30; (* units to move a character horizontally *) (* end module dalvec.const *) (* begin module interact.const *) maxstring = 150; (* the maximum string *) (* end module interact.const version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module dalvec.filler.const *) fillermax = 21; (* the size of the filler array for a string *) (* end module dalvec.filler.const *) type (* begin module rs.type *) rstype = record (* types of data in the data table from rseq *) rstart, rstop: integer; (* range of the data *) l, (* position *) nal,ncl,ngl,ntl: integer; (* numbers of each base *) rsl, (* rsequence(l) *) rs, (* running sum of rs *) varhnb, (* variance estimate for small sample correction *) sumvar: (* sum of varhnb *) real; nl: integer; (* = a+c+g+t *) ehnb: real; (* hg-e(n) *) flag: char; (* e = exact, a = approximate sampling statistics *) end; (* end module rs.type version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module interact.type *) string = record (* a string of characters *) letters: array[1..maxstring] of char; (* the letters in the string *) length: integer; (* the number of characters in the string *) current: integer; (* the letter we are working on *) end; (* end module interact.type version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module trigger.type *) trigger = record (* an object to be searched for *) seek: string; (* the characters looked for *) state: integer; (* how close to triggering we are *) skip: boolean; (* trigger not found- skip the line *) found: boolean (* the trigger was found *) end; (* end module trigger.type version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module filler.type *) (* the following is an array used to fill a string. it is convenient to have it much shorter than the maxstring, so that it is easy to fill the string using procedure fillstring. the user must declare the value of constant fillermax. *) filler = packed array[1..fillermax] of char; (* end module filler.type version = 4.75; (@ of rsgra.p 1990 Oct 2 *) var (* begin module dalvec.var *) rsdata, (* output of rseq program *) dalvecp, (* input of this program *) symvec: (* output of this program *) text; (* end module dalvec.var *) (* begin module halt *) procedure halt; (* stop the program. the procedure performs a goto to the end of the program. you must have a label: label 1; declared, and also the end of the program must have this label: 1: end. examples are in the module libraries. this is the only goto in the delila system. *) begin writeln(output,' program halt.'); goto 1 end; (* end module halt version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module interact.clearstring *) procedure clearstring(var ribbon: string); (* empty the string *) var index: integer; (* to the ribbon *) begin (* clearstring *) with ribbon do begin for index := 1 to maxstring do letters[index] := ' '; length := 0; current := 0; end end; (* clearstring *) (* end module interact.clearstring version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module interact.writestring *) procedure writestring(var tofile: text; var s: string); (* write the string s to file tofile, no writeln *) var i: integer; (* index to s *) begin (* writestring *) with s do for i := 1 to length do write(tofile, letters[i]) end; (* writestring *) (* end module interact.writestring version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module trigger.proc *) (* this module allows one to scan a series of characters, as from an array or a file, and to "trigger" or detect a simple string in the series. the advantage of the trigger is that several triggers can "observe" a stream of characters at once, each looking for a different thing. some other modules required: interact.const, interact.type *) procedure resettrigger(var t: trigger); (* reset the trigger to ground state *) begin (* resettrigger *) with t do begin state := 0; skip := false; found := false end end; (* resettrigger *) procedure testfortrigger(ch: char; var t: trigger); (* look at the character ch. if it is part of the trigger (at the current trigger state), then the trigger state goes higher. if it is not part of the trigger then the trigger state is reset, skip is true and one should skip onward to find the trigger. if the trigger is found, found is true. *) begin (* testfortrigger *) with t do begin state := succ(state); (* if debugging then begin writestring(list,seek); writeln(list,'testfortrigger seek.letters[',state:1,']:', seek.letters[state],' ch:',ch); end;*) if seek.letters[state] = ch then begin skip := false; if state = seek.length then found := true else found := false end else begin (* reset trigger *) state := 0; skip := true; found := false end end end; (* testfortrigger *) (* end module trigger.proc version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module filler.fillstring *) procedure fillstring(var s: string; a: filler); (* this procedure makes it reasonably easy to fill the string s with characters. one calls the procedure as: *) (* 1 2 3 4 5 *) (* 12345678901234567890123456789012345678901234567890 *) (* fillstring(s, 'this-is-the-string '); the two comments make it easy to line the characters up. also, for this example, it was assumed that the length of filler as defined by the constant fillermax was 50. *) var length: integer; (* of the string without trailing blanks *) index: integer; (* of s *) begin (* fillstring *) clearstring(s); length := fillermax; while (length > 1) and (a[length] = ' ') do length := pred(length); if (length = 1) and (a[length] = ' ') then begin writeln(output, 'fillstring: the string is empty'); halt end; for index := 1 to length do s.letters[index] := a[index]; s.length := length; s.current := 1 end; (* fillstring *) (* end module filler.fillstring version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module filler.filltrigger *) procedure filltrigger(var t: trigger; a: filler); (* fill the trigger t *) begin (* filltrigger *) fillstring(t.seek,a) end; (* fillstring *) (* end module filler.filltrigger version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module copyaline *) procedure copyaline(var fin, fout: text); (* copy a line from file fin to file fout *) begin (* copyaline *) while not eoln(fin) do begin fout^ := fin^; put(fout); get(fin) end; readln(fin); writeln(fout); end; (* copyaline *) (* end module copyaline version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* ********************************************************************** *) (* begin module dalvec.header *) procedure header(var infile: text; var outfile: text); (* do the header of the plot. Simply copy the output. Note: this depends on the fact that the input file already has the '*' marks *) var index: integer; (* to lines in the infile *) begin reset(infile); rewrite(outfile); writeln(outfile,'* dalvec ',version:4:2); (* copy header lines to outfile *) for index := 1 to 3 do copyaline(infile,outfile); writeln(outfile,'*'); writeln(outfile,'* position,', ' number of sequences,', ' information Rs,', ' variance of Rs'); end; (* end module dalvec.header *) (* ********************************************************************** *) (* ********************************************************************** *) (* begin module rs.readrsrange *) procedure readrsrange(var rsdata: text; var r: rstype); (* read the range data from rsdata to r. data is assumed to be the rsdata file from program rseq. *) var index: integer; (* for counting lines of rsdata *) skip: char; (* a character to skip the '*' on the line *) begin for index := 1 to 11 do readln(rsdata); readln(rsdata, skip, r.rstart, r.rstop); (* writeln(output, 'range: ',r.rstart:1,' ',r.rstop:1);*) end; (* end module rs.readrsrange version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module rs.getrsbegin *) procedure getrsbegin(var infile: text); (* skip to the beginning of the data in a data file from the rseq program *) var ch: char; (* a character read from infile *) begindata: trigger; (* a trigger to locate the beginning of the data *) begin (* 1 2 *) (* 123456789012345678901 *) filltrigger(begindata ,'l a c g t'); resettrigger(begindata); reset(infile); while not begindata.found do begin if eoln(infile) then readln(infile); if eof(infile) then begin writeln(output,'beginning of data not found'); halt; end; read(infile,ch); testfortrigger(ch,begindata); end; readln(infile); end; (* end module rs.getrsbegin version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* begin module rs.readrsdata *) procedure readrsdata(var rsdata: text; var rdata: rstype); (* read data from the data file of program rseq into the datatype *) begin with rdata do begin read(rsdata,l,nal,ncl,ngl,ntl,rsl,rs,varhnb,sumvar,nl,ehnb); (* skip spaces to find the flag: *) while rsdata^=' ' do get(rsdata); readln(rsdata,flag); (* writeln(output,'readrsdata: l a c g t flag = ', l:1,' ',nal:1,' ',ncl:1,' ',ngl:1,' ',ntl:1,' ',flag); *) end end; (* end module rs.readrsdata version = 4.75; (@ of rsgra.p 1990 Oct 2 *) (* ********************************************************************** *) (* ********************************************************************** *) (* begin module dalvec.sign *) function sign(a: real): real; (* return the sign of the number a *) begin if a >= 0 then sign := +1.0 else sign := -1.0 end; (* end module dalvec.sign *) (* begin module dalvec.themain *) procedure themain(var rsdata, dalvecp, symvec: text); (* the main procedure of the program *) const bignumber = pnum; (* number of digits that the chilogo is accurate to *) var chilogo: boolean; (* if true, do the chi logo *) chia, chic, chig, chit: real; (* partial chi square values *) chitotal: real; (* sum of the partial chi squares *) nl4: real; (* number of bases at l (nl) divided by 4 *) rdata: rstype; (* data from rseq *) parameter: char; (* parameter read from dalvecp *) position: integer; (* a location in the aligned sequence *) begin writeln(output,'dalvec ',version:4:2); reset(rsdata); rewrite(symvec); reset(dalvecp); chilogo := false; if not eof(dalvecp) then begin if not eoln(dalvecp) then readln(dalvecp,parameter); if parameter = 'c' then chilogo := true; end; if chilogo then writeln(output,'Chilogo will be produced') else writeln(output,'Normal sequence logo will be produced'); if eof(rsdata) then begin writeln(output,'empty rsdata file'); halt end else begin (* copy header stuff *) header(rsdata,symvec); writeln(symvec,'4 number of symbols in DNA or RNA'); (* find the range of the graph in bases *) reset(rsdata); readrsrange(rsdata,rdata); getrsbegin(rsdata); (* prepare for reading the rsdata for graphing *) getrsbegin(rsdata); (* create the symvec *) for position := rdata.rstart to rdata.rstop do begin (* skip lines with an '*' *) if rsdata^ <> '*' then begin readrsdata(rsdata,rdata); (* move to the next rsdata line and give the data *) with rdata do begin if chilogo then begin (* partial chi is (observed - expected)^2/expected. expected = nl/4. There is no need to divide by expected since it is renormalized afterwards anyway. *) nl4 := nl/4; chia := sqr(nal - nl4); chic := sqr(ncl - nl4); chig := sqr(ngl - nl4); chit := sqr(ntl - nl4); chitotal := chia+chic+chig+chit; if chitotal > 0 then begin nal := round(bignumber * (chia/chitotal)*sign(nal-nl4)); ncl := round(bignumber * (chic/chitotal)*sign(ncl-nl4)); ngl := round(bignumber * (chig/chitotal)*sign(ngl-nl4)); ntl := round(bignumber * (chit/chitotal)*sign(ntl-nl4)); nl := bignumber end else begin nal := 1; ncl := 1; ngl := 1; ntl := 1; nl := 4 end end; writeln(symvec,l:nfield, (* position *) ' ',nl:infofield, (* number sequences *) ' ',rsl:infofield:infodecim, (* information *) { ( ' ',varhnb:infofield:infodecim); (* variance of rsl *) } (* variance of rsl in scientific notation: *) ' ',varhnb:(infofield+3)); writeln(symvec,'a ',nal:nfield); writeln(symvec,'c ',ncl:nfield); writeln(symvec,'g ',ngl:nfield); writeln(symvec,'t ',ntl:nfield); end; end; end; end end; (* end module dalvec.themain *) begin themain(rsdata, dalvecp, symvec); 1: end.