program index (book, ind, indexp, output); (* index: make an alphabetized list of the sites in a book. by Gary Stormo and Thomas Schneider from modules: delman, delmods, prgmods Dr. Thomas D. Schneider National Cancer Institute Laboratory of Experimental and Computational Biology Frederick, Maryland 21702-1201 toms@ncifcrf.gov permanent email: toms@alum.mit.edu (use only if first address fails) http://www.lecb.ncifcrf.gov/~toms/ *) label 1; (* end of program *) const (* begin module version *) version = 9.31; (* of index.p 2003 Aug 11 2003 Aug 11, 9.29: indana requires complete columns! solution: columnfiller 2002 Oct 8, 9.28: make gpc compatible 1994 Nov 13, 9.27: previous changes origin before 1983 january 19 *) (* end module version *) (* begin module describe.index *) (* name index: make an alphabetic list of oligonucleotides in a book synopsis index(book: in, ind: out, indexp: in; output: out) files book: the book of sequences to be indexed ind: the alphabetized index to the book indexp: parameters to control index. if this file is empty, then default values are used. otherwise there may be 4 or 5 lines: first line: the number of bases in the alphabetizing window second line: the number of bases to print before the central window third line: the number of bases to print in the central window fourth line: the number of bases to print after the central window fifth line: if the first letter is a 't', then the index will run in a teaching mode. do not use this mode on large books. sixth line: if the first letter is 'f' then only the first oligo of each sequence is used for alphabitization. This produces a drastic reduction in the number of oligos sorted. It is meant to be used to sort aligned sequences, to see if there are identical copies. output: messages to the user description The index program generates an index of oligonucleotide fragments in a book. The first base of the alphabetizing window is stepped across all bases of the sequence, creating a list of overlapping oligos and their positions. The oligos are then sorted along with their positions. Three printing windows allow one to look at bases before the first base, from the first base some distance on (this is not the alphabetizing window) and a third set even further 3'. It is not inefficient to make the alphabetizing window large when there are no long repeats in the sequences (as when comparing two similar genes). Following the printing windows are: the sequence number of the piece in the book (provided by delila); the position of the first base; the orientation of the oligo; and the similarity. This last item is the number of bases that an oligo matches the previous oligo in the index, up to the point that they differ. High similarity means a repeat. examples The index can be used to locate restriction enzyme sites, by simply 'looking them up'. It has the advantage that when new enzymes become available, one does not need the computer to locate their sites. Direct repeats will show up as high similarity oligos, and if one gets the complement along with a sequence in a book (using delila) then inverted repeats can be found. The first column of the alphabetizing window contains all the mononucleotides; the first two, the di's, etc. documentation L. J. Korn, C. L. Queen and M. N. Wegman, PNAS 74: 4401-4405 (1977) see also search.p, helix.p, delila.p, delman.use.comparison author Gary Stormo and Thomas Schneider bugs One cannot sort more sequence than can fit into the computer memory. technical notes The constant mapmax determines the maximum number of bases indexed. *) (* end module describe.index *) (* more constants *) mapmax = 600000; (* largest number of bases that can be handled by the the index program. it determines the size of the map array. *) columnfiller = '-'; (* the character to put in columns to make sure that SOMETHING is in the column *) columnseparation = ' '; (* separation between columns *) pienumwid = 3; (* piece number width *) posnumwid = 8; (* position number width *) (* begin module book.const *) (* constants needed for book manipulations *) dnamax = 1024; (* length of dna arrays *) namelength = 100; (* maximum key name length *) linelength = 80; (* maximum line readable in book *) (* end module book.const version = 7.59; {of delmod.p 2002 Sep 5} *) type (* begin module book.type *) (* types needed for book manipulations *) chset = set of 'a'..'z'; (* types defined in book definition *) alpha = packed array[1..namelength] of char; (* this is not alfa *) (* name is a left justified string with blanks following the characters *) name = record letters: alpha; length: 0..namelength (* zero means an unspecified structure *) end; lineptr = ^line; line = record (* a line of characters *) letters: packed array [1..linelength] of char; length: 0..linelength; next: lineptr end; direction = (plus, minus, dircomplement, dirhomologous); configuration = (linear, circular); state = (on, off); header = record (* header of key *) keynam: name; (* key name of structure *) fulnam: lineptr; (* full name of structure *) note: lineptr (* note key *) end; (* begin module base.type *) (* define the four nucleotide bases *) base = (a,c,g,t); (* end module base.type version = 7.59; {of delmod.p 2002 Sep 5} *) (* sequence types *) dnaptr = ^dnastring; dnarange = 0..dnamax; seq = packed array[1..dnamax] of base; dnastring = record part: seq; length: dnarange; next: dnaptr end; orgkey = record (* organism key *) hea: header; mapunit: lineptr (* genetic map units *) end; chrkey = record (* chromosome key *) hea: header; mapbeg: real; (* number of genetic map beginning *) mapend: real (* number of genetic map ending *) end; pieceptr = ^piece; piekey = record (* piece key *) hea: header; mapbeg: real; (* genetic map beginning *) coocon: configuration; (* configruation (circular/linear) *) coodir: direction; (* direction (+/-) relative to genetic map *) coobeg: integer; (* beginning nucleotide *) cooend: integer; (* ending nucleotide *) piecon: configuration; (* configruation (circular/linear) *) piedir: direction; (* direction (+/-) relative to coordinates *) piebeg: integer; (* beginning nucleotide *) pieend: integer; (* ending nucleotide *) end; piece = record key: piekey; dna: dnaptr end; reference = record pienam : name; (* name of piece referred to *) mapbeg : real; (* genetic map beginning *) refdir : direction; (* direction relative to coordinates *) refbeg : integer; (* beginning nucleotide *) refend : integer; (* ending nucleotide *) end; genkey = record (* gene key *) hea : header; ref : reference; end; trakey = record (* transcript key *) hea : header; ref : reference; end; markerptr = ^marker; markey = record (* marker key *) hea : header; ref : reference; sta : state; phenotype : lineptr; next : markerptr; end; marker = record key : markey; dna : dnaptr; end; (* end module book.type version = 7.59; {of delmod.p 2002 Sep 5} *) position = 0..mapmax; (* somewhere on the map. note: position 0 is not used, but it allows the quicksort to function properly *) pdataptr = ^pdata; pdata = record (* information about a piece *) p: pieceptr; (* the piece *) n: integer (* the number of the piece *) end; place = packed record (* place of a base in the pieces *) pd: pdataptr; (* which piece *) dnalink: dnaptr; (* what dna link *) dnaloc: dnarange (* location in dnalink array *) end; var book, (* the book of sequences *) ind, (* the alphabetized index *) indexp: text; (* parameters file *) var theline: integer; (* current line of the book *) (* the parameters *) window: integer; (* the window size to alphabetize *) pbefore, pwindow, pafter: integer; (* the regions before, on and after the window start point to print *) teaching: boolean; (* are we teaching (or working) *) firstoligo: boolean; (* if true, only work with the first oligo *) numberofoligos: integer; (* in book *) numberofsequences: integer; (* in book *) numberoflinears: integer; (* in book *) (* the map is an ordered mapping of the bases in the book. all the pieces hang off the map, and the map points to all the bases in the pieces *) (* GPC gives this warning: "ISO Pascal forbids this use of packed array components" so this cannot be a packed array: *) map: array[1..mapmax] of place; (* the pointer situation looks something like this: map * * * * * * * * * (a bunch of place) : :.......... : : : v : : piece : : key v : *--->dna v ---->dna ---->dna part : part : part length : length : length *------: *------: * where * is a pointer. each map place points to one piece in three ways: 1. by pieceptr 2. by the dna part 3. by the base within the dna part the map is set up in readsequences, and the map is then sorted. in other words, we lied in the description of the program. sorting pointers into the sequence is equivalent to sorting a list of oligos. the latter is easier to describe, while the former is far more efficient in storage, and it allows any alphabetizing window width... *) (* begin module book.var *) (* ************************************************************************ *) (* global variables needed for book manipulations *) (* free storage: *) freeline: lineptr; (* unused lines *) freedna: dnaptr; (* unused dnas *) readnumber: boolean; (* whether to read a number from the notes, or to read in the notes *) number: integer; (* the number of the item just read *) numbered: boolean; (* true when the item just read is numbered *) skipunnum: boolean; (* a control variable to allow skipping of un-numbered items in the book *) (* ************************************************************************ *) (* end module book.var version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module package.primitive *) (* ************************************************************************ *) (* begin module halt *) procedure halt; (* stop the program. the procedure performs a goto to the end of the program. you must have a label: label 1; declared, and also the end of the program must have this label: 1: end. examples are in the module libraries. this is the only goto in the delila system. *) begin writeln(output,' program halt.'); goto 1 end; (* end module halt version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module copyaline *) procedure copyaline(var fin, fout: text); (* copy a line from file fin to file fout *) begin (* copyaline *) while not eoln(fin) do begin fout^ := fin^; put(fout); get(fin) end; readln(fin); writeln(fout); end; (* copyaline *) (* end module copyaline version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module copylines *) function copylines(var fin, fout: text; n: integer): integer; (* copy n lines of file fin to file fout. the actual number of lines copied is returned. *) var index: integer; (* the current line number *) begin (* copylines *) index := 0; while (not eof(fin)) and (index < n) do begin copyaline(fin, fout); index := succ(index) end; copylines := index end; (* copylines *) (* end module copylines version = 7.59; {of delmod.p 2002 Sep 5} *) (* ************************************************************************ *) (* end module package.primitive version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module package.getpiece *) (* ************************************************************************ *) (* begin module package.brpiece *) (* ************************************************************************ *) (* begin module book.basis *) (* procedures needed for book manipulations *) (* get procedures should be used for all linked lists of records *) procedure getline(var l: lineptr); (* obtain a line from the free line list or by making a new one *) begin if freeline<>nil then begin l:=freeline; freeline:=freeline^.next end else new(l); l^.length:=0; l^.next:=nil end; procedure getdna(var l: dnaptr); begin if freedna<>nil then begin l:=freedna; freedna:=freedna^.next end else new(l); l^.length:=0; l^.next:=nil end; (* clear procedures should be called each time the records are no longer needed failure to do this may result in a stack overflow. *) procedure clearline(var l: lineptr); (* return a line to the free line list *) var lptr: lineptr; begin if l<>nil then begin lptr:=l; l:=l^.next; lptr^.next:=freeline; freeline:=lptr end end; procedure writeline(var afile: text; l: lineptr; carriagereturn: boolean); (* write a line to a file, with carriage return if carriagereturn is true. *) var index: integer; (* index to characters in l *) begin with l^ do begin for index := 1 to length do write(afile, letters[index]); end; if carriagereturn then writeln(afile); end; procedure showfreedna; (* show the freedna list *) var counter: integer; (* count of freedna list *) l: dnaptr; (* pointer into freedna list *) begin l := freedna; counter := 0; while l <> nil do begin counter := succ(counter); write(output,counter:1); write(output, ', length = ',l^.length:1); { This is illegal according to gpc because one cannot write a pointer to a text file. It can be unearthed for debugging. write(output, ', pointer id: ',l:1); } writeln(output); l := l^.next end; end; procedure cleardna(var l: dnaptr); var lptr: dnaptr; begin if l<>nil then begin lptr:=l; l:=l^.next; lptr^.next:=freedna; freedna:=lptr end end; procedure clearheader(var h: header); (* clear the header h (remove lines to free storage) *) begin with h do begin clearline(fulnam); while note<>nil do clearline(note) end end; procedure clearpiece(var p: pieceptr); (* clear the dna of the piece *) begin while p^.dna<>nil do cleardna(p^.dna); clearheader(p^.key.hea) end; function chartobase(ch:char):base; (* convert a character into a base *) begin case ch of 'a': chartobase:=a; 'c': chartobase:=c; 'g': chartobase:=g; 't': chartobase:=t end end; function basetochar(ba:base):char; (* convert a base into a character *) begin case ba of a: basetochar:='a'; c: basetochar:='c'; g: basetochar:='g'; t: basetochar:='t'; end end; function complement(ba:base):base; (* take the complement of ba *) begin case ba of a: complement:=t; c: complement:=g; g: complement:=c; t: complement:=a; end end; function chomplement(b: char): char; (* create the character complement of base b. I must be getting hungry! *) begin chomplement := basetochar(complement(chartobase(b))); end; function pietoint(p: integer; pie: pieceptr): integer; (* p is a coordinate on the piece. we want to transform p into a number from 1 to n: an internal coordinate system for easy manipulation of piece coordinates *) (* Note: the dirhomologous and dircomplement are treated as plus and minus directions, which MIGHT NOT BE RIGHT! *) var i: integer; (* an intermediate value *) begin with pie^.key do begin case piedir of dirhomologous, plus: if p>=piebeg then i:=p-piebeg+1 else i:=(p-coobeg)+(cooend-piebeg)+2; dircomplement, minus: if p<=piebeg then i:=piebeg-p+1 else i:=(cooend-p)+(piebeg-coobeg)+2 end; pietoint:=i end end; function inttopie(i: integer; pie: pieceptr):integer; (* i is in the range 1 to some maximum. it is an internal coordinate system for the program. we want to do a coordinate transformation to obtain a value in the range of the piece called pie: i=1 corresponds to piebeg and i=its maximum corresponds to pieend *) (* Note: the dirhomologous and dircomplement are treated as plus and minus directions, which MIGHT NOT BE RIGHT! *) var p: integer; (* an intermediate value *) begin with pie^.key do begin case piedir of dirhomologous, plus: begin p:=piebeg+(i-1); if p>cooend then if coocon=circular then p:=p-(cooend-coobeg+1) end; dircomplement, minus: begin p:=piebeg-(i-1); if p '*' then begin writeln(output,' procedure skipstar: bad book'); writeln(output,' "*" expected as first character on the line, but "', thefile^,'" was found'); halt end; get(thefile); (* skip the star *) if thefile^ <> ' ' then begin writeln(output,' procedure skipstar: bad book'); writeln(output,' "* " expected on a line but "*', thefile^,'" was found'); halt end; get(thefile) (* skip the blank *) end end; (* skipstar *) (* end module book.skipstar version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module book.brreanum *) procedure brreanum(var thefile: text; var theline: integer; var reanum: real); (* read a real number from the file *) begin skipstar(thefile); readln(thefile,reanum); theline := succ(theline) end; (* end module book.brreanum version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module book.brnumber *) procedure brnumber(var thefile: text; var theline: integer; var num: integer); (* read a number from the file *) begin skipstar(thefile); readln(thefile,num); theline := succ(theline) end; (* end module book.brnumber version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module book.brname *) procedure brname(var thefile: text; var theline: integer; var nam: name); (* read a name from the file *) var i: integer; (* an index to the name *) c: char; (* a character read *) begin (* brname *) skipstar(thefile); with nam do begin length:=0; repeat length:=succ(length); read(thefile,c); letters[length] := c until (eoln(thefile)) or (length>=namelength) or (letters[length]=' '); if letters[length]=' ' then length:=length-1; if length 'n' then begin skipstar(thefile); if not eoln(thefile) then begin if thefile^ = '#' then begin numbered := true; get(thefile); (* move past the number symbol *) read(thefile,number); end end; repeat readln(thefile); theline := succ(theline) until thefile^ = 'n'; readln(thefile); theline := succ(theline) end else begin readln(thefile); theline := succ(theline) end end end; (* brnotenumber *) (* end module book.brnotenumber version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module book.brnote *) procedure brnote(var thefile: text; var theline: integer; var note: lineptr); (* read note key *) var newnote: lineptr; (* the new note *) previousnote: lineptr; (* the last line of the notes *) begin (* brnote *) note:=nil; if thefile^ = 'n' then begin (* enter note *) readln(thefile); theline := succ(theline); if thefile^ <> 'n' then begin (* abort null note (n/n) *) getline(note); newnote:=note; while thefile^ <> 'n' do begin (* wait until end of note *) brline(thefile,theline,newnote); previousnote:=newnote; (* get next note *) getline(newnote^.next); newnote:=newnote^.next; end; (* last note was not used, so: *) clearline(newnote); previousnote^.next:=nil; readln(thefile); theline := succ(theline); end else begin readln(thefile); theline := succ(theline); end; end end; (* brnote *) (* end module book.brnote version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module book.brheader *) procedure brheader(var thefile: text; var theline: integer; var hea: header); (* read the header of a key. *) begin with hea do begin readln(thefile); (* move past the object name - new definition 1999 Mar 13 *) theline := succ(theline); {bbb} (* read key name *) brname(thefile,theline,keynam); (* read full name *) getline(fulnam); brline(thefile,theline,fulnam); (* read note key *) if readnumber then brnotenumber(thefile,theline,note) else brnote(thefile,theline,note) end end; (* end module book.brheader version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module book.copyheader *) procedure copyheader(fromhea: header; var tohea: header); (* copy the header fromhea into tohea. Note that the linked objects are NOT copied, but merely pointed to. *) begin tohea.keynam.letters := fromhea.keynam.letters; tohea.keynam.length := fromhea.keynam.length; tohea.note := fromhea.note; tohea.fulnam := fromhea.fulnam; end; (* end module book.copyheader version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module book.brpiekey *) procedure brpiekey(var thefile: text; var theline: integer; var pie: piekey); (* read piece key, track the line number *) begin with pie do begin brheader(thefile,theline,hea); brreanum(thefile,theline,mapbeg); brconfig(thefile,theline,coocon); brdirect(thefile,theline,coodir); brnumber(thefile,theline,coobeg); brnumber(thefile,theline,cooend); brconfig(thefile,theline,piecon); brdirect(thefile,theline,piedir); brnumber(thefile,theline,piebeg); brnumber(thefile,theline,pieend); end end; (* end module book.brpiekey version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module book.brdna *) procedure brdna(var thefile: text; var theline: integer; var dna: dnaptr); (* read in dna from thefile, track the line *) (* note: if the dna were circularized, by linking the last dnastring to the first, then the cleardna routine could not clear properly, and would loop forever... there is no reason to do that, since a simple mod function will allow one to access the circle. *) var ch: char; workdna: dnaptr; begin getdna(dna); workdna:=dna; ch:=getto(thefile,theline,['d']); readln(thefile); theline := succ(theline); read(thefile,ch); (* skipstar *) while (ch = '*') do begin read(thefile,ch); (* skip blank *) repeat read(thefile,ch); if ch in ['a','c','g','t'] then begin if workdna^.length=dnamax then begin getdna(workdna^.next); workdna:=workdna^.next end; workdna^.length:=succ(workdna^.length); workdna^.part[workdna^.length]:=chartobase(ch) end until eoln(thefile); readln(thefile); (* go to next line *) theline := succ(theline); read(thefile,ch); (* ch is either '*' or 'd' *) end; readln(thefile); (* read past the d *) theline := succ(theline); end; (* end module book.brdna version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module book.brpiece *) procedure brpiece(var thefile: text; var theline: integer; var pie: pieceptr); (* read in a piece, change theline to reflect the lines traversed *) begin { readln(thefile); (* move past the word 'piece' - new definition 1999 Mar 13 *) theline := succ(theline); (* BUG: was below! *) bbb} brpiekey(thefile,theline,pie^.key); if numbered or (not skipunnum) then brdna(thefile,theline,pie^.dna); readln(thefile); (* move past the word 'piece' - new definition 1999 Mar 13 *) theline := succ(theline); end; (* end module book.brpiece version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module book.brinit *) procedure brinit(var book: text; var theline: integer); (* check that the book is ok to read, and set up the global variables for br routines *) begin (* brinit *) (* halt if the book is bad (first word is 'halt') or the first character is not * *) reset(book); if not eof(book) then begin (* check for the date line *) if book^ <> '*' then begin if book^ <> 'h' then writeln(output, ' this is not the first line of a book:') else writeln(output, ' bad book:'); write(output, ' '); while not (eoln(book) or eof(book)) do begin write(output, book^); get(book) end; writeln(output); halt end end else begin writeln(output, ' book is empty'); halt end; (* initialize free storage *) freeline:=nil; freedna:=nil; readnumber:=true; (* usually we read in numbers for items *) number:=0; (* arbitrary value *) numbered:=false; (* the piece has no number (none yet read in) *) skipunnum:=false; theline := 1; end; (* brinit *) (* end module book.brinit version = 7.59; {of delmod.p 2002 Sep 5} *) (* ************************************************************************ *) (* end module package.brpiece version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module book.getpiece *) procedure getpiece(var thefile: text; var theline: integer; var pie: pieceptr); (* move to and read in the next piece in the book *) var ch: char; begin ch:=getto(thefile,theline,['p']); (* get to the next p(iece) in the book *) if ch<>' ' then begin brpiece(thefile,theline,pie); { 1999 june 2: removed this: ch:=getto(thefile,theline,['p']); (* read to end of p *) } { bbb - now done in brpiece readln(thefile); (* read past piece *) theline := succ(theline); } end else clearpiece(pie); end; (* end module book.getpiece version = 7.59; {of delmod.p 2002 Sep 5} *) (* ************************************************************************ *) (* end module package.getpiece version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module book.stepbase *) function stepbase(startdna: dnaptr; var dna: dnaptr; var d: dnarange): base; (* advance d by one base in dna and then return the base at the new d. (this means that one should initialize d to zero) if we go past the last base, we restart at startdna. note: d is not the number of the base... it is used as a record for stepbase. do not mess with it, and do not use it to find out what base you are on. use a separate counter. *) begin if (d=dnamax) or (d=dna^.length) then begin d:=1; dna:=dna^.next; if dna=nil then dna:=startdna end else d:=succ(d); stepbase:=dna^.part[d] end; (* end module book.stepbase version = 7.59; {of delmod.p 2002 Sep 5} *) (* begin module missparam *) procedure missparam(var param: text); (* look at param to see if the next parameter is missing this is useful when reading in a series of parameters. use it just before readln of each parameter.*) begin (* missparam *) if eof(param) then begin writeln(output,' missing parameter'); halt end end; (* missparam *) (* end module missparam version = 7.59; {of delmod.p 2002 Sep 5} *) procedure readindexp; (* read the index parameters *) begin reset(indexp); if not eof(indexp) then begin missparam(indexp); readln(indexp, window); missparam(indexp); readln(indexp, pbefore); missparam(indexp); readln(indexp, pwindow); missparam(indexp); readln(indexp, pafter); end else begin window := 10; pbefore := 0; pwindow := 10; pafter := 0 end; (* check the parameters *) if window <= 0 then begin writeln(output, 'alphabetizing window must be positive integer'); halt end; if (pbefore < 0) or (pwindow < 0) or (pafter < 0) then begin writeln(output,'printing windows must be zero or positive integers'); halt end; if pwindow <> window then begin writeln(output,' warning: printing window size (',pwindow:1,')'); writeln(output,' is not equal to alphabetizing window size (', window:1,').') end; if indexp^ = 't' then teaching:=true else teaching:=false; readln(indexp); if indexp^ = 'f' then firstoligo := true else firstoligo := false; end; procedure initialize; begin writeln(output,' index ', version:4:2); brinit(book, theline); readnumber:=true; (* read numbers from the book *) skipunnum:=true; (* we will skip unnumbered pieces *) rewrite(ind); readindexp end; procedure readsequences; (* read in all the sequences from the book. construct the map with all the pieces hanging off it. notice that the pieces can be found only by looking through the map... *) var pie: pieceptr; (* a piece *) pied: pdataptr; (* information about the piece *) dn: dnaptr; (* a dnaptr in pie *) lo: dnarange; (* a location in dn *) length: integer; (* the length of pie *) m: position; (* index to map *) b: base; (* dummy to keep stepbase happy *) procedure fillmap; (* fill the map at location m *) begin with map[m] do begin pd := pied; dnalink := dn; dnaloc := lo end end; begin (* readsequences *) (* set globals *) numberofoligos:=0; numberofsequences:=0; numberoflinears:=0; while not eof(book) do begin new(pie); getpiece(book, theline, pie); if not eof(book) and (numbered or (not skipunnum)) then begin numberofsequences:=succ(numberofsequences); (* remove excess baggage from the piece *) clearheader(pie^.key.hea); (* fill in piece data structure *) new(pied); with pied^ do begin p:=pie; n:=number; if p^.key.piecon = linear (* record topology *) then numberoflinears:=succ(numberoflinears) end; (* start the fill *) m:=numberofoligos + 1; dn:=pie^.dna; lo:=1; fillmap; if firstoligo then begin numberofoligos:=m; if numberofoligos > mapmax then begin write (output, ' more than ', mapmax:1, ' bases in the book'); writeln(output, ' can not be handled by index.'); halt end; end else begin length:=piecelength(pie); if numberofoligos + length > mapmax then begin write (output, ' more than ', mapmax:1, ' bases in the book'); writeln(output, ' can not be handled by index.'); halt end; (* do the rest of the fill *) for m:=numberofoligos + 2 to numberofoligos + length do begin b:=stepbase(pie^.dna, dn, lo); fillmap end; numberofoligos:=numberofoligos + length end; end else dispose(pie) end end; procedure writeheader; (* write the header of the index. this header is read by procedure rhindex in program siman (and others) *) begin writeln(ind, '* index ', version:4:2, ' of '); reset(book); copyaline(book, ind); writeln(ind,'*'); writeln(ind,'* ',window:3,' is the length of the alphabetized window'); writeln(ind,'* ',pbefore:3,' bases before the window start are printed'); writeln(ind,'* ',pwindow:3, ' bases are in the central printed window'); writeln(ind,'* ',pafter:3,' bases after the window are printed'); writeln(ind,'* ', numberofoligos:3, ' oligos are in the book '); write (ind,'* ', numberofsequences:3, ' sequence'); if numberofsequences <> 1 then write(ind, 's'); writeln(ind,' indexed:'); writeln(ind,'* ', numberoflinears:3, ' linear and ', (numberofsequences - numberoflinears):1, ' circular'); if teaching then begin writeln(ind,'* teaching mode is on.'); writeln(ind); writeln(ind,'* I will first pass a window across all the'); writeln(ind,'* sequences in the book, so that every oligo is'); writeln(ind,'* under the window once.'); writeln(ind); write (ind,'* seq # is the number of the sequence in the book'); writeln(ind, ' that this view comes from;'); write (ind,'* pos is the position of the 5'' base of the view'); writeln(ind, ' in the sequence;'); write (ind,'* dir is the direction of this view relative'); writeln(ind, ' to the piece coordinates;'); write (ind,'* similarity is the number of bases that match'); writeln(ind, ' between this view'); writeln(ind,'* and the previous view', ' (only up to the first mismatch).'); end; if firstoligo then writeln(ind,'* first oligo is the only one used in', ' sorting') else writeln(ind,'* all oligos are used for sorting'); writeln(ind,'*'); writeln(ind,'*',' ':pbefore,'sequence ', ' ':(pwindow+pafter-8), ' ':pienumwid,'#', ' ':posnumwid-8, 'position', ' dir similarity name') end; function lessthan(alow, blow: position): boolean; (* is the sequence at alow less than (before, alphabetically) the sequence at blow? *) var (* pointers to a base *) adna, bdna: dnaptr; aloc, bloc: dnarange; abase, bbase: base; (* the bases at the position *) aline, bline: boolean; (* whether a or b piece is linear *) astart, bstart: dnaptr; (* the first dna link *) w: integer; (* current window width *) done: boolean; (* are we done yet? *) begin (* determine: if the sequence at map position alow is linear; what it's starting dna link is; what base it points to *) with map[alow] do begin with pd^.p^ do begin aline:=(key.piecon=linear); astart:=dna; end; adna:=dnalink; aloc:=dnaloc end; abase:=adna^.part[aloc]; (* determine: if the sequence at map position blow is linear; what it's starting dna link is; what base it points to *) with map[blow] do begin with pd^.p^ do begin bline:=(key.piecon=linear); bstart:=dna; end; bdna:=dnalink; bloc:=dnaloc end; bbase:=bdna^.part[bloc]; w:=1; done:=false; (* compare the positions until they differ *) while not done do begin if ord(abase) = ord(bbase) then begin (* identical bases means go to the next place *) w:=succ(w); (* move to next window position *) if w <= window then begin (* still in the window *) (* look at next position *) abase:=stepbase(astart, adna, aloc); bbase:=stepbase(bstart, bdna, bloc); (* if we just stepped back to the beginning, and we are not circular ... *) if aloc = 1 then if adna = astart then if aline then begin lessthan:=true; done:=true end; if bloc = 1 then if bdna = bstart then if bline then begin lessthan:=false; done:=true end; (* note: in the case when both a and b end together, lessthan should be false. since b follows a in the above code, that will happen. the order of the two if statements is critical... *) end else begin (* we are past the window *) lessthan:=false; done:=true end end else begin lessthan:= ord(abase) < ord(bbase); done:=true end end end; procedure swap(a, b: position); (* switch positions a and b *) var hold: place; begin hold:=map[a]; map[a]:=map[b]; map[b]:=hold end; (* begin module quicksort *) procedure quicksort(left, right: position); (* quick sort a list between positions left and right, into ascending order. a position is simply a scalar of the form 0..max. the array to be sorted is dimensioned 1..max. (the difference in the ranges is important to the correct operation of the sort...) two external routines are used: function lessthan(a, b: position): boolean is a generalized test for value-at-a < value-at-b. procedure swap(a, b: position) switches the items at positions a and b. since these routines are external, the procedure is general. this procedure taken from the book 'algorithms + data structures = programs' by niklaus wirth, prentice-hall, inc., englewood cliffs, n.j.(1976), pp. 76-82 *) var lower, upper: position; (* the positions looked at currently *) center: position; (* the rough center of the region being sorted *) begin lower := left; center := (left + right) div 2; upper := right; repeat while lessthan(lower, center) do lower := succ(lower); while lessthan(center, upper) do upper := pred(upper); if lower <= upper then begin (* ho ho keep track of the center through the map: *) if lower = center then center:=upper else if upper = center then center:=lower; swap(lower, upper); lower := succ(lower); upper := pred(upper) end until lower > upper; if left < upper then quicksort(left, upper); if lower < right then quicksort(lower, right) end; (* end module quicksort version = 'prgmod 3.96 85 mar 18 tds'; *) procedure writeindex; (* write out the index *) var current: position; (* bases in the window *) (* previous (p) and current (c) values *) (* the piece where the window was and is (previous and current) *) ppie, cpie: pieceptr; pdn, cdn: dnaptr; (* dna *) plo, clo: dnarange; (* location *) ppiestart, cpiestart: dnaptr; (* piece dna start *) pcircle, ccircle: boolean; (* the circularity of each piece *) (* where we are printing around the window *) dn: dnaptr; lo: dnarange; (* the location of the window in internal coordinates *) windowposition: integer; steps: integer; (* the number of bases yet to be printed *) offend: boolean; (* true means we have hit the end of a linear piece: stop printing *) procedure setup; (* set up for printing a line *) var (* an internal coordinate location *) int: integer; bk: integer; (* index for counting blanks *) (* an internal coordinate used to locate the first base of the before region *) j: integer; begin (* find out where we are *) with map[current] do begin cdn:=dnalink; clo:=dnaloc; cpie:=pd^.p; cpiestart:=cpie^.dna; ccircle:=(cpie^.key.piecon = circular) end; (* convert cdn and clo into internal coordinates *) dn:=cpiestart; int:=0; while dn <> cdn do begin int:=int + dnamax; dn:=dn^.next end; int:=int+clo; windowposition:=int; (* move the internal coordinate back to the start of the before section *) if pbefore > 0 then begin int:=int - pbefore; if int < 1 then if ccircle then begin steps:=pbefore; while int < 1 do int:=int+piecelength(cpie) end else begin (* put out blanks since there is nothing there *) for bk := 1 to 1-int do write(ind, columnfiller); int:=1; steps:=windowposition-int; end else steps:=pbefore end else steps:=0; (* set ourselves just before the startpoint indicated by the internal coordinates, so that stepbase will move us to the first base *) dn:=cpiestart; j:=0; while j+dnamax < int do begin j:=j+dnamax; dn:=dn^.next end; lo:=(int-j) - 1 end; (* setup *) procedure wacycle; (* print cycle for window and after: it stops if the piece is linear and we hit the end (offend) *) var ba: base; (* to hold the base until we decide if we print it *) counter: integer; (* counter of blanks *) begin while steps > 0 do begin ba := stepbase(cpiestart, dn, lo); steps:=pred(steps); (* write the base (ba) unless it is the first base of the first dna link of a linear sequence *) if lo = 1 then if dn = cpiestart then if not ccircle then begin offend:=true; (* steps was decremented, so we increment the blanks: *) if steps >= 0 then for counter := 1 to steps+1 do write(ind, columnfiller); steps:=0 end else write(ind, basetochar(ba)) else write(ind, basetochar(ba)) else write(ind, basetochar(ba)) end end; procedure writebefore; (* write out the region before the print window *) begin (* steps were calculated in setup *) while steps > 0 do begin write(ind, basetochar(stepbase(cpiestart, dn, lo))); steps:=pred(steps) end end; procedure writewindow; (* write out the printing window *) begin offend:=false; if pwindow > 0 then begin (* write the first base before checking in wacycle *) write(ind, basetochar(stepbase(cpiestart, dn, lo))); steps:=pwindow-1; wacycle end end; procedure writeafter; (* write out the region following the printing window *) var counter: integer; (* counts columnfiller characters *) begin if pafter > 0 then begin if offend then begin for counter := 1 to pafter do write(ind, columnfiller) end else begin steps:=pafter; wacycle end; end; end; procedure writepiename; (* write the piece name of cpie (type: pieceptr) *) var done: boolean; (* are we done writing the name? *) i: integer; (* index to piece name *) begin with cpie^.key.hea.keynam do begin i := 0; done := false; while not done do begin i := succ(i); write(ind, letters[i]); if i >= length then done := true else if letters[i] = ' ' then done := true; end; end; end; procedure writepiecenumber; (* write the piece number *) begin write(ind, map[current].pd^.n:pienumwid) end; procedure writeposition; (* write out the window position in piece coordinates *) begin write(ind, inttopie(windowposition, cpie):posnumwid) end; procedure writedirection; (* write out the direction of the window on the coordinates *) begin write(ind, ' '); if cpie^.key.piedir = plus then write(ind, '+') else write(ind, '-') end; procedure writesimilarity; (* compare the previous to the current window, print the length of the similarity *) var similarity: integer; (* the length of the run *) done: boolean; (* the finish line *) (* the racers: pointers to two bases a and b *) adn, bdn: dnaptr; alo, blo: dnarange; abase, bbase: base; (* the factors of judgement *) begin (* on your marks *) adn:=pdn; alo:=plo; bdn:=cdn; blo:=clo; (* get set *) done:=false; similarity := 0; abase:=adn^.part[alo]; bbase:=bdn^.part[blo]; (* go... *) while not done do begin (* the race is on ... *) if ord(abase) = ord(bbase) (* who is winning??? *) then begin (* ... still neck and neck *) similarity:=succ(similarity); (* the clock is ticking ... *) if similarity < window (* the flag is waving ... *) then begin abase:=stepbase(ppiestart, adn, alo); (* a races foreward, *) bbase:=stepbase(cpiestart, bdn, blo); (* ... and so does b *) if alo = 1 then if adn = ppiestart (* has a tripped?? *) then if not pcircle then done:=true; (* the contestant a just lost the race "." *) if blo = 1 then if bdn = cpiestart (* has b tripped?? *) then if not ccircle then done:=true; (* the contestant b just lost the race "." *) end else done:=true (* they ran off the track ... *) end else done:=true (* they both died ... *) end; (* announce the race time *) write(ind, similarity:6) end; procedure updateprevious; (* update the previous position for keeping track of the similarity *) begin ppie:=cpie; ppiestart:=cpiestart; pdn:=cdn; plo:=clo; pcircle:=ccircle; end; procedure writeline; (* write a line out to the index ind *) begin setup; (* write(ind, columnseparation); 2003 Aug 11: NOT printing this is a significant change in output! it allows the columnfiller to fuse to the sequence. *) writebefore; write(ind, columnseparation); writewindow; write(ind, columnseparation); writeafter; write(ind, columnseparation); writepiecenumber; write(ind, columnseparation); writeposition; write(ind, columnseparation); writedirection; write(ind, columnseparation); writesimilarity; write(ind, columnseparation); writepiename; writeln(ind); updateprevious end; begin (* writeindex *) (* by making the last map position the one previous to the first map position we have made the index formally circular for similarity purposes *) with map[numberofoligos] do begin pdn:=dnalink; plo:=dnaloc; ppie:=pd^.p; ppiestart:=ppie^.dna; pcircle:=(ppie^.key.piecon = circular) end; for current:=1 to numberofoligos do writeline end; begin (* index *) initialize; readsequences; writeheader; if numberofsequences = 0 then writeln(output, ' no numbered sequences in this book') else begin if teaching then begin writeindex; (* this is the unsorted index promised in the header *) page(ind); writeln(ind); writeln(ind, ' now i will sort the index alphabetically.') end; quicksort(1, numberofoligos); if teaching then begin writeln(ind, ' i am done sorting.'); writeln(ind, ' how long would it have taken you?'); writeln(ind, ' the sorted index is like a telephone directory.'); writeln(ind, ' it looks like this:'); writeln(ind) end; writeindex end; 1: end. (* index *)