program dbbk ( db, l1, changes, output ); (* dbbk: database into delila book conversion program by Matthew Yarus modified by: Dr. Thomas D. Schneider National Institutes of Health National Cancer Institute Center for Cancer Research Nanobiology Program Molecular Information Theory Group Frederick, Maryland 21702-1201 schneidt@mail.nih.gov permanent email: toms@alum.mit.edu http://www.ccrnp.ncifcrf.gov/~toms/ module libraries required: delman, delmods *) label 1; (* used to halt program *) const (* begin module version *) version = 3.46; (* of dbbk.p 2009 Apr 08 2009 Apr 08, 3.46: correct test for VERSION 2008 Nov 03, 3.45: use VERSION instead of ACCESSION to get the version number 2007 Dec 06, 3.44: make the piece circular in the l1 if in the db 2007 Apr 03, 3.43: allow underscores in ORGANISM name 2006 Oct 25, 3.42: allow for strain name in organism: ORGANISM Escherichia coli K12 gives E.coli-K12 2004 Jul 8, 3.41: upgrade to gpctime 2003 Aug 19, 3.40: unknown sequences are now marked by 'unknown' and '?'. 2003 Aug 19, 3.39: cleanup 2003 Aug 19, 3.38: changes reported in blocks 2003 Aug 19, 3.36: improve changes so that they are in blocks 2000 Sep 6, 3.35: name lengths increased 1999 Jun 15: fixed bug in copydna that was counting spaces as bases, resulting in misplacement of changes features. 1999 Feb 11: make absolutely sure that there are no weird control characters copied to the DNA sequence output. 1997 May 4: unrecorded changes 1996 August 17: the changes file now uses the features format of the lister program so that the changes can be easily observed. 1995 December 8: program does not write long names anymore. This allows the new feature of delila to give blanks and therefore the alist will not list accesion numbers for full names. 1994 June 10: program now uses ACCESSION numbers! 1992 sep 14: program now has input from db and output to l1 to fit catal. origin before 1983 july 19 *) (* end module version *) (* begin module describe.dbbk *) (* name dbbk: database to delila book conversion program synopsis dbbk(db: in, l1: out, changes: out, output: out) files db: contains one or more complete entries from either the EMBL or GenBank genetic sequence data bases. These entries may be obtained by using the original libraries or by using an entry extraction program. Dbpull is the delila program for data base accessing; to get complete entries the instruction 'all' must have been used in the dbpull fin file. (See delman.use.dbpull) l1: each db entry is represented in l1 by a delila style entry containing information extracted from the db entry. All of l1 has the biologically oriented structure of a standard delila book. The first line of l1 is not part of an entry, but contains the computer system date and the title of the book. changes: Delila programs cannot handle sequences that have ambiguities because Delila was designed on the assumption that people would finish their sequences. Unfortunately this is not true, and the databases contain bases other than acgt to indicate ambiguity. These are converted to "a" and the cases are reported in this file as "unknown". NOTE: "u" is converted to "t". The format is the one that the lister program uses as features. In the lister map the unknown region is marked by a string of question marks: "???????????". output: messages to the user. description This program converts GenBank and EMBL data base entries into a book of delila entries. The organism name is fused together with a period and is used for both organsim and chromosome names. Organism and chromosome only change if the name changes in db. The names of pieces were given by the ACCESSION number (1994 June 10) but this does not track the versions. So on 2008 Nov 03 I switched it to VERSION which looks like: J04553.1. This works with catal and delila. examples The changes file looks like: define "unknown:1220-4867" "?" "[]" "[]" 0 3646 @ AC012525 1220.0 +1 "unknown:1220-4867" "" Lister displays this as: * *1210 * *1220 5' c g t g g a a c a a g g a a g a a t t a a a a a 3' [????????? ... unknown:1220-4867 [for brevity the middle part is skipped] *4850 * *4860 * *4870 5' a a a a a a a a a a a a a a a a a a a a t a g a 3' ... ??????????????????????????????????] unknown:1220-4867 see also delila.p, dbpull.p, catal.p, libdef, lister.p author Matthew Yarus and Tom Schneider (modifications) bugs Databases do not have enough data on genes within each piece to make a book with gene sections. The changes file is a design bug in Delila. Genus names are limited to genuslimit (a constant) to avoid names longer than the standard Delila limit. If a name is larger than idlength the program simply stops reading the name and then dies when it reads the number of bases in the entry. This is currently fixed by making the name 100 characters but should be done better later. technical notes dbbk is known to convert GenBank entries from July 1989. It may not work on later versions. *) (* end module describe.dbbk *) (* begin module datetime.const *) datetimearraylength = 19; (* length of dataarray for dates, It is just long enough to include the 4 digit year - solving the year 2000 problem: 1980/06/09 18:49:11 123456789 123456789 1 2 *) (* end module datetime.const version = 1.08; (@ of gpctime.p 2004 Jan 21 *) (* begin module dbbk.constant *) (* more constants *) idlength = 100; (* length of identification code at beginning of each db library entry *) namelength = idlength; (* length of computer system date *) lclength = 3; (* length of codes identifying each library line, 'lc' stands for 'line codes' *) (* the following constants are used for string comparisons by the function lcequal. if your computer system does not do these packed array comparisons, lcequal must be rewritten and the constants replaced by series of assignment statements to arrays *) lcloc = 'LOC'; (* short for locus, line code for genb entry starting line. this line contains the entry id *) lcid = 'ID '; (* marks starting line of embl entry, also has id *) lc3spc = ' '; (* three space line code always marks sequence lines of embl type entry *) lcos = 'OS '; (* marks line of embl entry which contains a name for the organism where the sequence is *) (* lcdef = 'DEF'; (@ stands for 'definition', does for genb what 'os' does for embl *) lcdef = ' O'; (* stands for 'ORGANISM', does for genb what 'os' does for embl *) lcbas = 'BAS'; (* stands for 'BASE COUNT', the line just before ORIGIN in Genbank (hopefully...) *) lcsq = 'SQ '; (* marks line just above embl sequence. this line holds an origin and a base pair total *) lcori = 'ORI'; (* short for 'origin', code for line just above genb sequences. *) lcsit = 'SIT'; (* short for 'sites', line just below genb sequence *) lcterm = '// '; (* this code terminates every entry *) lcdat = 'DAT'; (* first 3 letters of 'date', the first word of a db dateline *) genuslimit = 1; (* maximum number of characters in the genus name to generate *) (* note: this used to be 4, but that gives Esch.coli, while 1 sets it to E.coli, which is much nicer!! *) (* end module dbbk.constant *) type (* begin module dbbk.type *) (* a 'u' after the 'lc' or 'id' in a type name indicates that the array is unpacked; a 'p' indicates that the array is packed; an 's' indicates that the array has more than one dimension *) idutype = array[1..idlength] of char; (* holds id for reading-writing before packing *) idptype = packed array[1..idlength] of char; (* holds id for string comparisons *) alpha = idptype; (* holds computer system date *) lcutype = array[1..lclength] of char; (* holds library line code for reading-writing before packing *) lcptype = packed array[1..lclength] of char; (* holds line codes for string comparisons *) libsused = ( embl, genb, none ); (* used to indicate whether an entry is of embl or genb type *) (* end module dbbk.type *) (* begin module datetime.type *) (* array for dates *) datetimearray = packed array[1..datetimearraylength] of char; (* end module datetime.type version = 1.08; (@ of gpctime.p 2004 Jan 21 *) var (* begin module dbbk.var *) db: text; (* contains complete embl or genbank data base entries *) l1: text; (* contains one converted delila type entry for each entry in the db file *) changes: text; (* changes made to the sequences *) notwarned: boolean; (* True if the user has not yet been warned about changes of bases *) (* end module dbbk.var *) (* begin module halt *) procedure halt; (* stop the program. the procedure performs a goto to the end of the program. you must have a label: label 1; declared, and also the end of the program must have this label: 1: end. examples are in the module libraries. this is the only goto in the delila system. *) begin writeln(output,' program halt.'); goto 1 end; (* end module halt version = 5.38; (@ of prgmod.p 2007 May 15 *) (* begin module decapitilize *) procedure decapitilize(var c: char); (* convert the character c to lower case *) var n: integer; (* c is the n'th letter of the alphabet *) begin n := ord(c); if (n >= ord('A')) and (n <= ord('Z')) then c := chr( n - ord('A') + ord('a')) end; (* end module decapitilize version = 'prgmod 4.01 89 Apr 14 tds'; *) (* begin module dbbk.writeidptype *) procedure writeidptype(var fout: text; writeasterisk, carriagereturn: boolean; thename: idptype ); (* holds the name of the organism in which the sequence occurs *) (* Write the thename name to file fout. If writeasterisk is true, then write "*" at the start. If carriagereturn is true, add a carriage return at the end. *) var index: integer; (* loop counter *) begin if writeasterisk then write ( fout, '* ' ); index := 1; repeat if thename[index] <> ' ' then write ( fout, thename[index] ); index := succ(index); until (index = idlength) or (thename[index] = ' '); if carriagereturn then writeln ( fout ) end; (* end module dbbk.writeidptype *) (* begin module dbbk.finishchanges *) procedure finishchanges(var changes: text; var inchanges: boolean; (* inside a set of changes? *) var changestart: integer; (* coordinate of start of changes *) entryname: idptype; (* holds the name of a fout entry *) basenumber: integer (* current coordinate *) ); (* finish changes *) begin (* finish writing the change *) inchanges := false; {zzz} writeln(changes, 'define "unknown:', changestart:1, '-', basenumber:1, '" "?" "[]" "[]" 0', ' ', (basenumber-changestart-1):1); write(changes,'@ '); writeidptype(changes,false,false,entryname); writeln(changes, ' ', changestart:1, '.0 +1 "unknown:', changestart:1, '-', basenumber:1, '" ""'); (* old format: define "change:" "-" "^" "^" 0 @ AC012525 1220.0 +1 "change:" "was n" *) end; (* end module dbbk.finishchanges *) (* begin module copydna *) procedure copydna(var fin, fout, changes: text; var basenumber: integer; (* count of bases written so far *) entryname: idptype; (* holds the name of a fout entry *) var inchanges: boolean; (* inside a set of changes? *) var changestart: integer (* coordinate of start of changes *) ); (* copy a line from file fin to file fout, converting the letters to lower case. Report changes to the changes file. *) var c: char; (* the character being manipulated *) begin (* copydna *) while not eoln(fin) do begin c := fin^; decapitilize(c); (* 1999 Feb 11: make absolutely sure that there are no weird control characters copied to the DNA sequence output. *) if ((ord(c) >= ord('a')) and (ord(c) <= ord('z'))) or (c = ' ') then begin if ((ord(c) >= ord('a')) and (ord(c) <= ord('z'))) then basenumber := basenumber + 1; (* convert 'u' to 't' *) if c = 'u' then c := 't'; (* see if it's ok sequence, otherwise, report to the boss *) if not (c in ['a','c','g','t',' ']) then begin (* this is not a regular base, so report the change *) if notwarned then begin notwarned := false; rewrite(changes); writeln(changes, '* dbbk ', version:4:2 ); { simple old method: writeln(changes, 'define "change:" "-" "^" "^" 0'); } end; if not inchanges then begin inchanges := true; changestart := basenumber; end; { simple old method: write(changes,'@ '); writeidptype(changes,false,false,entryname); writeln(changes, ' ', basenumber:6,'.0 +1 "change:" "was ', c,'"'); } c := 'a'; (* smash it *) end else begin if (c <> ' ') and inchanges then finishchanges(changes, inchanges, changestart, entryname, basenumber) {zzz} end; fout^ := c; put(fout); end; get(fin) end; readln(fin); writeln(fout); end; (* copydna *) (* end module copydna *) (* begin module dbbk.readlcu *) procedure readlcu ( var lib: text; var liblcu: lcutype; var liblcp: lcptype ); (* grabs a library line code and then packs it for string comparisons *) var index: integer; (* for loop counter *) begin (* readlcu *) if eof(lib) then begin (* this probably should not happen *) writeln(output,'in readlcu: hit end of file; last line type was', ' linetype: ',liblcp); halt end; for index := 1 to lclength do if eoln ( lib ) then liblcu[index] := ' ' else read ( lib, liblcu[index] ); pack ( liblcu, 1, liblcp ) end; (* readlcu *) (* end module dbbk.readlcu *) (* begin module dbpull.lcequal *) function lcequal ( lcp1, lcp2: lcptype ): boolean; (* tests two line codes for equality. if your computer system version of pascal does not do string comparisons simply change lcequal and the program will be fixed throughout *) begin (* lcequal *) lcequal := false; if lcp1 = lcp2 then lcequal := true end; (* lcequal *) (* end module dbpull.lcequal *) (* begin module dbpull.idequal *) function idequal(a, b: idptype): boolean; (* are id a and id b equal? *) var equal: boolean; (* are letters in a and b the same? *) index: integer; (* index *) begin index := 1; repeat equal := (a[index] = b[index]); index := succ(index) until (not equal) or (index > idlength); idequal := equal end; (* end module dbpull.idequal *) (* begin module dbpull.idclear *) procedure idclear(var ida: idptype); (* clear id a *) var i: integer; (* index *) begin for i := 1 to idlength do ida[i] := ' ' end; (* end module dbpull.idclear *) (* begin module dbpull.idcopy *) procedure idcopy(ida: idptype; var idb: idptype); (* copy id a into id b *) var i: integer; (* index *) begin for i := 1 to idlength do idb[i] := ida[i] end; (* end module dbpull.idcopy *) (* begin module dbpull.getid *) procedure getid ( var fin: text; (* see global *) var finidp: idptype ); (* holds requested library id *) (* finds the next string of non-space characters following the file cursor, adding spaces at the end if the string is too short, and then packs the string into the finidp array *) var index: integer; (* counter for loop *) finidu: idutype; (* holds id for reading in before packing *) begin (* getid *) while fin^ = ' ' do (* advances to first id character *) get ( fin ); index := 0; (* loop initialization value *) (* the following loop grabs all id characters *) while ( fin^ <> ' ' ) and ( index < idlength ) do begin index := succ ( index ); if eoln ( fin ) then finidu[index] := ' ' else read ( fin, finidu[index] ); (* decapitilize( finidu[index] ) (@ convert the name to lower case *) end; (* the following loop fills out id if id is too short *) while index < idlength do begin index := succ ( index ); finidu[index] := ' ' end; pack ( finidu, 1, finidp ) end; (* getid *) (* end module dbpull.getid *) (* begin module dbpull.getspecies *) procedure getspecies ( var fin: text; (* see global *) var finidp: idptype ); (* holds requested library id *) (* skip blanks, pick up the next two names from fin, put a period between them and return this as the species name. Add spaces at the end if the string is too short, and pack the string into the finidp array. Note that the genus name is not longer than genuslimit. 2007 Apr 3: interpret '_' as spaces in the name. *) var index: integer; (* counter for loop *) finidu: idutype; (* holds id for reading in before packing *) begin (* getspecies *) while fin^ = ' ' do (* advances to first id character *) get ( fin ); index := 0; (* loop initialization value *) (* grab the first part of the name *) while ( fin^ <> ' ' ) and ( fin^ <> '_' ) and ( index < idlength ) do begin index := succ ( index ); if eoln ( fin ) then finidu[index] := ' ' else read ( fin, finidu[index] ); if finidu[index] = '_' then finidu[index] := ' '; (* decapitilize( finidu[index] ) (@ convert the name to lower case *) end; (* now skip to the next part of the species name *) while (fin^ = ' ') or (fin^ = '_') do get ( fin ); (* advances to first id character *) (* force genus to be genuslimit characters long *) if index > genuslimit then index := genuslimit; (* and advance the index *) index := succ(index) ; (* loop initialization value *) (* now put the period into the name *) finidu[index] := '.'; (* grab the second part of the name *) { while ( fin^ <> ' ' ) and ( index < idlength ) do begin } (* 2006 Oct 25: read to the end of the line. Accept spaces but convert them to dashes. *) while not eoln(fin) do begin index := succ ( index ); if eoln ( fin ) then finidu[index] := ' ' else begin read ( fin, finidu[index] ); if finidu[index] = ' ' then finidu[index] := '-' end; (* decapitilize( finidu[index] ) (@ convert the name to lower case *) end; (* the following loop fills out id if id is too short *) while index < idlength do begin index := succ ( index ); finidu[index] := ' ' end; pack ( finidu, 1, finidp ) end; (* getspecies *) (* end module dbpull.getspecies *) (* begin module getdatetime *) procedure getdatetime(var adatetime: datetimearray); (* get the date and time into a single array from the system clock. adatetime contains the date: 1980/06/09 18:49:11 ye mo da ho mi se (year, month, day, hour, minute, second). As of 2000 February 18, the Sun Pascal compiler requires a formatting statement. This statement allows the date to be generated in this standard Delila format in a single call. Information about the formatting statement is available on the manual page for date in Unix. If a computer does not have this method, see the 'oldgetdatetime' routine in delmod.p (http://www.lecb.ncifcrf.gov/~toms/delila/delmod.html) for some conversion code. GPC Functions: function GetUnixTime (var MicroSecond : Integer) : UnixTimeType; http://agnes.dida.physik.uni-essen.de/~gnu-pascal/gpc_109.html#SEC109 7.10.8 Date And Time Routines procedure GetTimeStamp (var t : TimeStamp); function Date (t : TimeStamp) : packed array [1 .. DateLength] of Char; function Time (t : TimeStamp) : packed array [1 .. TimeLength] of Char; DateLength and TimeLength are implementation dependent constants. GetTimeStamp (t) fills the record `t' with values. If they are valid, the Boolean flags are set to True. TimeStamp is a predefined type in the Extended Pascal standard. It may be extended in an implementation, and is indeed extended in GPC. For the full definition of `TimeStamp', see section 8.255 TimeStamp. *) var t: TimeStamp; (* begin module pluckdigit *) function pluckdigit(number, logplace:integer): char; (* return the digit at the place value ('logplace') position of number. example: pluckdigit(13625, 3) = 3 pluckdigit(13625, 4) = 1 This routine was taken from module numberdigit in prgmod.p, but is modified so as not to give the sign. Instead it gives zeros above the digits. 'myabsolute' replaced 'absolute', which is apparently a keyword for GPC. The name is kept for to keep the code looking similar to its origin. *) var place: integer; (* the exponent of logplace *) count: integer; (* used to make place *) myabsolute: integer; (* the absolute value of number *) acharacter: char; (* the character to be returned *) procedure digit; (* extract a digit at the place position *) var tenplace: integer; (* ten times place *) z: integer; (* an intermediate value *) d: integer; (* the digit extracted *) begin (* digit *) tenplace:=10*place; z:=myabsolute-((myabsolute div tenplace)*tenplace); if place = 1 then d:=z else d:= z div place; case d of 0: acharacter:='0'; 1: acharacter:='1'; 2: acharacter:='2'; 3: acharacter:='3'; 4: acharacter:='4'; 5: acharacter:='5'; 6: acharacter:='6'; 7: acharacter:='7'; 8: acharacter:='8'; 9: acharacter:='9'; end end; (* digit *) begin (* pluckdigit *) place:=1; for count:=1 to logplace do place:=10*place; if number=0 then begin acharacter:='0' end else begin myabsolute:=number; if myabsolute >= place then digit else acharacter := '0' end; pluckdigit:=acharacter end; (* pluckdigit *) (* end module pluckdigit *) begin (* according to: http://agnes.dida.physik.uni-essen.de/~gnu-pascal/gpc_109.html#SEC109 *) GetTimeStamp(t); (* Predefined time stamp: http://agnes.dida.physik.uni-essen.de/~gnu-pascal/gpc_389.html#SEC389 TimeStamp = {@@packed} record DateValid, TimeValid : Boolean; Year : Integer; Month : 1 .. 12; Day : 1 .. 31; DayOfWeek : 0 .. 6; { 0 means Sunday } Hour : 0 .. 23; Minute : 0 .. 59; Second : 0 .. 61; { to allow for leap seconds } MicroSecond : 0 .. 999999 end; *) with t do begin if TimeValid then begin { writeln(output,'valid time'); writeln(output,'year =',year:4); writeln(output,'month =',month:2); writeln(output,'day =',day:2); writeln(output,'hour =',hour:2); writeln(output,'minute =',minute:2); writeln(output,'second =',second:2); } adatetime := 'year/mm/dd hh:mm:ss'; adatetime[ 1] := pluckdigit(year,3); adatetime[ 2] := pluckdigit(year,2); adatetime[ 3] := pluckdigit(year,1); adatetime[ 4] := pluckdigit(year,0); adatetime[ 6] := pluckdigit(month,1); adatetime[ 7] := pluckdigit(month,0); adatetime[ 9] := pluckdigit(day,1); adatetime[10] := pluckdigit(day,0); adatetime[12] := pluckdigit(hour,1); adatetime[13] := pluckdigit(hour,0); adatetime[15] := pluckdigit(minute,1); adatetime[16] := pluckdigit(minute,0); adatetime[18] := pluckdigit(second,1); adatetime[19] := pluckdigit(second,0); end else begin writeln(output,'getdatetime: invalid time!'); halt; end end; { Sun compiler method: date(adatetime,'%Y/%m/%d %H:%M:%S'); } end; (* end module getdatetime version = 1.08; (@ of gpctime.p 2004 Jan 21 *) (* begin module writedatetime *) procedure writedatetime(var thefile: text; adatetime: datetimearray); (* expand the date and time out and print in the file *) var index: integer; (* index of datetime *) begin for index:=1 to datetimearraylength do write(thefile,adatetime[index]) end; (* end module writedatetime version = 1.08; (@ of gpctime.p 2004 Jan 21 *) (* begin module readdatetime *) procedure readdatetime (var thefile: text; var adatetime: datetimearray); (* read the date and time from the file *) var index: integer; (* to the udatetime *) (* the following is an unpacked date time array, to avoid reading into a packed array. reading into a packed array is not transportable *) udatetime: array[1..datetimearraylength] of char; begin for index:=1 to datetimearraylength do read(thefile,udatetime[index]); pack(udatetime, 1, adatetime); if (adatetime[3]='/') and (adatetime[12]=':') then begin writeln(output,' old datetime (only 2 year digits) read: ', adatetime:datetimearraylength); end; end; (* end module readdatetime version = 1.08; (@ of gpctime.p 2004 Jan 21 *) (* begin module skipblanks *) (* 2003 July 31: tab is considered a blank character *) function isblank(c: char): boolean; (* is the character c blank or tab? *) const tab = 9; (* tab character *) begin isblank := (c = ' ') or (ord(c) = tab) end; procedure skipblanks(var thefile: text); (* skip over blanks until a non-blank, or end of line, is found *) begin while isblank(thefile^) and not eoln(thefile) do get(thefile); end; procedure skipnonblanks(var thefile: text); (* skip over nonblanks until a blank, or end of line, is found *) begin while (not isblank(thefile^)) and not eoln(thefile) do get(thefile); end; procedure skipcolumn(var thefile: text); (* skip over a data column *) begin skipblanks(thefile); skipnonblanks(thefile) end; (* end module skipblanks version = 5.38; (@ of prgmod.p 2007 May 15 *) (* begin module dbbk.note *) procedure note ( var fout: text; (* contains delila type entries converted from the entries in the fin file *) piecenum: integer ); (* delila pieces are numbered *) (* simulates the note section of a delila entry *) begin (* note *) writeln ( fout, 'note' ); writeln ( fout, '* #', piecenum:1 ); writeln ( fout, 'note' ); end; (* note *) (* end module dbbk.note *) (* begin module dbbk.dna *) procedure dna ( var fin: text; (* holds embl and genb entries *) var fout: text; (* contains delila type entries converted from the entries in the fin file *) libtitle: libsused ; (* indicates whether a fin entry is of the embl or genbank format *) entryname: idptype ); (* holds the name of a fout entry *) (* simulates the dna section of a delila entry *) var basenumber: integer; (* count of the bases written so far *) dumpint: integer; (* grabs an integer that is not used *) finlcu: lcutype; (* grabs fin line codes *) finlcp: lcptype; (* packs fin line codes for string comparisons *) inchanges: boolean; (* inside a set of changes? *) changestart: integer; (* coordinate of start of changes *) begin basenumber := 0; inchanges := false; changestart := -maxint; writeln ( fout, 'dna' ); case libtitle of none: begin writeln ( output, ' PROCEDURE DNA HAS BIZARRE VALUE OF' ); writeln ( output, ' LIBTITLE = NONE' ); halt; end; (* note: procedure getentry has placed file cursor at the beginning of sequence data before dna is called *) embl: repeat readlcu ( fin, finlcu, finlcp ); if lcequal ( finlcp, lc3spc ) (* lc3spc is used because embl sequence data lines have no line codes *) then begin write ( fout, '* ' ); (* all data lines in delila entries start with '*'s *) copydna ( fin, fout, changes, basenumber, entryname, inchanges, changestart ) end else readln ( fin ); until lcequal ( finlcp, lcterm ); genb: begin readlcu ( fin, finlcu, finlcp ); while ( not lcequal ( finlcp, lcsit )) and ( not lcequal ( finlcp, lcterm )) do begin (* sometimes no 'sites' section occurs after sequence, so we run into terminus code *) read ( fin, dumpint ); (* genb sequence data lines have no line code, but they do start with a cooridinate integer which is not needed here *) write ( fout, '* ' ); copydna ( fin, fout, changes, basenumber, entryname, inchanges, changestart ); readlcu ( fin, finlcu, finlcp ); end; end; end; (* final call if there are changes left *) if inchanges then finishchanges(changes, inchanges, changestart, entryname, basenumber); writeln ( fout, 'dna' ); end; (* end module dbbk.dna *) (* begin module dbbk.piece *) procedure piece ( var fin: text; (* holds embl and genbank entries *) var fout: text; (* contains delila style entries converted from fin entries *) piecenum: integer; (* delila pieces are numbered *) libtitle: libsused; (* indicates whether entry is of embl or genbank type *) bpint: integer; (* holds the number of base pairs found in the fin entry sequence data *) entryname: idptype; (* holds the name of a fout entry *) topology: char); (* linear or circular topology *) (* simulates the piece section of a delila entry *) var index: integer; (* loop counter *) begin writeln ( fout, 'piece' ); writeidptype(fout, true,true,entryname); writeln(fout, '* '); note ( fout, piecenum ); writeln ( fout, '* 1' ); (* this next loop is done twice because all sequence is assigned the value of 'linear' to avoid certain complications *) for index := 1 to 2 do begin if topology = 'l' (* new feature 2007 Dec 06 *) then writeln ( fout, '* linear' ) else writeln ( fout, '* circular' ); writeln ( fout, '* +' ); writeln ( fout, '* 1' ); writeln ( fout, '* ', bpint:1 ); end; dna ( fin, fout, libtitle, entryname ); (* this procedure and each of the next three in turn calls the preceding procedure. this hierarchy is meant to simulate the hierarchies of biological classification *) writeln ( fout, 'piece' ); end; (* end module dbbk.piece *) (* begin module dbbk.chromosome *) procedure chromosome ( var fin: text; (* holds embl and genb entries *) var fout: text; (* contains delila style entries converted from fin entries *) piecenum: integer; (* delila pieces are numbered *) libtitle: libsused; (* indicates whether fin entry is of an embl or genb format *) bpint: integer; (* holds the number of base pairs found in the fin entry sequence data *) entryname: idptype; (* holds name of a fout entry *) orgname: idptype; (* holds the name of the organism in which the sequence occurs *) topology: char); (* linear or circular topology *) (* simulates chromosome section of a delila entry *) begin writeln ( fout, 'chromosome' ); (* formerly dbbk gave the entry name as the name, with the line writeln ( fout, '* ', entryname ); writeln ( fout, '* ', entryname ); suppress that and use the organism: *) writeidptype( fout, true,true,orgname); writeln(fout, '* '); writeln ( fout, '* 1' ); writeln ( fout, '* ', bpint:1 ); piece ( fin, fout, piecenum, libtitle, bpint, entryname, topology ); end; (* end module dbbk.chromosome *) (* begin module dbbk.organism *) procedure organism ( var fin: text; (* holds embl and genb entries *) var fout: text; (* contains delila style entries converted from fin entries *) piecenum: integer; (* delila pieces are numbered *) libtitle: libsused; (* indicates whether fin entry is of embl or genbank format *) bpint: integer; (* holds number of base pairs found in the fin entry sequenc data *) entryname: idptype; (* holds the name of the fout delila style entry *) var firstorganism: boolean; (* true if this is the first organism to be written *) orgname: idptype; (* holds the name of the organism in which the sequence occurs *) var oldorgname: idptype; (* old organism name. kept track of to avoid writing new organism structure when it is not needed. *) topology: char); (* linear or circular topology *) (* simulates the organism section of a delila entry *) begin if idequal(orgname, oldorgname) then piece(fin, fout, piecenum, libtitle, bpint, entryname, topology) else begin if firstorganism then firstorganism := false (* avoid closing non-existant previous *) else begin (* close up previous organism *) writeln ( fout, 'chromosome' ); writeln ( fout, 'organism' ); end; write ( output, 'organism ' ); writeidptype( output, true,true,orgname); writeln ( fout, 'organism' ); writeidptype( fout, true,true,orgname); writeln(fout, '* '); writeln ( fout, '* bases' ); chromosome ( fin, fout, piecenum, libtitle, bpint, entryname, orgname, topology); idcopy(orgname, oldorgname) end; (* identify the entry under this organism *) writeidptype(output, false,false,entryname); writeln(output); end; (* end module dbbk.organism *) (* begin module dbbk.getentry *) procedure getentry ( var fin: text; (* holds embl and genb entries *) var fout: text; (* contains delila style entries converted from fin entries *) piecenum: integer ; (* delila pieces are numbered *) var firstorganism: boolean; (* true if this is the first organism to be written *) var oldorgname: idptype ); (* old organism name. kept track of to avoid writing new organism structure when it is not needed. *) (* goes to the beginning of a fin entry, and then runs through it grabbing values for dna, piece, etc. to use *) var done: boolean; (* terminates loops when certain conditions are met *) dumpword: idptype; (* grabs words that come between the fin file cursor and data that it is trying to get to *) finlcu: lcutype; (* grabs fin line codes *) finlcp: lcptype; (* packs fin line codes for string comparisons *) libtitle: libsused; (* indicates whether a fin entry is an embl or genbank format *) bpint: integer; (* holds the number of base pairs found in the fin entry sequence data *) entryname: idptype; (* holds the name of a fout entry *) locusname: idptype; (* dummy variable holds the name of a fout entry *) orgname: idptype; (* holds the name of the organism in which the genetic sequence occurs *) topology: char; (* linear or circular topology for this piece *) begin done := false; repeat readlcu ( fin, finlcu, finlcp ); if lcequal ( finlcp, lcid ) then begin libtitle := embl; getid ( fin, entryname ); (* entryname is filled with the id code of the entry *) repeat readlcu ( fin, finlcu, finlcp ); if lcequal ( finlcp, lcos ) (* the first 'os ' coded line generally starts with the name of the entry organism *) then begin getid ( fin, orgname ); done := true end; readln ( fin ) until done; done := false; repeat readlcu ( fin, finlcu, finlcp ); if ( lcequal ( finlcp, lcsq )) or ( eof ( fin )) then begin getid ( fin, dumpword ); (* dumps the word 'sequence' *) readln ( fin, bpint ); (* gets cursor to beginning of the genetic sequence data *) done := true end else readln ( fin ) until done; organism ( fin, fout, piecenum, libtitle, bpint, entryname, firstorganism, orgname, oldorgname, topology ); end else if lcequal ( finlcp, lcloc ) then begin libtitle := genb; get(fin); get(fin); (* skips the 'US' of 'locus' *) (* 1994 June 10: This can't be done anymore because the LOCUS is unstable and keeps changing. So we zap it by putting it into a dummy variable for now: zapped: getid (fin, entryname); (@ entryname assigned id code *) getid (fin, locusname); (* entryname assigned id code *) read(fin, bpint); (* get the length *) skipcolumn(fin); (* skip 'bp' *) skipcolumn(fin); (* skip 'DNA' *) skipblanks(fin); (* skip blanks *) if eoln(fin) then begin write(output,'Cannot get topology for '); writeidptype(output,false,true,locusname); halt; end; if (fin^ = 'l') or (fin^ = 'c') then topology := fin^ else begin write(output,'Topology for '); writeidptype(output,false,false,locusname); writeln(output,' is not l(inear) or c(ircular)'); while not eoln(fin) do begin write(output, fin^); get(fin); end; writeln(output); halt; end; { (* show the rest of the line - for testing *) while not eoln(fin) do begin write(output, fin^); get(fin); end; writeln(output); halt; } readln( fin); (* go to next line *) { (* now instead we pick up the ACCSSION NUMBER: *) while fin^ <> 'A' do readln(fin); (* find ACCESSION *) } (* now instead we pick up the VERSION NUMBER: *) while fin^ <> 'V' do begin readln(fin); (* find VERSION *) if eof(fin) then begin writeln(output,'could not find version number'); halt; end; end; while fin^ <> ' ' do get(fin); (* skip to blank *) while fin^ = ' ' do get(fin); (* skip to nonblank *) getid ( fin, entryname ); (* entryname assigned id code *) repeat readlcu ( fin, finlcu, finlcp ); if lcequal ( finlcp, lcbas ) then begin (* there was no ORGANISM since we hit the BASE COUNT line. This is probably an unanotated entry. Give the entry name. *) done := true; write(output,'WARNING: in entry '); writeidptype(output, false,false,entryname); write(output,'there was no ORGANISM. Using:'); writeidptype(output, false,false,entryname); orgname := entryname; end else begin if ( lcequal ( finlcp, lcdef )) or ( eof ( fin )) then begin getid ( fin, dumpword ); (* dump rest of word *) getspecies ( fin, orgname ); done := true end; readln ( fin ) end until done; repeat (* this loop gets to the beginning of sequence *) readlcu ( fin, finlcu, finlcp ); readln ( fin ) until lcequal ( finlcp, lcori ); organism(fin, fout, piecenum, libtitle, bpint, entryname, firstorganism, orgname, oldorgname, topology); end else begin libtitle := none; (* an entry of either type is yet to be found *) readln ( fin ) end; until ( libtitle <> none ) or ( eof ( fin )); (* '<> none' indicates that one embl or genb entry is processed *) end; (* end module dbbk.getentry *) (* begin module dbbk.datebook *) procedure datebook ( var fin: text; (* holds embl and genbank entries *) var fout: text ); (* contains delila style entries converted from fin entries *) (* adds computer system date and possibly fin date to first line of fout *) var index: integer; (* loop counter *) dumpchar: char; (* holds an unwanted character *) findated: boolean; (* 'true' indicates that fin has a dateline *) oldate: datetimearray; (* holds fin date *) adatetime: datetimearray; (* holds computer system date *) finlcu: lcutype; (* holds a fin line code *) finlcp: lcptype; (* finlcu(unpacked), finlcp(packed) *) begin (* datebook *) findated := false; readlcu ( fin, finlcu, finlcp ); if lcequal ( finlcp, lcdat ) then begin findated := true; read ( fin, dumpchar ); (* dumps 'e' of 'date' *) repeat get ( fin ) until ( fin^ <> ' ' ); readdatetime ( fin, oldate ); readln ( fin ) end else reset ( fin ); (* avoids skipping first id or locus code *) getdatetime ( adatetime ); write ( fout, '* ' ); if findated then begin writedatetime ( fout, adatetime ); write ( fout, ', ' ); writedatetime ( fout, oldate ); write ( fout, ', ' ); end else begin for index := 1 to 2 do begin writedatetime ( fout, adatetime ); write ( fout, ', ' ) end; end; end; (* datebook *) (* end module dbbk.datebook *) (* begin module dbbk.order *) procedure order ( var fin: text; (* holds embl and genbank entries *) var fout: text ); (* contains delila style entries converted from fin entries *) (* structures the whole program by calling other procedures *) var firstorganism: boolean; (* true only when the first organsism is about to be written. used to avoid closing the (non-existant) previous organism *) oldorgname: idptype; (* old organism name. kept track of to avoid writing new organism structure when it is not needed. *) piecenum: integer; (* each delila 'piece' posesses a unique number *) begin (* order *) (* initialize things *) reset ( fin ); rewrite ( fout ); if eof ( fin ) then begin writeln ( output, ' INPUT FILE FIN IS EMPTY' ); halt end; piecenum := 0; idclear(oldorgname); notwarned := true; firstorganism := true; (* start generating the book *) datebook ( fin, fout ); writeln ( fout, 'dbbk ', version:4:2 ); (* title is arbitrary *) repeat piecenum := succ ( piecenum ); getentry ( fin, fout, piecenum, firstorganism, oldorgname ) until eof ( fin ); (* close the organisms and chromosomes *) writeln ( fout, 'chromosome' ); writeln ( fout, 'organism' ); if not notwarned then begin write(output,'WARNING: some sequences have been altered,'); writeln(output,'see the changes file.'); end; end; (* order *) (* end module dbbk.order *) begin (* dbbk *) writeln ( output, ' dbbk ', version:4:2 ); order ( db, l1 ); 1: end. (* dbbk *)