program exon(exonp, db, lengths, dinst, ainst, einst, exonfeatures, output); (* exon: determine lengths of exons in GenBank entries Dr. Thomas D. Schneider National Institutes of Health National Cancer Institute Center for Cancer Research Nanobiology Program Molecular Information Theory Group Frederick, Maryland 21702-1201 toms@ncifcrf.gov permanent email: toms@alum.mit.edu (use only if first address fails) http://www.ccrnp.ncifcrf.gov/~toms/ *) label 1; (* end of program *) const (* begin module version *) version = 2.32; (* of exon.p 2007 Dec 10 2007 Dec 10 2.32: fix @ 2007 Dec 10 2.31: stop converting last exon to gene when mRNA (at least) 2007 Dec 10 2.30: allow m, R, r for mRNA. 2007 Dec 10 2.29: tell user exonp is being upgraded 2007 Dec 10 2.28: testing 2007 Dec 10 2.27: version upgrading 2007 Dec 10 2.26: allow mRNA analysis. 2007 Dec 05 2.25: put exon features into heap. see tech. notes 2007 Dec 05 2.24: increase exonmax was too big (100000) see tech. notes 2007 Nov 24 2.23: Segmentation Fault - exonmax was too big (100000) do fast skip of sequence 2007 Nov 23 2.22: increase exonmax 2006 Oct 25 2.21: allow for strain name in organism: ORGANISM Escherichia coli K12 gives E.coli-K12 2000 Nov 2: 2.20: clean up output 2000 Nov 2: 2.19: bug for U41218 2000 May 4: 2.14: cleaned up code so that naming really works 2000 May 3: 2.09: /note is used to name exons 2000 May 3: 2.06: sort exons into ascending order. 2000 May 3: 2.05: names are now put into einst, ainst, dinst 1999 December 1: 2.04: upgrade documentation 1999 June 1: 2.02: three options for exon names 1999 May 4: 1.98: einst file added 1999 April 26: 1.96: exon complements handled 1998 June 3: /gene name is now used for CDS. major use: 1994 November 12 origin 1994 April 10 *) updateversion = 1.26; (* defines lowest acceptable current parameter file *) (* end module version *) (* begin module describe.exon *) (* name exon: determine lengths of exons in GenBank entries synopsis exon(exonp: inout, db: in, dinst: out, ainst: out, einst: out, lengths: out, exonfeatures: out output: out) files exonp: parameters to control the program, one per line: 0: parameterversion: The version number of the program. This allows the user to be warned if an old parameter file is used. (Introduced 2007 Dec 10.) 1: If 'n' then the end exons are not included. These do not have reliable lengths. Even if end exons are included, the program will never add the Delila instructions for the very ends of the CDS, because these are not reliable. Often they are CAP or polyA sites. Specifically, the first coordinate of the CDS is likely to be a CAP and so should not be added to the acceptors in ainst, while the last coordinate of the CDS is likely to be a polyA site and so should not be added to the donors in dinst. 2: if 'd' then gobs of debugging output are printed to the output file. 3: Two constants, theDfromrange and theDtorange, that determine the from and to range to be written for Donor Delila instructions. 4: Two constants, theAfromrange and theAtorange, that determine the from and to range to be written for Acceptor Delila instructions. 5: If the first character is 'e' then exon features are also used. If the second character is 'i' then intron features are also used. 6: If the first character is 'a' (for alternative) then exon features that have one end point the same are included. If it is not 'a' then only exons that are completely different are included. 7: 4 characters that determine the harshness of which entries to keep. The categories are: single letter name string in GenBank: p putative "putative" n notexperimental "not_experimental" g geneprediction "gene prediction" u unpublished "Unpublished" s pseudo "pseudo" The letters 'pngus' are on the parameter line. If a letter is capitilzed, then any entry with that string in it ANYWHERE will be killed. This is harsh but effective at removing GenBank crap. 8: If the first character is 'n' (for "notes") then if there is no /gene or /number for a feature, the program will use the /note feature. WARNING: Despite 15 years of complaining to GenBank, names in notes are NOT PARSABLE and may cause ill health. 9: If the first character is 'r', 'R' or 'm' (for "mRNA") then the mRNA feature is used instead of CDS. Otherwise CDS is used. (Introduced 2007 Dec 10.) If exonp is old (before having a parameter version) exon will attempt to upgrade it. db: a set of GenBank entries lengths: A list of the exon lengths found in db. dinst: Delila instructions for donor sites. ainst: Delila instructions for acceptor sites. einst: Delila instructions for exons. The acceptor from (theAfromrange) and donor to (theDtorange) are used to extend beyond the exon edges. exonfeatures: Locations of exons in the format for the Lister program. output: messages to the user description The program searches for 'CDS'. If the next word is 'join' it parses out the parts of the CDS, determining the lengths of the exons. If 'complement' is found, the complementary exons are identified. To ensure a clean data set, the program eliminates: * single exons in a locus (unreliable data for lengths) * exons which have one end not defined (< or > mark) * exons at the beginning or end of the CDS (unreliable data) * exons that are references to other entries. * duplicate exons within a single locus * exons that have any coordinates the same as other exons in the same entry. This (arbitrarily) eliminates alternative splice cases. To remove further junk from the database, entries that contain any of these phrases are skipped: 'not_experimental' 'gene prediction' 'Unpublished' GenBank contains many mRNA sequences masquerading as DNA. They can be identified by zero length introns. They are ruthlessly eliminated. If a CDS has a no /gene name in the feature table, it will be named like this: U00096.CDS.190-255 no /gene, no /number If a CDS has a /gene name in the feature table, it would be nice to name it like this: U00096.thrA (this name can fail) Unfortunately that alone will fail because all exons end up being named the same! So if there is a /number the name will include it: M95740.IDUA.exon-3 /gene, /number If there is a /gene but no /number the range will be given: M95740.IDUA.427-512 /gene, no /number So there are three options for names. * The exons are placed into ascending order. * The Delila name command is used to name the pieces. examples documentation @article{Stephens.Schneider.Splice, author = "R. M. Stephens and T. D. Schneider", title = "Features of spliceosome evolution and function inferred from an analysis of the information at human splice sites", journal = "J. Mol. Biol.", volume = "228", pages = "1124-1136", year = "1992"} see also dbinst.p author Thomas Dana Schneider bugs technical notes The program deals with alternative splicing by removing any exon that has any coordinate the same as another exon. The program only can accept a single type of organism to be put into the instruction files. It's not clear that one would ever want to mix organisms for this analysis!! The zero coordinate for splice junctions follows the convention of Stephens.Schneider.Splice: it is the base on the intron side of the splice junction. 2007 Dec 05. The exon program would compile and run with exonmax set to 15000. Unfortunately this is not enough for H.sapiens chromosome 1 (NC_000001 247249719 bp). The program compiles (gpc) but gives a 'Segmentation Fault'. The reason (thanks to David Bryant) is that the stack size in Unix is restricted. The Unix command 'limit' gives 'stacksize 8192 kbytes'. There are at least 3 solutions. 1. The Unix stack size can be increased by the command: limit stacksize 65538 Doing so solved the problem. Although this works, it requires setting the operating system so it is not too portable. 2. The exonmax determines the number of exonrecords: fealist: array[1..exonmax] of exonrecord; The exonrecords use the standard 'string' for the gene namewhich has an array of characters whose size is determined by constant maxstring for which the default is 150. Setting maxstring to 20 solves the problem. Although this would work, perhaps it is best to allow long names. 2. Put the array into the program heap, which is unlimited, instead of the stack, which has the current limit. This requires a program change. I implemented it by making the fealist be a pointer to an array: 'fealist^.a' replaced 'fealist' through the code. This worked. Thanks to David Bryant for pointing out the situation and explaining the possible solution of putting the data on the heap. *) (* end module describe.exon *) (* Documentation for Feature table format definition: http://www.ncbi.nlm.nih.gov/collab/FT/index.html http://www.ncbi.nlm.nih.gov/collab/FT/index.html The DDBJ/EMBL/GenBank Feature Table Definition Version 2.0, Dec 15 1997. | 4.3 Data item positions | | The position of the data items within the feature descriptor line is as follows: | | column position data item | --------------- --------- | | 1-5 blank | 6-20 feature key | 21 blank | 22-80 location | | | Data on the qualifier and continuation lines begins in column position 22 | (the first 21 columns contain blanks). The EMBL format for all lines differs | from that described above in that it includes a line type abbreviation in | columns 1 and 2. The following situation appears in U00096: CDS complement(20233..20508) /gene="insA_1" /note="f91; 100 pct identical to ISA1_ECOLI SW: P03827" /codon_start=1 /label=b0022 /product="insertion element IS1 protein InsA" /db_xref="PID:g1786204" /transl_table=11 /translation="MASVSISCPSCSATDGVVRNGKSTAGHQRYLCSHCRKTWQLQFT YTASQPGTHQKIIDMAMNGVGCRATARIMGVGLNTILRHLKNSGRSR" gene complement(20815..21078) /gene="rpsT" CDS complement(20815..21078) /gene="rpsT" /note="f87; 100 pct identical RS20_ECOLI SW: P02378 but includes initiator met; TTG start" /codon_start=1 /label=b0023 /product="30S ribosomal protein S20" /db_xref="PID:g1786206" /transl_table=11 /translation="MANIKSAKKRAIQSEKARKHNASRRSMMRTFIKKVYAAIEAGDK AAAQKAFNEMQPIVDRQAAKGLIHKNKAARHKANLTAQINKLA" To avoid the /gene from being assigned to insA_1, only the first occurance is used. This is done with the ignoreslashgene boolean. 2000 May 4 REMOVE THIS MECHANISM: it is easier to reset the /gene finding mechanisms at each feature using the columnposition = 6 trigger. *) (* begin module interact.const *) (* begin module string.const *) maxstring = 2000; (* the maximum string *) (* end module string.const version = 4.86; (@ of prgmod.p 2004 Sep 8 *) (* end module interact.const version = 7.72; {of delmod.p 2007 Jul 23} *) (* begin module my.filler.const *) fillermax = 20; (* the size of the filler array for a string *) (* end module my.filler.const *) (* from filler.const version = 4.13; (@ of prgmod.p 1994 sep 5 *) type (* begin module interact.type *) (* begin module string.type *) stringptr = ^string; (* pointer to a string *) string = record (* a string of characters *) letters: array[1..maxstring] of char; (* the letters in the string *) length: integer; (* the number of characters in the string *) current: integer; (* the letter we are working on *) next: stringptr; (* the next string in a series *) end; (* end module string.type version = 4.86; (@ of prgmod.p 2004 Sep 8 *) (* end module interact.type version = 7.72; {of delmod.p 2007 Jul 23} *) (* begin module filler.type *) (* the following is an array used to fill a string. it is convenient to have it much shorter than the maxstring, so that it is easy to fill the string using procedure fillstring. the user must declare the value of constant fillermax. *) filler = packed array[1..fillermax] of char; (* end module filler.type version = 4.39; (@ of prgmod.p 1999 November 28 *) (* begin module trigger.type *) trigger = record (* an object to be searched for *) seek: string; (* the characters looked for *) state: integer; (* how close to triggering we are *) skip: boolean; (* trigger not found- skip the line *) found: boolean (* the trigger was found *) end; (* end module trigger.type version = 4.39; (@ of prgmod.p 1999 November 28 *) (* a type for the quicksort module *) position = 0..maxstring; var exonp, db, lengths, dinst, ainst, einst, exonfeatures: text; (* files used by this program *) (* zzz global for now, maybe forever! *) columnposition: integer; (* column position in entry *) (* begin module halt *) procedure halt; (* stop the program. the procedure performs a goto to the end of the program. you must have a label: label 1; declared, and also the end of the program must have this label: 1: end. examples are in the module libraries. this is the only goto in the delila system. *) begin writeln(output,' program halt.'); goto 1 end; (* end module halt version = 7.72; {of delmod.p 2007 Jul 23} *) (* begin module interact.clearstring *) (* begin module clearstring *) procedure clearstring(var ribbon: string); (* empty the string *) var index: integer; (* to the ribbon *) begin (* clearstring *) with ribbon do begin for index := 1 to maxstring do letters[index] := ' '; length := 0; current := 0; end end; (* clearstring *) procedure initializestring(var ribbon: string); (* start the string with a nil pointer. This routine should be called before doing linked list work. This allows the standard string routines to clear the string without killing the pointer. *) begin (* initializestring *) clearstring(ribbon); ribbon.next := nil; end; (* initializestring *) (* end module clearstring version = 4.86; (@ of prgmod.p 2004 Sep 8 *) (* end module interact.clearstring version = 7.72; {of delmod.p 2007 Jul 23} *) (* begin module equalstring *) function equalstring(a, b: string): boolean; (* Test for equality between two strings at current positions. NOTE: A compiler bug results if one directly tests this way: if thedefinition^.nametag = aname The reason is completely not clear! I showed that the parts of the strings were identical, but the whole was not by this test. For this reason it is *critical to test strings with equalstring. *) var index: integer; (* index to both strings *) equal: boolean; (* are letters in a and b the same? *) begin (* equalstring *) if a.length = b.length then begin index := 1; repeat equal := (a.letters[index] = b.letters[index]); index := succ(index) until (not equal) or (index > a.length); equalstring := equal end else equalstring := false end; (* equalstring *) (* end module equalstring version = 4.39; (@ of prgmod.p 1999 November 28 *) (* begin module interact.writestring *) (* begin module writestring *) procedure writestring(var tofile: text; var s: string); (* write the string s to file tofile, no writeln *) var i: integer; (* index to s *) begin (* writestring *) with s do for i := 1 to length do write(tofile, letters[i]) end; (* writestring *) (* end module writestring version = 4.39; (@ of prgmod.p 1999 November 28 *) (* end module interact.writestring version = 4.39; (@ of prgmod.p 1999 November 28 *) (* begin module filler.fillstring *) procedure fillstring(var s: string; a: filler); (* this procedure makes it reasonably easy to fill the string s with characters. one calls the procedure as: *) (* 1 2 3 4 5 *) (* 12345678901234567890123456789012345678901234567890 *) (* fillstring(s, 'this-is-the-string '); the two comments make it easy to line the characters up. also, for this example, it was assumed that the length of filler as defined by the constant fillermax was 50. *) var length: integer; (* of the string without trailing blanks *) index: integer; (* of s *) begin (* fillstring *) clearstring(s); length := fillermax; while (length > 1) and (a[length] = ' ') do length := pred(length); if (length = 1) and (a[length] = ' ') then begin writeln(output, 'fillstring: the string is empty'); halt end; for index := 1 to length do s.letters[index] := a[index]; s.length := length; s.current := 1 end; (* fillstring *) (* end module filler.fillstring version = 7.72; {of delmod.p 2007 Jul 23} *) (* begin module filler.filltrigger *) procedure filltrigger(var t: trigger; a: filler); (* fill the trigger t *) begin (* filltrigger *) fillstring(t.seek,a) end; (* fillstring *) (* end module filler.filltrigger version = 7.72; {of delmod.p 2007 Jul 23} *) (* begin module trigger.proc *) (* this module allows one to scan a series of characters, as from an array or a file, and to "trigger" or detect a simple string in the series. the advantage of the trigger is that several triggers can "observe" a stream of characters at once, each looking for a different thing. some other modules required: interact.const, interact.type *) procedure resettrigger(var t: trigger); (* reset the trigger to ground state *) begin (* resettrigger *) with t do begin state := 0; skip := false; found := false end end; (* resettrigger *) procedure testfortrigger(ch: char; var t: trigger); (* look at the character ch. if it is part of the trigger (at the current trigger state), then the trigger state goes higher. if it is not part of the trigger then the trigger state is reset, skip is true and one should skip onward to find the trigger. if the trigger is found, found is true. 1996 Sep 12: Bug found! In the case of a trigger "ab", the program used to miss it for situations like "aab". This was because at the first a it would step up. Then it would see the second a and recognize that was not part of ab. It would fail to realize that it could be the start of a new one. The code now accounts for that possibility. *) begin (* testfortrigger *) with t do begin state := succ(state); { writestring(list,seek); writeln(list,'testfortrigger seek.letters[',state:1,']:', seek.letters[state],' ch:',ch); } if seek.letters[state] = ch then begin skip := false; if state = seek.length then found := true else found := false end else begin (* it failed. But wait! It could be the beginning of a NEW trigger string! *) if seek.letters[1] = ch then begin state := 1; skip := false; found := false end else begin (* reset trigger *) state := 0; skip := true; found := false end end end end; (* testfortrigger *) (* end module trigger.proc version = 7.72; {of delmod.p 2007 Jul 23} *) (* begin module copyaline *) procedure copyaline(var fin, fout: text); (* copy a line from file fin to file fout *) begin (* copyaline *) while not eoln(fin) do begin fout^ := fin^; put(fout); get(fin) end; readln(fin); writeln(fout); end; (* copyaline *) (* end module copyaline version = 7.72; {of delmod.p 2007 Jul 23} *) (* begin module copyline *) procedure copyline(var fin, fout: text); (* copy a line from file fin to file fout but DO NOT CARRIAGE RETURN on the fout. Carriage return on the fin. *) begin (* copyline *) while not eoln(fin) do begin fout^ := fin^; put(fout); get(fin) end; readln(fin); end; (* copyline *) (* end module copyline version = 4.39; (@ of prgmod.p 1999 November 28 *) (* begin module copylines *) function copylines(var fin, fout: text; n: integer): integer; (* copy n lines of file fin to file fout. the actual number of lines copied is returned. *) var index: integer; (* the current line number *) begin (* copylines *) index := 0; while (not eof(fin)) and (index < n) do begin copyaline(fin, fout); index := succ(index) end; copylines := index end; (* copylines *) (* end module copylines version = 7.72; {of delmod.p 2007 Jul 23} *) (* ZAP zzz begin module skipblanks *) procedure skipblanks(var thefile: text); (* skip over blanks until a non-blank, or end of line, is found *) begin while (thefile^ = ' ') and not eoln(thefile) do begin (* modify skipblanks so that columns are tracked properly *) columnposition := succ(columnposition); get(thefile); end; end; procedure skipnonblanks(var thefile: text); (* skip over nonblanks until a blank, or end of line, is found *) begin while (thefile^ <> ' ') and not eoln(thefile) do get(thefile); end; { procedure skipcolumn(var thefile: text); (* skip over a data column *) begin skipblanks(thefile); skipnonblanks(thefile) end; } (* ZAP zzz end module skipblanks version = 4.16; (@ of prgmod.p 1996 August 12 *) (* begin module grabtoken *) procedure grabtoken(var thefile: text; var thestring: string); (* skip any blanks and then grab the next token from the file *) var c: char; (* a character in thefile *) done: boolean; (* done finding the name *) begin skipblanks(thefile); done := false; with thestring do begin length := 0; while not done do begin if eoln(thefile) then done := true else begin read(thefile,c); if c = ' ' then done := true else begin length := succ(length); letters[length] := c; end end end end end; (* end module grabtoken version = 4.39; (@ of prgmod.p 1999 November 28 *) (* begin module grabquote *) procedure grabquote(var thefile: text; var thestring: string); (* skip any blanks and then grab up to the next quote from the file *) var c: char; (* a character in thefile *) done: boolean; (* done finding the name *) begin skipblanks(thefile); done := false; with thestring do begin length := 0; while not done do begin if eoln(thefile) then done := true else begin read(thefile,c); if c = '"' then done := true else begin length := succ(length); letters[length] := c; end end end end end; (* end module grabquote version = 4.16; (@ of prgmod.p 1996 August 12 *) (* begin module exon.themain *) procedure themain(var exonp, db, lengths, dinst, ainst, einst, exonfeatures: text); (* the main procedure of the program *) const exonmax = 30000; (* largest number of exons the program can handle *) colwid = 7; (* width of columns, not including separater space *) type exonrecord = record (* a record of an exon *) thestart: integer; (* the start *) thestop: integer; (* the stop *) FEA: integer; (* which cds or mRNA of the locus it is in *) complement: boolean; (* whether it is on the complementary strand *) getthestart: boolean; (* whether to get the start *) getthestop: boolean; (* whether to get the stop *) printasexon: boolean; (* if true, the start or stop represent an exon. otherwise it must be an intron. *) genenamefound: boolean; (* if there is a /gene for the CDS, record it *) genenamestring: string; (* the gene name *) genenumberfound: boolean; (* if there is a /number for the CDS, record it *) genenumberstring: string; (* the gene number *) notestring: string; (* a string containing the last note *) notenumber: integer; (* count of notes since last exon *) featurenumber: integer; (* the number of the feature in the entry where this item was found. This allows notes to be attached only to the right feature *) end; (* exonrecordptr is a pointer to an exonset, which is a holder for an exonrecord that puts the record on the essentially infinite heap instead of the limited stack. *) exonset = record (* holder of the array of exonrecords *) a: array[1..exonmax] of exonrecord; end; exonrecordptr = ^exonset; (* pointer to the holder of exon records *) {qqq} var accession: trigger; (* trigger to find the ACCESSION pattern *) accname: string; (* the current accession name *) allowalternative: boolean; (* allow alternate splicing *) atcomplement: boolean; (* we are at a complement *) c: char; (* a character in db *) cprevious: char; (* the previous value of c. This is no longer used, but leave the mechanism for now. *) FEA: trigger; (* trigger to find the CDS or mRNA pattern *) FEAcount: integer; (* count of cds or mRNA in current entry *) closeparen: trigger; (* trigger to find the end of the join pattern *) complement: trigger; (* trigger to find the end of the complement pattern *) completeunit: boolean; (* true when a unit is complete *) comma: trigger; (* trigger to find the end of the comma pattern *) done: boolean; (* done fast skipping of sequence *) debug: boolean; (* true when debugging the program *) features: trigger; (* trigger to find the ACCESSION pattern *) getexons: boolean; (* true if we are to get exon features *) getintrons: boolean; (* true if we are to get intron features *) dotcount: integer; (* count number of dots in a unit of a CDS *) entryend: trigger; (* a mark for the end of an entry *) exon: trigger; (* trigger to find the exon feature *) intron: trigger; (* trigger to find the intron feature *) { (* Original definition before 2007 Dec 05: *) fealist: array[1..exonmax] of exonrecord; (* features in a locus *) } fealist: exonrecordptr; (* features in a locus *) {qqq} feastored: integer; (* current number of features recorded into fealist *) featarget: integer; (* earlier target for any gene names or numbers *) featurecount: integer; (* count of features in the entry so far *) firstnumber: boolean; (* true when the first number is about to be read *) genename: trigger; (* gene names for CDS of the form /gene *) genenumber: trigger; (* gene number for CDS of the form /number *) holdfile: text; (* internal file for rebuilding the exonp *) infeatures: boolean; (* true if inside feature secion of an entry *) geneprediction: trigger; (* a locus contains non-experimental junk *) good: boolean; (* the end point is good unless it is found to be outside the sequence *) hold: integer; (* for reversing start and stop when atcomplement is true *) join: trigger; (* trigger to find the join pattern *) killentry: boolean; (* kill entries that contain bad phrases *) killputative: boolean; (* kill putative entries *) killnotexperimental: boolean; (* kill not_experimental entries *) killgeneprediction: boolean; (* kill gene prediction entries *) killunpublished: boolean; (* kill Unpublished entries *) killpseudo: boolean; (* kill pseudo containing entries *) length: integer; (* length of an exon *) linenumber: integer; (* count of the lines in db *) locus: trigger; (* trigger to find the LOCUS pattern *) locuscount: integer; (* count of locus *) noends: boolean; (* true when printing ends of the CDS *) putative: trigger; (* a locus contains non-experimental junk *) notebegin: trigger; (* a mark for the beginning a note *) noteend: trigger; (* a mark for the end a note *) noteholder: string; (* holder for the last note *) notexperimental: trigger; (* a locus contains non-experimental junk *) numbers: set of '0'..'9'; organism: trigger; (* trigger to find organism key word *) origin: trigger; (* a mark for the ORIGIN: start of sequence *) parameterversion: real; (* version of the program *) {vvv} printon: boolean; (* if true, print the data passing by *) pseudo: trigger; (* a locus contains pseudo genes *) relorient: integer; (* relative orientation of site to piece *) start: integer; (* the start base of an exon *) stop: integer; (* the stop base of an exon *) theAfromrange, theAtorange: integer; (* from/to range: acceptor delila inst *) theDfromrange, theDtorange: integer; (* from/to range: donor delila inst *) thefeature: char; (* the kind of feature to process, CDS or mRNA *) totalfeastored: integer; (* count of exons in entire db *) totalentrycount: integer; (* count of entrys in entire db *) unitcount: integer; (* true when the end of a unit has been found *) unitend: boolean; (* true when the end of a unit has been found *) unknownstring: string; (* string for unknown genes *) unpublished: trigger; (* a locus contains junk *) usenotes: boolean; (* use notes to name exons *) procedure grabspecies(var db, f: text); (* get the species name from db into f. *) var c: char; (* a character read from db *) begin { writeln(output,'grabspecies'); } resettrigger(organism); reset(db); while (not eof(db)) and (not organism.found) do begin if eoln(db) then readln(db) else begin read(db, c); testfortrigger(c, organism); if organism.found then begin skipblanks(db); read(db,c); write(f,c,'.'); skipnonblanks(db); skipblanks(db); (* 2006 Oct 25: copyline fails for organisms with a strain name: ORGANISM Escherichia coli K12 because it retains the space and Delila can't handle it. copyline(db,f); *) (* replace spaces with dashes: *) while not eoln(db) do begin if db^ <> ' ' then f^ := db^ else f^ := '-'; put(f); get(db) end; end; end; end; if not organism.found then begin writeln(output,'ORGANISM line missing in db'); halt end; end; procedure startinst(var f, db: text; sitetype: char); (* Start Delila instructions in file f. If sitetype = 'a' then label it 'acceptor'. If sitetype = 'd' then label it 'donor'. Use the db to get the organism/chromosome names *) begin rewrite(f); write(f,'title "'); if sitetype = 'a' then write(f,'acceptor'); if sitetype = 'd' then write(f,'donor'); if sitetype = 'e' then write(f,'exon'); writeln(f,'";'); writeln(f,'(* exon ',version:4:2,' *)'); writeln(f,'default out-of-range reduce-range;'); writeln(f,'default numbering piece;'); write(f,'organism '); grabspecies(db,f); writeln(f,';'); write(f,'chromosome '); grabspecies(db,f); writeln(f,';'); writeln(f); end; procedure nextline; (* move to the next line *) begin if eoln(db) then begin readln(db); {writeln(output); } linenumber := succ(linenumber); if debug then if printon then writeln(output); columnposition := 0; end end; procedure nextcharacter(var f: text; var cprevious, c: char); (* read the next character c from file f, keep track of the previous character. *) begin cprevious := c; read(f,c); columnposition := succ(columnposition); if debug then if printon then write(output,c); { write(output,c); if eoln(db) then begin writeln(output); write(output,'| '); end; } end; procedure readaninteger(var f: text; first: char; var n: integer); (* read an integer from f, protecting ourselves from stupidity. first is the first character of the number *) begin {write(output,'readaninteger: ');} n := ord(first)- ord('0'); if not(first in numbers) then begin writeln(output,'readaninteger: bad number!'); writeln(output,'first is: ',first); halt end; while f^ in numbers do begin nextcharacter(f, cprevious, c); n := 10*n + ord(c)- ord('0'); end; {writeln(output,' becomes ',n:1);} end; procedure sign(var f: text; x: integer); (* write the sign of x *) begin if x < 0 then write(f,'-') else write(f,'+'); end; procedure thename(var f: text; e: integer); (* write the exon name e to file f. There are three formats, depending on whether the /gene and /number have been defined: U00096.CDS.11356-10643 - - M95740.IDUA.exon-3 /gene /number M95740.IDUA.427-512 /gene - The accession number or coordinate is important to keep so that features are identified properly in lister. Presumably gene names are unique in an entry, though this may not be true. It turns out that using the gene name alone will fail since exons will all have the same name! This is why the number is needed. *) var start, stop: integer; (* the range of the feature *) begin (* if it is an intron, then the start and stops are recorded the other way to make donor and acceptor identification easy, but with this new naming feature we have to invert it again... *) { (* original way of accessing the fealist before 2005 Dec 05 *) if fealist[e].printasexon then begin qqq } if fealist^.a[e].printasexon then begin start := fealist^.a[e].thestart; stop := fealist^.a[e].thestop end else begin start := fealist^.a[e].thestop; stop := fealist^.a[e].thestart end; writestring(f,accname); if fealist^.a[e].genenamefound then begin write(f, '.'); writestring(f, fealist^.a[e].genenamestring); end; if fealist^.a[e].genenumberfound then begin write(f, '.#'); writestring(f, fealist^.a[e].genenumberstring); end; if (not fealist^.a[e].genenamefound) and (not fealist^.a[e].genenumberfound) then begin { write(f, '.CDS:'); } if thefeature = 'r' then write(f, '.mRNA:') else write(f, '.CDS:'); write(f, start:1,'-',stop:1); end; if equalstring(fealist^.a[e].genenamestring, unknownstring) then begin write(f, '.',start:1,'-',stop:1); end; (* give the note if there was one *) if usenotes then begin if (fealist^.a[e].notenumber > 0) then begin write(f, '.'); writestring(f, fealist^.a[e].notestring); { write(f, 'e=',e:1); write(output, 'thename: e=',e:1,' '); writestring(output, fealist^.a[e].notestring); write(output,' notenumber = ',fealist^.a[e].notenumber:1); writeln(output); } {zzznote} end; end; end; (* thename *) procedure writeDelilainstructions(var inst: text; location, thefromrange, thetorange: integer; e: integer); (* write the e'th exon entry at the given location to the inst file *) begin write(inst,'piece '); writestring(inst,accname); write(inst,';'); write(inst,' name "'); thename(inst, e); write(inst,'";'); write(inst,' get from ',location:1); write(inst, ' '); sign(inst,thefromrange*relorient); if -relorient > 0 then write(inst,abs(thefromrange):1) else write(inst,abs(thefromrange):1); write(inst, ' to same '); sign(inst,thetorange*relorient); if relorient > 0 then write(inst,abs(thetorange):1) else write(inst,abs(thetorange):1); write(inst,' direction '); sign(inst,relorient); writeln(inst,';'); end; (******************************************************************************) (* Sorting routines *) function lessthan(a, b: position): boolean; (* see quicksort *) begin { This causes a segmentation error for unknown reasons: case fealist^.a[a].complement of true: case fealist^.a[b].complement of true: lessthan := fealist^.a[a].thestop < fealist^.a[b].thestart; false: lessthan := fealist^.a[a].thestop < fealist^.a[b].thestop; end; false: case fealist^.a[b].complement of true: lessthan := fealist^.a[a].thestart < fealist^.a[b].thestop; false: lessthan := fealist^.a[a].thestart < fealist^.a[b].thestart; end; end; } lessthan := fealist^.a[a].thestart < fealist^.a[b].thestart; end; procedure swap(a,b: position); (* see quicksort *) var hold: exonrecord; begin hold:=fealist^.a[a]; fealist^.a[a]:=fealist^.a[b]; fealist^.a[b]:=hold; (*;write(output,'a=',a:1,', b=',b:1); print (@ for testing *) end; (* begin module quicksort *) procedure quicksort(left, right: position); (* quick sort a list between positions left and right, into ascending order. a position is simply a scalar of the form 0..max. the array to be sorted is dimensioned 1..max. (the difference in the ranges is important to the correct operation of the sort...) two external routines are used: function lessthan(a, b: position): boolean is a generalized test for value-at-a < value-at-b. procedure swap(a, b: position) switches the items at positions a and b. since these routines are external, the procedure is general. this procedure taken from the book 'algorithms + data structures = programs' by niklaus wirth, prentice-hall, inc., englewood cliffs, n.j.(1976), pp. 76-82 *) var lower, upper: position; (* the positions looked at currently *) center: position; (* the rough center of the region being sorted *) begin lower := left; center := (left + right) div 2; upper := right; repeat while lessthan(lower, center) do lower := succ(lower); while lessthan(center, upper) do upper := pred(upper); if lower <= upper then begin (* keep track of the center through the map: *) if lower = center then center:=upper else if upper = center then center:=lower; swap(lower, upper); lower := succ(lower); upper := pred(upper) end until lower > upper; if left < upper then quicksort(left, upper); if lower < right then quicksort(lower, right) end; (* end module quicksort taken from progmod.p *) procedure sortexons; (* sort the exons in the fealist^.a by their first coordinate *) begin quicksort(1, feastored); end; (******************************************************************************) procedure printexons; (* Print the exons listed in the fealist^.a, then clear the list. *) var currentexons: integer; (* number of exons printed right now *) duplicate: boolean; (* check for duplicate site in previous feature *) current: integer; (* current base location for instruction *) previous: integer; (* previous base location for instruction *) e: integer; (* index for fealist^.a *) firstexon: integer; (* the first exon to print *) lastexon: integer; (* the last exon to print *) begin if debug then if eoln(db) then writeln(output); if debug then writeln(output); if debug then write(output,'****************************'); if debug then writeln(output, '---> ', feastored:4,' ',start:4,' ',stop:4,' ',killentry); firstexon := 1; lastexon := feastored; currentexons := lastexon - firstexon + 1; if (currentexons > 0) and (not killentry) then begin sortexons; (* only sort if there are exons to sort ... *) totalfeastored := totalfeastored + currentexons; totalentrycount := totalentrycount + 1; for e := firstexon to lastexon do begin { writeln(output); writeln(output,'PRINTEXONS e = ',e:1,'=========================================='); } start := fealist^.a[e].thestart; stop := fealist^.a[e].thestop; if fealist^.a[e].printasexon then begin if fealist^.a[e].complement then length := start - stop + 1 else length := stop - start + 1; if (length <= 0) then begin writeln(lengths,' ERROR: length < 0 for this entry:'); writeln(output,' ERROR: see lengths file'); end; (* write lengths file here *) write(lengths, ' ',length:colwid, ' ',start:colwid,' ',stop:colwid); if fealist^.a[e].complement then write(lengths, ' ','-':colwid) else write(lengths, ' ','+':colwid); write(lengths, ' ',e:colwid, ' ',fealist^.a[e].FEA:colwid, ' ',locuscount:colwid ); write(lengths,' '); writestring(lengths,accname); writeln(lengths); end; if fealist^.a[e].complement then relorient := -1 else relorient := +1; (* write Delila instructions here *) (* For intron ends, the location of the splice junction zero coordinate is that given in the feature table. For exons it is one off, depending on the relative orientation, relorient. *) with fealist^.a[e] do begin if getthestart then begin (* look for duplicates immediately preceeding *) { writeln(output,'========================================<<<<<<<<<<<<<<<<<<<<'); } if e > firstexon then begin previous := fealist^.a[e-1].thestart; if fealist^.a[e-1].printasexon then if fealist^.a[e].complement then previous := previous - 1 (* relorient *) else previous := previous + 1; if fealist^.a[e].printasexon then current := thestart - relorient else current := thestart; duplicate := (current = previous) and fealist^.a[e-1].getthestart; { writeln(output,'previous=',previous:1); writeln(output,'current=',current:1); } end else duplicate := false; { writeln(output,'fealist^.a[e].thestart = ', fealist^.a[e].thestart:1); writeln(output,'fealist^.a[e-1].thestart = ', fealist^.a[e-1].thestart:1); } { writeln(output,'========================================<<<<<<<<<<<<<<<<<<<<'); } if not duplicate then begin {zzzname} { writeln(output); writeln(output,'about to writeDelilainstructions: ainst (e=',e:1,')'); } {zzzggg} { write(ainst,'(* e=',e:1,' "'); thename(ainst,e); writeln(ainst,'" *)'); } if fealist^.a[e].printasexon then writeDelilainstructions(ainst,thestart - relorient, theAfromrange, theAtorange, e) else writeDelilainstructions(ainst,thestart, theAfromrange, theAtorange, e) end; end else begin if fealist^.a[e].printasexon then writeln(output,thestart:1, ' exon start was dropped from ainst:', ' end of CDS or sequence') else writeln(output,thestart:1, ' intron stop was dropped from ainst: end of sequence'); end; if getthestop then begin (* look for duplicates immediately preceeding *) { writeln(output,'========================================>>>>>>>>>>>>>>>>>>>>'); } if e > firstexon then begin previous := fealist^.a[e-1].thestop; if fealist^.a[e-1].printasexon then if fealist^.a[e].complement then previous := previous - 1 (* relorient *) else previous := previous + 1; if fealist^.a[e].printasexon then current := thestop - relorient else current := thestop; duplicate := (current = previous) and fealist^.a[e-1].getthestop; { writeln(output,'previous=',previous:1); writeln(output,'current=',current:1); } end else duplicate := false; { writeln(output,'fealist^.a[e].thestop = ', fealist^.a[e].thestop:1); writeln(output,'fealist^.a[e-1].thestop = ', fealist^.a[e-1].thestop:1); } { writeln(output,'========================================>>>>>>>>>>>>>>>>>>>>'); } if not duplicate then begin {zzzname} { writeln(output); writeln(output,'about to writeDelilainstructions: dinst (e=',e:1,')'); } {zzzggg} { write(dinst,'(* e=',e:1,' "'); thename(dinst,e); writeln(dinst,'" *)'); } if fealist^.a[e].printasexon then writeDelilainstructions(dinst,thestop + relorient, theDfromrange, theDtorange, e) else writeDelilainstructions(dinst,thestop, theDfromrange, theDtorange, e) end; end else begin if fealist^.a[e].printasexon then writeln(output,thestop:1, ' exon stop was dropped from dinst:', ' end of CDS or sequence') else writeln(output,thestop:1, ' intron start was dropped from dinst: end of sequence'); end; end; if fealist^.a[e].printasexon then begin (* write einst: exon instructions *) with fealist^.a[e] do begin write(einst,'piece '); writestring(einst,accname); write(einst,'; '); if relorient = +1 then begin {zzzname} write(einst,'name "'); thename(einst, e); write(einst,'"; '); write(einst,'get from '); write(einst,thestart:1); write(einst,' '); sign(einst,theAfromrange); write(einst,abs(theAfromrange):1); write(einst,' to '); write(einst,thestop:1); write(einst,' '); sign(einst,theDtorange); write(einst,abs(theDtorange):1); writeln(einst, ' direction +;') end else begin {zzzname} write(einst,'name "'); thename(einst, e); write(einst,'"; '); write(einst,'get from '); write(einst,thestop:1); write(einst,' '); sign(einst,-theDtorange); write(einst,abs(theDtorange):1); write(einst,' to '); write(einst,thestart:1); write(einst,' '); sign(einst,-theAfromrange); write(einst,abs(theAfromrange):1); writeln(einst, ' direction +;') end end; (* write exonfeatures file here with this format: define "J02833exon467-615" "-" "<]" "[>" 0 148 *) write(exonfeatures, 'define "'); thename(exonfeatures, e); write(exonfeatures, '" "-" "<]" "[>" 0 ',(length-1):1); writeln(exonfeatures); (* call it immediately; @ J02833 467 +1 "J02833exon467-615" "" *) {vvv} { write(exonfeatures, '@ BUBAvvv '); } write(exonfeatures, '@ '); writestring(exonfeatures,accname); write(exonfeatures, ' ',start:1,' ',relorient:1,' "'); thename(exonfeatures, e); write(exonfeatures, '" ""'); writeln(exonfeatures); end; end; end; feastored := 0; (* zap the entries whether or not they were printed *) end; (* printexons *) procedure addfeature(featureisexon: boolean; getstart, getstop: boolean); (* see if the feature already exists in the list for this locus. If not, add it to the list. Mark whether to get the start and stops. This allows ends of CDS to be removed. When featureisexon is false then the feature is not a cds or exon (ie it's an intron), and we should not mark it as an exon. Only instructions for the ends are given. *) var e: integer; (* index for fealist *) previous: integer; (* the original feature previously stored *) good: boolean; (* was the feature unique? *) begin {writeln(output,'addfeature');} if debug then writeln(output); if debug then writeln(output,'addfeature'); if debug then writeln(output,' good unit: ',start:1,'..',stop:1); if debug then writeln(output,' feastored: ',feastored:1); {if debug then writeln(output, 'checking for duplicate exons **********');} e := 1; good := true; if featureisexon then write(output,' CDS/exon ',start:1,'-',stop:1) else write(output,' intron ',start:1,'-',stop:1); {zzznote} while e <= feastored do begin if debug then write(output, ' exon ', e:4,' ', fealist^.a[e].thestart:1,'..',fealist^.a[e].thestop:1); (* if either the start OR the stop is the same as any previous one, toss it out. This not only eliminates duplicates but also drops alternative splicing. In 4167 exons only 8 were dropped for this reason. *) if (allowalternative and ((fealist^.a[e].thestart = start) and (fealist^.a[e].thestop = stop))) or ((not allowalternative) and ((fealist^.a[e].thestart = start) or (fealist^.a[e].thestop = stop))) then begin {write(output, 'allowalternative: ', allowalternative);} { write(output,'exon ',start:1,'-',stop:1, ' was dropped because it is '); } write(output,' was dropped because it is '); if ((fealist^.a[e].thestart = start) and (fealist^.a[e].thestop = stop)) then write(output,'duplicate') else write(output,'alternative'); { writeln(output); writeln(output,'fealist^.a[e].thestart = ', fealist^.a[e].thestart); writeln(output,' start = ', start); writeln(output,' fealist^.a[e].thestop = ', fealist^.a[e].thestop); writeln(output,' stop = ', stop); } good := false; previous := e; (* note where it was in the list *) e := feastored; (* kill the loop rapidly! *) end else if ((fealist^.a[e].thestart = start) or (fealist^.a[e].thestop = stop)) then write(output,' alternative'); (* mark alternatives *) e := succ(e) (* normal loop step *) end; if good then begin feastored := succ(feastored); if feastored > exonmax then begin writeln(output); writeln(output,'exonmax = ',exonmax:1, ' exceeded'); halt end; (* record the exon in the list *) fealist^.a[feastored].complement := atcomplement; fealist^.a[feastored].FEA := FEAcount; fealist^.a[feastored].printasexon := featureisexon; fealist^.a[feastored].genenamefound:= false; fealist^.a[feastored].genenumberfound:= false; fealist^.a[feastored].notenumber:= 0; clearstring(fealist^.a[feastored].notestring); fealist^.a[feastored].featurenumber:= featurecount; { write(output,' vvvCHECK'); } if featureisexon then begin fealist^.a[feastored].thestart := start; fealist^.a[feastored].thestop := stop; fealist^.a[feastored].getthestart := getstart; fealist^.a[feastored].getthestop := getstop; end else begin (* for an intron, reverse the order of the recording: *) fealist^.a[feastored].thestart := stop; fealist^.a[feastored].thestop := start; fealist^.a[feastored].getthestart := getstop; fealist^.a[feastored].getthestop := getstart; end; (* target gene names and numbers to this recent one *) featarget := feastored; end else begin (* Ok so the feature exists - see if we can put current genename and genenumber into it *) { writeln(output); writeln(output, 'rescue mission'); (* first matters first. Do we have anything to contribute? *) writeln(output, 'feastored = ',feastored:1); writeln(output, 'previous = ',previous:1); writeln(output, 'fealist^.a[previous].genenamefound = ', fealist^.a[previous].genenamefound:1); writeln(output, 'fealist^.a[previous].genenumberfound = ', fealist^.a[previous].genenumberfound:1); } (* target gene names and numbers to the previous feature *) featarget := previous; {zzzggg} end; writeln(output); end; (* addfeature *) procedure dojoin; (* do the join function to identify the parts of a CDS. *) var getstart, getstop: boolean; (* whether to get the start or stop of an exon. Don't ever get them if they are at the end of a CDS. *) begin if debug then writeln(output,'[JOIN FOUND]'); writeln(output); (* break the line at the 'join(' *) printon := true; resettrigger(join); {printon := false;} (* go through the join list *) resettrigger(closeparen); unitcount := 0; firstnumber := true; good := true; dotcount := 0; unitend := false; completeunit := false; repeat nextline; if db^=' ' then nextcharacter(db,cprevious,c) (* skip all blanks *) else begin nextline; nextcharacter(db,cprevious,c); if (c = ',') or (c = ')') then begin (* end of the unit *) unitend := true; unitcount := succ(unitcount); {write(output,' unitend, unitcount = ',unitcount:1);} end else begin (* inside of a unit *) if c in numbers then begin if firstnumber then begin readaninteger(db, c, start); if start = stop + 1 then begin killentry := true; writestring(output,accname); (* report the name to output *) writeln(output,' zero length intron containing entry!', ' DELETED'); end; completeunit := false; (* until stop has been read *) firstnumber := false; end else begin readaninteger(db, c, stop); completeunit := true; end end else if c = '.' then begin dotcount := succ(dotcount); end else begin (* any other junk: kill *) {writeln(output,'unit died because of ',c);} good := false; end end; (* end of unit read *) {writeln(output,'dotcount=', dotcount:1);} {writeln(output,'HERE good=', good, ' unitend=',unitend);} if unitend and completeunit and good then if dotcount <> 2 then good := false; if noends and unitend then begin (* kill the end ones *) if unitcount = 1 then good := false; if {unitend and} (c = ')') then good := false; end; { write (output,' unitend=',unitend); write (output,' completeunit=',completeunit); writeln(output,' good=',good); } if unitend and completeunit and good then begin (* record the unit *) if atcomplement then begin (* reverse start and stop *) hold := start; start := stop; stop := hold; end; (* Do another test. Two cases: 1) If we are at the low end of the CDS, and it is the exon start, this is probably not an acceptor and should be removed. 2) If we are at the high end of the CDS, and it is the exon stop, this is probably not a donor and should be removed. *) getstart := true; getstop := true; (* test for low end of the CDS *) if unitcount = 1 then getstart := false; (* test for high end of the CDS *) if c = ')' then getstop := false; addfeature(true, getstart, getstop); end; if unitend then begin (* reeset for next time *) {writeln(output,'UNITEND');} unitend := false; good := true; dotcount := 0; firstnumber := true; completeunit := false; end; end; testfortrigger(c,closeparen); until closeparen.found; (* join list finishes *) writeln(output,')'); {vvv dojoin} resettrigger(closeparen); atcomplement := false end; (* dojoin *) procedure dofeature(featureisexon: boolean); (* get the feature information. Cases of "<" or ">" are not included since they are off the end of the piece. exon 143..251 /gene="hMLH1" /number=14 When featureisexon is false then the feature is not a cds or exon (ie it's an intron), and we should not mark it as an exon. Only instructions for the ends are given. *) var done: boolean; (* done looking for the word 'complement' *) getstart, getstop: boolean; (* whether to get the start or stop of an exon. *) begin {writeln(output,'dofeature *************************');} start := 0; stop := 0; getstart := false; getstop := false; skipblanks(db); read(db,c); atcomplement := false; if c = 'c' then begin (* handle the complement! *) {zzzccc} (* make sure it is the word "complement" *) resettrigger(complement); (* first character already is past: *) complement.state := succ(complement.state); done := false; while not done do begin if eoln(db) then done := true else begin read(db,c); {write(output,c);} testfortrigger(c,complement); if complement.found then begin atcomplement := true; done := true; read(db,c); (* prepare for the next part *) end; end; end end; if c in numbers then begin (* ie, it's not "<" *) readaninteger(db, c, start); getstart := true; {writeln(output,' start =',start:1);} end else if c = '<' then begin (* get the number but keep getstart false *) read(db,c); if c in numbers then readaninteger(db, c, start) else start := 0; end;; (* skip the ".." *) read(db,c); if c = '.' then begin read(db,c); {writeln(output,' c1="',c,'"');} if c = '.' then begin {writeln(output,' c2="',c,'"');} read(db,c); {writeln(output,' c3="',c,'"');} if c = '>' then begin read(db,c); (* skip on *) if c in numbers then readaninteger(db, c, stop) else stop := maxint; end else if c in numbers then begin (* ie, it's not ">" *) readaninteger(db, c, stop); getstop := true; {writeln(output,' stop =',stop:1);} end; end; end; readln(db); {writeln(output,'start: ',getstart);} {writeln(output,' stop: ',getstop);} (* kill weird cases. This comes up in D13897 where there is: exon 1531..(1770.1773) Such things should just be dropped. *) if start = 0 then getstart := false; if stop = 0 then getstart := false; if atcomplement then begin {zzzccc} (* reverse start and stop *) hold := start; start := stop; stop := hold; end; if getstart or getstop then addfeature(featureisexon, getstart, getstop); {writeln(output,'***********************************');} end; procedure readparameters; (* read the parameters *) begin reset(exonp); (* insist on new parameter for version *) if exonp^ in ['0','1','2','3','4','5','6','7','8','9','0'] then begin readln(exonp, parameterversion); end else begin writeln(output,'No version parameter, upgrading exonp.'); rewrite(holdfile); writeln(holdfile, version:4:2, ' ', 'version of exon that this parameter file is designed for.'); {vvv} if copylines(exonp,holdfile,8) <> 8 then begin writeln(output, 'Cannot copy 8 lines of exonp:'); writeln(output, 'Build it yourself.'); halt; end; writeln(holdfile, 'CDNA c/C: CDNA, m/R/r: mRNA'); (* preserve notes: *) while not eof(exonp) do copyaline(exonp,holdfile); reset(holdfile); rewrite(exonp); while not eof(holdfile) do copyaline(holdfile,exonp); reset(exonp); readln(exonp, parameterversion); end; readln(exonp,c); if c = 'n' then noends := true else noends := false; readln(exonp,c); if c = 'd' then debug := true else debug := false; readln(exonp, theDfromrange, theDtorange); readln(exonp, theAfromrange, theAtorange); read(exonp,c); if c = 'e' then getexons := true else getexons := false; read(exonp,c); if c = 'i' then getintrons := true else getintrons := false; readln(exonp); readln(exonp,c); if c = 'a' then allowalternative := true else allowalternative := false; read(exonp,c); if c = 'P' then killputative := true else killputative := false; read(exonp,c); if c = 'N' then killnotexperimental := true else killnotexperimental := false; read(exonp,c); if c = 'G' then killgeneprediction := true else killgeneprediction := false; read(exonp,c); if c = 'U' then killunpublished := true else killunpublished := false; read(exonp,c); if c = 'S' then killpseudo := true else killpseudo := false; readln(exonp); if not eoln(exonp) and (not eof(exonp)) then begin read(exonp,c); if c = 'n' then usenotes := true else usenotes := false; end else usenotes := false; readln(exonp); readln(exonp,thefeature); (* allow the user to type mRNA, RNA or rna: *) if thefeature in ['m','r','R'] then thefeature := 'r'; {vvv} end; procedure writeparameters(var f: text); (* read the parameters to file f *) procedure no(body: boolean); begin write(f,'* '); if body then write(f,'no ') else write(f,' '); end; begin writeln(f ,'***********************************************'); writeln(f ,'**************** parameters *******************'); no(noends); writeln(f,'exons on the ends of CDS included.'); writeln(f, '* the DONOR from range: ', theDfromrange:3); writeln(f, '* the DONOR to range: ', theDtorange:3); writeln(f, '* the ACCEPTOR from range: ', theAfromrange:3); writeln(f, '* the ACCEPTOR to range: ', theAtorange:3); no(not getexons); writeln(f ,'exon features included.'); no(not getintrons); writeln(f ,'intron features included.'); no(not allowalternative); writeln(f ,'alternative exons included.'); no(not killputative); writeln(f ,'"putative" entries will be killed.'); no(not killnotexperimental); writeln(f ,'"not_experimental" entries will be killed.'); no(not killgeneprediction); writeln(f ,'"gene prediction" entries will be killed.'); no(not killunpublished); writeln(f ,'"Unpublished" entries will be killed.'); no(not killpseudo); writeln(f ,'"pseudo" entries will be killed.'); no(not usenotes); writeln(f ,'/notes included in names.'); write(f ,'* ', thefeature, ' feature: '); if thefeature = 'r' then writeln(f, 'mRNA') else writeln(f, 'CDNA'); writeln(f ,'***********************************************'); end; (* writeparameters *) procedure killtest( c: char; var t: trigger; n: string; killt: boolean; var killentry: boolean); (* Test whether the character c triggers trigger t. n is the name of the entry. killt is whether the entry should be killed if the trigger t was found. killentry is whether the entry has already been killed. *) begin testfortrigger(c,t); if t.found then begin if n.length > 0 then writestring(output,n) (* report the name to output *) else write(output,'The next entry'); write(output,' will be'); if killt then begin write(output, ' KILLED'); killentry := true; end else write(output, ' KEPT'); write(output, '. It contains: "'); writestring(output,t.seek); writeln(output, '".'); resettrigger(t); end; end; procedure doFEA; (* do CDS or mRNA features *) begin if (columnposition >= 6) and (columnposition <= 20) then testfortrigger(c,FEA); if FEA.found then if infeatures then begin {writeln(output,'CDS FOUND');} skipblanks(db); FEAcount := succ(FEAcount); if debug then writeln(output); if debug then writeln(output,'CDS ',FEAcount:1, ' ======================'); while not eoln(db) do begin nextcharacter(db, cprevious,c); {write(output,c);} write(output,c); testfortrigger(c,complement); if complement.found then atcomplement := true; if atcomplement then begin testfortrigger(c,closeparen); if closeparen.found then begin atcomplement := false end; end; begin (* unique case of a single exon, eg M35391 *) testfortrigger(c,join); if join.found then begin dojoin; end else begin (* If it's a number then just grab the two numbers *) if c in numbers then begin skipblanks(db); readaninteger(db, c, start); read(db, c); if c <> '.' then begin writeln(output,'error in reading CDS or mRNA"',c, '" found instead of "."'); writeln(output,'ord(c) is ',ord(c)); halt end; read(db, c); if c <> '.' then begin writeln(output,'error in reading CDS or mRNA"',c, '" found instead of "."'); halt end; read(db, c); if not (c in numbers) then begin if c = '>' then begin read(db,c); (* skip on *) if c in numbers then readaninteger(db, c, stop) else stop := maxint; end else begin writeln(output,'error in reading CDS or mRNA"',c ,'" is not a number'); halt end end else readaninteger(db, c, stop); { writeln(output,'start ',start:1); writeln(output,'stop ',stop:1); } if atcomplement then begin (* reverse start and stop *) hold := start; start := stop; stop := hold; end; addfeature(true, true, true); end end end end; resettrigger(FEA); end; (* pick up gene name and number into previous CDS *) (* 2007 Dec 10; Isn't that a bad idea? It wrecks the last exon! *) if infeatures and getexons then begin testfortrigger(c,genename); (* Pick up names and numbers only if they are in the same feature *) if genename.found then begin if (fealist^.a[featarget].genenamefound <> true) and (featurecount = fealist^.a[featarget].featurenumber) then begin { writeln(output,'featarget = ',featarget:1,' '); writeln(output,'genename.found -----------'); } { This mechanism was converting the last exon into the gene feature! fealist^.a[featarget].genenamefound:= true; } {vvv some mechanism here causes loss of the last exon!} if thefeature = 'r' then fealist^.a[featarget].genenamefound:= false else begin fealist^.a[featarget].genenamefound:= true; (* read the name into the feature list *) grabquote(db, fealist^.a[featarget].genenamestring); write(output,' gene: '); writestring(output, fealist^.a[featarget].genenamestring); writeln(output); end; { writeln(output,'genename.found ----------- done'); } end; resettrigger(genename); end; {zzzggg} testfortrigger(c,genenumber); if genenumber.found then begin if (fealist^.a[featarget].genenumberfound <> true) and (featurecount = fealist^.a[featarget].featurenumber) then begin { writeln(output,'featarget = ',featarget:1,' '); writeln(output,'genenumber.found -----------'); } fealist^.a[featarget].genenumberfound:= true; (* read the number into the feature list *) grabtoken(db, fealist^.a[featarget].genenumberstring); write(output,' '); writestring(output, fealist^.a[featarget].genenumberstring); writeln(output); { writeln(output,'genenumber.found ----------- done'); } end; resettrigger(genenumber); end end; end; begin (* themain *) writeln(output,'exon ',version:4:2); new(fealist); (* allocate the feature list memory on the heap see technical notes for the reason for doing this. *) readparameters; rewrite(lengths); writeln(lengths,'* exon ',version:4:2); writeparameters(lengths); writeparameters(output); {vvv} if debug then writeln(output,'DEBUGGING'); (* 1 2 3 4 5 *) (* 12345678901234567890123456789012345678901234567890 *) filltrigger( locus, 'LOCUS '); filltrigger( organism, 'ORGANISM '); filltrigger( accession, 'ACCESSION '); filltrigger( features, 'FEATURES '); if thefeature = 'r' then filltrigger( FEA, 'mRNA ') else filltrigger( FEA, 'CDS '); filltrigger( genename, '/gene=" '); filltrigger( genenumber, '/number= '); filltrigger( exon, 'exon '); filltrigger( intron, 'intron '); filltrigger( join, 'join( '); filltrigger( closeparen, ') '); filltrigger( comma, ', '); filltrigger( complement, 'complement( '); filltrigger( putative, 'putative '); filltrigger(notexperimental, 'not_experimental '); filltrigger(geneprediction, 'gene prediction '); filltrigger(unpublished, 'Unpublished '); filltrigger(pseudo, 'pseudo '); filltrigger(entryend, '// '); filltrigger(notebegin, '/note=" '); filltrigger(noteend, '" '); filltrigger(origin, 'ORIGIN '); fillstring(unknownstring, 'unknown '); resettrigger(FEA); resettrigger(exon); resettrigger(intron); resettrigger(locus); resettrigger(accession); resettrigger(features); resettrigger(genename); resettrigger(genenumber); resettrigger(join); resettrigger(closeparen); resettrigger(complement); resettrigger(comma); resettrigger(putative); resettrigger(notexperimental); resettrigger(geneprediction); resettrigger(unpublished); resettrigger(pseudo); resettrigger(entryend); resettrigger(notebegin); resettrigger(noteend); resettrigger(origin); locuscount := 0; FEAcount := 0; totalentrycount := 0; totalfeastored := 0; feastored := 0; infeatures := false; linenumber := 1; (* we are already at the first line *) numbers:=['0','1','2','3','4','5','6','7','8','9']; killentry := false; writeln(lengths,'* columns'); (* 12345678 *) writeln(lengths, '* length', ' start', ' stop', ' dir', ' exon#', ' FEA#', ' locus#', ' accession'); startinst(dinst,db,'d'); startinst(ainst,db,'a'); startinst(einst,db,'e'); rewrite(exonfeatures); writeln(exonfeatures,'* exon: exonfeatures ',version:4:2); reset(db); c := ' '; (* forces initial cprevious to be blank *) while not eof(db) do begin columnposition := 0; while not eoln(db) do begin cprevious := c; nextcharacter(db, cprevious,c); (* Count Features *) if infeatures and (columnposition = 6) and (c <> ' ') then begin (* We are (almost certainly, damn GenBank for not defining things precisely!) at the start of a feature *) featurecount := succ(featurecount); resettrigger(genename); resettrigger(genenumber); { writeln(output,'featurecount = ',featurecount:1); } end; (* LOCUS *) testfortrigger(c,locus); if locus.found then begin killentry := false; locuscount := succ(locuscount); FEAcount := 0; clearstring(accname); infeatures := false; atcomplement := false; stop := -maxint; (* be sure initial intron length is large *) start := -maxint; (* just to be clean *) featurecount := 0; if debug then writeln(output); if debug then writeln(output,'LOCUS ',locuscount:1,' ======================'); end; (* /note *) testfortrigger(c,notebegin); if notebegin.found then begin (* skip all notes. They can contain deadly triggers that mess up the program. *) if debug then write(output,'skipping note:', c); clearstring(noteholder); if feastored > 0 then begin (* if there is at least one exon so far, record notes *) if featurecount = fealist^.a[feastored].featurenumber then fealist^.a[feastored].notenumber := succ(fealist^.a[feastored].notenumber); end; while not noteend.found do begin nextcharacter(db, cprevious,c); (* ONLY record into previous fealist^.a item IF we are on the same object. Otherwise record into holding string *) if feastored > 0 then begin (* if there is an exon, record notes *) (* only record the first note *) (* only record if it is from the same feature *) if (fealist^.a[feastored].notestring.length < maxstring) and (c <> '"') and (featurecount = fealist^.a[feastored].featurenumber) then with fealist^.a[feastored].notestring do begin length := succ(length); letters[length] := c; { write(output,'+',e:1,'@'); write(output,c); } end else with noteholder do begin {zzzuuu} (* currently this is not used, but the code for now *) (* check for length solved the U41218 bug *) if length < maxstring then begin length := succ(length); letters[length] := c; { writeln(output,'noteholder.length = ',noteholder.length:1); write(output,'noteholder = '); writestring(output,noteholder); writeln(output,'"'); } end; end; end; if debug then write(output,c); if debug then if eoln(db) then writeln(output); testfortrigger(c,noteend); end; if debug then writeln(output); resettrigger(notebegin); resettrigger(noteend); end; (* END OF ENTRY *) testfortrigger(c,entryend); if entryend.found then begin printexons; (* print previously found exons *) end; {write(output,c);} (* START OF ENTRY *) testfortrigger(c,accession); if accession.found then begin (* capture the accession string *) grabtoken(db,accname); writestring(output,accname); (* report the name to output *) writeln(output); if accname.length <= 0 then begin writeln(output,'MISSING ACCESSION STRING'); end; resettrigger(accession); if debug then write(output,'ACCESSION '); if debug then writestring(output, accname); if debug then writeln(output,' ======================'); end; (* test for bad entry markers and kill them: *) if not killentry then begin killtest(c,putative, accname,killputative, killentry); killtest(c,notexperimental,accname,killnotexperimental,killentry); killtest(c,geneprediction ,accname,killgeneprediction ,killentry); killtest(c,unpublished ,accname,killunpublished ,killentry); killtest(c,pseudo ,accname,killpseudo ,killentry); end; testfortrigger(c,features); if features.found then begin {writeln(output,'FEATURE found'); } infeatures := true; end; {zzzname} { This is a possible solution to the CDS regions suppressing the note information. It is not very good because CDS is more "reliable" according to David Lipman. ALSO this routine looks for the /gene and /note! if not usenotes then } doFEA; if infeatures then if getexons then begin testfortrigger(c,exon); if exon.found then begin dofeature(true); resettrigger(exon); end; end; if infeatures then if getintrons then begin testfortrigger(c,intron); if intron.found then begin dofeature(false); resettrigger(intron); end; end; (* Sequence: skip it extremely quickly!! *) testfortrigger(c,origin); if origin.found then begin writeln(output,'Skipping sequence ...'); readln(db); done := false; while not done do begin read(db, c); if c = '/' then begin read(db, c); if c = '/' then done := true end; (* next line handles the readln(db); *) end; writeln(output,'... done skipping sequence'); end; end; nextline; end; printexons; (* finish off previous exon list *) writeln(output,totalentrycount:1,' entry(s) have features'); writeln(output,totalfeastored:1,' features processed'); writeln(output,featurecount:1,' features'); writeln(lengths,'* ',totalentrycount:1,' entry(s) have features'); writeln(lengths,'* ',totalfeastored:1,' features processed'); writeln(lengths,'* ',featurecount:1,' features'); end; (* end module exon.themain *) begin themain(exonp, db, lengths, dinst, ainst, einst, exonfeatures); 1: end.