Sequence logos
are a graphical technique for displaying a summary of
a set of aligned sequences. They were invented
by Tom Schneider and his first high school student Mike Stephens.
The original paper is available.
Weblogo
is a web-based server to create sequence logos,
written and supported by Steve Brenner's group.
Although weblogo is highly useful for biologists to generate
logos, like any other tool it can be misused.
Below are recommendations for proper use of the logos so that
they provde useful data for further studies.
Links to the
Glossary are provided.
Please follow the links provided for more detail on each point.
-
Coordinates:
Chose a sensible
coordinate system. The
zero coordinate
is used
in
sequence walkers, so it is important to have a zero somewhere in the
sequence. Usually we chose a well conserved base.
See the glossary entry on
binding site symmetry for further information.
-
Bits:
-
Range:
Before producing a final figure, use a
range larger than the region you are interested in.
Example: -200 to +200 bases around a binding site.
This allows you to see whether you have cut off part
of a binding site.
It also will make the noisiness of your logo clear and
the variation should be about the size of the error bars.
This will help you to avoid over interpreting the result.
-
Alignment:
-
Report the
alignment you used so that others can reproduce your logo!
-
Give the exact source of each sequence
(GenBank Accession number and version)
-
Give the exact coordinates you used. Do not make your reader
depend on the sequence to locate the sites. We have had cases
where the given sequences in E. coli were ambiguous.
This prevented us from extracting and analyzing the sequences ourselves
to analyze ranges around the site larger than initially provided.
-
Do not give partial sequences or variable length sequences
(unless the sequence does not exist, as on the 5' end of an mRNA).
That is, don't embed your model of the sites into the reporting
of the alignment.
-
A simple but precise way to express aligned sequences is with
Delila instructions.
-
Number:
Publish the number of sequences used to create the logo,
preferably on the logo image itself.
Providing a logo without indicating how many sequences
are involved makes it impossible to judge how much to trust
the image.
-
Information:
Report the total information content of the logo.
For DNA, RNA (and perhaps protein) binding sites, this is an
important number called
Rsequence.
It is generally related to the
size of the genome and number of sites.
See
the paper on Ev and
run the Evj program
to see how this works.
The total information is also essential for computing the
efficiency.
-
Error bars:
Publish error bars.
Without these one cannot tell how good the logo is.
Publish the error on the total information too.
The total error is important for computing the
efficiency.
-
Symmetry:
If your binding site is symmetric,
publish a symmetrical sequence logo.
See the LexA example at the top of the page.
You can publish an asymmetric site for a dimeric protein
if you can show statistically that the asymmetric site has
more information or if it is correlated with a particular direction
such as the direction of transcription.
Arbitrary orientation in the alignment is not an acceptable practice.
-
Sine wave:
For DNA (and even RNA!) put a sine wave on the logo and align it
with major and minor grooves. This makes interesting predictions!
See the papers:
-
oxyr - Reading of DNA Sequence Logos:
Prediction of Major Groove Binding by Information Theory. See also
How To Read Sequence Logos.
-
baseflip - Bases that do not match the sine wave can represent
abnormal structures or base flipping.
-
repan3 - An experiment that suggests DNA base flipping by the
bacteriophage P1 repA protein
-
flexrbs - Ribosome binding sites in E. coli
have a region 5' to the initiation codon, the
Shine-Dalgarno (SD), that base pairs to the 3' end of the 16S rRNA
forming a helix. The logo of the SD appears to
follow a sine wave, implying a helical structure.
-
flexprom - The sigma70 subunit of E. coli
RNA polymerase can can be aligned at the -35 region to a co-crystal
structure. This allows determination of the face of the DNA where the
-10 contacts the polymerase and reveals a base that is probably
flipping out of the DNA at transcriptional initiation.
-
Avoid consensus sequences:
Despite the implication of the title of the original paper,
sequence logos are NOT consensus sequences!
Note that one can not only read the consensus sequence (most frequent
base at every position) from the top of the logo but
one can also read the anti-consensus sequence (least frequent
base at every position) from the bottom of the logo.
One can also read everything in between.
So logos, in themselves, to not represent a consensus.
See this paper:
Consensus Sequence Zen.
-
Publish the raw sequence data used to make the logo.
This allows others to reconstruct your sequence logo and to make
computations on it.
Give the sources of the data. This can be supplementary material
or made available on the web.
- Notes on using the
Weblogo 3 Server
from the
WebLogo 3 : User's Manual
-
Avoid Relative Entropy.
If you use
relative entropy,
then your results
WILL NOT BE BITS and so is a serious mistake.
The simplest way to see this is to consider the states of a coin.
A coin has only two states - heads and tails.
(We ignore the possibility of balancing on the edge as this
will not be stable in noisy situations.)
A coin can only store 1 bit of information.
It cannot store more than 1 bit of information since
there are only two states.
For the four nucleotides, the maximum information is
log2 4 = 2 bits.
Yet the relative entropy measure can give values more than this.
It is clear that the information needed to describe
the sequence patterns never takes more than 2 bits,
so the relative entropy is not a measurement in bits.
If you use relative entropy then your results will not be comparable
to energy because that comparison depends on using actual bits,
see the papers
ccmm and
edmm.
Further the
isothermal efficiency
cannot be correctly computed.
Other workers will have to throw out your work and start over
from the raw sequences.
-
References:
Please cite the
original reference:
@article{Schneider.Stephens1990,
author = "T. D. Schneider
and R. M. Stephens",
title = "{Sequence logos: a new way to display consensus sequences}",
journal = "Nucleic Acids Res",
volume = "18",
pages = "6097--6100",
pmid = "2172928",
pmcid = "PMC332411",
year = "1990"}
so that your paper can be tracked
in the literature.