TOBFAC came into being as a database of tobacco transcription factors at the time, possibly the largest collection of transcription factor sequences from a single plant species (over 2,500 genes). We have now expanded TOBFAC with the goal of making it the best database for tobacco genomic research. To do this, we have incorporated a large amount of new data that can be searched and assembled. For the first time, it is possible to search:
1) 1,159,022 gene-space sequence reads (GSRs) obtained by
methylation filtering from the Tobacco Genome Initiative (TGI).
2) The DFCI Tobacco Gene Index (Release 4.0 July 5, 2008) that
contains 163,524 tobacco EST sequences and 2,288 expressed
transcripts (ETs).
3) The complete TOBFAC database of tobacco transcription factors.
It is also possible to search multiple libraries in a single search. We have incorporated tools for downloading all of the sequences from the blast results and also a contig tool to assemble any or all of the resulting sequences.
These refinements to TOBFAC bring together at least 1,327,716 individual sequences from either tobacco genomic DNA or cDNA, and TOBFAC now represents the tobacco genomic database that the tobacco community requires, but that has been lacking.
We are also improving the TOBFAC sequences by extending the original contigs using a contig extension tool designed by Ryan Thompson. This has allowed us to refine the predicted genes. These will be updated on a gene family basis as the improved data become available.
Paul J. Rushton, Marta T. Bokowiec, Shengcheng Han, Hongbo Zhang, Jennifer F. Brannock, Xianfeng Chen, Thomas W. Laudeman, and Michael P. Timko
Tobacco Transcription Factors: Novel Insights into Transcriptional Regulation in the Solanaceae
Plant Physiol. Published on March 12, 2008 10.1104/pp.107.114041
Rushton PJ, Bokowiec MT, Laudeman TW, Brannock JF, Chen X, Timko MP.
TOBFAC: the database of tobacco transcription factors
BMC Bioinformatics. 2008 Jan 25;9:53.
Regulation of gene expression at the level of transcription is a major
control point in many biological processes and plant genomes devote
approximately 7% of their coding sequence to transcription factors. Global
analysis of transcription factors has only been performed for three seed
plants - Arabidopsis (http://datf.cbi.pku.edu.cn/index.php)
, poplar
(http://dptf.cbi.pku.edu.cn/)
and rice (http://drtf.cbi.pku.edu.cn/)
.
TOBFAC: The database of tobacco transcription factors, contains a detailed
analysis of over 2,513 tobacco (Nicotiana tabacum L.) transcription factors
using a dataset of 1,159,022 gene-space sequence reads (GSRs) obtained by
methylation filtering from the Tobacco Genome Initiative (TGI). These GSRs
are estimated to represent at least 90% of tobacco open reading frames.
TOBFAC contains all of the transcription factor sequences from the TGI, together with EST data. These sequences can be queried by BLAST searches and downloaded for further analysis. TOBFAC also contains phylogenetic trees for some of the largest families of transcription factors and these are also downloadable. We aim to regularly update the information so that TOBFAC will continue to represent one of the most wide-ranging databases of transcription factors in any plant species and be a major resource for the study of gene expression in tobacco and the Solanaceae.
Tobacco (Nicotiana tabacum L.) has been one of the most studied plant
species, partly because of its economic importance and partly because it is
a convenient plant system for research. Tobacco is a model plant for the
Solanaceae and is an amphiploid species (2n=48) with a relatively large
genome size of approximately 4.5 Mbp and this large genome size makes the
goal of sequencing the tobacco genome difficult. However, to alleviate some
of the difficulties created by the presence of large amounts of repetitive
DNA in large genomes, a number of techniques have been developed to isolate
the low-copy or hypomethylated regions of the genome for sequencing. One of
these techniques is methylation filtration (MF), which preferentially clones
the hypomethylated fraction of the genome, effectively reducing the size of
the genome to be sequenced. The Tobacco Genome Initiative (TGI)
(http://www.tobaccogenome.org/)
has been established to sequence and
annotate more than 90% of the open reading frames in the genome of
cultivated tobacco using methylation filtration technology.
We used a dataset of 1,159,022 gene-space sequences reads (GSRs) obtained by methylation filtering from the Tobacco Genome Initiative (TGI) to obtain sequence information from at least 90% of tobacco transcription factors. A consensus amino acid sequence (normally the DNA-binding domain) from each of 64 currently known transcription factor families was used to isolate sequences that belong to each class of transcription factor. These were assembled into contigs and individually analysed by BLAST searches to verify the identity of the gene sequence. Tobacco contains a minimum of 2,513 transcription factors, a total that is higher than both Arabidopsis and rice. Arabidopsis, poplar and tobacco all contain this core set of 64 transcription factor families and that rice also shares 63 of these. This suggests that the evolution of higher plants was not associated with the wholesale gain or loss of transcription factor families but rather with the lineage specific expansion of transcription factor subfamilies. Highlights of our work include the discovery of a novel subfamily of NAC transcription factors that we have called TNACS. The TNAC genes make up about 25% of all NAC genes in tobacco but are completely absent from all currently sequenced plant genomes. TNACs are, however, present in tomato, pepper and potato and this novel subfamily therefore appears to be restricted to the Solanaceae. In addition, we have subjected the tobacco ERF, WRKY, NAC, homeodomain, bZIP, bHLH, R2R3MYB and MADS box genes to detailed phylogenetic analysis that facilitates predictions of function based on phylogenetic position.
The table below lists over- and under-represented TF families compared to the three sequenced higher plant genomes.
|
TF Family |
Arabidopsis |
Poplar |
Rice (indica) |
Rice (japonica) |
Tobacco |
|
|
|
|
|
|
|
|
ERF/AP2 |
146 |
212 |
174 |
182 |
274 |
|
C2H2 |
134 |
81 |
94 |
113 |
161 |
|
HD |
87 |
106 |
84 |
103 |
129 |
|
TCP |
23 |
34 |
22 |
24 |
43 |
|
ZF-HD |
16 |
25 |
14 |
15 |
38 |
|
GRF |
9 |
9 |
12 |
18 |
23 |
|
BES |
8 |
12 |
7 |
6 |
19 |
|
SAP |
1 |
1 |
0 |
0 |
4 |
|
|
|
|
|
|
|
|
PcG |
34 |
45 |
34 |
34 |
23 |
|
ZIM |
18 |
22 |
19 |
29 |
13 |
|
ARF |
23 |
37 |
24 |
41 |
12 |
|
CCAAT HAP5 |
13 |
19 |
14 |
18 |
6 |
|
CPP |
8 |
13 |
11 |
16 |
3 |
Paul J Rushton
Marta T. Bokowiec
Xianfeng (Jeff) Chen
Thomas (Tom) W Laudeman
Jennifer F. Brannock
Michael P. Timko