Paul J. Rushton, Marta T. Bokowiec, Shengcheng Han, Hongbo Zhang, Jennifer F. Brannock, Xianfeng Chen, Thomas W. Laudeman, and Michael P. Timko
Tobacco Transcription Factors: Novel Insights into Transcriptional Regulation in the Solanaceae
Plant Physiol. Published on March 12, 2008 10.1104/pp.107.114041
Rushton PJ, Bokowiec MT, Laudeman TW, Brannock JF, Chen X, Timko MP.
TOBFAC: the database of tobacco transcription factors
BMC Bioinformatics. 2008 Jan 25;9:53.
Regulation of gene expression at the level of transcription is a major
control point in many biological processes and plant genomes devote
approximately 7% of their coding sequence to transcription factors. Global
analysis of transcription factors has only been performed for three seed
plants - Arabidopsis (http://datf.cbi.pku.edu.cn/index.php)
, poplar
(http://dptf.cbi.pku.edu.cn/)
and rice (http://drtf.cbi.pku.edu.cn/)
.
TOBFAC: The database of tobacco transcription factors, contains a detailed
analysis of over 2,513 tobacco (Nicotiana tabacum L.) transcription factors
using a dataset of 1,159,022 gene-space sequence reads (GSRs) obtained by
methylation filtering from the Tobacco Genome Initiative (TGI). These GSRs
are estimated to represent at least 90% of tobacco open reading frames.
TOBFAC contains all of the transcription factor sequences from the TGI, together with EST data. These sequences can be queried by BLAST searches and downloaded for further analysis. TOBFAC also contains phylogenetic trees for some of the largest families of transcription factors and these are also downloadable. We aim to regularly update the information so that TOBFAC will continue to represent one of the most wide-ranging databases of transcription factors in any plant species and be a major resource for the study of gene expression in tobacco and the Solanaceae.
Tobacco (Nicotiana tabacum L.) has been one of the most studied plant
species, partly because of its economic importance and partly because it is
a convenient plant system for research. Tobacco is a model plant for the
Solanaceae and is an amphiploid species (2n=48) with a relatively large
genome size of approximately 4.5 Mbp and this large genome size makes the
goal of sequencing the tobacco genome difficult. However, to alleviate some
of the difficulties created by the presence of large amounts of repetitive
DNA in large genomes, a number of techniques have been developed to isolate
the low-copy or hypomethylated regions of the genome for sequencing. One of
these techniques is methylation filtration (MF), which preferentially clones
the hypomethylated fraction of the genome, effectively reducing the size of
the genome to be sequenced. The Tobacco Genome Initiative (TGI)
(http://www.tobaccogenome.org/)
has been established to sequence and
annotate more than 90% of the open reading frames in the genome of
cultivated tobacco using methylation filtration technology.
We used a dataset of 1,159,022 gene-space sequences reads (GSRs) obtained by methylation filtering from the Tobacco Genome Initiative (TGI) to obtain sequence information from at least 90% of tobacco transcription factors. A consensus amino acid sequence (normally the DNA-binding domain) from each of 64 currently known transcription factor families was used to isolate sequences that belong to each class of transcription factor. These were assembled into contigs and individually analysed by BLAST searches to verify the identity of the gene sequence. Tobacco contains a minimum of 2,513 transcription factors, a total that is higher than both Arabidopsis and rice. Arabidopsis, poplar and tobacco all contain this core set of 64 transcription factor families and that rice also shares 63 of these. This suggests that the evolution of higher plants was not associated with the wholesale gain or loss of transcription factor families but rather with the lineage specific expansion of transcription factor subfamilies. Highlights of our work include the discovery of a novel subfamily of NAC transcription factors that we have called TNACS. The TNAC genes make up about 25% of all NAC genes in tobacco but are completely absent from all currently sequenced plant genomes. TNACs are, however, present in tomato, pepper and potato and this novel subfamily therefore appears to be restricted to the Solanaceae. In addition, we have subjected the tobacco ERF, WRKY, NAC, homeodomain, bZIP, bHLH, R2R3MYB and MADS box genes to detailed phylogenetic analysis that facilitates predictions of function based on phylogenetic position.
The table below lists over- and under-represented TF families compared to the three sequenced higher plant genomes.
|
TF Family |
Arabidopsis |
Poplar |
Rice (indica) |
Rice (japonica) |
Tobacco |
|
|
|
|
|
|
|
|
ERF/AP2 |
146 |
212 |
174 |
182 |
274 |
|
C2H2 |
134 |
81 |
94 |
113 |
161 |
|
HD |
87 |
106 |
84 |
103 |
129 |
|
TCP |
23 |
34 |
22 |
24 |
43 |
|
ZF-HD |
16 |
25 |
14 |
15 |
38 |
|
GRF |
9 |
9 |
12 |
18 |
23 |
|
BES |
8 |
12 |
7 |
6 |
19 |
|
SAP |
1 |
1 |
0 |
0 |
4 |
|
|
|
|
|
|
|
|
PcG |
34 |
45 |
34 |
34 |
23 |
|
ZIM |
18 |
22 |
19 |
29 |
13 |
|
ARF |
23 |
37 |
24 |
41 |
12 |
|
CCAAT HAP5 |
13 |
19 |
14 |
18 |
6 |
|
CPP |
8 |
13 |
11 |
16 |
3 |
Paul J Rushton
Marta T. Bokowiec
Xianfeng (Jeff) Chen
Thomas (Tom) W Laudeman
Jennifer F. Brannock
Michael P. Timko