In this month’s episode, we’re going to revisit a topic that most—possibly even all—readers were exposed to in some far distant undergraduate course, but possibly not in much depth or with its significance to human genetic diseases made very clear outside of a few special cases. Here’s a quick test: consider the question, “Are mutations in or directly adjacent to coding regions of genes the only ones likely to lead to disease states?” If your immediate reaction is to answer this with a “Yes,” this article is for you. (If you answered “No,” you might want to read on anyway and see if your logic is right!)
Let’s go back to some very basic molecular biology. The genome contains genes, which are regions of DNA which get transcribed to RNA; in some cases this RNA is itself directly functional (things like tRNAs or the 18S component of the ribosome, for instance) but in most cases, the RNA is an mRNA, carrying a protein coding sequence which is translated by the ribosomal machinery into a covalently attached series of amino acids—a protein—which by nature of the varied side chain chemistries and their electrostatic, hydrogen bond, and hydrophobic interactions folds up to a thermodynamic energy minima state to create a functional enzyme or structural protein. Mutations—changes to the underlying DNA sequence—within any of this coding region are statistically likely to cause unwanted changes of function in the final protein product, although “likely” there is a reminder that such mutations can be silent (meaning not causing a protein sequence change), or not harmful (causing a change which doesn’t have significant impact), or even possibly advantageous, yielding a biologically more fit product.
What that compressed summary of about two years’ worth of undergrad biology courses omits, is that these genes in ones DNA don’t just magically transcribe to RNA on their own. Bearing in mind that only a small fraction of the human genome carries actual genes as defined above, there exist other DNA sequence elements whose sole role is to mark where genes are, and to control their level of expression (transcription to RNA). There are three particularly significant types of these control elements, called promoters, enhancers, and repressors; and as we’ll see below, mutations in any of these can have effects as serious (or worse) than mutations in coding sequences.
Promoters: the proximal gatekeeper for a gene
Promoters are relatively short sequences (roughly 100 to 1,000 base pairs in length) always found directly upstream (5’, with respect to the DNA coding strand) of the gene they control (“drive,” in usual parlance). These sequences contain elements which recruit in RNA polymerases responsible for transcribing the gene. Very simplistically, if a particular defined promoter sequence in a particular cell type and setting is maximally efficient at recruiting RNA polymerase—let’s call that 100 percent activity—then variations in the sequence can occur which reduce this activity (less RNA is made per unit time). Some changes are more disruptive than others, and in combinations it’s not hard to envision how variations from a “best” promoter sequence can lead to potential for a smooth range of basal expression rates, from sub 1 percent to full 100 percent expression. That’s a great thing from a cell’s standpoint, because it allows different genes to have their expression levels tailored to the steady state amount of gene product needed.
Adding (literally) a layer of complexity here is that promoters don’t directly bind RNA polymerase. Instead, they contain shorter sub-sequences, which are recognized as binding sites for a class of proteins known as transcription factors (TFs); there are a great many of these, each with their own preferred DNA sequence binding site (usually short, 10-20 base pairs) and their own level of ability to recruit in RNA polymerase. Many also have, either directly or indirectly, allosteric (secondary) binding sites where ligands such as metabolites or hormones can bind and influence the transcription factor’s level of activity. In fact, it’s the complex interaction of all these different transcription factors and their modulating ligands which is at the core of how different cell types are defined, and a hepatocyte behaves differently than an epithelial cell despite both having the same DNA—they’re “receiving different signals”—which control their relative expression levels of various genes.
It is easy to grasp then how a mutation within a promoter, changing a TF binding site, can lead to problems not through a change in the function of the mature gene product, but through variation in expression level of the product. Undesirable either up or down regulation of a gene can have serious consequences; and if it’s unfortunate enough to happen in a gene which in turn controls the expression or activity of other genes, a whole set of genes can have their levels altered by a single nucleotide change. In almost all cases, that’s not for the best and such a change results in a disease state.
The reader will recall that we started this section by stating that a promoter is always directly upstream of a gene. The spacing between the promoter and the transcriptional start site (where first RNA nucleotide will be laid down in a nascent transcript) is also important, so insertion or deletion mutations—even ones which don’t directly change any specific TF binding sites—can impact the gene expression level. An example of this immediately familiar to all readers would be Huntington’s Disease. Here, an unstable genetic element lies between the promoter and the transcriptional start site. Normally the spacing is acceptable and sufficient levels of the Huntington gene mRNA are transcribed; however during cell replication the unstable element can have additional DNA inserted, moving the promoter away from the start of the gene. As this happens, the promoter is less efficient at driving transcription and transcript levels fall. If the insertion is small and drop in expression is low, overt disease does not occur but it’s considered a“carrier” state, where further expansion will drop gene expression below levels required for normal function, and disease pathology results. (Carrier in this sense is not strictly identical to the meaning in Mendelian genetics, thus the quotation marks.)
The bottom line is that for every gene, not only is the coding section sequence important for proper function, but there’s always an adjacent promoter region which is susceptible to mutations which can have serious clinical repercussions. A gene might have a perfect wild type coding sequence and yet not function as needed.
Enhancers and repressors
The good news about promoters is that we know where to find them. In fact, by sequencing and examining large numbers of them in various contexts, and identifying the various TFs that bind their binding sites and their ligands, we understand, can find, and in the right context, even manipulate promoters at will to do things such as create tissue specific gene expression.
Enhancers and repressors however are more challenging. These are DNA sequence elements which can also modulate gene expression levels (upwards for enhancers, and downwards for repressors, as one might guess). Like promoters, they are short (50-1,000) base pair elements, and within this element will carry binding sites (often, as repeated copies) for proteins which can influence transcription rates at nearby genes. Nearby is an intentionally vague term though, as it can range up to 1 million base pairs away from the gene it influences, and they can be either upstream or downstream—that is, 5’ or 3’—to the gene. They are at least restricted to action in cis or in other words, on the same contiguous chromosome as the gene, but identifying them in relationship to a particular gene can be challenging. Considering the case of a hypothetical enhancer sequence, finding unexpectedly low expression levels of an otherwise intact gene with apparently normal promoter sequence would be first clue that enhancer sequences might be involved. If a number of such cases could be found and genomic region flanking the impacted gene can be sequenced, identification of any areas of genetic change from wild type in common among these cases would be a place to look for enhancer elements. Damage (sequence alteration or deletion) of these would be expected to reduce gene expression. The mirror image of this in a sense is a repressor, which shares the same characteristics but which in its normal state reduces expression of the gene. Mutations at a repressor site then cause an undesirable upregulation in gene expression.
How do enhancers and repressors work across such large distances—and perhaps more interestingly, how is it that they’re specific? That is, an enhancer or repressor will usually act on a particular distal gene, yet other genes near the one influenced may not be influenced. The answer to this is perhaps somewhat disappointing, as there’s nothing amazing; the answer is, because the enhancer or repressor is not, spatially, far away from the gene it regulates. In other words, enhancers and repressors are able to work on sequentially distal targets due to chromatin organization. By wrapping and compacting chromosomes to fit inside a cell nucleus, distant sequence elements can be placed physically adjacent to one another such that a protein binding one sequence element is directly touching and influencing another. The astute reader will note however that in order for this to work reliably, the gene packing and organization must occur reproducibly such that the two chromosome sections can be relied upon to be in proximity. An even more astute reader might further guess that if chromosome organization and packing changes in a reliable fashion during steps of the cell cycle, one might envision enhancers or repressors which can only exert influence at specific times.
Conclusion
The take home message from all of the above is that no, it’s not just the coding sequence of any given gene which can mutate and influence biological function of the gene. This has possible implications for the relative information carried by whole genome sequencing vs whole exome sequencing projects—but that’s a topic for another month.