Preprint/Evolution of protein domain repeats



Protein domain repeats are evolutionarily related units that occur in tandem within a protein. Many proteins are composed of multiple protein domains, functional units of common origin. One variety of domain combination that is particularly frequent among Metazoa, is tandem domain repeats. These are stretches of domains from the same family, situated next to each other in a protein. Certain properties characterize these domains. First, they are often quite short, often less than 50 residues, and, second, they tend to be highly variable with only a few residues that are crucial for the functionality of the domain. Structurally, repeat domains are diverse and may form modular structures on their own or form larger filaments where each repeat is dependent on other repeats for its functionality. Their sequences are malleable, both with regard to the repeating unit and in the number of repeats, and they therefore provide flexible binding to many partners.

Many repeat proteins expand through duplications of neighbouring domains but for some repeat domains low similarity between adjacent domains have been observed. Causes for this observed pattern that have been proposed include aggregation prevention, as adjacent highly similar domains may aggregate, and constraints imposed on the repeat protein by its binding partners. It is possible, however, that it is simply a propagating pattern initially derived through random drift. Further, while multi-domain proteins evolve by single domain insertions/deletions at the N- or C-termini, repeat domains tend to expand through internal tandem duplications, where several units at a time are duplicated. Additionally, repeat proteins often expand to more than twenty repeating units within one protein, possibly through homologous recombination.

History
Protein domains are structural, functional and evolutionary building blocks that, within one protein, can form various architectures that may be composed of one or several domains,. One subset of such domain architectures is domain repeats, i.e. strings of the same class of domains repeated after one another (tandem repeats), as for instance ankyrin repeats (see ). These domains are often short, such as the ankyrin repeat of ~33 residues, and their sequences are highly varied, where typically only a motif is retained. The latter property confounds the automated assignment of repeat domains, since their signatures are often only a small part of the sequence, such in the case of the common HEAT repeat that has on average around 13% sequence identity. As more sequences have become available, and the fraction of protein domain repeats is substantial, several methods to identify repeats have been devised. One method that is commonly used is using hmmer with relaxed e-values for repeating regions. However, de novo methods also exist, such as HHpredID, a method based on HMM-HMM comparisons.



The study of repeat proteins started with Margaret Dayhoff's observations of internal gene duplications in 1978 and the structurally confirmed observation of tandem domains in acid proteases. Many of these duplications take place within repeat proteins, generating mutations that are often associated with disease. The perhaps best known case is huntingtin, a protein that contains many HEAT domain repeats preceeded by a trinucleotide repeat, the expansion of which causes Huntington's disease. However, while the disease-causing trinucleotide repeats of huntingtin are short, many proteins contain repeats encompassing entire protein domains (protein domain repeats). Such domain repeats were found to be important as structural components of virus proteins, such as the shaft repeat of the adenovirus fiber protein and an accumulation of ankyrin repeats in poxviruses has been observed. Indeed, structural roles are common for repeat proteins, and some of the best studied specimens play important roles as cytoskeleton crosslinkers, e.g. spectrin.

The tandem repeats of several proteins have been linked to complex diseases, such as cancer, as for instance the polyglycine repeats of the androgen receptor. Further, leucine rich repeat (LRR) repeats may play a role in Parkinson's disease. Clearly, many repeat domains are of medical importance as exemplified by the immunoglobulin domain, the domain that is the main structural component of antibodies. In recent years, it has become evident that other protein domain repeats may also be used as protein scaffolds capable of specific protein binding. Indeed, the LRR repeat is the main component of the adaptive immune system of jawless vertebrates and enables plants to adjust to new pathogens. Using alternative repeat domains for specific protein binding may allow optimization of the biophysical properties of protein scaffolds.

The many cellular roles of repeat proteins
There are several different classes of protein domain repeats ranging from those that fold independently, to those repeats that aggregate. Other repeats, forming elongated structures, can only fold in the context of similar repeats, sometimes forming superhelical structures that are stabilized by adjacent repeat domains. A further distinctive characteristic of repeat domains is their tendency to, unlike most other domains, have little direct interaction between sequentially distant residues.





Repeat proteins have diverse functions, but are particularly common in cell cycle regulation, transcriptional regulation, protein transport, protein folding or in structural roles. Although fairly rare in prokaryotes, repeat proteins in these organisms tend to be located on the cell surface, often playing an important role in pathogenesis. In eukaryotes, many repeat proteins help shape cellular structure, such in the cases of filamin, spectrin and titin. Other domain repeats are important in complex formation through flexible binding surfaces. Ankyrin repeats, for example, mediate many different protein-protein interactions and is a scaffold that evolves continuously toward the binding of new ligands. Indeed, repeat proteins are common among the hubs in the protein-protein interaction networks.

Repeat proteins are longer than other proteins, and, indeed, constitute some of the proteins that shape the cytoskeleton. One example of this is titin, one of the largest known proteins, with hundreds of immunoglobulin repeats in tandem. Many of the repeat proteins that are important for the function of the cytoskeleton have cellular roles that were most likely cemented at the dawn of the vertebrate lineage, such as filamin, a mechanotransductor important for signalling that exists in three closely related paralogs in all sequenced vertebrates.

Evolution
Proteins evolve both through mutations involving one or a few residues and by domain rearrangements. The latter are comparatively well tolerated since, in many cases, protein domains perform modular functions. Repeat proteins have high variability with regard to the number of repeats in the protein. They differ from other proteins in the sense that they tend to expand through internal duplications rather than domain shuffling. A likely scenario, is that repeat proteins expand rapidly until a physical/structural limit has been reached and subsequently diverge rapidly since repeat domains tend to only have weak sequence similarity One possible explanation for their propensity is that their structures allow expansion and, additionally, may provide novel ligand binding.





Particularly higher eukaryotes are prone to rapid repeat expansion, as is immediately obvious from the abundance of repeats in eukaryotic genomes compared to prokaryotic. The three muscle proteins titin, twitchin and nebulin provide a few extreme examples of repeat expansion. For nebulin, the vertebrate lineage has seen rapid expansion of four proteins, where the largest protein is composed of more than two hundred domains.

The immunoglobulin and filamin repeats, which share the β-sandwich fold, exhibit a pattern where roughly every other domain shows high sequence similarity. This pattern is probably the result of a divergence of adjacent domains and subsequent duplications of the resulting pairs. Although such patterns may evolve by chance as the duplication of the diverged domain pair propagates, functional explanations for these dissimilar adjacent domains have been suggested. For instance, a lack of similarity between adjacent domains may serve to prevent aggregation, as suggested by Wright and coworkers. Alternatively, functional constraints set by other proteins in the interaction where the repeat domain is involved may cause this patter, as in the case of filamin.

Although different domains evolve through internal duplications of different sized cassettes, the most common cassette size is two and the most common location of duplication is in the middle, see. Intriguingly, the abundant EGF and immunoglobulin domains have enriched exon junctions in domain boundaries, thus suggesting a role for exon shuffling in repeat expansion for these domains. Other domains, such as the eukaryote specific C2H2 zinc finger domain shows a very different pattern where the numerous repeating domains are contained within one giant exon.

Clearly, internal repeat duplications most frequently involve more than one domain,. In some cases, repeat proteins contain another level of repeats, namely super-repeats units that, in an already repetitive sequence, show internal similarity in larger bocks. Expansions such as these are easily detected through inspection of domain similarity matrices where parallel lines of very similar domains, all seven domains apart, see and, are clearly visible. As Figure shows, in nebulin, the duplications are consistently taking place in cassettes of seven units. A similar super-repeat is found in the skeletal muscle isoform of titin.

Although it is not the only mechanism behind protein domain repeat expansions, homologous recombination is likely to be important since any region of homology between two sequences may cause homologous recombination and subsequent tandem duplication. Indeed, larger repeating regions tend to be duplicated. Further, duplications in more malleable regions, such as those that contain domains that have a repeat--forming characteristic, are less likely to have a detrimental effect on protein function. In fact, an increase in the number of repeated domains might not, to any great extent, alter the protein structure and can promote protein stability.

Repeat proteins and complexity
Higher complexity is associated with larger genomes and an abundance of sequence repeats. Indeed, protein domain repeats are more common in complex organisms. For coding repeats there are likely to be additional constraints affecting their abundance aside from constraints on genome size. In particular, protein domain repeats are generally uncommon in prokaryotes. This may be explained by the prokaryotic lack of the sophisticated protein synthesis machinery, including the endoplasmatic reticuluum and golgi, that allows eukaryotes to handle the multi-domain, non-globular folds that characterize repeat proteins.

While nebulin and titin are extreme examples of protein domain repeats, forming enormous proteins that are essential for cellular structure, most vertebrates contain a large number of repeat proteins that have between 3 and twenty consecutive domains. An estimated 17% of the human proteome consists of protein domain repeats while the corresponding number is around 5% for prokaryotes, and indeed unicellular organisms in general. Although the reason for the predominance of protein domain repeats in Metazoa is not clear, it is possible that repeats provide another source of variability in compensation for low generation rates. Further, studies show that protein domain repeats are comparatively recent development in genome evolution, since ancient proteins tend to have few repeats. It should also be noted that certain Metazoa, such as Drosophila melanogaster, are comparable to unicellular organisms with regard to repeat proteins.

Acknowledgements
This work was supported by grants from the Swedish Research Council by grants to AE, SSF (the Foundation for Strategic Research). SL was financed by Bioinformatics Infrastructure for Life Sciences. The authors gratefully acknowledge Dr. Åsa K. Björklund for assistance with figure preparation and, further, the Journal of Molecular Biology for permission to reuse images previously published.