Talk:Preprint/Evolution of protein domain repeats

Wikification
Thanks for this nice draft!

To move forward in terms of wikification, please have another look at the guidelines, especially the Hyperlinks section. Furthermore, please add some suitable references for the lead section, and provide the images as SVG rather than PNG.

--Daniel Mietchen 20:24, 21 September 2012 (PDT)

Review from BK Lee
1.	The word “domain”

Coming from the protein structure background, the word “domain” has a specific meaning to me. Although there is no consensus on precise definition of the term “domain” (See, for example, a recent review by Veretnik, Gu & Wodak, (2009) “Identifying structural domains in proteins” In Structural Bioinformatics, Second Edition, pp. 485-513, Wiley), a domain is usually more than 40 residues long and is physically separated from the rest of the protein and/or has a distinctly different architecture from other surrounding parts of the protein. However, in this article, the word “domain” has a broader meaning and includes the repeating units in ankyrin repeats and leucine-rich repeats (proteins shown in Fig. 1), which are quite short, as noted in the beginning part of the article, and not sufficiently separated from neighboring units to be called a “domain” as understood by the protein structure perspective. Unlike the structural domains, which are indeed “functional units of common origin” as stated at the top of the article, these short repeating units do not have an assignable function and, therefore, are difficult to consider as “functional units”.

It may be that the term “domain” has this broader meaning in the protein sequence community. However, it is obviously highly desirable to assign the same meaning to one word between the protein structural and sequence communities, particularly in Wiki pages whose readers are likely to come from diverse background. I strongly urge the authors to refrain from using the word “domain” when it includes units that do not qualify as a structural domain. In most cases where the word “domain” occurs in this article, words like “unit”, “repeating unit” or “repeat” seem to be sufficient. The title of the article “Evolution of protein domain repeats” should be changed to something like “Evolution of proteins with repeating units”.

2.	 In the abstract part, “… they tend to be highly variable with only a few residues …”

Repeating units tend to be variable in sequence. Their structure tends to be highly conserved.

3.	In the History part, “The perhaps best known case is huntingtin, a protein that contains many HEAT domain repeats[11] preceded by a trinucleotide repeat …”

The trinucleotide repeat occurs in the huntingtin gene, not the protein.

4.	Fig. 2 legend

“… illustration of five most …”: Four shown, not five.

“… one nebulin domain per exon is red, two green and three blue.”: I assume each box represents one, not two or three, nebulin domain. Do the authors mean a nebulin domain made of one exon is red, two green and three blue?

5.	In the cellular roles section, third sentence: “A further distinctive characteristic of repeat domains is their tendency to, unlike most other domains, have little direct interaction between sequentially distant residues.”

This sentence is correct only if the term “repeat domains” means the entire proteins. But in the first sentence of this section, the term “protein domain repeats” seems to mean repeating units. This is confusing. It would be less confusing if the authors used more explicit, more clearly different languages to refer to the whole protein and to the individual repeating units.

6.	Fig. 4

Explain the horizontal dotted line and the stars (*) in the figure.

7.	Table

One of the most common enzyme domains, the TIM barrel, which is made of eight β−α units, is missing. Is this because (1) TIM barrels do not occur more often than the last member in this list, the PASTA domains, (2) the β−α units did not arise from duplication of one unit, or (3) there is perhaps a detection problem because the sequence similarity among the repeats is low?

In fact, it would be nice if the article contained some information/discussion on the relation between the level of sequence similarity among the repeats and the age of the duplication events. It is noted at the end of the “Repeat proteins and complexity” section that, according to ref. 26, ancient proteins tend to have few repeats. But it is also expected that the level of sequence similarity among the repeats will erode with time. Is it possible that this erosion makes it increasingly more difficult to detect repeat containing proteins and that this is partly the reason why few repeat containing proteins are found among ancient proteins?

8.	In the Evolution section, “… where parallel lines of very similar domains, all seven domains apart, see figure 5 and figure 4, …”

Difficult to know where the number seven in this sentence came from. The authors probably did not mean to introduce the number seven until after the next sentence.

9.	It is a good idea to run a spell checker program through the whole article.

Reviewer 2
This paper seems to be a review. As such, it should be scrupulous in definitions and so on, so that readers who are not working in this area can turn to it as a resource. The paper fails in this goal. Even the title is obscure. What are ‘protein domain repeats’? It sounds like tandem, repeated, domains – which is not what the authors mean. The authors lump together anything that mentions repeat – from ‘repeat domains’ such as LRR, HEAT etc, to trinucleotide repeats as are seen in Huntington’s disease. The authors also repeatedly mix ‘structural’ and ‘functional’ in their ideas and descriptions. They also include very hazy descriptions of how repeats evolve – but include no evidence for quite dogmatic statements “ Ankyrin repeat(s)… evolve continuously towards the binding of new ligands”. Where is this bold statement coming from? The authors speculate that repeat proteins expand rapidly until ‘a physical structural limit’ has been reached/ What would determine such a ‘physical structural limit’? What are the data to support this statement? The figures are not informative. Figure 4 is missing axes legends. The data in the table contains several mistakes. Mid-way through the paper they introduce the word ‘cassette’ – is this a domain, a module, a repeat? The authors do not tell the reader. Overall – this ‘review’ obfuscates rather than illuminates the topic.