G-Protein Coupled Receptor Gene Duplication Project/Progress

Questions to Answer

 * 1) How can I find the sequences Fredriksson and Schioth (2005) used to construct their HMMER? Are they in an online database?
 * 2) Do the sequences in supplemental document S1 of Fredriksson and Schioth (2005) have their N and C termini removed?
 * 3) In a chromosome, have we ever observed more genes than expected due to Gene Duplication ? Or is it only in theory ?

Daily Log

 * December 4, 2008: Updated the SpaceRemover program description and added a header bar to my page.


 * August 31, 2009: SpaceRemover is not an effective program for reformatting files. David B. helped me figure out how to use the program HMMER.  The sequence files should be aligned in MUSCLE and made into "Type: P" (instead of "Type: A").  Then HMMER works:
 * $ ./hmmbuild test.hmm ADRA1A.msf
 * $ ./hmmcalibrate test.hmm
 * $ ./hmmsearch test.hmm Artemia.fa
 * The next step is to construct the same Hidden Markov Model(s) Fredriksson and Schioth used and test them on the same genomes they analyzed, to determine if my results match theirs.


 * September 1, 2009: Fredriksson and Schioth used a large number of sequences to construct hidden Markov models and these sequences are in a supplementary FASTA file in their paper. This file is several hundred pages in Microsoft Word and is so large, it cannot be copied and pasted in one piece.  I was able to copy and paste, into Wikiversity, the accession numbers of the rhodopsin sequences they used to construct the HMMs.  I also don't know if the N and C termini are removed from the sequences in their FASTA file.  Before the receptor sequences were aligned (in my project, by MUSCLE), Fredriksson and Schioth removed the N and C termini as identified by RPS-BLAST searches.
 * Next:
 * Use RPS-BLAST to determine if Fredriksson and Schioth removed the N and C termini from the FASTA sequences in their supplemental paper.
 * The problem with this is: So far, I can't find the genes Fredriksson and Schioth used in the online databases. Genes and proteins with their accession numbers do not seem to exist within Entrez, although I was able to find genes which were homologous to the ones they used.  The 2005 paper claims that the rhodopsin genes used to construct the HMM came from the human genome, as identified by Fredriksson et al.  Now I have a question: Did they deposit the data on their rhodopsin genes into a public database?
 * Activities:
 * What notations are used to store and tag gene and protein sequence data?
 * Read Fredriksson et al. 2003?
 * WHERE CAN I FIND THE RHODOPSIN GENES IN THE PUBLIC DATABASE?


 * September 9, 2009: It appears as though the supplementary file S1, in Fredriksson and Schioth (2005), is actually the list of genes they found when they conducted the hidden Markov model search rather than the list of genes they constructed their hidden Markov model from.
 * Regardless of the sequences used by Fredriksson and Schioth, I would like to be able to do a reasonable job of searching a genome for G-protein coupled receptors. The next step is to download an entire genome and experiment with various searches on it for GPCRs.
 * A well-studied genome such as the human genome should provide a reasonable indicator of whether I am on the right track.
 * A little studied mammal or invertebrate genome may be a place to look for new GPCRs. What is known about the GPCRs in Monodelphis domestica?  What is known about the GPCRs in Nematostella vectensis?
 * Next step: Download a reasonable copy of the human genome. I already have a couple GPCRs to perform my first HMM analysis with.


 * September 11, 2009: Where human genomes can be found.
 * The side buttons of the main page of the NCBI site. There being 3 or 4 different genome sequences to choose from. I (User:Graeme E. Smith) believe that these different sequences reflect a couple of different groups of humans that reflect different ethnic somatypes, but I could be wrong.  I believe once you know the name of the file, which you can find via the database, you can order the file from ftp.ncbi.nlm.nih.gov
 * The beginning of the September 1 issue of the journal Genome Research has analyses of complete human genomes. Where do these human genomes come from?
 * August 27, 2010: Downloaded HMMER onto C2B2 Titan, a new computer cluster I've got access to. Installed HMMER.
 * September 13, 2010: My goal is to download what I believe are the rhodopsin sequences Fredriksson and Schioth used, to create a file of the sequences in FASTA format. These sequences are available on the NCBI protein server.  I could find sequences when I typed the accession numbers into the search box on that page.  Some of these sequences had been updated since Fredriksson and Schioth's work, and I used the updated versions.  I created a file "rhodopsin.fasta" in folder "Wikiversity" on a computer I have access to, containing these sequences.  The names of the sequences are in the header row of the FASTA file.
 * I pasted these sequences in the order in which they are listed on this site. As of now, I've finished with NP_000729.1.  More later, but at this point it is clear what to do.
 * September 20, 2010: I'm up to HRH4 NP_067630.1 RHOD AMIN. I can't find this sequence yet.
 * October 15, 2010: I could not find this HRH4 receptor. I'm using Histamine H4 receptor isoform 1, NP_067637.2.  An ISOFORM is one of several alternatively spliced versions of a protein.
 * I am up to RHOD LGR LGR8 NP_570718.1, and have added all the sequences before this.
 * November 14, 2010: Pasted all the sequences into the "rhodopsin.fasta" file. The next step is to take a few of the sequences and run HMMER, to test HMMER.  Essentially, I will start by getting some HMMER results, and will attempt to duplicate Fredriksson and Schioth (2005) later.  In their paper, they said they searched the rhodopsin GPCRs of the other species against a file similar to mine.  They used a cutoff value of E=1e-9, and manually inspected the five top hits of each receptor to determine which human GPCR was the closest.  To place the receptor in a given group, at least four of the five best hits needed to be part of that group.