Draft:Original research/Genes

A gene is a distinct sequence of nucleotides forming part of a chromosome, the order of which determines the order of monomers in a polypeptide or nucleic acid molecule which a cell (or virus) may synthesize.

Theoretical genes
Def. a "theoretical unit of heredity of living organisms; [that] in principle predetermines a precise trait of an organism's form (phenotype), such as hair color" or a "segment of DNA or RNA from a cell's or an organism's genome, that may take several forms and thus parameterizes a phenomenon, in general the structure of a protein; locus" is called a gene.

Def. any discrete locus of heritable, genomic sequence which affect an organism's traits by being expressed as a functional product or by regulation of expression is called a gene.

Here's a theoretical definition:

Def. a specific nucleotide sequence within a gene locus with its own transcription start site(s), introns, exons, and UTRs, that transcribes a specific RNA product is called an isoform, or gene isoform.

Def. any "of several different forms of the same protein, arising from either single nucleotide polymorphisms, differential splicing of mRNA, or post-translational modifications (e.g. sulfation, glycosylation, etc.)" is called an isoform.

Def. a "region of a transcribed gene present in the final functional RNA molecule" is called an exon.

Def. a "portion of a split gene that is included in pre-RNA transcripts but is removed during RNA processing and rapidly degraded" is called an intron.

Gene structures
The promoter is recognized and bound by transcription factors that recruit and help RNA polymerase bind to the region to initiate transcription. A gene can have more than one promoter, resulting in messenger RNAs (mRNA) that differ in how far they extend in the 5' end. Highly transcribed genes have "strong" promoter sequences that form strong associations with transcription factors, thereby initiating transcription at a high rate, while others genes have "weak" promoters that form weak associations with transcription factors and initiate transcription less frequently. Eukaryotic promoter regions are much more complex and difficult to identify than prokaryotic promoters.

Additionally, genes can have regulatory regions many kilobases upstream or downstream of the open reading frame that alter expression that act by binding to transcription factors which then cause the DNA to loop so that the regulatory sequence (and bound transcription factor) become close to the RNA polymerase binding site. For example, enhancers increase transcription by binding an activator protein which then helps to recruit the RNA polymerase to the promoter; conversely silencers bind repressor proteins and make the DNA less available for RNA polymerase.

The transcribed pre-mRNA contains untranslated regions at both ends which contain a ribosome binding site, terminator and start and stop codons. In addition, most eukaryotic open reading frames contain untranslated introns which are removed before the exons are translated, where the sequences at the ends of the introns dictate the splice sites to generate the final mature mRNA which encodes the protein or RNA product.

Many prokaryotic genes are organized into operons, with multiple protein-coding sequences that are transcribed as a unit. The genes in an operon are transcribed as a continuous messenger RNA, referred to as a polycistronic mRNA, where the term cistron in this context is equivalent to gene, with the transcription of an operon's mRNA often controlled by a repressor that can occur in an active or inactive state depending on the presence of specific metabolites. When active, the repressor binds to a DNA sequence at the beginning of the operon, called the operator region, and represses transcription of the operon; when the repressor is inactive transcription of the operon can occur (see e.g. Lac operon), where the products of operon genes typically have related functions and are involved in the same regulatory network.

Gene clusters
GeneID: 348 APOE apolipoprotein E description contains this: "This gene maps to chromosome 19 in a cluster with the related apolipoprotein C1 and C2 genes."

Gene expressions
Gene expressions is a suite of genes, and their isoforms, that appear to be biochemically involved in the appearance of a trait.

Although it is harder to regulate the transcription of genes with multiple transcription start sites, "variations in the expression of a constitutive gene would be minimized by the use of multiple start sites."

Earlier "studies led to the design of a super core promoter (SCP) that contains a TATA, Inr, MTE, and DPE in a single promoter (Juven-Gershon et al., 2006b). The SCP is the strongest core promoter observed in vitro and in cultured cells and yields high levels of transcription in conjunction with transcriptional enhancers. These findings indicate that gene expression levels can be modulated via the core promoter."

On the right is a phylogenetic tree of eutherian A1BG, which includes opossum DM43 and DM46, and A1BG-like sequences in marsupials.

"A peptide identified in the late and early milk proteomes showed homology to eutherian alpha 1B glycoprotein (A1BG), a plasma protein with unknown function46, as well as venom inhibitors characterised in the Southern opossum Didelphis marsupialis (DM43 and DM4647,48,49), all members of the immunoglobulin superfamily. To characterise the relationship between the peptide sequence identified in koala, A1BG, DM43 and DM46, a phylogenetic tree was constructed [diagram on the right] including all marsupial and monotreme homologs (identified by BLAST), three phylogenetically representative eutherian sequences, with human IGSF1 and TARM1, related members of the immunoglobulin super family, used as outgroups. This phylogeny indicates that A1BG-like proteins in marsupials and the Didelphis antitoxic proteins are homologs of eutherian A1BG, with excellent bootstrap support (98%). The marsupial A1BG-like sequences and the Didelphis antitoxic proteins formed a single clade with strong bootstrap support (97%)."

"Human TARM1 and IGSF1, related members of the immunoglobulin superfamily are used as outgroups. The tree was constructed using the maximum likelihood approach and the JTT model with bootstrap support values from 500 bootstrap tests. Bootstrap values less than 50% are not displayed. Accession numbers: Tasmanian devil (Sarcophilus harrisii; XP_012402143), Wallaby (Macropus eugenii; FY619507), Possum (Trichosurus vulpecula; DY596639) Virginia opossum (Didelphis virginiana; AAA30970, AAN06914), Southern opossum (Didelphis marsupialis; AAL82794, P82957, AAN64698), Human (Homo sapiens; P04217, B6A8C7, Q8N6C5), Platypus (Ornithorhychus anatinus; ENSOANP00000000762), Cow (Bos taurus; Q2KJF1), Alpaca (Vicugna pacos; XP_015107031)."

Gene regulations
Each gene, or its isoforms, is likely to have upregulation and downregulation transcription factors. As each gene is investigated, these enhancers and inhibitors are noted as discovered.

For example, submitting "gene regulation" APOE human to the NCBI gene database returns 28 genes and 21 mouse analogs. The first on the list is GeneID: 2099 ESR1 estrogen receptor 1. "This gene encodes an estrogen receptor, a ligand-activated transcription factor composed of several domains important for hormone binding, DNA binding, and activation of transcription. [...] Estrogen and its receptors are essential for sexual development and reproductive function, but also play a role in other tissues such as bone. Estrogen receptors are also involved in pathological processes including breast cancer, endometrial cancer, and osteoporosis." from the page url=http://www.ncbi.nlm.nih.gov/gene/2099. The database also maintains the DNA sequence upstream, downstream, and through the entire gene locus so that analysis of "Alternative promoter usage and alternative splicing result in dozens of transcript variants, but the full-length nature of many of these variants has not been determined. [provided by RefSeq, Mar 2014]" can be attempted. The site lists gene interactions and six variants for three isoforms (1, 2, and 3) and ten experimental transcriptions.

Gene similarities
There are genes on other chromosomes that are similar to each gene being considered. For example, GeneID: 338, Apolipoprotein B, is on chromosome 2.

Eukaryote genes
Def. any "of the single-celled or multicellular organisms, of the taxonomic domain Eukaryota, whose cells contain at least one distinct nucleus" is called a eukaryote.

Those specific genes that cause cells to contain at least one distinct nucleus are eukaryote genes.

Genetics
There are "more than 4 million sites where proteins bind to DNA to regulate genetic function, sort of like a switch."

"Humans belong to the biological group known as Primates, and are classified with the great apes, one of the major groups of the primate evolutionary tree. Besides similarities in anatomy and behavior, our close biological kinship with other primate species is indicated by DNA evidence. It confirms that our closest living biological relatives are chimpanzees and bonobos, with whom we share many traits. But we did not evolve directly from any primates living today."

"DNA also shows that our species and chimpanzees diverged from a common ancestor species that lived between 8 and 6 million years ago. The last common ancestor of monkeys and apes lived about 25 million years ago."

Human DNA
"[H]uman DNA has millions of on-off switches and complex networks that control the genes' activities. ... [A]t least 80% of the human genome is active, which opposed the previously held idea that most of the DNA are useless."

"DNA contains genes, which hold the instructions for [life. But, these] take up only about 2 percent of the genome ... The human genome is made up of about 3 billion “letters” along strands that make up the familiar double helix structure of DNA. Particular sequences of these letters form genes, which tell cells how to make proteins. People have about 20,000 genes, but the vast majority of DNA lies outside of genes. ... [A]t least three-quarters of the genome is involved in making RNA [...] it appears to help regulate gene activity."

Human genes
"Nine elements were tested, representing a sampling of elements present in the two gene deserts and DACH introns, spread over a 1530-kb region surrounding the human DACH's TATA box."

Gene ID: 1602 is the human gene DACH1 dachshund homolog 1 also known as DACH. DACH1 has three isoforms: a, b, and c.

"[T]he human ... prostaglandin-endoperoxide-synthase-2 [gene contains] a canonical TATA box (nucleotide residues at positions -31 to -25 for the human gene)." This is Gene ID: 5743.

The Drosophila hsp70 has a TATA box containing promoter. This suggests that GeneID: 3308 HSPA4 heat shock 70kDa protein 4 [Homo sapiens], also known as hsp70, has a TATA box in its core promoter.

Genotypes
The genetic information in a genome is held within genes, and the complete set of this information in an organism is called its genotype. A gene is a unit of heredity and is a region of DNA that influences a particular characteristic in an organism. Genes contain an open reading frame that can be transcribed, as well as regulatory sequences such as promoters and enhancers, which control the transcription of the open reading frame.

Only about 1.5% of the human genome consists of protein-coding exons.

Pseudogenes
"An abundant form of noncoding DNA in humans are pseudogenes, which are copies of genes that have been disabled by mutation. These sequences are usually just molecular fossils, although they can occasionally serve as raw genetic material for the creation of new genes through the process of gene duplication and divergence.

About 2700 formerly active genes are now pseudogenes.

Deaminations
The CpG deficiency is due to an increased vulnerability of methylcytosines to spontaneously deaminate to thymine in genomes with CpG cytosine methylation.

Methylations
Cytosines in CpG dinucleotides can be methylated to form 5-methylcytosine. In mammals, methylating the cytosine within a gene can turn the gene off, a mechanism that is part of a larger field of science studying gene regulation that is called epigenetics. Enzymes that add a methyl group are called DNA methyltransferases.

In mammals, 70% to 80% of CpG cytosines are methylated.

CpG dinucleotides have long been observed to occur with a much lower frequency in the sequence of vertebrate genomes than would be expected due to random chance. For example, in the human genome, which has a 42% GC content, a pair of nucleotides consisting of cytosine followed by guanine would be expected to occur 0.21 * 0.21 = 4.41% of the time. The frequency of CpG dinucleotides in human genomes is 1% &mdash; less than one-quarter of the expected frequency.

Unmethylated CpG sites can be detected by Toll-Like Receptor 9 (TLR 9) on plasmacytoid dendritic cells and B cells in humans. This is used to detect intracellular viral, fungal, and bacterial pathogen DNA.

Methylation is central to imprinting, along with histone modifications. Most of the methylation occurs a short distance from the CpG islands (at &quot;CpG island shores&quot;) rather than in the islands themselves.

Methylation of CpG sites within the promoters of genes can lead to their silencing, a feature found in a number of human cancers (for example the silencing of tumor suppressor genes). In contrast, the hypomethylation of CpG sites has been associated with the over-expression of oncogenes within cancer cells.

Mutations
Alu elements are a common source of mutation in humans, but such mutations are often confined to non-coding regions where they have little discernible impact on the bearer.

The mutagenic effect of Alu and retrotransposons in general has played a major role in the recent evolution of the human genome.

The first report of Alu-mediated recombination causing a prevalent inherited predisposition to cancer was a 1995 report about hereditary nonpolyposis colorectal cancer.

"The human diseases caused by Alu insertions include":
 * Breast cancer
 * Ewing's sarcoma
 * Familial hypercholesterolemia
 * Hemophilia
 * Neurofibromatosis
 * Diabetes mellitus type II.

The following diseases have been associated with single-nucleotide DNA variations in Alu elements impacting transcription levels:
 * Alzheimer's disease
 * Lung cancer
 * Gastric cancer.

"The ACE gene, encoding angiotensin-converting enzyme, has 2 common variants, one with an Alu insertion (ACE-I) and one with the Alu deleted (ACE-D). This variation has been linked to changes in sporting ability: the presence of the Alu element is associated with better performance in endurance-oriented events (e.g. triathlons), whereas its absence is associated with strength- and power-oriented performance

The opsin gene duplication which resulted in the re-gaining of trichromacy in Old World primates (including humans) is flanked by an Alu element, implicating the role of Alu in the evolution of three colour vision.

Hypotheses

 * 1) Each gene may be expressed by one of more isoforms usually subject to cell type.