Draft:Original research/Genomes

The genome is the entirety of an organism's hereditary information. In humans, it is encoded in DNA. The genome includes both the genes and the non-coding sequences of the DNA.

Genetic information is encoded as a sequence of nucleobases: adenine (A), cytosine (C), guanine (G), and thymine (T).

There are "3.2 billion base pairs in the human genome."

Associated with genomes are epigenomes.

Theoretical genomes
Def. the "complete genetic information ... of an organism" is called a genome.

Genomics
Genomics is a branch of molecular biology concerned with the structure, function, evolution, and mapping of genomes.

The H (heavy, outer circle) and L (light, inner circle) strands are given with their corresponding genes. There are 22 transfer RNA (TRN) genes for the following amino acids: F, V, L1 (codon UUA/G), I, Q, M, W, A, N, C, Y, S1 (UCN), D, K, G, R, H, S2 (AGC/U), L2 (CUN), E, T and P (white boxes). There are 2 ribosomal RNA (RRN) genes: S (small subunit, or 12S) and L (large subunit, or 16S) (blue boxes). There are 13 protein-coding genes: 7 for NADH dehydrogenase subunits (ND, yellow boxes), 3 for cytochrome c oxidase subunits (COX, orange boxes), 2 for ATPase subunits (ATP, red boxes), and one for cytochrome b (CYTB, coral box). Two gene overlaps are indicated (ATP8-ATP6, and ND4L-ND4, black boxes).

The control region (CR) is the longest non-coding sequence (grey box). Its three hyper-variable regions are indicated (HV, green boxes).

Epigenomes
Inside each eukaryote nucleus is genetic material (DNA) surrounded by protective and regulatory proteins. These protective and regulatory proteins and the dynamic changes to them that occur during the course of a eukaryote's existence are the epigenome.

An epigenome consists of a record of the chemical changes to the DNA and histone proteins of an organism that can be passed down to an organism's offspring via transgenerational epigenetic inheritance, where changes to the epigenome can result in changes to the structure of chromatin and changes to the function of the genome.

Unlike the underlying genome which is largely static within an individual, the epigenome can be dynamically altered by environmental conditions.

Deoxyribonucleic acids
Deoxyribonucleic acid (DNA) is composed of nucleobases (the sequence of which is the epigenome), deoxyribose (a sugar), and phosphate groups. Each nucleobase is attached to one deoxyribose molecule and one (PO4) phosphate molecule to form a chain of nucleotides (nucleobase + deoxyribose + phosphate) for a haploid genome. A linking of nucleobases may occur without the phosphate or the deoxyribose. The phosphate and the sugar are part of the epigenome.

Biochemistry
A haploid genome contains non-repetitive DNA and repetitive DNA. Non-repetitive DNA consists mainly of coding DNA. In eukaryotes, the coding DNA consists of genes having exon-intron organization. The major part of mammalian genomes is repetitive DNA.

In eukaryotes such as plants, protozoa and animals, however, "genome" carries the typical connotation of only information on chromosomal DNA. So although these organisms contain chloroplasts and/or mitochondria that have their own DNA, the genetic information contained by DNA within these organelles is not considered part of the genome. In fact, mitochondria are sometimes said to have their own genome often referred to as the "mitochondrial genome". The DNA found within the chloroplast may be referred to as the "plastome".

Genetic information is encoded as a sequence of nucleobases: adenine (A), cytosine (C), guanine (G), and thymine (T).

Human DNA
"[H]uman DNA has millions of on-off switches and complex networks that control the genes' activities. ... [A]t least 80% of the human genome is active, which opposed the previously held idea that most of the DNA are useless."

"DNA contains genes, which hold the instructions for [life. But, these] take up only about 2 percent of the genome ... The human genome is made up of about 3 billion “letters” along strands that make up the familiar double helix structure of DNA. Particular sequences of these letters form genes, which tell cells how to make proteins. People have about 20,000 genes, but the vast majority of DNA lies outside of genes. ... [A]t least three-quarters of the genome is involved in making RNA [...] it appears to help regulate gene activity."

Non-repetitive DNA
Each gene has exons interspaced with introns, usually alternating along the DNA template strand. Before and after these are the 5' and 3' untranslated regions, by convention, respectively. Between genes usually before the gene in the direction of transcription are nucleotide sequences that contain the gene promoters.

Mitochondrial DNA
Mitochondrial DNA (mtDNA) is a circular double strand molecule as in the image on the right which has a length of 15-20 kilobases in animals. In most species it has the same 37 genes that codify for 13 proteins, 2 ribosomal RNAs and 22 transfer RNAs.

The human mtDNA was the first mitochondrial genome sequenced. This first complete sequence was called Cambridge Reference Sequence (CRS).

This mtDNA is composed of 16,569 base pairs (bp) with their genes distributed between the H chain (high) and the L chain (light). With the exception of the Control Region (D-loop) which has regulatory functions and a 9 bp region called V Region, all the rest of the genome consists in coding DNA.

In the last 30 years, the mtDNA has been widely used in human evolution studies as a consequence of its particular characteristics that make it an ideal and useful tool.

Because of its non-coding nature, the Control Region exhibits the highest mutation rate in the mitochondrial genome. Several studies have demonstrated the strictly maternal inheritance of mtDNA, a phenomenon which represents an enormous advantage because it allows tracing related matrilineages along time without all inherent nuclear DNA problems like recombination and biparental inheritance.

Viral genomes
Viral genomes can be composed of either RNA or DNA. The genomes of RNA viruses can be either single-stranded or double-stranded RNA], and may contain one or more separate RNA molecules. DNA viruses can have either single-stranded or double-stranded genomes. Most DNA virus genomes are composed of a single, linear molecule of DNA, but some are made up of a circular DNA molecule.

Archaea genomes
Archaea have a single circular chromosome.

Prokaryotic genomes
Most bacteria also have a single circular chromosome; however, some bacterial species have linear chromosomes or multiple chromosomes. If the DNA is replicated faster than the bacterial cells divide, multiple copies of the chromosome can be present in a single cell. Most prokaryotes have very little repetitive DNA in their genomes. However, some symbiotic bacteria (e.g. Serratia symbiotica) have reduced genomes and a high fraction of pseudogenes: only ~40% of their DNA encodes proteins.

Some bacteria have auxiliary genetic material, which is carried in plasmids.

Eukaryotic genomes
Eukaryotic genomes are composed of one or more linear DNA chromosomes. The number of chromosomes varies widely from Jack jumper ants and Diploscapter pachys an asexual nemotode, which each have only one pair, to an Ophioglossum or fern species that has 720 pairs. A typical human cell has two copies of each of 22 autosomes, one inherited from each parent, plus two sex chromosomes, making it diploid. Gametes, such as ova, sperm, spores, and pollen, are haploid, meaning they carry only one copy of each chromosome.

In addition to the chromosomes in the nucleus, organelles such as the chloroplasts and mitochondria have their own DNA. Mitochondria are sometimes said to have their own genome often referred to as the "mitochondrial genome". The DNA found within the chloroplast may be referred to as the "plastome". Like the bacteria they originated from, mitochondria and chloroplasts have a circular chromosome.

Unlike prokaryotes, eukaryotes have exon-intron organization of protein coding genes and variable amounts of repetitive DNA. In mammals and plants, the majority of the genome is composed of repetitive DNA.

A larger genome does not necessarily contain more genes, and the proportion of non-repetitive DNA decreases along with increasing genome size in complex eukaryotes.

Simple eukaryotes such as Caenorhabditis elegans and Drosophila melanogaster (fruit fly), have more non-repetitive DNA than repetitive DNA, while the genomes of more complex eukaryotes tend to be composed largely of repetitive DNA. In some plants and amphibians, the proportion of repetitive DNA is more than 80%. Similarly, only 2% of the human genome codes for proteins.

Non-coding sequences
Noncoding sequences include introns, sequences for non-coding RNAs, regulatory regions, and repetitive DNA. Noncoding sequences make up 98% of the human genome. There are two categories of repetitive DNA in the genome: tandem repeats and interspersed repeats.

Tandem repeats
Short, non-coding sequences that are repeated head-to-tail are called tandem repeats. Microsatellites consisting of 2-5 basepair repeats, while minisatellite repeats are 30-35 bp. Tandem repeats make up about 4% of the human genome and 9% of the fruit fly genome. Tandom repeats can be functional. For example, telomeres are composed of the tandem repeat TTAGGG in mammals, and they play an important role in protecting the ends of the chromosome.

In other cases, expansions in the number of tandem repeats in exons or introns can cause disease. For example, the human gene huntingtin typically contains 6-29 tandem repeats of the nucleotides CAG (encoding a polyglutamine tract). An expansion to over 36 repeats results in Huntington's disease, a neurodegenerative disease. Twenty human disorders are known to result from similar tandem repeat expansions in various genes. The mechanism by which proteins with expanded polygulatamine tracts cause death of neurons is not fully understood. One possibility is that the proteins fail to fold properly and avoid degradation, instead accumulating in aggregates that also sequester important transcription factors, thereby altering gene expression.

Tandem repeats are usually caused by slippage during replication, unequal crossing-over and gene conversion.

Transposable elements
Transposable elements (TEs) are sequences of DNA with a defined structure that are able to change their location in the genome. TEs are categorized as either class I TEs, which replicate by a copy-and-paste mechanism, or class II TEs, which can be excised from the genome and inserted at a new location.

The movement of TEs is a driving force of genome evolution in eukaryotes because their insertion can disrupt gene functions, homologous recombination between TEs can produce duplications, and TE can shuffle exons and regulatory sequences to new locations.

Retrotransposons
Retrotransposons can be transcribed into RNA, which are then duplicated at another site into the genome. Retrotransposons can be divided into Long terminal repeats (LTRs) and Non-Long Terminal Repeats (Non-LTR).

Long terminal repeats (LTRs) are derived from ancient retroviral infections, so they encode proteins related to retroviral proteins including gag (structural proteins of the virus), pol (reverse transcriptase and integrase), pro (protease), and in some cases env (envelope) genes. These genes are flanked by long repeats at both 5' and 3' ends. It has been reported that LTRs consist of the largest fraction in most plant genome and might account for the huge variation in genome size.

Non-long terminal repeats (Non-LTRs) are classified as long interspersed elements (LINEs), short interspersed elements (SINEs), and Penelope-like elements. In Dictyostelium discoideum, there is another DIRS-like elements belong to Non-LTRs. Non-LTRs are widely spread in eukaryotic genomes.

Long interspersed elements (LINEs) encode genes for reverse transcriptase and endonuclease, making them autonomous transposable elements. The human genome has around 500,000 LINEs, taking around 17% of the genome.

Short interspersed elements (SINEs) are usually less than 500 base pairs and are non-autonomous, so they rely on the proteins encoded by LINEs for transposition. The Alu element is the most common SINE found in primates. It is about 350 base pairs and occupies about 11% of the human genome with around 1,500,000 copies.

DNA transposons
DNA transposons encode a transposase enzyme between inverted terminal repeats, which when expressed, recognizes the terminal inverted repeats that flank the transposon and catalyzes its excision and reinsertion in a new site. This cut-and-paste mechanism typically reinserts transposons near their original location (within 100kb). DNA transposons are found in bacteria and make up 3% of the human genome and 12% of the genome of the roundworm C. elegans.

Hypotheses

 * 1) The human genome may be less than 10 % human.