New AI tool can decode DNA sequences

New AI decodes DNA, revolutionises genomics | Digital Watch Observatory

Recently, In the journal Nature Machine, findings on this new tool “GROVER” which can extract important information out of DNA sequences  were published.

About DNA

  • DNA, or deoxyribonucleic acid, is the central information storage system of most animals and plants, and even some viruses.
    • DNA is organised structurally into chromosomes and then wound around nucleosomes as part of those chromosomes. 
  • Classification: The name comes from its structure, which is a sugar and phosphate backbone which have bases sticking out from it—so-called bases.
    • It’s a polymer of four bases – Adenine (A), Cytosine (C), Guanine (G), and Thymine (T))
  • Double Helix model: In 1953 James Watson and Francis Crick, based on the X-ray diffraction data produced by Maurice Wilkins and Rosalind Franklin, proposed a very simple but famous Double Helix model for the structure of DNA. 
    • A DNA molecule consists of two strands wound around each other, with each strand held together by bonds between the bases. Adenine pairs with thymine, and cytosine pairs with guanine. 
    • Gene: The sequence of bases in a portion of a DNA molecule, called a gene, carries the instructions needed to assemble a protein
  • Hallmarks: Base pairing between the two strands of polynucleotide chains.

DNA

About GROVER

  • GROVER is a new large language model trained on humans.
  • DNA that can extract important information out of DNA sequences, such as identifying gene promoters or protein binding sites
  • Significance: The researchers believe tools like GROVER could help transform genomics and personalized medicine. 
  • To train GROVER, the team at the Biotechnology Center (BIOTEC) of Dresden University of Technology in Germany, first created a ‘DNA dictionary’. 
  • The DNA Dictionary: DNA resembles language. It has four letters that build sequences and the sequences carry a meaning
    • DNA consists of four letters (A, T, G, and C) and genes, but there are no predefined sequences of different lengths that combine to build genes or other meaningful sequences.
    • Information hidden in the DNA is multilayered. Only 1-2 % of the genome consists of genes, the sequences that code for proteins.
  • GROVER Role: ​​Grover learns the grammar of DNA
    • In terms of the DNA code, this means learning the rules of the sequences, i.e. the order of the nucleotides and their meaning
    • For example: It’s Similar to how GPT models learn human languages, Grover has basically learned to speak DNA,
  • GROVER Functioning: Grover can not only predict the sequence of DNA sequences for certain genetic information, but also derive information of biological relevance from the context, such as the start of genes or protein binding sites on the DNA
    • Grover also learns processes that are considered “epigenetic“.
      • Epigenetics: It is the study of how cells control gene activity without changing the DNA sequence. 

GROVER Training

  • DNA dictionary using byte pair encoding (BPE) : To train Grover, the team first created a DNA dictionary using byte pair encoding (BPE) –, a tokenization strategy – originally developed for transformer models such as GPT-3, and examined the entire genome for the most common letter combinations. 
Share this with friends ->