Cracking the code of life: New AI model deciphers the hidden language of DNA

07-Aug-2024
Magdalena Gonciarz generiert mit Dall-E3

Artistic representation of the Large Language Model trained on DNA sequences.

Scientists at TU Dresden have trained a large language model with human DNA. Researchers can use it to try to decode the complex information hidden in our genome. The AI treats human DNA like language, learning its rules and relationships to derive functional information about the DNA sequences. This new tool, published in Nature Machine Intelligence, has the potential to revolutionize genomics and advance personalized medicine.

DNA contains the basic information for life. Understanding how this information is stored and organized has been one of the greatest scientific challenges of the last century. With GROVER, a new Large Language Model trained with human DNA, researchers can now try to decode the complex information hidden in our genome.

Developed by a team at the Biotechnology Center (BIOTEC) of the Technische Universität Dresden, GROVER treats human DNA like language and learns its rules and relationships to derive functional information about the DNA sequences. This new tool, published in "Nature Machine Intelligence", has the potential to revolutionize genomics and advance personalized medicine.

Since the discovery of the double helix, researchers have been searching for the knowledge encoded in DNA. 70 years later, it is clear that the information hidden in DNA is complex. Only 1-2 percent of the genome consists of genes, the sequences that code for proteins.

"DNA has many functions that go beyond protein coding. Some sequences regulate genes, others serve structural purposes, and most sequences fulfill several functions simultaneously. Currently, we do not understand the significance of most of the DNA. For the areas outside of genes, we seem to have only scratched the surface. This is where AI and large language models can help," says Dr. Anna Poetsch, research group leader at BIOTEC.

DNA as a language

Large language models such as GPT have changed our understanding of language. Trained exclusively with text, the language models developed the ability to use language in many contexts.

"DNA is the code of life. Why not treat it like a language?" asks Dr. Poetsch. The Poetsch team trained a large language model on a reference human genome. The resulting tool called GROVER, or "Genome Rules Obtained via Extracted Representations", can be used to extract biological meaning from DNA.

"GROVER has learned the rules of DNA. In terms of language, we talk about grammar, syntax and semantics. For DNA, this means learning the rules of sequences, the order of nucleotides and sequences and their meaning. Similar to how GPT models learn human languages, GROVER has basically learned to 'speak DNA'," explains Dr. Melissa Sanabria, the researcher behind the project.

The team showed that GROVER can not only accurately predict the following DNA sequences, but can also be used to extract information of biological meaning from context. For example, it can identify the start of genes or protein binding sites on DNA. GROVER also learns processes that are generally considered "epigenetic", i.e. those that take place on DNA and have not previously been considered "coded".

"It is fascinating that by training GROVER with the DNA sequence alone, without any additional functional data, we can actually extract information about biological function. For us, this shows that function, including some epigenetic information, is also encoded in the sequence," says Dr. Sanabria.

The DNA dictionary

"DNA is similar to language. It consists of four letters that form sequences, and the sequences carry a meaning. However, unlike a language, there is no concept of words," says Dr. Poetsch. DNA consists of four letters (A, T, G and C) and genes, but there are no predefined sequences of different lengths that combine to form genes or other meaningful sequences.

To train GROVER, the team first had to create a DNA dictionary. They used a trick of compression algorithms. "This step is crucial and distinguishes our DNA language model from previous attempts," says Dr. Poetsch.

"We analyzed the entire genome and looked for letter combinations that occur most frequently. We started with two letters and searched the DNA again and again to build it up to the most common multi-letter combinations. In this way, in about 600 cycles, we fragmented the DNA into 'words' that allow GROVER to best predict the next sequence," explains Dr. Sanabria.

The promise of AI in genomics

GROVER promises to unlock the different levels of the genetic code. DNA contains important information about what makes us human, our susceptibilities to disease and our responses to treatments.

"We believe that understanding the rules of DNA through a language model will help us uncover the depths of biological meaning hidden in DNA. This should advance both genomics and personalized medicine," says Dr. Poetsch.

Note: This article has been translated using a computer system without human intervention. LUMITOS offers these automatic translations to present a wider range of current news. Since this article has been translated with automatic translation, it is possible that it contains errors in vocabulary, syntax or grammar. The original article in German can be found here.

Other news from the department science

Most read news

More news from our other portals

All FT-IR spectrometer manufacturers at a glance