From Sequence to Structure: Tymor Hamamsy’s Protein Annotation Research in Nature Biotechnology

NYU Center for Data Science
3 min readSep 27, 2023

Proteins are the building blocks of life, and they are composed of sequences of amino acids that determine their 3D structure and, subsequently, their function. Protein sequences are readily discoverable, but the search for efficient and reliable methods to infer structure and function from sequence is difficult. It has long been a subject of intense study — given a boost, in recent years, from the world of machine learning. Earlier this month, Tymor Hamamsy, a CDS PhD student, published a paper in Nature Biotechnology that pushes this search forward. The paper, titled “Protein remote homology detection and structural alignment using deep learning,” offers groundbreaking insights and potential applications in the realm of bioinformatics.

Ascribing structure and function to protein sequences is called protein annotation, which bridges the gap between genomics and biology. A significant challenge in protein annotation is that less than 1% of proteins have experimentally verified functions. Historically, sequence search, which involves comparing biological sequences against large databases of known sequences, has been instrumental in protein annotation. The leading tool in this domain is BLAST (Basic Local Alignment Search Tool), which has been used in hundreds of thousands of applications since its inception in the 1980’s.

While sequence search is central to many bioinformatics tasks, it has its limitations. Traditional sequence alignment, for instance, cannot determine protein structural similarity. This is a significant issue, especially when two proteins have similar structures but different sequences. The new research published by Hamamsy (along with CDS Associate Professor of Computer Science and Data Science Kyunghyun Cho and others) addresses this problem by focusing on structure-aware sequence search, which is closer to the function of proteins.

AlphaFold2, the groundbreaking structure prediction method introduced by DeepMind in 2021, is the most well-known AI-based method to go from protein sequences to predicted protein structures. However, AlphaFold2 (like the original AlphaFold) merely predicts protein structures from sequences. Hamamsy and his co-authors’ research takes a different approach. Instead of merely predicting structures, their method aims to build structure-aware embeddings of proteins. These embeddings can then be applied to vast databases of protein sequences, allowing researchers to find proteins based on structural similarity rather than sequence similarity.

One of the most revolutionary aspects of Hamamsy’s team’s research is the application of language models to protein sequences. Just as language models can be trained on sequences of words, they can also be trained on sequences of amino acids. These models have proven to be incredibly effective at learning representations of proteins, capturing the statistical patterns imparted by billions of years of evolution.

Hamamsy emphasized that the “protein remote homology detection” from the paper’s title pointed to an important concept on which their research had pushed the state-of-the-art forward. Protein remote homology detection refers to finding structurally similar proteins that are evolutionarily related, and have similar structures, but that have diverged from a sequence perspective. “We demonstrated in a few benchmarks, that these language models are able to do so much better than competing tools at remote homology detection,” said Hamamsy. “And that potentially opens up a wide range of annotation possibilities.”

The stakes are high. Hamamsy and his co-authors’ tools could revolutionize biology research, leading to the discovery of proteins with novel functions, the annotation of previously unannotated genomes, and more. One of the case studies in their paper demonstrated the potential of TM-Vec, one of the introduced methods, in identifying and classifying bacteriocins, nature’s antibiotics. Another promising application is in the realm of drug discovery, especially in the early phases.

The work of Hamamsy and his co-authors is a testament to the potential of combining computational methods with biology. By leveraging technology built for the internet, such as vector databases, their research can scale to the vastness of the metagenomic protein universe. As the field of bioinformatics continues to evolve, it’s clear that the integration of computational tools and biological research will pave the way for groundbreaking discoveries.

For a deeper dive into Hamamsy and his co-authors’ findings, you can access the full paper in Nature Biotechnology.

By Stephen Thomas

--

--

NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.