Unsupervised Language Modeling at the Scale of Evolution
Growth in the number of protein sequences in public databases has followed an exponential trend over decades, creating a deep view into the breadth and diversity of proteins across life. Modeling sequences at the scale of evolution is a logical step toward predictive and generative artificial intelligence for biology. Our goal is to develop general purpose models that can distill biological design principles directly from sequences with unsupervised learning. In contrast to the standard practice of fitting models to families of related sequences, we fit a single high-capacity model to millions of diverse sequences spanning evolution. I'll discuss our work to understand what large transformer language models learn about protein structure and function from sequences, how their internal representations can be used to produce features for a variety of tasks, and the use of the models generatively. I'll also introduce a new language model that learns to extract structure using attention over sets of aligned sequences rather than individual sequences. Protein language modeling at scale produces state-of-the-art features for prediction tasks and surpasses state-of-the-art unsupervised protein structure learning methods.
Add to Calendar