
A couple of months ago, Ginkgo launched models.ginkgobioworks.ai, an API offering biological developers access to our growing suite of AI models. Since then, I’ve had lots of fun talking to academics and entrepreneurs about how they use AI tools.
For me personally, the most exciting thing about Ginkgo AI is the way it brings new people into our community. Because the model API is ultra-low-cost (including a free tier) and easy-to-use, it’s a super low friction way to get started building with biology.
One challenge to onboarding all these new Ginkgo AI users is equipping them with effective documentation to enable the many use-cases that interest them. This was an anticipated problem and one that the software world has been dealing with for a long time. When you offer a flexible tool to lots of smart people – how can you economically support them in all their projects?
The challenge of accessibility is particularly acute in the case of AI for biology, where many users are new to biology, new to AI, or both. I’ve had conversations with users who knew they had a potential biological application of AI, but weren’t sure how to proceed. Which model to use? Which data to train it with? When can I be confident that a particular model is able to predict a desired property? And maybe most important – when is the right time to begin generating actual lab data for AI-generated constructs?
This tutorial is a step toward making our AI models easier to use. I wrote a Google Colab notebook demonstrating one particular way to use our models. In this case, we use the mask fill transform to make some functional improvements to an enzyme sequence. I recorded a live demo video where we go through the process of finding an interesting sequence, sending it to the Ginkgo API and visualizing the results.
My hope is that cookbooks like this one can be a flexible way to teach expert users, or rather, giving them what they need to teach themselves. Instead of trying to offer explicit instructions covering every possible application, we build a library of examples and trust that our users are smart enough to mix and match to build what they need. Cookbooks are also super fun to write, because I get to experiment with mini-projects for using AI in fun ways. In this case, I set up a little competition to see which of the 3 protein sequence models we currently host does the best job at replicating the work of a human protein engineer.
This project demonstrates the use AminoAcid-0 (AA-0) and other protein LLMs accessible at models.ginkgobioworks.ai by using them to modify the sequence of a plastic-degrading enzyme.
The enzyme PETase is an esterase capable of breaking down the plastic polyethylene terephthalate (PET). In 2016, Yoshida et al. discovered PETase in a bacterium living close to a plastic bottle recycling site. In 2017, Austin et al. determined a high resolution crystal stucture for the enzyme (PDB 6EQE) and identified 2 amino acid changes that improved its ability to digest PET.
In the PETase Fill Mask Challenge, we compare the ability of 3 different protein LLMs to re-discover the changes that enhanced PETase activity via masked language modeling.
ESM-2 was developed in 2022 by Meta Research and quickly became popular for a range of sequence and structure prediction tasks.
ESM-2-3B expands the model size of ESM-2 from 650M to 3B parameters.
AA-0 is a protein LLM similar in structure to ESM-2, trained using 2 billion additional protein sequences from Ginkgo’s in-house sequence libraries.
Obligatory disclaimer: this is a very small test on a single protein – it isn’t representative of any model’s general performance. More thorough benchmarking data is presented in our AA-0 technical review. To access Ginkgo’s API and try these models for yourself, head to models.ginkgobioworks.ai.
YouTube: How to Engineer Protein Sequences with AA-0and the Ginkgo AI API
Posted by Jake Wintermute