TechBio Bytes I: FAbCon, a generative foundation model for de novo antibody sequence generation

Jun 04, 2024

In this series, I will be reviewing recent papers within the realm of AI and biomedicine/health. Posts will include practical examples for open-source papers.

Paper: https://www.biorxiv.org/content/10.1101/2024.05.22.594943v1.full.pdf

Model: https://huggingface.co/alchemab/fabcon-small

Broader context:

Antibodies (Abs) have the potential to recognize + bind to a vast array of antigens. The human body can theoretically produce over 10^15 unique Abs to cover all possible antigens. Given this magnitude, discovering and optimizing Ab therapeutics is challenging. It requires screening billions of Ab variants to identify those with high binding affinity and specificity to the target antigen, while also possessing good biophysical properties for developability.

Recently, researchers have been leveraging LLMs for learning patterns from large Ab repertoires. FAbCon is a SOTA LLM model that builds on this foundation. It enables de novo design of Abs optimized for desirable properties. It can provide a rich pool of high-quality starting leads to fuel discovery pipelines.

What is FAbCon?

FAbCon is a large 2.4 billion parameter language model specifically designed and pre-trained on a massive corpus of Ab sequences to learn representations useful for Ab sequence understanding and generation tasks.

Technical details:

Based on the transformer decoder architecture from the general Falcon LLM
Pre-trained on a massive corpus of 823.7 million unpaired and 2.5 million paired Ab sequences using Causal Language Modeling (CLM)
CLM involves training to predict the next amino acid based on preceding residues
Three variants spanning 144M to 2.4B parameters: FAbCon-small, -medium, -large

After pre-training on the diverse Ab data, FAbCon learns rich representations capturing patterns in Ab sequences. These representations can then be transferred to downstream tasks:

Generating new Ab sequences optimized for developability properties (example shown below)
Predicting antigen binding by fine-tuning on labeled Ab-antigen datasets

Key Capabilities:

SOTA performance in predicting antigen binding for Abs against targets like HER2, SARS-CoV-2, IL-6
Can generate Ab sequences with computational developability profiles mimicking human repertoires
Scaling up model size consistently improves antigen binding prediction
Requires less training data compared to models without pre-training

Limitations and Future Work

As with any LLM, FAbCon may encode biases present in the pre-training data, which could impact the reliability of the generated Ab sequences.
The FAbCon model uses a gated architecture, which restricts the direct use of the model for certain tasks, such as full sequence generation
The publicly available versions of FAbCon are pre-trained only, and further fine-tuning on labeled datasets would require obtaining a commercial license from the authors

Code + Use Cases:

The FAbCon models are available on the Hugging Face Hub. We can load and use them with the 🤗 Transformers library.

Since FAbCon is a gated model, there are some limitations on what can be done out-of-the-box with the publicly released versions. However, here are some potential use cases and examples:

Generate Novel Ab Sequences:

Here we will use pre-trained FAbCon to generate a novel Ab sequence with properties it learned during pre-training on the large Ab corpus.

from transformers import PreTrainedTokenizerFast, FalconForCausalLM

tokenizer = PreTrainedTokenizerFast.from_pretrained("alchemab/fabcon-large")
model = FalconForCausalLM.from_pretrained("alchemab/fabcon-large")

input_ids = tokenizer("H", return_tensors='pt')['input_ids']
output_ids = model.generate(input_ids, max_new_tokens=256, top_k=50, temperature=0.95)
decoded_seq = tokenizer.batch_decode(output_ids)[0]

print(f"Generated sequence: {decoded_seq}")

We provided a starting prompt (H) to the model, which represents the beginning of the Ab sequence. We then used model.generate() to generate a new Ab sequence of up to 256 tokens, with top_p=0.95 param to enable diverse and high quality sequence generation. This will generate a paired Ab sequence (output_ids) and print out the decoded sequence (decoded_seq).

Sequence Representation & Embedding Extraction:

We can also extract the embeddings or representations learned by FAbCon for downstream analysis like clustering, visualization, KNN searching, or transfer learning to Ab-specific tasks.

from transformers import PreTrainedTokenizerFast, FalconModel

tokenizer = PreTrainedTokenizerFast.from_pretrained("alchemab/fabcon-large")
model = FalconModel.from_pretrained("alchemab/fabcon-large")

input_ids = tokenizer("HEVQLLESGGGLVQPGGSLRLSCATSGYTFT", return_tensors='pt')['input_ids']
attention_mask = tokenizer("HEVQLLESGGGLVQPGGSLRLSCATSGYTFT", return_tensors='pt')['attention_mask']

outputs = model(input_ids=input_ids, attention_mask=attention_mask)
sequence_embedding = outputs.last_hidden_state[:, 0, :]

We started by loading the pre-trained FAbCon-small model and its tokenizer. We then encoded a sample Ab sequence into input IDs. The sequence_embedding variable now contains the high-dimensional representation of the input antibody sequence.

Zero-Shot Humanness Prediction:

The paper shows FAbCon's perplexity scores correlate strongly with the OASis humanness metric. So FAbCon could potentially be used zero-shot to predict the "humanness" of an Ab sequence.

from transformers import PreTrainedTokenizerFast, FalconForSequenceClassification

tokenizer = PreTrainedTokenizerFast.from_pretrained("alchemab/fabcon-large")
model = FalconForSequenceClassification.from_pretrained("alchemab/fabcon-large")

input_ids = tokenizer("HEVQLLESGGGLVQPGGSLRLSCATSGYTFT", return_tensors='pt')['input_ids']
attention_mask = tokenizer("HEVQLLESGGGLVQPGGSLRLSCATSGYTFT", return_tensors='pt')['attention_mask']

output = model(input_ids=input_ids, attention_mask=attention_mask)
logits = output.logits
humanness_score = logits[:, 0].item()

The model's output contains the logits, which can be used to compute the "humanness" score directly.

For more advanced tasks like antigen binding prediction, FAbCon would need to be further fine-tuned on labeled data. That will require a commercial license according to the model card.

Thanks for reading!

Share Up to Data

Up to Data