Installation |
Checkpoints and data |
Quick start |
Evaluation |
Citation
Official implementation of LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation, accepted to ICML 2026.
LineageFlow is a family-aware protein sequence generator. It starts from phylogeny-informed ancestral sequence reconstruction (ASR) priors and transports them toward extant sequences with a shared flow-matching denoiser. A single intermediate-time rerouting step performs mutate-select-amplify guidance for objective-aware generation without per-step predictor guidance.
Table 1 from the paper compares LineageFlow with released-weight and Pfam-trained baselines under the same evaluation protocol. The released configuration in this repository uses the rp55 checkpoint and the default inference parameters in config/generation.json.
This repository contains the code needed to run the released model and reproduce the evaluation pipeline:
- Model definition for the LineageFlow denoiser.
- Batch and single-family sequence generation.
- Rerouting fitness scorers based on ESM2 masked scoring, prior likelihood, and lightweight heuristics.
- Evaluation scripts for family validity, foldability, self-consistency, novelty, diversity, and matched natural baselines.
- Download utilities and path conventions for the released checkpoint and preprocessed Pfam assets.
git clone https://github.com/Jinx-byebye/LineageFlow.git
cd LineageFlow
conda create -n lineageflow python=3.10 -y
conda activate lineageflow
pip install -r requirements.txt
The base requirements cover LineageFlow inference and the Python evaluation utilities. Full structural evaluation also uses external protein-modeling tools:
- HMMER for Pfam profile-HMM family validity.
- MMseqs2 for nearest-neighbor novelty and diversity.
- OmegaFold for predicted fold confidence.
- ESM-IF and PyG dependencies for inverse-folding self-consistency scoring.
For CUDA environments, scripts/setup_fold_eval_env.sh provides a reference setup for the foldability and self-consistency dependencies.
The released checkpoint and preprocessed Pfam assets are hosted on Hugging Face Hub.
Install the Hugging Face CLI if needed:
pip install -U "huggingface_hub[cli]"
Download the checkpoint:
hf download jinxbye/LineageFlow \
lineageflow-rp55.ckpt \
--local-dir checkpoints
Download the preprocessed Pfam assets:
hf download jinxbye/LineageFlow-assets \
--repo-type dataset \
--local-dir dataset
Expected layout after download:
checkpoints/lineageflow-rp55.ckpt
dataset/pfam_priors_asr_mad/*.prior.json
dataset/pfam_gap_rates/*.gap.json
dataset/pfam_fastas_clean/*.fasta
dataset/pfam_pi_smooth_tau0.5_gap060_gt80_020.csv
dataset/pfam_priors_keep_ids_gap060_gt80_020.txt
All paths can also be overridden from the command line.
The default config in config/generation.json uses the released inference setting:
- t_max=6.0, t_int=3.0
- steps_base=100, steps_final=50
- rounds=3, population_size=8, mutate_frac=0.25
- beta=4.0, lam=20.0
- max_families=64, runs_per_family_batch=16
- ESM2-150M rerouting scorer: facebook/esm2_t30_150M_UR50D
- decode=argmax
Generate a batch of sequences:
python inference/batch_generate.py \
--config config/generation.json \
--ckpt checkpoints/lineageflow-rp55.ckpt \
--num-samples 512 \
--gpus all \
--out outputs/lineageflow_samples.fasta
Each FASTA header contains the intended Pfam family, which is required by the family-validity and novelty evaluators.
python inference/generate.py \
--config config/generation.json \
--ckpt checkpoints/lineageflow-rp55.ckpt \
--prior dataset/pfam_priors_asr_mad/PF00000.prior.json \
--gap dataset/pfam_gap_rates/PF00000.gap.json \
--out outputs/PF00000_samples.fasta
Run the full evaluation pipeline:
python evaluation/evaluate_all.py \
--fasta outputs/lineageflow_samples.fasta \
--outdir results/eval/lineageflow_samples \
--hmmdb databases/pfam35/Pfam-A.hmm \
--target-db results/mmseqs/pfam_train_gap060_gt80_020 \
--metrics family_validity foldability self_consistency novelty \
--fold-gpus all \
--sc-gpus all
Core metrics:
- Family validity: profile-HMM top-1 family accuracy and intended-family hit rate.
- Foldability: OmegaFold mean pLDDT.
- Self-consistency: ESM-IF sequence perplexity conditioned on the predicted backbone.
- Novelty: nearest-neighbor identity to the training corpus via MMseqs2.
- Diversity: MMseqs2 cluster count on foldable generated sequences.
To compare against natural sequences with the same family mixture:
python evaluation/sample_training_matched.py \
--generated-fasta outputs/lineageflow_samples.fasta \
--pfam-fastas-dir dataset/pfam_fastas_clean \
--out outputs/training_matched.fasta
config/ Default inference configuration and config reference
core/ Dirichlet path utilities and vector-field math
dataset/ Pfam family table and prior asset loaders
fitness/ Rerouting fitness functions
inference/ Single-family, batch, and trajectory generation
models/ LineageFlow denoiser architecture
evaluation/ Family validity, foldability, self-consistency, novelty
scripts/ Public setup utilities
assets/ README figures
checkpoints/ Placeholder for released model weights
outputs/ Local generation outputs
If you use LineageFlow, please cite:
@inproceedings{liang2026lineageflow,
title = {LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation},
author = {Liang, Langzhang and Yang, Ming and Feng, Yi and Li, Junfan and Pan, Shirui and Xu, Yinghui and Ying, Tianlei and Zheng, Yizhen and Xu, Zenglin},
booktitle = {Proceedings of the 43rd International Conference on Machine Learning},
year = {2026}
}
The same entry is available in citation.bib.
LineageFlow builds on public protein modeling tools and resources including ESM, HMMER, MMseqs2, OmegaFold, and Pfam.
LineageFlow is released under the MIT License.