← 返回首页
GitHub - Jinx-byebye/LineageFlow: Official repository for the ICML 2026 paper LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation. · GitHub
Skip to content

Navigation Menu

Toggle navigation
Sign in
Appearance settings
Search or jump to...

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Include my email address so I can be contacted

Saved searches

Use saved searches to filter your results more quickly

Appearance settings
Resetting focus

Jinx-byebye/LineageFlow

Go to file
Code

Repository files navigation

LineageFlow

Installation | Checkpoints and data | Quick start | Evaluation | Citation

Official implementation of LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation, accepted to ICML 2026.

LineageFlow is a family-aware protein sequence generator. It starts from phylogeny-informed ancestral sequence reconstruction (ASR) priors and transports them toward extant sequences with a shared flow-matching denoiser. A single intermediate-time rerouting step performs mutate-select-amplify guidance for objective-aware generation without per-step predictor guidance.

Main Results

Table 1 from the paper compares LineageFlow with released-weight and Pfam-trained baselines under the same evaluation protocol. The released configuration in this repository uses the rp55 checkpoint and the default inference parameters in config/generation.json.

What Is Included

This repository contains the code needed to run the released model and reproduce the evaluation pipeline:

  • Model definition for the LineageFlow denoiser.
  • Batch and single-family sequence generation.
  • Rerouting fitness scorers based on ESM2 masked scoring, prior likelihood, and lightweight heuristics.
  • Evaluation scripts for family validity, foldability, self-consistency, novelty, diversity, and matched natural baselines.
  • Download utilities and path conventions for the released checkpoint and preprocessed Pfam assets.

Installation

git clone https://github.com/Jinx-byebye/LineageFlow.git cd LineageFlow conda create -n lineageflow python=3.10 -y conda activate lineageflow pip install -r requirements.txt

The base requirements cover LineageFlow inference and the Python evaluation utilities. Full structural evaluation also uses external protein-modeling tools:

  • HMMER for Pfam profile-HMM family validity.
  • MMseqs2 for nearest-neighbor novelty and diversity.
  • OmegaFold for predicted fold confidence.
  • ESM-IF and PyG dependencies for inverse-folding self-consistency scoring.

For CUDA environments, scripts/setup_fold_eval_env.sh provides a reference setup for the foldability and self-consistency dependencies.

Checkpoints And Data

The released checkpoint and preprocessed Pfam assets are hosted on Hugging Face Hub.

Install the Hugging Face CLI if needed:

pip install -U "huggingface_hub[cli]"

Download the checkpoint:

hf download jinxbye/LineageFlow \ lineageflow-rp55.ckpt \ --local-dir checkpoints

Download the preprocessed Pfam assets:

hf download jinxbye/LineageFlow-assets \ --repo-type dataset \ --local-dir dataset

Expected layout after download:

checkpoints/lineageflow-rp55.ckpt dataset/pfam_priors_asr_mad/*.prior.json dataset/pfam_gap_rates/*.gap.json dataset/pfam_fastas_clean/*.fasta dataset/pfam_pi_smooth_tau0.5_gap060_gt80_020.csv dataset/pfam_priors_keep_ids_gap060_gt80_020.txt

All paths can also be overridden from the command line.

Quick Start

Batch Generation

The default config in config/generation.json uses the released inference setting:

  • t_max=6.0, t_int=3.0
  • steps_base=100, steps_final=50
  • rounds=3, population_size=8, mutate_frac=0.25
  • beta=4.0, lam=20.0
  • max_families=64, runs_per_family_batch=16
  • ESM2-150M rerouting scorer: facebook/esm2_t30_150M_UR50D
  • decode=argmax

Generate a batch of sequences:

python inference/batch_generate.py \ --config config/generation.json \ --ckpt checkpoints/lineageflow-rp55.ckpt \ --num-samples 512 \ --gpus all \ --out outputs/lineageflow_samples.fasta

Each FASTA header contains the intended Pfam family, which is required by the family-validity and novelty evaluators.

Single-Family Generation

python inference/generate.py \ --config config/generation.json \ --ckpt checkpoints/lineageflow-rp55.ckpt \ --prior dataset/pfam_priors_asr_mad/PF00000.prior.json \ --gap dataset/pfam_gap_rates/PF00000.gap.json \ --out outputs/PF00000_samples.fasta

Evaluation

Run the full evaluation pipeline:

python evaluation/evaluate_all.py \ --fasta outputs/lineageflow_samples.fasta \ --outdir results/eval/lineageflow_samples \ --hmmdb databases/pfam35/Pfam-A.hmm \ --target-db results/mmseqs/pfam_train_gap060_gt80_020 \ --metrics family_validity foldability self_consistency novelty \ --fold-gpus all \ --sc-gpus all

Core metrics:

  • Family validity: profile-HMM top-1 family accuracy and intended-family hit rate.
  • Foldability: OmegaFold mean pLDDT.
  • Self-consistency: ESM-IF sequence perplexity conditioned on the predicted backbone.
  • Novelty: nearest-neighbor identity to the training corpus via MMseqs2.
  • Diversity: MMseqs2 cluster count on foldable generated sequences.

To compare against natural sequences with the same family mixture:

python evaluation/sample_training_matched.py \ --generated-fasta outputs/lineageflow_samples.fasta \ --pfam-fastas-dir dataset/pfam_fastas_clean \ --out outputs/training_matched.fasta

Repository Layout

config/ Default inference configuration and config reference core/ Dirichlet path utilities and vector-field math dataset/ Pfam family table and prior asset loaders fitness/ Rerouting fitness functions inference/ Single-family, batch, and trajectory generation models/ LineageFlow denoiser architecture evaluation/ Family validity, foldability, self-consistency, novelty scripts/ Public setup utilities assets/ README figures checkpoints/ Placeholder for released model weights outputs/ Local generation outputs

Citation

If you use LineageFlow, please cite:

@inproceedings{liang2026lineageflow, title = {LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation}, author = {Liang, Langzhang and Yang, Ming and Feng, Yi and Li, Junfan and Pan, Shirui and Xu, Yinghui and Ying, Tianlei and Zheng, Yizhen and Xu, Zenglin}, booktitle = {Proceedings of the 43rd International Conference on Machine Learning}, year = {2026} }

The same entry is available in citation.bib.

Acknowledgements

LineageFlow builds on public protein modeling tools and resources including ESM, HMMER, MMseqs2, OmegaFold, and Pfam.

License

LineageFlow is released under the MIT License.

About

Official repository for the ICML 2026 paper LineageFlow: Flow Matching for High-Fidelity Family-Aware Protein Sequence Generation.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Footer

© 2026 GitHub, Inc.