DoolyProf extracts all operations present in an LLM inference forward pass and efficiently profiles only those absent from its latency database. The resulting latency models can be used in downstream LLM inference tasks such as simulation (e.g., DoolySim, Vidur) or in latency-prediction-based schedulers (e.g., llm-d).
📝 Paper: Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
📋 Update Logs: Link to Update Logs
Dooly system architecture overview and key ideas
Selecting the optimal LLM inference configuration requires evaluation across hardware, serving engines, attention backends, and model architectures, since no single choice performs best across all workloads. Profile-based simulators are the standard tool for this exploration, yet they hardcode their operation set to a specific configuration and re-profile every operation from scratch, making configuration sweeps prohibitively expensive. Existing approaches treat each (model, engine, backend, hardware) combination as a fresh profiling target, ignoring the substantial overlap across configurations.
Dooly takes a different approach: it exploits structural redundancy in LLM inference to achieve configuration-agnostic, redundancy-aware profiling, profiling each unique operation only once and reusing it across all configurations that share it.
Our core insight is that every input dimension of every operation in an LLM forward pass is either fixed by the model configuration or determined by the incoming request, and that model-configuration values (head size, layer count, etc.) recur heavily across model families. Dooly exploits this through three key ideas:
Refer to indivudal repositories for installation instructions.
See individual repositories for license information.
This organization has no public members. You must be a member to see who’s a part of this organization.
Loading…
Loading…