EntityBench — Entity-Consistent Long-Range Multi-Shot Video Generation
EntityBench · EntityMem
EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation
1ByteDance ·
2ByteDance Seed ·
3Rice University
Multi-shot video generation extends single-shot generation to coherent visual narratives, yet maintaining consistent characters, objects, and locations across shots remains a challenge over long sequences. Existing evaluations typically use independently generated prompt sets with limited entity coverage and simple consistency metrics, making standardized comparison difficult. We introduce EntityBench, a benchmark of 140 episodes (2,491 shots) derived from real narrative media, with explicit per-shot entity schedules tracking characters, objects, and locations simultaneously across easy / medium / hard tiers of up to 50 shots, 13 cross-shot characters, 8 cross-shot locations, 22 cross-shot objects, and recurrence gaps spanning up to 48 shots. It is paired with a three-pillar evaluation suite that disentangles intra-shot quality, prompt-following alignment, and cross-shot consistency, with a fidelity gate that admits only accurate entity appearances into cross-shot scoring.
As a baseline, we propose EntityMem, a memory-augmented generation system that stores verified per-entity visual references in a persistent memory bank before generation begins. Experiments show that cross-shot entity consistency degrades sharply with recurrence distance in existing methods, and that explicit per-entity memory yields the highest character fidelity (Cohen's d = +2.33) and presence among methods evaluated.
The Benchmark
A Long-Range Cross-Shot Entity Memory Test
EntityBench scripts are derived from real narrative media, then enriched and validated by LLMs into generation-ready prompts. Each shot ships with an explicit entity_schedule naming the characters, objects, and locations expected to appear, along with cut and continuation transition flags. The three difficulty tiers separate long-range memory load from intra-shot complexity: hard-tier episodes hold per-shot composition roughly constant while pushing recurrence gaps past 30 shots and entity-slot re-appearance rates above 80%.
Recurrence-gap CCDF. The hard tier carries a heavy tail past 30 intervening shots — a long-range stress test absent from prior benchmarks.
Cut / continuation structure. Chains of consecutive non-cut shots extend up to 36, providing transition-fidelity signal at scale.
Per-episode entity counts. An average episode declares 7 characters, 5 locations, and 15 objects.
Per-entity persistence. Most entities appear in 2 shots; anchor entities sustain runs of up to 9 consecutive shots.
Characters are tested most aggressively: 80.3% of every character slot in the benchmark must be rendered from memory, not from a prompt-level description.
Evaluation
Three Pillars · 51 Metrics
The evaluation suite asks three progressive questions: is each shot well-formed in isolation?, does each shot match its prompt?, and do shots agree with one another? Pillars build on each other — Pillar 2's per-shot fidelity scores filter the cross-shot pool used in Pillar 3, so cross-shot consistency is only measured on appearances the model rendered correctly.
Figure 1. The full evaluation suite — 3 pillars, 51 metrics, hierarchical and per-entity-type. Pillar 2's per-entity fidelity scores gate admission into Pillar 3.
Pillar 16 metrics
Intra-shot quality
VBench-style dimensions: subject consistency, temporal flickering, motion smoothness, dynamic degree, aesthetic and imaging quality. Is each shot well-formed in isolation?
Pillar 224 metrics
Prompt-following alignment
Presence, per-entity fidelity (face / hair / clothing / build / shape / layout / …) and action correctness, scored shot-by-shot.
Pillar 321 metrics
Cross-shot consistency
DINOv2 centroid similarity for characters and objects, plus LLM pairwise identity judgment on type-specific criteria.
The fidelity gate. A naive cross-shot metric rewards methods that produce nearly-static yet incorrect renderings — they look similar to each other so they're scored as “consistent.” The fidelity gate admits only (shot, entity) pairs that cleared the Pillar 2 fidelity threshold into the Pillar 3 pool, so consistency is measured only on appearances the entity was rendered correctly in the first place.
EntityMem
Per-Entity Memory, Established Before Generation
EntityMem stores per-entity visual and textual references in a persistent memory bank before any video generation begins, so each entity's identity is established once and reused consistently throughout the sequence. At generation time, each shot retrieves its entity references independently of the scene in which they previously appeared — disentangling identity from context, and avoiding the autoregressive failure mode where distortions in early shots compound into the reference pool.
01
Entity references
Per-entity portraits and panoramic backgrounds generated on a chroma-key, segmented out, and verified by an LLM agent before entering the bank.
02
Keyframe composition
A Layout Agent plans each shot: character positions, camera angle, and how many keyframes to capture the progression of the action.
03
Memory-augmented generation
Labeled portraits and keyframe composites are passed to the video backbone alongside the text prompt, with stored descriptions auto-injected for recurring entities.
Experiments
Results on EntityBench
Numbers below are fidelity-gate-corrected means: per-episode scores are weighted by the number of gate-passing instances they contributed, so methods that fail the gate on harder cases are penalised accordingly. Bold values mark the column winner.
Ours
StoryMem
HoloCine
CineTrans
| subject_consistency | 0.881 | 0.759 | 0.860 | 0.968 |
| temporal_flickering | 0.976 | 0.838 | 0.957 | 0.979 |
| motion_smoothness | 0.988 | 0.849 | 0.964 | 0.990 |
| dynamic_degree | 0.657 | 0.562 | 0.721 | 0.688 |
| aesthetic_quality | 0.593 | 0.475 | 0.518 | 0.596 |
| imaging_quality [0,100] | 66.00 | 56.41 | 49.97 | 68.57 |
| intra_character_presence | 0.967 | 0.849 | 0.882 | 0.796 |
| intra_object_presence | 0.888 | 0.893 | 0.723 | 0.776 |
| intra_location_presence | 0.687 | 0.681 | 0.624 | 0.651 |
| intra_face_fidelity | 0.740 | 0.452 | 0.349 | 0.327 |
| intra_face_face | 0.607 | 0.424 | 0.369 | 0.366 |
| intra_face_hair | 0.684 | 0.485 | 0.482 | 0.413 |
| intra_face_clothing | 0.802 | 0.504 | 0.339 | 0.378 |
| intra_face_build | 0.726 | 0.539 | 0.449 | 0.521 |
| intra_object_fidelity | 0.601 | 0.618 | 0.267 | 0.384 |
| intra_object_shape | 0.712 | 0.701 | 0.373 | 0.508 |
| intra_object_color_texture | 0.691 | 0.709 | 0.331 | 0.480 |
| intra_object_proportions | 0.728 | 0.715 | 0.383 | 0.539 |
| intra_object_details | 0.573 | 0.598 | 0.256 | 0.371 |
| intra_location_fidelity | 0.555 | 0.504 | 0.306 | 0.428 |
| intra_location_layout | 0.603 | 0.529 | 0.354 | 0.474 |
| intra_location_color_mood | 0.706 | 0.627 | 0.474 | 0.588 |
| intra_location_landmarks | 0.562 | 0.522 | 0.305 | 0.429 |
| intra_location_perspective | 0.557 | 0.520 | 0.346 | 0.488 |
| intra_action_overall | 0.618 | 0.547 | 0.569 | 0.273 |
| intra_action_depicted | 0.519 | 0.446 | 0.458 | 0.124 |
| intra_action_subject_identity | 0.706 | 0.595 | 0.606 | 0.478 |
| intra_action_subject_action | 0.697 | 0.626 | 0.695 | 0.323 |
| intra_action_object_interaction | 0.781 | 0.712 | 0.616 | 0.346 |
| intra_action_motion_quality | 0.716 | 0.723 | 0.772 | 0.528 |
| cs_face | 0.737 | 0.792 | 0.751 | 0.772 |
| cs_object | 0.798 | 0.839 | 0.803 | 0.794 |
| cs_transition_boundary | 0.738 | 0.663 | 0.498 | 0.508 |
| llm_face_accuracy | 0.406 | 0.226 | 0.228 | 0.091 |
| llm_face_mean_score | 0.426 | 0.234 | 0.242 | 0.145 |
| llm_face_face | 0.381 | 0.216 | 0.223 | 0.145 |
| llm_face_hair | 0.447 | 0.248 | 0.282 | 0.175 |
| llm_face_clothing | 0.464 | 0.241 | 0.242 | 0.143 |
| llm_face_build | 0.489 | 0.260 | 0.285 | 0.217 |
| llm_object_accuracy | 0.164 | 0.203 | 0.088 | 0.092 |
| llm_object_mean_score | 0.202 | 0.222 | 0.094 | 0.145 |
| llm_object_shape | 0.232 | 0.239 | 0.104 | 0.180 |
| llm_object_color_texture | 0.235 | 0.243 | 0.104 | 0.190 |
| llm_object_proportions | 0.238 | 0.244 | 0.105 | 0.195 |
| llm_object_details | 0.184 | 0.209 | 0.087 | 0.124 |
| llm_scene_accuracy | 0.309 | 0.398 | 0.304 | 0.119 |
| llm_scene_mean_score | 0.659 | 0.671 | 0.616 | 0.432 |
| llm_scene_layout | 0.697 | 0.684 | 0.641 | 0.449 |
| llm_scene_color_mood | 0.716 | 0.724 | 0.669 | 0.619 |
| llm_scene_landmarks | 0.603 | 0.637 | 0.563 | 0.346 |
| llm_scene_perspective | 0.727 | 0.696 | 0.713 | 0.467 |
Bold = column winner per row. All values are fidelity-gate-corrected means
(imaging_quality on [0,100]; all others on [0,1]).
Qualitative examples
Qualitative examples of the strongest persistent-memory baseline (StoryMem) and our per-entity memory bank (EntityMem). Videos autoplay; reload to restart.
Example 1
Shot 1
Shot 3
Shot 4
Shot 7
Shot 8
StoryMem
OursEntityMem
Shot 1.
<registry>
Marcus: Dark-skinned man with a goatee, bare-chested, wearing an open white lab coat, a white cap with a rainbow emblem, and green-rimmed glasses.
Leo: Man with blonde hair wearing an open white lab coat.
Akira: Shirtless young man with spiky red and black hair, wearing a necklace.
Chloe: A young girl with long blonde hair and green eyes.
The School Classroom: Classroom with light green walls, a wooden floor, and bookshelves.
the striped t-shirt: Short-sleeved crew-neck t-shirt with a pattern of horizontal blue and white stripes.
the classroom bookshelves: Bookshelves with framed pictures on the shelves.
<prompts>
Marcus, wearing the white lab coat, and Leo, wearing the striped t-shirt under his white lab coat, are talking with Chloe near the classroom bookshelves. They all turn as Akira walks into the frame from the left to join them.
Shot 3.
<registry>
The Geometric Space: Abstract geometric space with a brown geometric surface and large, static red and yellow graphic shapes and structures.
the white fox pokémon: Small, fox-like Pokémon (Alolan Vulpix) with fluffy white fur that has some reddish-brown coloration, large pointed ears, a curled tuft of fur on its head, and multiple curled tails.
<prompts>
Medium shot in The Geometric Space. Chloe walks over to the red and yellow perch where the white fox pokémon sits. Rohan stands beside her, watching intently as Chloe gently pets the white fox pokémon.
Shot 4.
<registry>
no new entities — all recurring from earlier shots
<prompts>
High-angle medium shot of the white fox pokémon standing on a brown geometric surface in The Geometric Space, its tails twitching slightly. Nearby, Chloe takes a small step toward the white fox pokémon, her eyes widening in awe and her mouth falling slightly open. The large, red and yellow structure of the red and yellow perch is in the background.
Shot 7.
<registry>
the pink flower hair clip: A bright pink flower hair ornament composed of five simple, rounded petals and a small, circular yellow center, all with a smooth appearance and a thin black outline.
the green chalkboard: Large, rectangular chalkboard with a clean, dark green, matte surface.
the yellow electric pokémon: Small, yellow, rodent-like Pokémon with long black-tipped ears, circular red cheeks, and a lightning-bolt-shaped tail.
<prompts>
Chloe, wearing the pink flower hair clip, and Marcus, standing in front of the green chalkboard with the yellow electric pokémon, turn and smile at Akira, who smiles back.
Shot 8.
<registry>
The Open Field: An open outdoor space under a blue sky.
<prompts>
Close-up shot of Leo smiling in The Open Field while wearing the striped t-shirt. The yellow electric pokémon suddenly jumps onto Leo's shoulder.
Example 2
Shot 1
Shot 2
Shot 3
Shot 4
Shot 5
StoryMem
OursEntityMem
Shot 1.
<registry>
Tingo: A large, rounded character in a plush, full-body purple suit. It has a silver rectangular screen on its abdomen and a single antenna on its head shaped like an inverted triangle.
The Playroom: A colorful indoor playroom featuring a bright red floor, a red door, and a large blue slide.
The Hillside Dome: A surreal, dome-shaped house covered in green grass, built into the side of a vibrant green rolling hill dotted with colorful flowers.
<prompts>
Medium shot of Tingo happily dancing and singing in The Playroom.
Shot 2.
<registry>
Sunny: A stylized sun with the smiling face of a baby.
<prompts>
Eye-level medium shot in The Playroom. Chloe smiles in the foreground, sitting amidst the balls in the colourful ball pit. In the background, Tingo, Poppy, Zippy and Gigi jump energetically with their arms raised in the air, and Sunny bobs energetically in the air.
Shot 3.
<registry>
The Sunny Sky: Bright blue sky with fluffy white clouds and a stylized, glowing sun with radiating light beams.
<prompts>
Extreme close-up of Sunny in The Sunny Sky. Sunny smiles and giggles as bright light beams radiate outwards against the blue sky and fluffy clouds.
Shot 4.
<registry>
The Sun Meadow: A surreal outdoor landscape of rolling, bright green hills under a sky that features a stylized sun with a smiling baby's face.
<prompts>
In the vibrant, green landscape of The Sun Meadow, the doors of The Hillside Dome open. Tingo emerges, runs forward, and begins to dance under the sky with its smiling baby-faced sun.
Shot 5.
<registry>
the intricate blue door: Large, round, arched blue door with an intricate, repeating raised pattern of concentric circles and geometric shapes across its surface.
<prompts>
Wide shot of The Hillside Dome set into a rolling green hill in The Sun Meadow. Tingo, Zippy, Poppy and Chloe run towards the house and enter through the intricate blue door. The door closes, then reopens, and the four characters run back out into the sunny landscape one by one.
Example 3
Shot 1
Shot 2
Shot 3
Shot 4
Shot 5
StoryMem
OursEntityMem
Shot 1.
<registry>
Young-ho: Older Korean official with a grey beard and a wrinkled face, in a traditional red and black robe and a wide-brimmed black hat (Gat) with beaded strings.
The Grand Throne Hall: Interior of a traditional Korean palace throne room with a neutral color palette.
Hwan: Older Korean King with a beard, a large, wide-brimmed black hat (Gat) with beaded strings and a gold ornament, and a traditional red King's Robe layered with a dark vest.
the ceremonial blue and pink robe: Traditional Korean official's robe made of a fine, silky fabric with a vibrant light blue or teal body, wide flowing sleeves, and contrasting bright pink fabric on the collar, cuffs, and front panels.
the official's wing-tipped hat: Formal black Joseon-era official's hat (Samo) with a rounded crown and two thin, upright, wing-like projections extending from the back.
<prompts>
Young-ho stands in The Grand Throne Hall before the wooden lattice screen, addressing the unseen Hwan. As he speaks, his expression slowly shifts into a sinister grin.
Shot 2.
<registry>
Do-yun: A man in a traditional turquoise and purple hanbok and a tall, wide-brimmed black hat (gat).
<prompts>
A close-up of Do-yun inside The Grand Throne Hall. He wears the traditional turquoise and purple hanbok and a tall, wide-brimmed black hat (gat). Do-yun's eyes drift downwards, a pensive and somber expression settling on his face.
Shot 3.
<registry>
Jun-seo: Middle-aged Korean man wearing a traditional purple King's Robe and a black hat (Gat).
<prompts>
In The Royal Quarters, surrounded by the dark wood furniture, Mi-kyung, wearing the green and maroon hanbok, sits on the polished wood floor, breathing shakily as she cradles her arm, her gaze fixed downward. Jun-seo takes a slow step closer, looking down at her with a somber and concerned expression.
Shot 4.
<registry>
no new entities — all recurring from earlier shots
<prompts>
Extreme close-up in The Infirmary Wing on the face of Jin-woo, who wears the black headband of the black assassin's garb. He lies on the wooden floor near Hwan with his eyes shut as a single tear rolls down his cheek.
Shot 5.
<registry>
The Snowy Courtyard: A spacious outdoor courtyard of a traditional Korean palace, with the grounds entirely covered in white snow.
<prompts>
In The Snowy Courtyard, Mi-kyung kneels and bows her head deeply in the snow in front of a palace building. Standing together in the foreground, Hwan, Sang-hoon, Do-yun, Jun-seo and Young-ho watch her, their expressions growing more solemn.
Citation
BibTeX
@article{he2026entitybench,
title = {EntityBench: Towards Entity-Consistent Long-Range Multi-Shot Video Generation},
author = {He, Ruozhen and Meng, Wei and Yang, Ziyan and Ordonez, Vicente},
journal = {Preprint},
year = {2026},
}
EntityBench · Rice University & ByteDance
Project page built with vanilla HTML & CSS.