As shown in Figure 2, the ADD pipeline leverages agents and specialized tools to automatically decompose in-the-wild images. Background and foreground layers are curated separately and later recombined by the Layered Image Curator, where consistency checks ensure the quality of the final layered compositions.
Learning in-the-wild image layering requires supervision that is rarely available in real photographs. Unlike PSD-style assets with explicitly authored layers, natural images entangle foreground appearance, occlusion, cast shadows, reflections, and illumination changes into a flattened RGB observation. To address this bottleneck, LiWi introduces a multi-agent system that automatically synthesizes high-quality layered samples from in-the-wild images.
Figure 3 summarizes the composition of LiWi-100k across diverse natural scenes and structural layouts. This distribution supports training and evaluation for layer decomposition beyond graphic design templates.
As illustrated in Figure 4, real-world photographs contain cast shadows, illumination variations, and contact darkening that are difficult to assign to ordinary foreground or background layers. LiWi introduces an explicit shadow layer to represent these foreground-induced photometric footprints, which helps the model separate illumination effects from semantic layers and improves color consistency in decomposition.
Figure 5 shows the degraded boundary refinement process. Natural image layers often suffer from boundary erosion, dilation, or blur around thin structures and object edges. LiWi addresses this with a degradation-restoration objective: the model is trained to recover clean foregrounds from deliberately degraded boundary observations, providing extra supervision for sharper alpha boundaries and more accurate foreground reconstruction.
Table 1 reports the quantitative results on LiWi-100k. The original Qwen-Image-Layered model shows a clear domain gap when transferred from graphic designs to in-the-wild images. Fine-tuning on LiWi data already improves both RGB reconstruction and alpha estimation, while the full LiWi framework further reduces RGB L1 and improves Alpha soft IoU across all edit settings.
Table 2 shows that LiWi also outperforms prior methods on the Crello benchmark. Although Crello contains raster graphic designs rather than natural photographs, LiWi retains strong performance and improves both RGB L1 and Alpha soft IoU, showing that the method generalizes beyond the in-the-wild setting.
Table 3 highlights that LiWi's alpha quality transfers to an external benchmark. Even without dedicated segmentation training on DIS-5K, the model remains competitive on fine boundary metrics, consistent with the role of the degradation-restoration objective.
Figure 6 summarizes the visual results of the LiWi framework on the test set. LiWi consistently produces cleaner decompositions for in-the-wild scenes, better preserving foreground completeness while removing residual shadows and other photometric artifacts from the background.
Figure 7 highlights layered samples from the LiWi dataset with two- and three-layer compositions. It illustrates the diversity of foreground-background arrangements supported by the proposed data construction pipeline.
Figure 8 presents a broader view of LiWi-100k across multiple scene types and object layouts. It reflects the coverage of the proposed dataset in terms of natural appearance, structural complexity, and layered composition diversity.
If you find our work interesting, please consider citing our paper:
@misc{he2026liwi, title = {LiWi: Layering in the Wild}, author = {He, Yu and Li, Fang and Tong, Haoyang and Ma, Lichen and Shan, Xinyuan and Fu, Jingling and Chen, Dong and Liu, Luohang and Huang, Junshi and Li, Yan}, year = {2026}, eprint = {2605.14552}, archivePrefix = {arXiv}, primaryClass = {cs.CV}, doi = {10.48550/arXiv.2605.14552}, url = {https://arxiv.org/abs/2605.14552} }