PixelCraft: Revolutionizing Visual Reasoning with Structured Images
PixelCraft, a novel multi-agent system, has been developed to tackle the challenges of visual reasoning with structured pictures. This system combines large multimodal models with traditional computer vision techniques to achieve high-fidelity visual understanding.
PixelCraft operates through a dynamic three-stage workflow. It begins with tool selection, where the system chooses the most appropriate visual tool for the task at hand. This is followed by agent discussion, where the agents collaborate to reason about the image and its contents. Finally, the system engages in self-criticism, allowing it to revisit earlier steps and explore alternative solutions. This process is facilitated by an image memory, which stores earlier visual steps, enabling the planner to explore different reasoning branches.
The core of PixelCraft involves constructing a high-quality corpus and fine-tuning a multimodal large language model into a grounding model for precise pixel-level localizations. This approach has led to substantial accuracy gains on benchmarks like CharXiv and ChartQAPro compared to standard chain-of-thought prompting. Moreover, experiments have shown that PixelCraft significantly improves visual reasoning performance, establishing a new standard for structured image analysis.
PixelCraft, developed by a team of researchers, has demonstrated remarkable performance in visual reasoning with structured pictures. By combining large multimodal models with traditional computer vision techniques and employing a dynamic three-stage workflow, the system has achieved high-fidelity visual understanding. Future research directions include improving the automation and verification of tool generation, mitigating the reliance on a strong backbone MLLM, and enhancing generalization to diverse chart structures and visual styles.