Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation

Jong Inn Park

Maanas Taneja

Qianwen Wang

Dongyeop Kang

Minnesota NLP Lab

, University of Minnesota

arXiv Code

Our framework's source code will be publicly available soon!

Tuning Large Multimodal Models for Videos using Reinforcement Learning from AI Feedback

Dynamic Multi-Reward Weighting for Multi-Style Controllable Generation

Agentic Framework

Figure 1. Conceptual overview of the multi-agent video generation pipeline.

Figure 2. Detailed workflow on how generation agents contribute to scene composition.

Multi-Agent Workflow: SciTalk mimics a human content creator's pipeline by orchestrating a set of specialized LLM-based agents across four stages—Preprocessing, Planning, Editing, and Feedback & Evaluation. Each agent is responsible for a distinct subtask, such as summarization (Flashtalk Generator), scene design (Sceneplan Generator), text and layout (Text Assistant and Layout Allocator), and background selection.
Grounding in Source Material: During the Preprocessing and Planning stages, SciTalk scrapes scientific papers (e.g., from arXiv) to extract text, images, and figures. These are reused throughout the pipeline to ensure factual accuracy and visual grounding, rather than relying on generative visuals.
Iterative Feedback Loop: After the initial video composition, multimodal Feedback Agents (e.g., Flashtalk Feedback Agent, Sceneplan Feedback Agent) evaluate specific components using structured metrics. Reflection Agents then revise the original generation prompts based on this feedback, allowing each agent to improve its output in subsequent iterations.
Final Evaluation & Refinement: A separate Evaluation Agent provides an end-to-end quality check using human-aligned metrics, ensuring improvements across content accuracy, clarity, visual synchronization, and engagement. This setup mirrors how creators iteratively refine videos, resulting in progressively enhanced outputs.

Iterative Feedback Loop

Example reflections with the paper: Speed Is All You Need: On-Device Acceleration of Large Diffusion Models via GPU-Aware Optimizations

Inserted text Deleted text Matched feedback sentence

Flashtalk Generator Feedback Reflection

Sceneplan Generator Feedback Reflection

Text Assistant Feedback Reflection

Prompt-Level Feedback Integration: After each video draft is evaluated, SciTalk revises the generation prompts using only the feedback relevant to each agent. This ensures targeted refinement—e.g., the Flashtalk Generator only reflects comments about engagement, while the Sceneplan Generator adjusts visual structure.
Frame-by-Frame Diff Mapping: We visualize how feedback influences prompt changes by computing token-level diffs between prompt versions and aligning each diff frame with the feedback sentences it addresses.

Video Examples

BibTeX

@misc{park2025stealingcreatorsworkflowcreatorinspired,
        title={Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation}, 
        author={Jong Inn Park and Maanas Taneja and Qianwen Wang and Dongyeop Kang},
        year={2025},
        eprint={2504.18805},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2504.18805}, 
}