Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation


Minnesota NLP Lab, University of Minnesota

Agentic Framework

  • Multi-Agent Workflow: SciTalk mimics a human content creator's pipeline by orchestrating a set of specialized LLM-based agents across four stages—Preprocessing, Planning, Editing, and Feedback & Evaluation. Each agent is responsible for a distinct subtask, such as summarization (Flashtalk Generator), scene design (Sceneplan Generator), text and layout (Text Assistant and Layout Allocator), and background selection.
  • Grounding in Source Material: During the Preprocessing and Planning stages, SciTalk scrapes scientific papers (e.g., from arXiv) to extract text, images, and figures. These are reused throughout the pipeline to ensure factual accuracy and visual grounding, rather than relying on generative visuals.
  • Iterative Feedback Loop: After the initial video composition, multimodal Feedback Agents (e.g., Flashtalk Feedback Agent, Sceneplan Feedback Agent) evaluate specific components using structured metrics. Reflection Agents then revise the original generation prompts based on this feedback, allowing each agent to improve its output in subsequent iterations.
  • Final Evaluation & Refinement: A separate Evaluation Agent provides an end-to-end quality check using human-aligned metrics, ensuring improvements across content accuracy, clarity, visual synchronization, and engagement. This setup mirrors how creators iteratively refine videos, resulting in progressively enhanced outputs.

Iterative Feedback Loop

Flashtalk Generator Feedback Reflection

Sceneplan Generator Feedback Reflection

Text Assistant Feedback Reflection

  • Prompt-Level Feedback Integration: After each video draft is evaluated, SciTalk revises the generation prompts using only the feedback relevant to each agent. This ensures targeted refinement—e.g., the Flashtalk Generator only reflects comments about engagement, while the Sceneplan Generator adjusts visual structure.
  • Frame-by-Frame Diff Mapping: We visualize how feedback influences prompt changes by computing token-level diffs between prompt versions and aligning each diff frame with the feedback sentences it addresses.

Video Examples

BibTeX

@misc{park2025stealingcreatorsworkflowcreatorinspired,
        title={Stealing Creator's Workflow: A Creator-Inspired Agentic Framework with Iterative Feedback Loop for Improved Scientific Short-form Generation}, 
        author={Jong Inn Park and Maanas Taneja and Qianwen Wang and Dongyeop Kang},
        year={2025},
        eprint={2504.18805},
        archivePrefix={arXiv},
        primaryClass={cs.CL},
        url={https://arxiv.org/abs/2504.18805}, 
}