SERUM extracts finite-state behavioral models from unstructured egocentric video using multi-pass VLM annotation.
A live runner meant to help you understand the paper by watching SERUM work on a video of your choice. Pick a pre-loaded video or paste a YouTube URL, and stream pass-by-pass labels as the model produces them. To keep latency reasonable, the post-hoc analysis stages (normalization, Markov modeling, chart generation) are skipped here. Those live in the dashboard below.
Some restricted YouTube videos may fail to fetch (anti-scraping protections).
Inference takes a few minutes per video. Pass 1 results appear at ~90 seconds; full convergence by ~18 minutes for a 10-minute video. Leave the tab open to watch.
Yellow bar = activity pass committed. Green bar = intent pass also committed (frame fully labeled in the current pair). Both clear when the next pair of passes begins. Click a bar (or use ← →) to jump the video to that frame.
The dashboard lets you explore the complete pipeline output (including the post-hoc analyses skipped by the live runner above) across 400+ videos: per-pass refinement chains, FSM graphs synchronized to the video, normalized Markov transition matrices, and confidence / perplexity diagnostics.
If the first link is offline, try the mirror. Some ISPs may block ngrok; visit the HTTP version once to accept the certificate before the HTTPS link will load reliably.
@inproceedings{serum2026,
title = {SERUM: State Extraction and Refinement for User Modeling},
author = {Phu, Andy J. and Mooney, James and de Langis, Karin and Le, Khanh Chi and Kang, Dongyeop},
booktitle = {Conference on Language Modeling (CoLM)},
year = {2026},
note = {Under review}
}