🎉 ACL 2026
Scholarly writing is a cognitively demanding, non-linear, and multi-intentional process that involves frequent switching between planning, drafting, and revising — contrasting fundamentally with the token-by-token text generation of large language models (LLMs). We introduce ScholaWrite, the first dataset capturing the end-to-end scholarly writing process: nearly 62K LaTeX-based keystroke edits collected from 5 computer science preprints over 4 months, authored by 10 researchers and annotated with 15 fine-grained writing intention categories. We make three contributions:
Below is a data entry of one keystroke data in scholawrite. You can find the full dataset in the Huggingface data card
We fine-tuned Llama-3.1-8B-Instruct on ScholaWrite for two tasks: predicting the next writing intention, and generating text edits aligned with that intention. Given a seed document (a LaTeX-formatted abstract from an award-winning NLP paper), the model predicts what a human writer would do next, applies the edit, and repeats — producing a full writing trajectory over 100 iterations. Here is how it unfolds:
ScholaWrite 2.0 is launching! Here's what we aim to do:
Capture the complete research journey — from first idea to finished paper — across multiple academic fields and researchers at different career stages.
Uncover patterns in how researchers search, read, write, and collaborate, including how they use AI tools throughout the process.
Develop an AI assistant that understands researchers' real workflows and can provide support without disrupting their thinking process.
Participants install a lightweight app that records work sessions and connects to everyday tools (Google Docs, Slack, GitHub, etc.). If you'd like to contribute your research activities to this dataset or want to join us in building ScholaWrite 2.0, please fill out this form and we'll reach out shortly!
Our lab has several ongoing projects collecting research workflows — see the full list at minnesotanlp.github.io/ai4sci-data.
@inproceedings{le2026scholawrite,
title={ScholaWrite: A Dataset of End-to-End Scholarly Writing Process},
author={Khanh Chi Le and Linghe Wang and Minhwa Lee and Ross Volkov and Luan Tuyen Chau and Dongyeop Kang},
year={2026},
booktitle={Proceedings of the 64rd Annual Meeting of the Association for Computational Linguistics (ACL 2026)},
note={To appear},
url={https://arxiv.org/abs/2502.02904},
}