Abstract

Scholarly writing is a cognitively demanding, non-linear, and multi-intentional process that involves frequent switching between planning, drafting, and revising — contrasting fundamentally with the token-by-token text generation of large language models (LLMs). We introduce ScholaWrite, the first dataset capturing the end-to-end scholarly writing process: nearly 62K LaTeX-based keystroke edits collected from 5 computer science preprints over 4 months, authored by 10 researchers and annotated with 15 fine-grained writing intention categories. We make three contributions:

  • Dataset: We present ScholaWrite, a curated dataset of nearly 62K LaTeX-based keystrokes captured from documents that were turned into publications in the computer science domain, annotated by experts in linguistics and computer science.
  • Taxonomy & Tools: We develop a taxonomy of scholarly writing intentions, along with collection and annotation tools to support future research.
  • Analysis: Scholarly writing is highly non-linear and multi-intentional — yet current LLMs struggle to align with it. Fine-tuning on ScholaWrite substantially improves model alignment. Our goal is for models to understand the cognitive processes involved in writing and apply them to text generation.

Recorded Writing Process Data

Key Findings

Finding 1: Writing is Non-Linear and Multi-Intentional

Contrary to how LLMs generate text, scholarly writing is not a linear progression. Writers constantly shift between planning, drafting, and revising, often within a single session.

57%
of writing sessions involve
3+ distinct intentions
Duration distribution across time
  • Scholarly writing is inherently multi-intentional and cognitively demanding, with writers constantly juggling multiple goals within a single sitting.
  • While Text Production dominates in frequency (40.3%), less frequent intentions like Idea Generation and Idea Organization involve longer, more cognitively demanding sessions — suggesting that the most important thinking happens in bursts that are easy to overlook.
  • As writing progresses, sessions shift from planning-heavy to revision-heavy, with revisions becoming more frequent but shorter.

ScholaWrite Dataset

Below is a data entry of one keystroke data in scholawrite. You can find the full dataset in the Huggingface data card

    1. {
    2. project: 0,
    3. timestamp: 1700506686410,
    4. author: 0,
    5. "before text": "% Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
      % For authors from different institutions:
      % author{anonymous}\\ Address line \\ ... \\ Address line
      % And ... And
      % Author n \\ Address line \\ ... \\ Address line}
      % To start a seperate `row'' of authors use AND, as in
      % author{anonymous}\\ Address line \\ ... \\ Address line
      % AND
      % Author 2 \\ Address line \\ ... \\ Address line And
      % Author 3 \\ Address line \\ ... \\ Address line}

      author{anonymous}\\
      Affiliation / Address line 1 \\
      Affiliation / Address line 2 \\
      Affiliation / Address line 3 \\
      \texttt{email@domain} \\And
      Second Author \\
      Affiliation / Address line 1 \\
      Affiliation / Address line 2 \\
      Affiliation / Address line 3 \\
      \texttt{email@domain} \\}

      \begin{document}
      maketitle
      \begin{abstract}
      Style is an important component of text that expresses a diverse set of information, including interpersonal dynamics (e.g. formality) and the author's emotions or attitudes (e.g. disgust). Writers constantly incorporate style -- and oftentimes, multiple styles -- into their writing. In order for generative language models to be useful in a wide variety of situations, these models should also be able to control and weave together styles when generating text. Previous work investigates reinforcement learning (RL) approaches for controlled generation of a single style, or else controlled generation for multiple attributes. In this paper, we investigate expanding this into controlling for \textbf{multiple} styles simultaneously. Our baseline is a plug-and-play approach. Our results indicate that plug-and-play does not satisfactorily solve the multi-style controlled generation problem, and that a straightforward RL approach can achieve strong results. We also explore the trade-off between training time and accuracy between plug-and-play and fune-tuning approaches for SoTA models.
      end{abstract}

      section{Introduction}
      Writers can apply styles to text to convey a variety of information citep{hovy1987generating,silverstein2003indexical,block2015social,kang2021style}. Styles can convey both information about the writer (e.g. their attitudes or demographic traits) and the writer's interpersonal relationship or goals with respect to the reader (e.g. respectful or threatening language). Following previous work, we consider each individual aspect of these stylistic goals – i.e. each unique attitude, demographic attribute, interpersonal relationship goal – to be an individual style.

      Stylistic information is a common and crucial component of communication: in fact, a text's style can convey a variety of information not included in the text's raw semantic content citep{hovy1995multifunctionality}.
      Consequently, it is vital that large language models are well-equipped to understand and apply styles themselves.
      Progress has been made in the domain of controlled generation, in which the goal is for a generative language model to generate text of a specified style.",
    6. "after text": "% Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
      % For authors from different institutions:
      % author{anonymous}\\ Address line \\ ... \\ Address line
      % And ... And
      % Author n \\ Address line \\ ... \\ Address line}
      % To start a seperate ``row'' of authors use AND, as in
      % author{anonymous}\\ Address line \\ ... \\ Address line
      % AND
      % Author 2 \\ Address line \\ ... \\ Address line And
      % Author 3 \\ Address line \\ ... \\ Address line}

      author{anonymous}\\
      Affiliation / Address line 1 \\
      Affiliation / Address line 2 \\
      Affiliation / Address line 3 \\
      \texttt{email@domain} \\And
      Second Author \\
      Affiliation / Address line 1 \\
      Affiliation / Address line 2 \\
      Affiliation / Address line 3 \\
      \texttt{email@domain} \\}

      \begin{document}
      maketitle
      \begin{abstract}
      Style is an in component of text that expresses a diverse set of information, including interpersonal dynamics (e.g. formality) and the author's emotions or attitudes (e.g. disgust). Writers constantly incorporate style -- and oftentimes, multiple styles -- into their writing. In order for generative language models to be useful in a wide variety of situations, these models should also be able to control and weave together styles when generating text. Previous work investigates reinforcement learning (RL) approaches for controlled generation of a single style, or else controlled generation for multiple attributes. In this paper, we investigate expanding this into controlling for \textbf{multiple} styles simultaneously. Our baseline is a plug-and-play approach. Our results indicate that plug-and-play does not satisfactorily solve the multi-style controlled generation problem, and that a straightforward RL approach can achieve strong results. We also explore the trade-off between training time and accuracy between plug-and-play and fune-tuning approaches for SoTA models.
      end{abstract}

      section{Introduction}
      Writers can apply styles to text to convey a variety of information citep{hovy1987generating,silverstein2003indexical,block2015social,kang2021style}. Styles can convey both information about the writer (e.g. their attitudes or demographic traits) and the writer's interpersonal relationship or goals with respect to the reader (e.g. respectful or threatening language). Following previous work, we consider each individual aspect of these stylistic goals – i.e. each unique attitude, demographic attribute, interpersonal relationship goal – to be an individual style.

      Stylistic information is a common and crucial component of communication: in fact, a text's style can convey a variety of information not included in the text's raw semantic content citep{hovy1995multifunctionality}.
      Consequently, it is vital that large language models are well-equipped to understand and apply styles themselves.
      Progress has been made in the domain of controlled generation, in which the goal is for a generative language model to generate text of a specified style. ",
    7. label: "Linguistic Style",
    8. high_level: "REVISION",
    9. }

Iterative Self-Writing

We fine-tuned Llama-3.1-8B-Instruct on ScholaWrite for two tasks: predicting the next writing intention, and generating text edits aligned with that intention. Given a seed document (a LaTeX-formatted abstract from an award-winning NLP paper), the model predicts what a human writer would do next, applies the edit, and repeats — producing a full writing trajectory over 100 iterations. Here is how it unfolds:

Sample Inference Trajectories

Data Collection / Annotation Tools

Chrome Extension Tutorial


We designed and implemented a Chrome extension, which enables the real-time collection of keystroke trajectories in the Overleaf platform without disturbing participants' writing process. You can browse the extension code through this link.

â–¸ To install and run this extension please do the following: â–¸ Note

Call for Participation

ScholaWrite 2.0 is launching! Here's what we aim to do:

  • Capture the complete research journey — from first idea to finished paper — across multiple academic fields and researchers at different career stages.

  • Uncover patterns in how researchers search, read, write, and collaborate, including how they use AI tools throughout the process.

  • Develop an AI assistant that understands researchers' real workflows and can provide support without disrupting their thinking process.

Participants install a lightweight app that records work sessions and connects to everyday tools (Google Docs, Slack, GitHub, etc.). If you'd like to contribute your research activities to this dataset or want to join us in building ScholaWrite 2.0, please fill out this form and we'll reach out shortly!

Our lab has several ongoing projects collecting research workflows — see the full list at minnesotanlp.github.io/ai4sci-data.

BibTeX

@inproceedings{le2026scholawrite,
  title={ScholaWrite: A Dataset of End-to-End Scholarly Writing Process},
  author={Khanh Chi Le and Linghe Wang and Minhwa Lee and Ross Volkov and Luan Tuyen Chau and Dongyeop Kang},
  year={2026},
  booktitle={Proceedings of the 64rd Annual Meeting of the Association for Computational Linguistics (ACL 2026)},
  note={To appear},
  url={https://arxiv.org/abs/2502.02904},
}