We are recruiting for ScholaWrite 2.0,
We're studying how researchers work — from first idea to finished paper

Abstract

Scholarly writing is a cognitively demanding, non-linear, and multi-intentional process that involves frequent switching between planning, drafting, and revising — contrasting fundamentally with the token-by-token text generation of large language models (LLMs). We introduce ScholaWrite, the first dataset capturing the end-to-end scholarly writing process: nearly 62K LaTeX-based keystroke edits collected from 5 computer science preprints over 4 months, authored by 10 researchers and annotated with 15 fine-grained writing intention categories. We make three contributions:

Dataset: We present ScholaWrite, a curated dataset of nearly 62K LaTeX-based keystrokes captured from documents that were turned into publications in the computer science domain, annotated by experts in linguistics and computer science.
Taxonomy & Tools: We develop a taxonomy of scholarly writing intentions, along with collection and annotation tools to support future research.
Analysis: Scholarly writing is highly non-linear and multi-intentional — yet current LLMs struggle to align with it. Fine-tuning on ScholaWrite substantially improves model alignment. Our goal is for models to understand the cognitive processes involved in writing and apply them to text generation.

Recorded Writing Process Data

Project 1

Project 2

Project 3

Project 4

Project 5

Key Findings

Finding 1: Writing is Non-Linear and Multi-Intentional

Contrary to how LLMs generate text, scholarly writing is not a linear progression. Writers constantly shift between planning, drafting, and revising, often within a single session.

57%

of writing sessions involve
3+ distinct intentions

Scholarly writing is inherently multi-intentional and cognitively demanding, with writers constantly juggling multiple goals within a single sitting.
While Text Production dominates in frequency (40.3%), less frequent intentions like Idea Generation and Idea Organization involve longer, more cognitively demanding sessions — suggesting that the most important thinking happens in bursts that are easy to overlook.
As writing progresses, sessions shift from planning-heavy to revision-heavy, with revisions becoming more frequent but shorter.

Finding 2: LLMs Are Misaligned with Human Writing

We benchmark LLMs on two tasks: next-intention prediction (can the model predict what a writer will do next?) and writing alignment (does the model's output resemble how a human actually edits?). Current LLMs perform poorly on both.

LLMs struggle at intention prediction: Even GPT-4o (F1 = 0.41) falls well short of the alignment achievable with fine-tuning (F1 = 0.64).

Alignment degrades with complexity: BERTScore-F1 decreases as sessions become longer or more cognitively complex, and drops further when multiple intentions co-occur.

Finding 3: Fine-Tuning on ScholaWrite Helps

Fine-tuning on ScholaWrite produces dramatic improvements in both keystroke-level writing alignment and intention prediction. The models learn to generate edits that more closely resemble the step-by-step revisions human scholars make.

Next Intention Prediction

Input: writing context

main.text

\section{Introduction}

Large language models have shown remarkable abilities in many writing tasks.

just performed: Text Production

→

Output: predicted next intention

suggested next intention ▾

Clarity0.21

Coherence0.17

Idea Generation0.11

Output Alignment

Task: Insert citation [Smith, 2023] into the paragraph

Before fine-tuning

Recent work explores LLM prompting [Smith, 2023] techniques.

✗ inserted mid-phrase

→

After fine-tuning

Recent work explores LLM prompting techniques [Smith, 2023].

✓ placed at sentence end

ScholaWrite Dataset

Below is a data entry of one keystroke data in scholawrite. You can find the full dataset in the Huggingface data card

{
project: 0,
timestamp: 1700506686410,
author: 0,
"before text": "% Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
% For authors from different institutions:
% author{anonymous}\\ Address line \\ ... \\ Address line
% And ... And
% Author n \\ Address line \\ ... \\ Address line}
% To start a seperate `row'' of authors use AND, as in
% author{anonymous}\\ Address line \\ ... \\ Address line
% AND
% Author 2 \\ Address line \\ ... \\ Address line And
% Author 3 \\ Address line \\ ... \\ Address line}

author{anonymous}\\
Affiliation / Address line 1 \\
Affiliation / Address line 2 \\
Affiliation / Address line 3 \\
\texttt{email@domain} \\And
Second Author \\
Affiliation / Address line 1 \\
Affiliation / Address line 2 \\
Affiliation / Address line 3 \\
\texttt{email@domain} \\}

\begin{document}
maketitle
\begin{abstract}
Style is an important component of text that expresses a diverse set of information, including interpersonal dynamics (e.g. formality) and the author's emotions or attitudes (e.g. disgust). Writers constantly incorporate style -- and oftentimes, multiple styles -- into their writing. In order for generative language models to be useful in a wide variety of situations, these models should also be able to control and weave together styles when generating text. Previous work investigates reinforcement learning (RL) approaches for controlled generation of a single style, or else controlled generation for multiple attributes. In this paper, we investigate expanding this into controlling for \textbf{multiple} styles simultaneously. Our baseline is a plug-and-play approach. Our results indicate that plug-and-play does not satisfactorily solve the multi-style controlled generation problem, and that a straightforward RL approach can achieve strong results. We also explore the trade-off between training time and accuracy between plug-and-play and fune-tuning approaches for SoTA models.
end{abstract}

section{Introduction}
Writers can apply styles to text to convey a variety of information citep{hovy1987generating,silverstein2003indexical,block2015social,kang2021style}. Styles can convey both information about the writer (e.g. their attitudes or demographic traits) and the writer's interpersonal relationship or goals with respect to the reader (e.g. respectful or threatening language). Following previous work, we consider each individual aspect of these stylistic goals – i.e. each unique attitude, demographic attribute, interpersonal relationship goal – to be an individual style.

Stylistic information is a common and crucial component of communication: in fact, a text's style can convey a variety of information not included in the text's raw semantic content citep{hovy1995multifunctionality}.
Consequently, it is vital that large language models are well-equipped to understand and apply styles themselves.
Progress has been made in the domain of controlled generation, in which the goal is for a generative language model to generate text of a specified style.",
"after text": "% Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
% For authors from different institutions:
% author{anonymous}\\ Address line \\ ... \\ Address line
% And ... And
% Author n \\ Address line \\ ... \\ Address line}
% To start a seperate ``row'' of authors use AND, as in
% author{anonymous}\\ Address line \\ ... \\ Address line
% AND
% Author 2 \\ Address line \\ ... \\ Address line And
% Author 3 \\ Address line \\ ... \\ Address line}

author{anonymous}\\
Affiliation / Address line 1 \\
Affiliation / Address line 2 \\
Affiliation / Address line 3 \\
\texttt{email@domain} \\And
Second Author \\
Affiliation / Address line 1 \\
Affiliation / Address line 2 \\
Affiliation / Address line 3 \\
\texttt{email@domain} \\}

\begin{document}
maketitle
\begin{abstract}
Style is an in component of text that expresses a diverse set of information, including interpersonal dynamics (e.g. formality) and the author's emotions or attitudes (e.g. disgust). Writers constantly incorporate style -- and oftentimes, multiple styles -- into their writing. In order for generative language models to be useful in a wide variety of situations, these models should also be able to control and weave together styles when generating text. Previous work investigates reinforcement learning (RL) approaches for controlled generation of a single style, or else controlled generation for multiple attributes. In this paper, we investigate expanding this into controlling for \textbf{multiple} styles simultaneously. Our baseline is a plug-and-play approach. Our results indicate that plug-and-play does not satisfactorily solve the multi-style controlled generation problem, and that a straightforward RL approach can achieve strong results. We also explore the trade-off between training time and accuracy between plug-and-play and fune-tuning approaches for SoTA models.
end{abstract}

section{Introduction}
Writers can apply styles to text to convey a variety of information citep{hovy1987generating,silverstein2003indexical,block2015social,kang2021style}. Styles can convey both information about the writer (e.g. their attitudes or demographic traits) and the writer's interpersonal relationship or goals with respect to the reader (e.g. respectful or threatening language). Following previous work, we consider each individual aspect of these stylistic goals – i.e. each unique attitude, demographic attribute, interpersonal relationship goal – to be an individual style.

Stylistic information is a common and crucial component of communication: in fact, a text's style can convey a variety of information not included in the text's raw semantic content citep{hovy1995multifunctionality}.
Consequently, it is vital that large language models are well-equipped to understand and apply styles themselves.
Progress has been made in the domain of controlled generation, in which the goal is for a generative language model to generate text of a specified style. ",
label: "Linguistic Style",
high_level: "REVISION",
}

Iterative Self-Writing

We fine-tuned Llama-3.1-8B-Instruct on ScholaWrite for two tasks: predicting the next writing intention, and generating text edits aligned with that intention. Given a seed document (a LaTeX-formatted abstract from an award-winning NLP paper), the model predicts what a human writer would do next, applies the edit, and repeats — producing a full writing trajectory over 100 iterations. Here is how it unfolds:

llama-8b-SW in seed 1

llama-8b-instruct in seed 1

GPT-4o in seed 1

llama-8b-SW in seed 2

llama-8b-instruct in seed 2

GPT-4o in seed 2

llama-8b-SW in seed 3

llama-8b-instruct in seed 3

GPT-4o in seed 3

Sample Inference Trajectories

Idea Generation

Section Planning

Text Production

Object Insertion

Macro Insertion

Clarity

Coherence

Cross-reference

Fluency

Linguistic Style

Scientific Accuracy

Structural Revision

Visual Formatting

Citation Integration

Data Collection / Annotation Tools

Chrome Extension Tutorial

We designed and implemented a Chrome extension, which enables the real-time collection of keystroke trajectories in the Overleaf platform without disturbing participants' writing process. You can browse the extension code through this link.

▸ To install and run this extension please do the following: ▸ Note

Annotation UI Tutorial

To make the labeling process easy and smooth, we developed a novel web app to replay collected keystroke data. The interface offers various modes for visualizing keystroke collections within each Overleaf project: by time, LaTeX file, and author. You can browse the code through this link.

▸ Step-by-step Annotation Procedures:

Visualization Tool

We visualized the self-iterative writing output of Llama-8B-SW and Llama-8B-Instruct on all four seeds. It is a slight modification of the web page used for human evaluation in the paper, which includes model names and a backward navigation button. The red and strikethrough represent deletion, while green represents addition. You can access our demo page through this link.

▸ Here is the tutorial:

BibTeX

@inproceedings{le-etal-2026-scholawrite,
    title = "{S}chola{W}rite: A Dataset of End-to-End Scholarly Writing",
    author = "Le, Khanh Chi  and
      Wang, Linghe  and
      Lee, Minhwa  and
      Volkov, Ross  and
      Chau, Luan Tuyen  and
      Kang, Dongyeop",
    editor = "Liakata, Maria  and
      Moreira, Viviane P.  and
      Zhang, Jiajun  and
      Jurgens, David",
    booktitle = "Proceedings of the 64th Annual Meeting of the {A}ssociation for {C}omputational {L}inguistics (Volume 1: Long Papers)",
    month = jul,
    year = "2026",
    address = "San Diego, California, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2026.acl-long.1606/",
    doi = "10.18653/v1/2026.acl-long.1606",
    pages = "34755--34788",
    ISBN = "979-8-89176-390-6",
    abstract = "Writing is a cognitively demanding activity that requires constant decision-making, heavy reliance on working memory, and frequent shifts between tasks of different goals. To build writing assistants that truly align with writers' cognition, it is necessary to capture and analyze the complete thought process behind how writers transform ideas into final texts. We present SCHOLAWRITE, the first dataset of end-to-end scholarly writing, tracing the multi-month journey from initial drafts to final manuscripts. The dataset traces nearly 62K LaTeX-based edits from five computer science preprints over four months and is enriched with fine-grained annotations of cognitive writing intentions. We demonstrate the value of ScholaWrite through three complementary contributions: (1) analysis of real-world writing behavior reveals that scholarly writing is highly non-linear and multi-intentional, blending rapid drafting bursts with cognitively sustained writing sessions; (2) evaluations of current large language models show that they struggle to provide meaningful support throughout the human writing process; and (3) models finetuned on SCHOLAWRITE demonstrate improved alignment with human writing workflows. SCHOLAWRITE underscores the value of capturing scientists' cognitive writing process and provides actionable insights and resources for the development of future writing assistants."
}

ScholaWrite: A Dataset of End-to-End Scholarly Writing Process