We are recuiting for ScholaWrite 2.0,
focusing on human-AI collaborative scientific writing!

A Summary of Contributions

Scholarly writing is a non-linear and cognitively complex task that involves frequent switching between multiple activities, coordinating various pieces of multiform information, and revising previously written text. This approach contrasts with the token-by-token text generation of large language models (LLMs). Our goal is for the model to understand the cognitive processes involved in writing and apply them to text generation.
We present ScholaWrite, a curated dataset of nearly 62K LATEX-based keystrokes that were turned into publications in the computer science domain, annotated by experts in linguistics and computer science.
We develop a taxonomy of scholarly writing intentions, providing an overall understanding of how scholars tend to produce their ideas.
We found that a finetuned Llama3.2 model on our dataset outperformed GPT-4o in predicting the next writing intention and better mimicked the human-like iterative revision process than GPT-4o.

Recorded Writing Process Data

Project 1

Project 2

Project 3

Project 4

Project 5

Chrome Extension Tutorial

We designed and implemented a Chrome extension, which enables the real-time collection of keystroke trajectories in the Overleaf platform without disturb participants' writing process. You can browse the extension code through this link.

To install and run this extention please do the following:

Start the backend by following procedure in Run ScholaWrite System section, and setup URL in Chrome extension by completing step 1 in Setup the Extention section
The extension only works on Chrome. Open your browser & Search "chrome://extensions/"
On the top right corner of the page, toggle on 'Developer Mode'.
Click on 'Load unpacked' on the top left corner, and you should see a pop-up of the file directory.
Click on the folder named 'extension' in the folder where you download/clone the repository. Then, click on the toggle on the right side, enabling the extension!
At the top right corner, click on the puzzle icon followed by the pin icon next to the extension. The “S” logo will be shown!
Click on the 'S' icon to register your account. Register your username with your email address and password you would like.
After successful registration, you will see a welcome message. Toggle on the "Record writer actions" button. Now The Chrome extension setup is complete. Enjoy your writing!

Note: The operating logic of our Chrome extension is through listening frontend elements. Due to update of HTML elements in Overleaf, our Chrome extension is no longer able to recording writer actions

Annotation UI Tutorial

To make labeling process easy and smooth, we developed a novel web app to the replay collected keystroke data. The interface offers various modes for visualizing keystroke collections within each Overleaf project: by time, LaTeX file, and author. You can browse the code through this link.

Step-by-step Annotation Procedures:

In the login page, enter an existing username or create a new one. Then click the blue login button.
Choose the title of the project that you would like to annotate, and click the button 'Switch'
Click 'By Time' button on the interface and start reading keystrokes.
Click through the first few actions and try to identify which high-level label is occurring (Planning, Implementation, or Revision).
Once identified the current high-level label, attempt to identify where it ends.
Once you know the start and end indices for the current high-level label,
1. Decide the low-level label within the high-level label (e.g., Idea Generation, Text Production, Coherence, etc.)
2. If a keystroke looks like an artifact that does not deliver any insight, then label it as an 'artifact'.
3. If you think multiple labels can be applied to the identified span, please select multiple labels under the dropdown menu list.

ScholaWrite Dataset

Below is a data entry of one keystroke data in scholawrite. You can find the full dataset in the Huggingface data card

{
project: 0,
timestamp: 1700506686410,
author: 0,
"before text": "% Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
% For authors from different institutions:
% author{anonymous}\\ Address line \\ ... \\ Address line
% And ... And
% Author n \\ Address line \\ ... \\ Address line}
% To start a seperate `row'' of authors use AND, as in
% author{anonymous}\\ Address line \\ ... \\ Address line
% AND
% Author 2 \\ Address line \\ ... \\ Address line And
% Author 3 \\ Address line \\ ... \\ Address line}

author{anonymous}\\
Affiliation / Address line 1 \\
Affiliation / Address line 2 \\
Affiliation / Address line 3 \\
\texttt{email@domain} \\And
Second Author \\
Affiliation / Address line 1 \\
Affiliation / Address line 2 \\
Affiliation / Address line 3 \\
\texttt{email@domain} \\}

\begin{document}
maketitle
\begin{abstract}
Style is an important component of text that expresses a diverse set of information, including interpersonal dynamics (e.g. formality) and the author’s emotions or attitudes (e.g. disgust). Writers constantly incorporate style -- and oftentimes, multiple styles -- into their writing. In order for generative language models to be useful in a wide variety of situations, these models should also be able to control and weave together styles when generating text. Previous work investigates reinforcement learning (RL) approaches for controlled generation of a single style, or else controlled generation for multiple attributes. In this paper, we investigate expanding this into controlling for \textbf{multiple} styles simultaneously. Our baseline is a plug-and-play approach. Our results indicate that plug-and-play does not satisfactorily solve the multi-style controlled generation problem, and that a straightforward RL approach can achieve strong results. We also explore the trade-off between training time and accuracy between plug-and-play and fune-tuning approaches for SoTA models.
end{abstract}

section{Introduction}
Writers can apply styles to text to convey a variety of information citep{hovy1987generating,silverstein2003indexical,block2015social,kang2021style}. Styles can convey both information about the writer (e.g. their attitudes or demographic traits) and the writer’s interpersonal relationship or goals with respect to the reader (e.g. respectful or threatening language). Following previous work, we consider each individual aspect of these stylistic goals – i.e. each unique attitude, demographic attribute, interpersonal relationship goal – to be an individual style.

Stylistic information is a common and crucial component of communication: in fact, a text’s style can convey a variety of information not included in the text's raw semantic content citep{hovy1995multifunctionality}.
Consequently, it is vital that large language models are well-equipped to understand and apply styles themselves.
Progress has been made in the domain of controlled generation, in which the goal is for a generative language model to generate text of a specified style.",
"after text": "% Author 1 \\ {\bf Author 2} \\ ... \\ {\bf Author n} \\
% For authors from different institutions:
% author{anonymous}\\ Address line \\ ... \\ Address line
% And ... And
% Author n \\ Address line \\ ... \\ Address line}
% To start a seperate ``row'' of authors use AND, as in
% author{anonymous}\\ Address line \\ ... \\ Address line
% AND
% Author 2 \\ Address line \\ ... \\ Address line And
% Author 3 \\ Address line \\ ... \\ Address line}

author{anonymous}\\
Affiliation / Address line 1 \\
Affiliation / Address line 2 \\
Affiliation / Address line 3 \\
\texttt{email@domain} \\And
Second Author \\
Affiliation / Address line 1 \\
Affiliation / Address line 2 \\
Affiliation / Address line 3 \\
\texttt{email@domain} \\}

\begin{document}
maketitle
\begin{abstract}
Style is an in component of text that expresses a diverse set of information, including interpersonal dynamics (e.g. formality) and the author’s emotions or attitudes (e.g. disgust). Writers constantly incorporate style -- and oftentimes, multiple styles -- into their writing. In order for generative language models to be useful in a wide variety of situations, these models should also be able to control and weave together styles when generating text. Previous work investigates reinforcement learning (RL) approaches for controlled generation of a single style, or else controlled generation for multiple attributes. In this paper, we investigate expanding this into controlling for \textbf{multiple} styles simultaneously. Our baseline is a plug-and-play approach. Our results indicate that plug-and-play does not satisfactorily solve the multi-style controlled generation problem, and that a straightforward RL approach can achieve strong results. We also explore the trade-off between training time and accuracy between plug-and-play and fune-tuning approaches for SoTA models.
end{abstract}

section{Introduction}
Writers can apply styles to text to convey a variety of information citep{hovy1987generating,silverstein2003indexical,block2015social,kang2021style}. Styles can convey both information about the writer (e.g. their attitudes or demographic traits) and the writer’s interpersonal relationship or goals with respect to the reader (e.g. respectful or threatening language). Following previous work, we consider each individual aspect of these stylistic goals – i.e. each unique attitude, demographic attribute, interpersonal relationship goal – to be an individual style.

Stylistic information is a common and crucial component of communication: in fact, a text’s style can convey a variety of information not included in the text's raw semantic content citep{hovy1995multifunctionality}.
Consequently, it is vital that large language models are well-equipped to understand and apply styles themselves.
Progress has been made in the domain of controlled generation, in which the goal is for a generative language model to generate text of a specified style. ",
label: "Linguistic Style",
high_level: "REVISION",
}

Iterative Self-Writing

Figure a

Figure b

Figure c

Finetuning Setup
To examine the generalizability of the models trained on the SCHOLAWRITE, we test end-to-end writing tasks. Our dataset is organaized with tuples of before text, after text and the scholarly writing intention (Figure [a])

We use this dataset to train two models: one for classification of writing intentions, and another for implementing writing assistance for that intention. These models are shown in figure [b].

Iterative Writing
The finetuned models were compared with GPT-4o and Llama3.1 8B on our novel iterative-writing setup shown in figure [c].

The goal of this setup is to understand the aptitude of our finetuned writing assistants on a self-writing application. We started our iterative writing with one of seed documents derived from four award-winning NLP papers on diverse topics. Each seed document contains LaTex-formatted title and abstract. In each iteration, the model will revise the seed document based on its generated intention. The revised document will be the seed document of the next iteration. We ran the iterative writing for 100 iterations.

Evaluation
We evaluated the results of the finetuned model and GPT-40 iterative writign with human evaluation and automatic evaluation metrics.
1. Human Evaluation Inspired by Chang et al. (2023), we conduct human evaluation to mea sure the quality of writing outputs. We asked three native English speakers with Overleaf writing experience to evaluate our finetuned model and GPT-4o on the following metrics:
  1. Accuracy: Out of 100, the number of generated outputs that align with the provided intention.
  2. Alignment: Which model's whole writing process throughout the entire 100 iterations looks more like a human?
  3. Fluency: Which model's final writing sounds more grammatically correct?
  4. coherence: Which model's final writing sounds more logical?
  5. Relevancy: Does the final writing from each model contain related contents to the original seed document?
2. Automatic Evaluation:
  The following quantitative analysis metrics were also used to measure the quality of writing outputs:
  1. Lexical diversity: Assess the unique tokens model generated in the final iteration of writing, measured by number of unique tokens divided by total tokens generated.
  2. Topic consistency: Cosine similarity between the seed document and output from the final iteration of writing
  3. Intention coverage: Assess the diversity on model's writing intention, measured by number of unique label predicted through entire 100 iteration divided by all 15 intention labels available

Iterative Writing Replays

llama-8b-SW in seed 1

llama-8b-instruct in seed 1

GPT-4o in seed 1

llama-8b-SW in seed 2

llama-8b-instruct in seed 2

GPT-4o in seed 2

llama-8b-SW in seed 3

llama-8b-instruct in seed 3

GPT-4o in seed 3

Visualization Tool

We vsialized the self-iterative writing output of Llama-8B-SW and Llama-8B-Instruct on all four seeds. It is the slight modification of web page used for human evaluation in paper, which includes model names and backword navigation button. The red and strikethrough represent deletion, while green represent addition. You can access our demo page through this link.

Here is the tutorial:

Picture above is the UI of demo page. First, click the drop down menu to select the seed document would like to read. After selection, click 'Switch' button on the right.
At the bottom of the page, from left to right, are progress bar, backword button, play button, and forward button. Progress bar shows the relative position of current text you are reading. Clicking forward button will display the text of previous iteration. Clicking play button will showing text of next iteration every one second. Clicking forward button display the text of next iteration. You can also use arrow keys on the keyboard to navigate to previous/next iteration.
You can also have more granular control to read text from different iterations by muanlly input the number on the top right corner of the UI. The number you entered represents the number of iterations. When you're done typing, press enter. Then the page will display the writing from your specified iteration
On the top of each model's output, there are writing intention and model's name.
After finishing viewing the outputs, click 'Go Back' button on the upper right corner to get back to this project page.

Sample Inference Trajectories

Idea Generation

Section Planning

Text Production

Object Insertion

Macro Insertion

Clarity

Coherence

Cross-reference

Fluency

Linguistic Style

Scientific Accuracy

Structural Revision

Visual Formatting

Citation Integration

BibTeX

@misc{wang2025scholawritedatasetendtoendscholarly,
          title={ScholaWrite: A Dataset of End-to-End Scholarly Writing Process},
          author={Linghe Wang and Minhwa Lee and Ross Volkov and Luan Tuyen Chau and Dongyeop Kang},
          year={2025},
          eprint={2502.02904},
          archivePrefix={arXiv},
          primaryClass={cs.CL},
          url={https://arxiv.org/abs/2502.02904},
    }

ScholaWrite: A Dataset of End-to-End Scholarly Writing Process