Decoding the End-to-end Writing Trajectory in Scholarly Manuscripts
Ryan Koo*, Anna Martin*, Linghe Wang, Dongyeop Kang (*equal contribution) Visual replay user's writing based on recorded writer actions University of Minnesota-Twin Cities
Abstract
Scholarly writing presents a complex space that generally follows
a methodical procedure to plan and produce both rationally sound
and creative compositions. Recent works involving large language
models (LLM) demonstrate considerable success in text generation
and revision tasks; however, LLMs still struggle to provide struc-
tural and creative feedback on the document level that is crucial
to academic writing. In this paper, we introduce a novel taxonomy
that categorizes scholarly writing behaviors according to intention,
writer actions, and the information types of the written data. We
also provide ManuScript, an original dataset annotated with a
simplified version of our taxonomy to show writer actions and the
intentions behind them. Motivated by cognitive writing theory,
our taxonomy for scientific papers includes three levels of catego-
rization in order to trace the general writing flow and identify the
distinct writer activities embedded within each higher-level process.
ManuScript intends to provide a complete picture of the schol-
arly writing process by capturing the linearity and non-linearity of
writing trajectory, such that writing assistants can provide stronger
feedback and suggestions on an end-to-end level.
Repaly collected writing trajectories
Annotated Sample
An example manuscript with annotations on writing intentions (left) and writing actions (right).
Each horizontal line denotes a single annotation.
Introduction / Background / Motivation
Writing is a cognitively fatiguing task involving continuous decision-
making, heavy use of working memory, and frequent switching be-
tween multiple activities. Scholarly writing is particularly complex
as it requires the author to coordinate many pieces of multiform
information while also meeting the high standards of academic
communication. Flower and Hayes’ [ 6] cognitive process theory of
writing organizes these tasks into three processes: planning, during
which the writer generates and organizes ideas and sets writing goals;
translation, during which the writer implements their plan,
keeping in mind the organization of the text as well as word choice
and phrasing; and reviewing, during which the writer evaluates and
revises their text. Flower and Hayes emphasize that these distinct
phases are non-linear and highly embedded, meaning that any pro-
cess or sub-process can be embedded within any other process and
move back and forth between each process. In order to provide
relevant feedback at each step of the academic writing process, it is
critical for writing assistants to have a strong understanding of the
planning, translation, and revision stages throughout their entirety.
Recent corpora for the study of writing processes exist for each
of these sub-processes. Berdanier [2] demystifies the academic writ-
ing process in a study showing the “linguistic scheme” involving a
distinct planning and crafting procedure typically followed within
technical writing. Furthermore, much work has been done to study
text revision using keystroke data [1 , 3 , 13 ], and revision history
[ 4, 5, 7 , 11 , 12 ]. More recently, Sardo et al. [ 9] have developed a
corpus and a metric for edit-complexity that draws a complex topo-
logical structure of the writer’s efforts throughout the history of
the essay to study the planning and translation processes. Despite
recent advancements in large language models, particularly text
generation, LLMs still exhibit subpar performance for reasoning
capabilities and particularly planning [10 ] to have any significant
impact in aiding the writing process [ 9]. Our work builds upon
these previous studies to provide a dataset with annotations en-
compassing the writing process spanning across all three stages, as
described by Flower and Hayes.
In this paper, we propose a novel dataset ManuScript composed
of writing actions as discrete points annotated according to our
taxonomy to capture the end-to-end writing process. The dataset
consists of several annotated data points following an author's writing trajectory
from the beginning to the end of their draft of an academic manuscript. Since writing
is often non-linear in that it does not occur in a specific order, the writer may jump
back and forth between each phase. Hence, the taxonomy follows a hierarchical structure
encompassing the three phases as (higher-level) coarse-grained labels and finer-grained
annotations embedded within each higher-level process. We plan to extend this work by
scaling the data collection process over a longer period of time to develop a more nuanced
taxonomy of writer actions and intentions.
MANUSCRIPT: A DATASET OF THE
END-TO-END WRITING PROCESS
Analyzing a final manuscript alone is intractable for capturing an
author’s original intentions. Flower and Hayes’s cognitive process
model [6] cannot be applied directly to keystroke data from scholarly writing.
Hence, we have developed an initial taxonomy that can
characterize the finer-grained actions an author takes into distinct
categories but is also general enough to fully capture the author’s
trajectory throughout the entire writing process.
Data Collection
We developed a chrome extension that reverse
engineers Overleaf’s editing history utilizing user keystrokes to
track writing actions in real-time (See details in Appendix A). From
this, we can generate a playback that shows the chronological progression
for each completed writing session. Our initial study involved four
participants in a pilot study where they were prompted
to describe their current or future research plans by responding to
the available prompts or in free form over a thirty-minute writing
session. In total, we collected four writing trajectories, including
46 discontinuous edits with 3290 recorded actions. The detailed
statistics are in Appendix C.
Annotation Schema and result
Inspired by Flower and Hayes’ cognitive
process model, our annotation schema (Figure 2) contains two levels of granularity
in order to accurately reflect the hierarchical and recursive
aspects of the writing process. The higher level includes
Planning, Implementation, and Revision. These labels are used
to denote the general process that the writer is working in. The
lower level categorizations include {idea generation, concept
organization}, {lexical_chaining}, and {syntactic, lexical,
structural} for each of the three processes respectively.
Three of the authors annotated the samples that were gathered. One author annotated sample 1 in the
course of developing the annotation guidelines. Figure 3 illustrates
the writing trajectory of sample 1. Each of the other three
samples was annotated by two different authors such that each
author annotated two samples, and no two samples had the same
pair of annotators.
bibtex citation
@inproceedings{
koo2023decoding,
title={Decoding the End-to-end Writing Process in Scholarly Manuscripts via Writer-action Taxonomy},
author={Ryan Hyunkyo Koo and Anna Martin and Linghe Wang and Dongyeop Kang},
booktitle={The Second Workshop on Intelligent and Interactive Writing Assistants},
year={2023},
url={https://arxiv.org/abs/2304.00121}
}