Abstract
People frequently use speech-to-text systems to compose short texts with voice. However, current voice-based interfaces struggle to support composing more detailed, contextually complex texts, especially in scenarios where users are on the move and cannot visually track progress. Longer-form communication, such as composing structured emails or thoughtful responses, requires persistent context tracking, structured guidance, and adaptability to evolving user intentions—capabilities that conventional dictation tools and voice assistants do not support. We introduce StepWrite, a large language model-driven voice-based interaction system that augments human writing ability by enabling structured, hands-free and eyes-free composition of longer-form texts while on the move. StepWrite decomposes the writing process into manageable subtasks and sequentially guides users with contextually-aware non-visual audio prompts. StepWrite reduces cognitive load by offloading the context-tracking and adaptive planning tasks to the models. Unlike baseline methods like standard dictation features (e.g., Microsoft Word) and conversational voice assistants (e.g., ChatGPT Advanced Voice Mode), StepWrite dynamically adapts its prompts based on the evolving context and user intent, and provides coherent guidance without compromising user autonomy. An empirical evaluation with 25 participants engaging in mobile or stationary hands-occupied activities demonstrated that StepWrite significantly reduces cognitive load, improves usability and user satisfaction compared to baseline methods. Technical evaluations further confirmed StepWrite’s capability in dynamic contextual prompt generation, accurate tone alignment, and effective fact checking. This work highlights the potential of structured, context-aware voice interactions in enhancing hands-free and eye-free communication in everyday multitasking scenarios.
Augmenting your ability to write—anytime, anywhere.
StepWrite is an AI-powered voice system that helps you create structured, high-quality text through adaptive Q&A prompts, enabling hands-free and eyes-free writing while you’re on the go or occupied.

Problem Space
Voice-based interfaces are widely adopted for short, transactional tasks, yet they remain ill-suited for composing long-form, contextually complex text in hands-busy or eyes-free settings. Long-form composition requires persistent context tracking, structured planning, and iterative refinement—capabilities not provided by linear dictation or command–response assistants. As a result, users encounter elevated cognitive load, reduced structural coherence, and substantial post-editing, often with a diminished sense of authorship.
Where current tools fall short.
Dictation systems are fundamentally linear: they transcribe but do not plan, organize, or maintain document state across turns, leading to fragmented drafts and heavy revision. General voice assistants support turn-by-turn commands but lack persistence, memory of writing goals, and mechanisms to manage evolving document structure. Screen-first AI writing tools provide completions and rewrites, yet presume continuous visual attention and offer limited scaffolding for non-visual, hands-free workflows. Collectively, these limitations constrain users’ ability to sustain intent, control tone, and preserve coherence while multitasking.
Why now.
Recent advances in large language models make adaptive, context-aware scaffolding feasible in real time: systems can decompose tasks, ask targeted follow-up questions, align tone, and maintain state throughout a dialogue. In parallel, mature speech technologies (robust VAD, reliable transcription, responsive TTS) and ubiquitous mobile/wearable hardware enable practical hands-free, eyes-free interaction outside the desktop environment. Together, these developments create the conditions for augmentative writing support that lowers cognitive load while preserving user agency.
Motivation.
There is a clear need for voice-first systems that (i) maintain state across turns to preserve goals, audience, and constraints; (ii) scaffold planning adaptively through high-utility prompts; (iii) support hands-free, eyes-free interaction with unambiguous audio feedback and simple controls; and (iv) align tone and intent without compromising authorship. This work responds to that need by focusing on augmenting human writing ability during mobile and multitasking scenarios.
StepWrite
StepWrite is a voice-first writing system that augments human writing ability by transforming long-form composition into an adaptive, spoken dialogue. It provides context-aware scaffolding for hands-free, eyes-free authoring: the system incrementally elicits key details, infers tone, and synthesizes a coherent draft entirely through speech. Implemented as a responsive web application, StepWrite offloads computation to cloud APIs and operates across heterogeneous devices, supporting real-world multitasking scenarios such as composing emails or replies while walking or cooking. It is available as open-source software.
At the interaction level, StepWrite reframes dictation as guided Q&A. Rather than asking users to plan and dictate in a single pass, it breaks writing into manageable subtasks and asks targeted, context-aware questions one at a time. Prompts are generated adaptively from prior responses, user intent, and inferred genre, helping users elaborate goals, audience, constraints, and timing without sustained visual attention. This design externalizes planning to reduce cognitive load, yet preserves agency: users can skip questions, revise earlier answers, and switch to keyboard or hybrid input as needed. In effect, StepWrite scaffolds the writing process while keeping authorship with the user.
A robust speech layer underpins hands-free use. Audio is filtered for noise, segmented by voice-activity detection with a tunable “thinking window,” and screened client-side for macro commands (e.g., “skip question,” “go back”). To minimize false triggers, built-in commands are intentionally phrased as two-word macros and matched with token-level fuzzy recognition (cosine threshold 0.85). Non-command utterances are then transcribed (Whisper) and streamed with visual and audio confirmation for real-time feedback.
The modular prompt pipeline maintains a conversation graph of prompt–response pairs. After each turn, an LLM evaluates the accumulated context and either produces the next high-utility question or signals sufficiency via a followup_needed
flag; once sufficient, the system proceeds to drafting. Tone is classified from the Q&A history and passed to the generator, which produces a draft constrained to user-supplied facts (with optional memory for personalization). A dedicated fact-checking loop then verifies the draft against the Q&A, returning structured issues (missing, inconsistent, inaccurate, unsupported) and invoking selective rewrites until alignment is achieved or a pass limit is reached—preserving tone and structure while enforcing information integrity.
Session control supports both linear progression and cyclical revision. Users can move between dialogue and editor views, replay or modify prior answers, and issue corrective commands via voice. The system persists session state (UUID) for continuity and provides layered feedback—TTS prompts, waveform visualization, and auditory cues for navigation—to maintain clarity in eyes-free contexts.
What StepWrite enables.
By offloading planning, context tracking, and verification to the system—and by providing adaptive, tone-aware prompts—users can compose structured, intent-aligned texts while occupied, without sacrificing authorship. For example, a user unpacking groceries can initiate a reply, answer a brief series of targeted questions, and issue a simple “finish” command to receive a fact-checked draft ready to send—all hands-free and eyes-free.
System Overview
End-to-end workflow.
StepWrite orchestrates a voice-first authoring loop that (i) elicits task-relevant context through an adaptive Q&A dialogue, (ii) classifies tone from the accumulated interaction history, (iii) generates a draft constrained to user-provided facts, and (iv) verifies and selectively revises the draft through a fact-checking loop before presenting the final output. Users can signal completion at any point; otherwise, the system detects sufficiency and proceeds automatically. Session state persists across turns to support non-linear navigation and recovery.

Speech-to-text pipeline.
Incoming audio is cleaned via noise analysis, segmented by voice-activity detection with pause-timing logic, and screened client-side for macro commands (e.g., “skip question,” “go back”). Only non-command utterances are forwarded for transcription. If input is allowed, audio is sent to the STT backend (Whisper), and the resulting text is appended to the Q&A context for downstream prompting and control.

Text-to-speech pipeline.
System prompts and readbacks are synthesized to audio with caching for reuse. During playback, the system continuously monitors for user interruptions (immediate responses or voice activity within a short window); on detection, audio is halted and the pipeline returns to listening, preserving a natural, real-time conversational cadence.

Runtime modules.
- Adaptive Q&A planner: Generates the next high-utility question from the conversation graph; stops when additional follow-ups are unnecessary.
- Tone classifier → Drafting Infers tone from Q&A history; the generator composes a draft aligned with stated facts and preferences.
- Fact-checking loop: Detects missing/inconsistent/unsupported content and triggers targeted rewrites until alignment criteria are met or a pass limit is reached.
- Session & control: Persistent state, voice macros, and bidirectional navigation between dialogue and editor views enable flexible, hands-free revision.
Study Design
We conducted a within-subjects evaluation in which each participant completed two writing tasks—a write task and a reply task—using three hands-free tools: StepWrite (structured, adaptive guidance), ChatGPT Advanced Voice Mode (conversational generative AI), and Microsoft Word Dictation (speech-to-text). All six tool–task combinations were counterbalanced (Latin square). To approximate real-world multitasking, participants worked under two activity contexts: stationary (e.g., origami, light snacking) and movement (walking within a defined area). Each participant performed one task in each context; assignments were randomized.
Apparatus.
All tools ran on a MacBook Pro (M3 Pro, 36 GB RAM) connected to a small external display ~2 m from participants; the display was intermittently disabled to encourage audio-first, eyes-free interaction. Participants wore a Poly Voyager 4320 headset with noise-canceling microphone. A wireless keyboard and mouse were provided only for the revision phase. For study parity, StepWrite’s memory/personalization features were disabled; ChatGPT AVM was accessed via the official UI; Dictation used Word’s built-in STT.
Procedure.
After consent and a brief demonstration of each tool (plus a command reference for StepWrite and Dictation), participants completed each task in two phases:
- Drafting (voice-only): Interactive Q&A with StepWrite; conversational voice editing with ChatGPT AVM; direct dictation in Word.
- Revision (keyboard-only): Participants edited the draft to be something they would personally send (content, tone, clarity).
Timing and flow.
Drafting time was measured from the start of voice interaction to initial draft display (including any in-flow voice edits). Revision time covered only keyboard/mouse edits. If participants returned to Q&A after viewing the draft, revision timing paused and drafting timing resumed. Conditions (tool order, activity context) were randomized/counterbalanced; post-task questionnaires were completed after each tool; sessions concluded with a debrief. All procedures were IRB-approved.
Results
Revision Effort. StepWrite required significantly fewer edits (M=1.18 edits) compared to ChatGPT AVM (M=2.12) and Dictation (M=8.46, p<.001), indicating that its structured scaffolding substantially reduced revision burden by guiding users to produce cleaner drafts from the start.

Readability. StepWrite and ChatGPT AVM produced moderately readable texts initially (FRE ≈45, Grade Level ≈10), requiring minimal post-editing. Dictation drafts started very challenging (FRE=5.66, Grade Level=28.65), becoming readable only after extensive revision.


Sentence Structure.
Dictation generated overly long, fragmented sentences (62.83 words avg.), necessitating substantial editing (19.28 words revised). StepWrite (12.79 words) and ChatGPT AVM (10.81 words) delivered concise, well-structured sentences initially, requiring minimal changes.

Lexical & Semantic Diversity.
ChatGPT AVM produced the highest lexical diversity (TTR=0.82). StepWrite (0.75) was moderate, while Dictation was slightly lower (0.73). Semantic diversity (meaning-level edits) was lowest with StepWrite (M=0.011), indicating initial drafts closely matched user intent.


Final Draft Length. StepWrite produced longer drafts (Write=86.6 words, Reply=104.3 words) than ChatGPT AVM and Dictation, showing that guided questions encouraged richer elaboration.

Temporal Efficiency. ChatGPT AVM was fastest overall (~150–168s). StepWrite took longer (Write=248s, Reply=200s) due to structured prompts but minimized revision effort. Dictation drafted quickly (~60s) but required extensive revision.


Necessity of StepWrite Questions.
77.9% of StepWrite’s questions were essential to final outputs (EQF=0.779), with higher necessity (81.8%) in reply tasks.
Task | Necessary | Skipped | Unnecessary |
---|---|---|---|
Write | 74.4% | 12.1% | 13.6% |
Reply | 81.8% | 11.4% | 6.8% |
Tone Classification
StepWrite’s tone classifier achieved 91.7% accuracy on a balanced evaluation set of 350 messages. Eleven of the fourteen tone categories scored F1 ≥ 0.86, including perfect scores for apologetic, encouraging, surprised, and cooperative. Categories like assertive and informal scored lower due to class imbalance and nuanced boundaries, but overall results demonstrate strong performance for real-world tone detection.
Tone | Precision | Recall | F1 | Support |
---|---|---|---|---|
Apologetic | 1.00 | 1.00 | 1.00 | 22 |
Encouraging | 1.00 | 1.00 | 1.00 | 21 |
Surprised | 0.95 | 1.00 | 0.98 | 21 |
Cooperative | 0.95 | 1.00 | 0.98 | 21 |
Optimistic | 1.00 | 0.95 | 0.98 | 22 |
Empathetic | 1.00 | 0.95 | 0.97 | 20 |
Concerned | 1.00 | 0.94 | 0.97 | 16 |
Friendly | 0.93 | 1.00 | 0.96 | 25 |
Diplomatic | 0.94 | 0.94 | 0.94 | 16 |
Formal | 0.85 | 0.96 | 0.90 | 83 |
Curious | 1.00 | 0.75 | 0.86 | 24 |
Urgent | 0.74 | 0.95 | 0.83 | 21 |
Informal | 0.83 | 0.79 | 0.81 | 19 |
Assertive | 1.00 | 0.42 | 0.59 | 19 |
Overall Accuracy: 0.917 • Macro F1: 0.912 • Weighted F1: 0.912
User Experience
NASA TLX. StepWrite had the lowest workload (M=16.8), followed by ChatGPT (M=22.5), with Dictation highest (M=49.2).

System Usability Scale (SUS). StepWrite (SUS=80.0) and ChatGPT AVM (83.2) were rated highly usable, while Dictation scored below average (60.0).

Emotional Experience Questionnaire (EEQ). StepWrite (EEQ=5.76/7) received highest emotional positivity, particularly for engagement and motivation.

Hands-Free Writing Tools Assessment (HFWTA). StepWrite dominated across seven dimensions, especially in Guided Writing Process (6.17/7).

Summary of Findings
Across both writing and reply tasks, StepWrite’s structured Q&A approach produced the cleanest initial drafts, requiring minimal editing and achieving stable readability, complexity, and sentence structure. StepWrite outputs aligned closely with users’ original intent, resulting in significantly fewer semantic revisions and the lowest overall revision effort. Although StepWrite required more drafting time due to its incremental scaffolding process, it substantially reduced total revision time, achieved the lowest perceived workload (NASA TLX), and earned top ratings for usability (SUS) and emotional engagement (EEQ).
ChatGPT AVM provided quick generative drafts with moderate editing demands, displaying high lexical diversity and balancing speed with output quality. It performed well in multitasking scenarios and excelled at reducing stress, though its drafts sometimes diverged more noticeably from users’ intended meanings.
In contrast, Dictation required minimal upfront drafting time but imposed heavy revision demands. Its unstructured outputs featured low readability, fragmented sentence structure, high semantic diversity (substantial meaning-level edits), and the greatest overall workload, frustration, and cognitive load. Dictation also received the lowest usability and emotional engagement ratings.
These results show that incremental, context-aware questioning (StepWrite’s adaptive planning and scaffolding) is most effective for clean, intent-aligned, hands-free composition. Purely generative approaches (ChatGPT AVM) and linear speech-to-text methods (Dictation) present trade-offs among drafting speed, flexibility, cognitive effort, and user satisfaction.