Web Supplement to
“Free vs. Transcribed Text for Keystroke-Dynamics Evaluations”
This webpage supplements “Free vs. Transcribed Text for
Keystroke-Dynamics Evaluations” by Kevin Killourhy and Roy
Maxion, published in LASER 2012.
Kevin S. Killourhy and Roy A. Maxion. Free vs. transcribed text for
keystroke-dynamics evaluations. In Learning from Authoritative
Security Experiment Results (LASER-2012), July 18–19,
2012, Arlington, VA, 2012. ACM Press.
The abstract of the paper is as follows:
Background. One revolutionary application of keystroke dynamics
is continuous reauthentication: confirming a typist's identity during
normal computer usage without interrupting the user.
Aim. In laboratory evaluations, subjects are typically given
transcription tasks rather than free composition (e.g., copying
rather than composing text), because transcription is easier for
subjects. This work establishes whether free and transcribed text
produce equivalent evaluation results.
Method. Twenty subjects completed comparable transcription
and free-composition tasks; two keystroke-dynamics classifiers were
implemented; each classifier was evaluated using both the
free-composition and transcription samples.
Results. Transcription hold and keydown-keydown times are
2–3 milliseconds slower than free-text
features; t-tests showed these effects to be significant.
However, these effects did not significantly change evaluation
Conclusions. The additional difficulty of collecting freely
composed text from subjects seems unnecessary; researchers are
encouraged to continue using transcription tasks.
This webpage provides the typing data, classifier implementations, and
evaluation scripts used in the research. In conjunction with the
paper itself, these supplemental research materials and instructions
are intended to enable researchers to reproduce the scientific
results, tables, and figures from this work. We hope that the
materials provide a useful basis for further scientific research.
The following archive contains all of the research materials (for easy
When downloaded and unzipped, the archive contains the following files:
data/TimingFeatures-Hold.txt: The key hold-time
features recorded in a fixed-width text-file table with headers.
Columns specify the subject ID, session index, screen/exercise index,
key index, key name, and hold time in seconds.
data/TimingFeatures-DD.txt: The digraph
keydown-keydown-time features recorded in a fixed-width text-file
table with headers. Columns specify the subject ID, session index,
screen/exercise index, digraph index, digraph key names, and
keydown-keydown time in seconds.
data/SessionMap.txt: Fixed-width text-file table
listing the correspondence between the Subject IDs and session
indices, and the free-composition or transcription task that was
performed in that session. In each session, the subject either freely
composes or transcribes text concerning one of four pictures. The
pictures are denoted by one of four codes: Sea, Runaway, Girl, and
Doll. The task names pair a picture with whether the writing was free
or transcribed (e.g., “Runaway - Trans” or “Sea -
r/run.R: R script implementing the entire
evaluation procedure. Functions in the script (1) read and preprocess
the data, (2) implement the classification algorithms, (3) run the
evaluation, (4) analyze the results, and (5) produce tables and
figs/time-density-repro.eps: Density plots comparing free
and transcribed timing features (Figure 2 from the paper). This plot
is a product of the evaluation. It is shared so that researchers can
assess whether they are able to successfully reproduce it.
figs/error-plot-repro.eps: Evaluation results for both
classifiers (Figure 3 from the paper). This plot is another product
of the evaluation. It is also shared so that researchers can assess
whether they are able to successfully reproduce it.
Organized into the above directory structure (as they are in the
zip-file archive linked above), these materials should be sufficient
to reproduce our research results.
Our data analysis was performed using the R statistical programming
environment. The following instructions guide a researcher through
the details of setting up an environment similar to ours, running the
evaluation scripts, and checking the results.
Download and extract research materials. Begin by
downloading the zip-file archive from the link above and extracting
its contents to a directory
DSL-Free-vs-Transcribed. Confirm that the files
listed above exist and are organized in the appropriate subdirectories
Install R and add-on packages. The R statistical
programming environment can be obtained from the R Project for
It is available for most modern operating systems, and it is free and
open-source. We developed and tested our evaluation script with R
version 2.13.1, but we expect that it will work with other versions of
R. In addition to the basic R installation, our work uses several
add-on packages. The R Project maintainers have organized a large
collection of packages, making it easy to download and install them
directly from within R. The packages on which this work depends
They can be installed using the command:
install.packages( c('reshape','plyr','lattice') );
Run the evaluation. Launch R, and change the working
DSL-Free-vs-Transcribed. The working
directory can be changed using either the
or via a correspondingly named menu option (within a GUI environment).
From that directory, the project-specific R functions can be loaded
with the following command:
Having loaded all the functions, the complete evaluation procedure can be
invoked with a single command:
When run, this function will load the data, split it into appropriate
subgroups (e.g., free and transcribed), evaluate both of the
classifiers, and analyze the results. While running, the function
produces messages and progress bars; it also print various
intermediary results and summary statistics. Refer to the declaration
run function in the script for more detail.
Confirm the results. If the
executes successfully, it will print the results of two statistical
analyses. The percentages should match the following to within a
Additionally, the function produces two figures
Significant timing p-values:
Significant evaluation p-values:
figs directory. Compare these figures
error-plot-repro.eps, also in the
directory, to ensure that the figures used in the paper have been
Questions and Answers
In response to some actual and anticipated questions, we offer some
explanations and additional details about the data and the work.
In the real world, people with different jobs (e.g., writers,
programmers, sysadmins) type different things, on different
keyboards, in different environments (etc.). Why weren't these
factors taken into account?
These factors were taken into
account by controlling them. Prior work has established that some
of these factors absolutely do affect how a person types. Failing
to account for their effect would introduce a confounding factor.
By instructing subjects to complete the same typing tasks, on the
same keyboard, in the same environment (etc.), we prevented these
factors from confounding the results. We agree that there should be
more work to measure the effects of those other factors, but the
focus of this study was on one factor: free composition
If you have 20 subjects with no drop-outs, why aren't their
data collected for this study is part of a much larger keystroke-data
collection project. Within the larger project, not every subject
completes every task. For instance, only 20 subjects completed the
free-composition and transcription tasks studied in this work.
Subjects are assigned IDs that are unique and consistent across all
tasks, and so the sequence of subject IDs for any particular task can
be expected to have gaps. As an example, Subject
completed a variety of typing tasks but not the free-composition and
transcription tasks under study in this work.
s021 did complete the
free-composition and transcription tasks, and so the data set appears
to have a gap between
subjects recruited for the free-composition and transcription tasks
dropped-out of that data-collection effort.
In the data, why do the screen indices run from 3–10,
when the exercises range from 1–8?
The software that
presents the subjects with typing tasks first asks the subjects to
identify themselves (Screen 1) and presents them with a set of
instructions (Screen 2). The exercises are presented to them on the
screens that follow (Screens 3–10). To maintain our subjects'
privacy, we have stripped the data collected on the first two screens
from the data set before sharing it publicly. The remaining 8 screen
indices correspond to data collected during each of the 8 exercises,
with screen index [K] mapping to exercise [K-2].
The statistical analysis based on multiple testing is very
traditional. Why didn't you use ANOVA / generalized linear
mixed-effects models (GLMMs) / Bayesian hierarchical models / some
other more powerful test procedure?
At present, the
keystroke-dynamics community has not arrived at a consensus about
methodology and analytical best practices. Some prior work has used
the multiple-testing procedure, and so we chose to use it because of
its familiarity and acceptability. However, in other studies that
we have conducted, we have drawn inferences using linear
mixed-effects models (LMMs) and non-parametric hypothesis testing.
One hope in sharing the data is that other researchers might conduct
their own statistical analysis, and we can begin a larger discussion
about the best statistical methods for keystroke-dynamics research.
In conjunction with the paper itself, these supplemental research
materials, instructions, and explanations are intended to enable
researchers to reproduce the scientific results, tables, and figures
from this work. By making our research materials publicly available,
we also hope to enable other researchers to extend our work: testing
additional feature-extraction procedures or classifier
implementations, running alternative evaluations and analyses, or
conducting largely new and different experiments with our data or
classifier implementations. We hope this resource is useful. Please
let us know if you use our data, or if you have any comments or
This material is based upon work supported by the National Science
Foundation under grant number CNS-0716677. Any opinions, findings,
conclusions, or recommendations expressed in this material are those
of the authors, and do not necessarily reflect the views of the
National Science Foundation.