Web Supplement to
“Free vs. Transcribed Text for Keystroke-Dynamics Evaluations”
(LASER-2012)


by
Kevin Killourhy and Roy Maxion
(click to show email)


  1. Introduction
  2. Research Materials
  3. Procedure
  4. Questions and Answers
  5. Summary

Introduction

This webpage supplements “Free vs. Transcribed Text for Keystroke-Dynamics Evaluations” by Kevin Killourhy and Roy Maxion, published in LASER 2012.
Kevin S. Killourhy and Roy A. Maxion. Free vs. transcribed text for keystroke-dynamics evaluations. In Learning from Authoritative Security Experiment Results (LASER-2012), July 18–19, 2012, Arlington, VA, 2012. ACM Press.
The abstract of the paper is as follows:
Background. One revolutionary application of keystroke dynamics is continuous reauthentication: confirming a typist's identity during normal computer usage without interrupting the user.
Aim. In laboratory evaluations, subjects are typically given transcription tasks rather than free composition (e.g., copying rather than composing text), because transcription is easier for subjects. This work establishes whether free and transcribed text produce equivalent evaluation results.
Method. Twenty subjects completed comparable transcription and free-composition tasks; two keystroke-dynamics classifiers were implemented; each classifier was evaluated using both the free-composition and transcription samples.
Results. Transcription hold and keydown-keydown times are 2–3 milliseconds slower than free-text features; t-tests showed these effects to be significant. However, these effects did not significantly change evaluation results.
Conclusions. The additional difficulty of collecting freely composed text from subjects seems unnecessary; researchers are encouraged to continue using transcription tasks.
This webpage provides the typing data, classifier implementations, and evaluation scripts used in the research. In conjunction with the paper itself, these supplemental research materials and instructions are intended to enable researchers to reproduce the scientific results, tables, and figures from this work. We hope that the materials provide a useful basis for further scientific research.

Research Materials

The following archive contains all of the research materials (for easy download): When downloaded and unzipped, the archive contains the following files: Organized into the above directory structure (as they are in the zip-file archive linked above), these materials should be sufficient to reproduce our research results.

Procedure

Our data analysis was performed using the R statistical programming environment. The following instructions guide a researcher through the details of setting up an environment similar to ours, running the evaluation scripts, and checking the results.
  1. Download and extract research materials. Begin by downloading the zip-file archive from the link above and extracting its contents to a directory named DSL-Free-vs-Transcribed. Confirm that the files listed above exist and are organized in the appropriate subdirectories (i.e., data/, r/, and figs/).

  2. Install R and add-on packages. The R statistical programming environment can be obtained from the R Project for Statistical Computing (http://www.r-project.org). It is available for most modern operating systems, and it is free and open-source. We developed and tested our evaluation script with R version 2.13.1, but we expect that it will work with other versions of R. In addition to the basic R installation, our work uses several add-on packages. The R Project maintainers have organized a large collection of packages, making it easy to download and install them directly from within R. The packages on which this work depends are reshape, plyr, and lattice. They can be installed using the command:
    install.packages( c('reshape','plyr','lattice') );

  3. Run the evaluation. Launch R, and change the working directory to DSL-Free-vs-Transcribed. The working directory can be changed using either the setwd command or via a correspondingly named menu option (within a GUI environment). From that directory, the project-specific R functions can be loaded with the following command:
    source('r/run.R');
    Having loaded all the functions, the complete evaluation procedure can be invoked with a single command:
    run();
    When run, this function will load the data, split it into appropriate subgroups (e.g., free and transcribed), evaluate both of the classifiers, and analyze the results. While running, the function produces messages and progress bars; it also print various intermediary results and summary statistics. Refer to the declaration of the run function in the script for more detail.

  4. Confirm the results. If the run function executes successfully, it will print the results of two statistical analyses. The percentages should match the following to within a percentage point:
    Significant timing p-values: 28.9%
    Significant evaluation p-values: 3.6%
    Additionally, the function produces two figures named time-density.eps and error-plot.eps in the figs directory. Compare these figures with time-density-repro.eps and error-plot-repro.eps, also in the figs directory, to ensure that the figures used in the paper have been accurately reproduced.

Questions and Answers

In response to some actual and anticipated questions, we offer some explanations and additional details about the data and the work.
  1. In the real world, people with different jobs (e.g., writers, programmers, sysadmins) type different things, on different keyboards, in different environments (etc.). Why weren't these factors taken into account?
    These factors were taken into account by controlling them. Prior work has established that some of these factors absolutely do affect how a person types. Failing to account for their effect would introduce a confounding factor. By instructing subjects to complete the same typing tasks, on the same keyboard, in the same environment (etc.), we prevented these factors from confounding the results. We agree that there should be more work to measure the effects of those other factors, but the focus of this study was on one factor: free composition vs. transcription.

  2. If you have 20 subjects with no drop-outs, why aren't their Subject IDs s001s020?
    The data collected for this study is part of a much larger keystroke-data collection project. Within the larger project, not every subject completes every task. For instance, only 20 subjects completed the free-composition and transcription tasks studied in this work. Subjects are assigned IDs that are unique and consistent across all tasks, and so the sequence of subject IDs for any particular task can be expected to have gaps. As an example, Subject s020 completed a variety of typing tasks but not the free-composition and transcription tasks under study in this work. Subjects s019 and s021 did complete the free-composition and transcription tasks, and so the data set appears to have a gap between s019 and s021. No subjects recruited for the free-composition and transcription tasks dropped-out of that data-collection effort.

  3. In the data, why do the screen indices run from 3–10, when the exercises range from 1–8?
    The software that presents the subjects with typing tasks first asks the subjects to identify themselves (Screen 1) and presents them with a set of instructions (Screen 2). The exercises are presented to them on the screens that follow (Screens 3–10). To maintain our subjects' privacy, we have stripped the data collected on the first two screens from the data set before sharing it publicly. The remaining 8 screen indices correspond to data collected during each of the 8 exercises, with screen index [K] mapping to exercise [K-2].

  4. The statistical analysis based on multiple testing is very traditional. Why didn't you use ANOVA / generalized linear mixed-effects models (GLMMs) / Bayesian hierarchical models / some other more powerful test procedure?
    At present, the keystroke-dynamics community has not arrived at a consensus about methodology and analytical best practices. Some prior work has used the multiple-testing procedure, and so we chose to use it because of its familiarity and acceptability. However, in other studies that we have conducted, we have drawn inferences using linear mixed-effects models (LMMs) and non-parametric hypothesis testing. One hope in sharing the data is that other researchers might conduct their own statistical analysis, and we can begin a larger discussion about the best statistical methods for keystroke-dynamics research.

Summary

In conjunction with the paper itself, these supplemental research materials, instructions, and explanations are intended to enable researchers to reproduce the scientific results, tables, and figures from this work. By making our research materials publicly available, we also hope to enable other researchers to extend our work: testing additional feature-extraction procedures or classifier implementations, running alternative evaluations and analyses, or conducting largely new and different experiments with our data or classifier implementations. We hope this resource is useful. Please let us know if you use our data, or if you have any comments or suggestions.

This material is based upon work supported by the National Science Foundation under grant number CNS-0716677. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors, and do not necessarily reflect the views of the National Science Foundation.