Keystroke Dynamics - Benchmark Data Set

2. The Data

The data consist of keystroke-timing information from 51 subjects (typists), each typing a password (.tie5Roanl) 400 times.

DSL-StrongPasswordData.txt (Fixed-width format) ................... MD5 hash = e5b72954c2e093a0a4ec7ca1485f9d05
DSL-StrongPasswordData.csv (Comma-separated-value format) MD5 hash = 470235f96568f28f9ea0da62234ec857
DSL-StrongPasswordData.xls (Excel format) ............................. MD5 hash = e1a69b03315664d5dcaefd52583d6ad9

Common questions:

Q2-1: How were the data collected?

For complete details of our data collection methodology, we refer readers to our original paper [1]. A brief summary of our methodology follows.

We built a keystroke data-collection apparatus consisting of: (1) a laptop running Windows XP; (2) a software application for presenting stimuli to the subjects, and for recording their keystrokes; and (3) an external reference timer for timestamping those keystrokes. The software presents the subject with the password to be typed. As the subject types the password, it is checked for correctness. If the subject makes a typographical error, the application prompts the subject to retype the password. In this manner, we record timestamps for 50 correctly typed passwords in each session.

Whenever the subject presses or releases a key, the software application records the event (i.e., keydown or keyup), the name of the key involved, and a timestamp for the moment at which the keystroke event occurred. An external reference clock was used to generate highly accurate timestamps. The reference clock was demonstrated to be accurate to within ±200 microseconds (by using a function generator to simulate key presses at fixed intervals).

We recruited 51 subjects (typists) from within a university community; all subjects fully completed the study—we did not drop any subjects. All subjects typed the same password, and each subject typed the password 400 times over 8 sessions (50 repetitions per session). They waited at least one day between sessions, to capture some of the day-to-day variation of each subject's typing. The password (.tie5Roanl) was chosen to be representative of a strong 10-character password.

The raw records of all the subjects' keystrokes and timestamps were analyzed to create a password-timing table. The password-timing table encodes the timing features for each of the 400 passwords that each subject typed.
Q2-2: How do I read the data into R / Matlab / Weka / Excel / ...?

The data are provided in three different formats to make it easier for researchers visiting this page to view and manipulate the data. In all its forms, the data are organized into a table, but different applications are better suited to different formats.

(R): In the fixed-width format, the columns of the table are separated by one or more spaces so that the information in each column is aligned vertically. This format is easy to read in a standard web browser or document editor with a fixed width font. It can also be read by the standard data-input mechanisms of the statistical-programming environment R. Specifically, the read.table command can be used to read the data into a structure called a data.frame:

X <-read.table( 'DSL-StrongPasswordData.txt', header=TRUE )
(Matlab and Weka): In the comma-separated-value format, the columns of the table are separated by commas. This format is commonly read by most data-analysis packages (e.g., Matlab). The Weka data-mining software has collected many machine-learning algorithms that might be brought to bear on the keystroke-dynamics data. While Weka encourages the use of its own ARFF data-input format, a researcher could convert a CSV into an ARFF-formatted file by prepending the appropriate header information.

(Excel): In the Microsoft Excel binary-file format, the columns of the table are encoded as a standard Excel spreadsheet. This format can be used by researchers wishing to bring Excel's data analysis and graphing capabilities to bear on the data.

By making the data available in these three formats, we hope to make it easier for other researchers to use their preferred data-analysis tools. In our own research, we use the fixed-width format and the R statistical-programming environment.
Q2-3: How are the data structured? What do the column names mean? (And why aren't the subject IDs consecutive?)
The data are arranged as a table with 34 columns. Each row of data corresponds to the timing information for a single repetition of the password by a single subject. The first column, subject, is a unique identifier for each subject (e.g., s002 or s057). Even though the data set contains 51 subjects, the identifiers do not range from s001 to s051; subjects have been assigned unique IDs across a range of keystroke experiments, and not every subject participated in every experiment. For instance, Subject 1 did not perform the password typing task and so s001 does not appear in the data set. The second column, sessionIndex, is the session in which the password was typed (ranging from 1 to 8). The third column, rep, is the repetition of the password within the session (ranging from 1 to 50).

The remaining 31 columns present the timing information for the password. The name of the column encodes the type of timing information. Column names of the form H.key designate a hold time for the named key (i.e., the time from when key was pressed to when it was released). Column names of the form DD.key1.key2 designate a keydown-keydown time for the named digraph (i.e., the time from when key1 was pressed to when key2 was pressed). Column names of the form UD.key1.key2 designate a keyup-keydown time for the named digraph (i.e., the time from when key1 was released to when key2 was pressed). Note that UD times can be negative, and that H times and UD times add up to DD times.

Consider the following one-line example of what you will see in the data:
```
  subject  sessionIndex  rep      H.period   DD.period.t   UD.period.t     ...
     s002             1    1        0.1491        0.3979        0.2488     ...
```
The example presents typing data for subject 2, session 1, repetition 1. The period key was held down for 0.1491 seconds (149.1 milliseconds); the time between pressing the period key and the t key (keydown-keydown time) was 0.3979 seconds; the time between releasing the period and pressing the t key (keyup-keydown time) was 0.2488 seconds; and so on.

3. Evaluation Scripts

The following procedure—written in the R language for statistical computing (www.r-project.org)—demonstrates how to use the data to evaluate three anomaly detectors (called Euclidean, Manhattan, and Mahalanobis).

evaluation-script.R

Note that this script depends on the R package ROCR for generating ROC curves [2].

Common questions:

Q3-1: What does the script really do? Can you explain the steps of the evaluation?

For complete details of our evaluation methodology, and a clear explanation of our design decisions, we refer readers to our original paper [1]. A brief summary of our evaluation methodology follows.

The following four steps are used to evaluate a single anomaly detector on the task of discriminating a single subject (designated as the genuine user) from the other 50 subjects (designated as the impostors). After evaluating the detector for a single subject, these four steps will be repeated for each subject in the data set, so that each subject, in turn, will have been "attacked" by each of the other 50 subjects in a balanced experimental design.

Step 1 (training): Retrieve the first 200 passwords typed by the genuine user from the password-timing table. Use the anomaly detector's training function with these password-typing times to build a detection model for the user's typing.

Step 2 (genuine-user testing): Retrieve the last 200 passwords typed by the genuine user from the password-timing table. Use the anomaly detector's scoring function and the detection model (from Step 1) to generate anomaly scores for these password-typing times. Record these anomaly scores as user scores.

Step 3 (impostor testing): Retrieve the first 5 passwords typed by each of the 50 impostors (i.e., all subjects other than the genuine user) from the password-timing table. Use the anomaly detector's scoring function and the detection model (from Step 1) to generate anomaly scores for these password-typing times. Record these anomaly scores as impostor scores.

Step 4 (assessing performance): Employ the user scores and impostor scores to generate an ROC curve for the genuine user. Calculate, from the ROC curve, an equal-error rate, that is, the error rate corresponding to the point on the curve where the false-alarm (false-positive) rate and the miss (false-negative) rate are equal.

Repeat the above four steps, designating each of the subjects as the genuine user in turn, and calculating the equal-error rate for the genuine user. Calculate the mean of all 51 subjects' equal-error rates as a measure of the detector's performance, and calculate the standard deviation as a measure of its variance across subjects.
Q3-2: How do I download R / install packages / run the script?
You can download R from the webpage for the R Project for Statistical Computing (http://www.r-project.org). The R statistical-programming environment is a general programming language with many functions and packages for conducting a range of statistical analyses and data visualizations. It is available for most modern operating systems, and it is free and open-source. We developed and tested our evaluation script with R version 2.6.2, but we expect that it will work with similar versions of R.

If you are not familiar with R, there are many tutorials and references available online. The following is a collection of some that we have used, or that have been recommended to us:
- Introduction to the Statistical Language R (by Myron Hylinka)
- A Skimpy Intro to R/S/S-Plus (by Thomas Fletcher)
- A Brief History of S (by Richard Becker)
- R Tutorial (by Kelly Black)
- An Introduction to R (by the R Development Core Team)
- R Language Definition (by the R Development Core Team)
Once you have installed R and have become familiar with how to use it, the next step is to install an additional package (called ROCR) that is necessary for running our evaluation. The R project maintainers have organized a large collection of packages, and have made it easy to download and install these packages. The necessary package can be installed with a single R command:

install.packages( 'ROCR' )
The ROCR package [2] provides a set of functions that help with the analysis of the performance of anomaly-detection algorithms. Specifically, we use the package to generate ROC curves based on the output of each anomaly detector's scoring function, and then use the ROC curves to calculate the detector's equal-error rates.

The final step is to download and install the evaluation script from this webpage, and to place it in the same directory as the data in fixed-width format. To run the script, use the R command source:

source('evaluation-script.R')
The script first loads libraries that contain functionality used in the evaluation. The next action the script takes is to define training and scoring functions for the three anomaly detectors (i.e., the Euclidean, Manhattan, and Mahalanobis detectors). Then, the script defines a function for calculating the equal-error rate of a detector based on its output, and a test function that evaluates how well a given detector can discriminate a given subject from the rest. Finally, the script loads the typing data, and uses the previously defined functions to calculate the average equal error rate of each detector across all subjects.

If you have installed R correctly, installed the appropriate packages, and run the evaluation script successfully, it should print information with which you can monitor the progress of the evaluation. Eventually, it should tally and print the following results for the three anomaly detectors:
```
            eer.mean eer.sd
Euclidean      0.171  0.095
Manhattan      0.153  0.092
Mahalanobis    0.110  0.065
```
Note that these results are fractional rates between 0.0 and 1.0 (not percentages between 0% and 100%). They match the average equal-error rates and standard deviations for the detectors from Table 2 of our original paper (and reproduced in the table of results, below). By running this script successfully, you will have replicated our evaluation methodology and reproduced our results for these three detectors.
Q3-3: Why does the script only have code for three anomaly detectors?

The purpose of this webpage is to share the data and the evaluation methodology that were the original contributions of our paper, not to provide and support code for all 14 anomaly detectors. In our original paper, we describe each of the 14 detectors, and we provide references to the original sources. We encourage researchers who are interested in replicating those detectors to use that material.

We implemented these three anomaly detectors because they are good examples with which to demonstrate our evaluation methodology. They are relatively easy to understand, since they are based on classical measures of distance from the statistical machine-learning and pattern-recognition literature. They are easy to implement, since they do not depend on packages or algorithms not found in a typical R installation. Finally, they are easy to run, since they do not require complex optimizations in order to run efficiently.

Note that—in the interest of scientific progress—we see a benefit in maintaining reference implementations of the top-performing detectors. Such reference implementations could be used, evaluated, and improved by the whole community. We are investigating the feasibility of sharing and supporting such reference implementations (and wholly encourage others to do so as well) in the future, but are not able to do so at the present time.
Q3-4: What other kinds of anomaly detectors can be evaluated using these scripts?
Each of the anomaly detectors in our comparison was comprised of two functions, a training function and a scoring function. The training function takes a matrix of password-timing information as input, and it outputs a detection model. Each row of the input matrix encodes password-timing information from one repetition of the genuine user typing the password. The function uses this set of timing information to build a model of that user's typing. The details of the model are detector specific, and they need not take a particular form for our evaluation.

The scoring function takes the detection model produced by the training function and another matrix of password-timing information as input. It outputs a set of anomaly scores. The scoring function compares the timing information from each password in the matrix to the genuine user's typing model. For each password, it calculates an anomaly score, indicating the degree to which that new sample is dissimilar from the typing model. A higher anomaly score means greater dissimilarity according to that anomaly detector's conception of similarity.

The commonality across all the anomaly detectors is that each one can be implemented as a training and a scoring function. Any other anomaly detector which can be implemented as such a pair of functions with the same types of input and output can be evaluated using our methodology. If a new detector were implemented in R as the functions newTrain and newScore, it could be evaluated simply by adding these two functions to the detectorSet list of detectors:
```
  detectorSet = list( NewDetector =
    list( train = newTrain,
         score = newScore ) );
```
Our intent in sharing the data is for the password-timing tables to be used to evaluate a range of anomaly detectors so that the results of the evaluations can be soundly compared, using the same data and the same evaluation procedure. Consequently, we encourage other researchers to use our evaluation script to evaluate new and better anomaly-detection strategies for keystroke dynamics.
Q3-5: What if I want to do a different evaluation using the data?

The data, with or without the evaluation methodology, is intended to be a shared resource for public use. Consequently, if researchers would like to do different evaluations using the data, they are welcome to do so. For instance, while our study has focused on anomaly detectors, other researchers have considered binary and multi-class classifiers for keystroke dynamics. For instance, a binary classifier might be trained to discriminate between two typists, or between one typist and a pool of typing data comprised of many other typists; a multi-class classifier might be trained to identify which of several typists entered a particular typing sample. The data shared on this website could be used to evaluate any of these alternative families of learning algorithms.

Caution: we have one request to researchers using this data, but using a different evaluation methodology. Please, make it very clear that your methodology differs from the one in our paper, and clearly describe your alternative methodology. We have observed that algorithms evaluated under different conditions are often compared, even though the differing evaluation environments represent a serious potential confound. By clearly explaining your evaluation methodology, and how it differs from others, you mitigate the risk of a confused reader conflating the different methodologies, and making unsound comparisons.

4. Table of Results

The following table ranks 14 anomaly detectors based on their average equal-error rates. The evaluation procedure described in the script above was used to obtain the equal-error rates for each anomaly detector. For example, the average equal-error rate for the scaled Manhattan detector (across all subjects) was 9.62%, and the standard deviation was 0.0694.

Detector	Average Equal-Error Rate (stddev)
Manhattan (scaled)	0.0962 (0.0694)
Nearest Neighbor (Mahalanobis)	0.0996 (0.0642)
Outlier Count (z-score)	0.1022 (0.0767)
SVM (one-class)	0.1025 (0.0650)
Mahalanobis	0.1101 (0.0645)
Mahalanobis (normed)	0.1101 (0.0645)
Manhattan (filter)	0.1360 (0.0828)
Manhattan	0.1529 (0.0925)
Neural Network (auto-assoc)	0.1614 (0.0797)
Euclidean	0.1706 (0.0952)
Euclidean (normed)	0.2153 (0.1187)
Fuzzy Logic	0.2213 (0.1051)
k Means	0.3722 (0.1391)
Neural Network (standard)	0.8283 (0.1483)

Note that these are results are fractional rates between 0.0 and 1.0 (not percentages between 0% and 100%).

Common questions:

Q4-1: How do I interpret this table of results?

The first column indicates the name of the detector. All of the detectors in this list bear the names they were given in our original paper. The names are meant to describe the mathematical or statistical technique that underlies the anomaly-detection strategy. The reader interested in how each detector works will find additional detail in the prose and references of the original paper.

The second column provides the average equal-error rate of each detector, as estimated by our evaluation methodology. The standard deviation appears in parentheses. The detectors are sorted from least to highest error.

Note that these results and rankings are only the observed results of a single evaluation on a single data set. We would discourage a reader from inferring that the top-ranked detectors are necessarily going to always outperform the other detectors. We believe it is likely that many factors—who the subjects are, what they type, and specifically how the data are collected and analyzed—affect the error rates of anomaly detectors used for keystroke dynamics. Variations in these factors might change a detector's equal-error rate, and might cause a different set of detectors to be among the top performers. A high rank in this table suggests that a detector is promising; but more data, and more evaluations will be needed to determine how various factors affect keystroke-dynamics error rates. This topic is a subject of our current and ongoing research.
Q4-2: Why do you use the average equal-error rate as the sole measure of performance?

Summarizing the performance of an anomaly detector as a single number is tricky. There is no right way to do it that does not make some concessions, or have some drawbacks. As such, researchers in the field have used a variety of measures. In the original paper, we reported both the equal-error rate and the zero-miss false-alarm rate, since they are both used in the literature. On this webpage, we tabulate the equal-error rate of each detector because it is a common measure of performance for many biometric systems. If other, demonstrably better measures of performance emerge, we will consider the feasibility of calculating them on our evaluation data, and updating this page with these better measures.
Q4-3: Do you plan to update the table with new results?

It has been suggested that we maintain a "scoreboard" of the latest and best results. Insofar as we are informed of the results obtained by other investigators, we may do so. We intend to assemble and maintain a list of research projects that use and extend our results. Researchers might use this reference to compare and build upon each other's work.

The data and evaluation procedure are freely available for use. We do ask, as a courtesy, that you let us know if you publish results based on our data.