Interactive Task Training of a Mobile Robot through Human Gesture Recognition
Introduction Gesture-based programming, or programming by demonstration, is a powerful tool which can be used to impart abstract knowledge about a task to a robotic system in an extremely short amount of time. In this method of training, a task expert, such as a human (or another robot), does the actions necessary to complete a task, or gestures in such a way as to impart symbolic knowledge about the task. The primary benefit of this method is that the trainer does not have to provide the robot with an exact model of all of the actions necessary to accomplish its goal. All the trainer needs do is present the parameters of these actions to the robot. Such parameters may include what kinds of objects to affect by the action, where the robot should be oriented while executing the action, and so forth. Hidden Markov Models Hidden Markov models are used to model the underlying processes of a system whose inner workings cannot be completely observed. This is a useful method for determining the underlying processes behind the individual components of a gesture. A fundamental assumption that can be made about human gestures is that simply observing a gesture is not sufficient to extract the underlying structure behind it. The same gesture may appear to a naive system to be different when repeated by different people under different circumstances. A HMM attempts to classify the underlying structure of the gesture and correlate it with the actual observed input. Gestures or speech (or any other kind of signal continuous signal) can be discretized and represented as single-dimensional strings of observation symbols, O=(o1,o2,...,on). An algorithm called the Forward-Backward algorithm is used to determine the likelihood that a given HMM, λ, produced a string of observed symbols. In order to adjust the parameters of an HMM to recognize a particular class of observation symbols, an algorithm known as Baum-Welch is used. Each gesture is assigned its own HMM to classify it.
Demonstration-Based Programming The demonstration-based programming system consists of three sections, as shown below. The first part is a signal pre-processor, which filters the raw sensor data into a form that is usable by the rest of the system. The second part is the gesture classifier which uses a HMM representation of gestures to recognize those made by the teaching human. The third part is the robotic skill system which contains all of the sensory-motor skills necessary for the robot to interact with its environment.
This system is implemented on an RWI Pioneer 1 mobile robot outfitted with a Newton Labs Fast-Track Color Vision System (FTVS). All software is written in C++ using the Saphira 6.1f API running under Linux. The vision system performs color segmentation on the image, given user-defined parameters. The FTVS has three separate data channels which it can use to track different colors. Regions in the image which correspond to these colors are analyzed and statistics about largest single blob in the image are computed at 60Hz. One channel is defined explicitly for the teacher color (red). The other two channels are defined as "data" channels which colors of objects (blue and green) that the robot can manipulate are stored in. The FTVS computes statistics of the objects that it tracks, including the center of mass, area, and perimeter of a bounding box surrounding the blob of color. These statistics are passed into data modules which discretize the sensor data at 10Hz. This Δ X and Δ Y of the center of mass for each channel are computed and passed into the HMM classifier. Gesture Classifier When a new gesture is learned by the robot, a new HMM representation must be created and trained on sample gesture data. The human teacher provides a data set of sample gestures that is used by the Baum-Welch algorithm to train the new HMM. Once trained, this HMM is loaded into a database and is ready for use. When the human performs a gesture for the robot, the strings of symbols, O=(o1,o2,...,on), generated by both the object tracking modules are fed into the HMM classifier. In order to classify this gesture, the value of P(O|λi) must be generated for each HMM in the database. The Forward-Backward algorithm is applied to calculate the likelihood for each HMM.
The gestures that the robot is programmed to recognize are the following:
Robot Skill System The final part of the demonstration-based programming system is the database of skills that the robot is programmed with. A skill is a sensor-motor primitive that allows the robot to interact with its environment. Without this basic level of competence, the robot is unable to do any useful work. All of the gestures that the robot knows how to recognize in its HMM database have a corresponding skill associated with them. When a robot recognizes a gesture, it determines the skill that corresponds to that gesture and records the index of that skill as well as the Cartesian coordinates of where it was when it saw that gesture in its plan execution sequencer. When the robot has learned the task (i.e. the human has stopped demonstrating), the robot executes each action stored in its plan execution sequencer in the order that it saw them. All of the known gestures have a corresponding skill associated with them. However, not all skills have a corresponding gesture. The skills that do not have an associated gesture are generally used as part of the training process or are used for assisting the robot as it moves about the environment on its own. The complete list of robot skills is:
Experiments Each HMM in the gesture classification database is trained with 25 sample gestures of a particular type. To test the classification system, 100 additional test samples of each kind of gesture are obtained. Each sample is fed into the HMM classifier, and the values for P(O|λi), and Ci are computed for each. In the second column, a value of less than 100% in the P(O|λi) column indicates that the system could potentially mis-classify gestures if the classification was accomplished using likelihood calculations alone. The values in the third column represent the percentage rate of how many times the system was confident of its classification. A low value here would mean that the system finds that particular gesture too ambiguous and would elect not to classify it all instead of risking a misclassification. No mis-classifications occurred for this initial experiment.
To illustrate an example of the amount of variation between gestures of the same type, the figure below shows the likelihood values from testing 100 different instances of a Grab Object gesture. The four connect line graphs represent all four of the HMMs stored in the system. The top-most set of points (denoted by '+' symbols) represents the likelihood returned from the HMM trained to recognize the Grab Object gesture. The other three set of points represent the likelihood returns from the other gestures. According to the above table, every value for P(O|λi) correctly classifies the data, even though the log of the probability values returned from the Forward-Backward algorithm fluctuates between -50 and -100.
![]()
The calculation of the confidence values over the same set of gestures and HMMs is shown in the figure below. As in the previous figure, the data returned from the HMM that was trained on the Grab Object (once again highlighted by '+' symbols) has a much higher value than the three other HMMs. However, there are gestures in this sequence which the system is not very confident about, Cj <= Σ(k != j) Ck, and thus the percentage correct classification for the confidence factor is only 94%.
![]()
Summary and Conclusions A demonstration-based programming system was developed which allows a human to train a robot on a task by performing a series of actions or gestures. By demonstrating the actions for the robot, the human can let the robot extract relevant parameters for the task (such as the Cartesian position where the action should take place). The robot follows the human around the environment and tries to be as unobtrusive as possible so as to let the human complete its task. The robot provides feedback to the human when it fails to recognize a gesture so that the human can know to re-demonstrate the task. A set of simple gestures and corresponding actions was defined and implemented on a mobile robot. The gesture-recognition system was tested and found to be reasonably robust in its classification of gestures. The whole system was put through a preliminary test, and the results and outlook for the system are very encouraging. An interesting departure from strictly human to robot gesture recognition is that of robot to robot gesture/action recognition. If a single robot is programmed with a particular task and executes it, another robot could be programmed with that task simply by watching the first one. In teams of robots where there are many parallel tasks that must be done, two specific classes of robots could be used: specialists and floaters. The specialists would be programmed ahead of time to do a particular task, while the floaters would move about and assist the specialists as needed. The floaters would observe the specialists doing their tasks and then be able to assist them appropriately. Once the floaters were no longer needed, they would move off to find another specialist to assist. Future extensions of this work will take this scenario and others like it into account. Acknowledgments We would like to acknowledge the support of NSF under grant NSF/DUE-9351513. Additional support was provided by the Air Force Research Laboratory under contract number F30602-96-2-0240.
Relevant Publications P. E. Rybski, R. M. Voyles, "Interactive Task Training of a Mobile Robot through Human Gesture Recognition and Imitation," Proceedings of the 1999 IEEE International Conference on Robotics and Automation, pp. 664-669, Detroit, MI, May 1999. | |||||||||||||||
|
|||||||||||||||