Vision-based gesture recognition and HCI
Human–computer interaction as a discipline has received increased interest over the last two decades. Research has focused on developing novel ways of interaction and the possibility of computer vision–based interfaces has been an active area of research. Such interfaces offer very natural, manageable methods for interacting with a system.
Research in computer vision–based HCI has primarily been focussed on the computer–vision part. Consequently there has been an increased interest in sub-disciplines such as gesture recognition . This has not only led to more accurate gesture recognition systems but to an overall improvement in the hardware (e.g. image acquisition setup) and software (e.g. classification algorithms) of computer vision and image processing. Similarly, improvements in other areas of imaging, computer vision and pattern recognition (e.g. better tracking algorithms) has benefitted gesture recognition.
In the last five years, some studies have pointed out this lack of focus (e.g. ). The use of standard HCI techniques for the usability evaluation of their recognition systems  has been a step forward from previous studies, which focussed only on accuracy of computer vision techniques, but failed to address key issues of HCI. This, therefore, is the focus of our research; the development of a vision–based gesture recognition system that addresses human–computer interaction.
2. Overview of Research
The research will strong two disciplines of computing and comprise two components; a) the development of a gesture recognition module and b) improving user interaction. The aim of the gesture recognition module is to develop a novel and robust tracking system that would be able to track multiple ‘active’ users and recognise dynamic gestures. I will focus on two aspects of HCI. Firstly, at all key stages of system development, there will be proper usability studies and, secondly, the cognitive workload experienced during interaction will be assessed.
Figure 1: Overview of Gesture Recognition System
2.1. Gesture recognition module
A block diagram of a basic gesture recognition system is shown in figure 1. A typical gesture recognition system can be represented as shown below. The focus of research will be on feature extraction and gesture classification – that is, a Gesture Recognition Module.
2.1.1 Feature Extraction
Feature extraction involves identifying candidate’s regions (hands and face) by background segmentation and tracking these candidate regions. Gesture recognition module, as shown in Figure 1, continuously takes video frames as input. Using skin locus-based segmentation, potential skin regions are identified and the rest of the image data is discarded as background. At this stage, image data is reduced considerably, making subsequent processing much more efficient. Skin colour information is to be combined with optical flow information to create joint criteria for discarding false positive regions. Once the candidate regions are identified, these candidate regions are to be tracked, using a particle filtering (PF) framework. The most probable candidate region is extracted and used as a template for gesture classification. Selecting a Bayesian framework will enable us to reliably track several candidate regions.
2.1.2 Gesture Classification
Standard deformable model fitting will suffice for classifying an isolated or stand-alone gesture. For classifying continuous gestures, a probabilistic framework such as Variable Length Markov Model (VLMM) is to be used. VLMM will make gesture classification flexible, allowing for a variable number of valid gestures made at one time.
2.2. Improving System Interaction
If vision–based interaction is to replace traditional methods, then we need to evaluate gesture–mediated computer interaction for application areas that inherently have a high cognitive workload. The term ‘cognitive workload’ can simply be defined as the mental effort required or mental workload experienced while accomplishing a task. Any key development in the gesture recognition system will be subjected to usability evaluation, using standard HCI principles.
The measurement of cognitive workload and usability testing will form a framework for evaluating our gesture recognition system. For the purpose of evaluation, mock-up application scenarios will be created. Areas of application that may be challenging are surveillance setups that require looking at more than one screen and taking appropriate action(s) and a traffic monitoring control room where a great deal of information is available at a particular time.
3. Objectives and Challenges
Major research objectives are mentioned below.
3.1 Gesture Recognition
3.1.1 Main Objectives
a. Developing a novel PF tracking mechanism that is based on skin-colour locus and optical flow. Skin segmentation experiments have enabled us to identify a credible range of skin colour locus in normalised r, g space.
b. The possible target application areas, mentioned above, will involve more than one user interacting with the system. The gesture classification mechanism is to be truly multi-user, allowing more than one active user. Skin colour and optical flow measurements will form the metric for resolving conflict and assigning priority.
3.2.2 Other Expected Contributions
c. One of the secondary aims of the project is to give a detailed comparison of various classification algorithms/techniques in order to analyse their relative strengths/weaknesses if they are to be used in real-time applications. The comparison is to be made separately for static and continuous or dynamic gestures. The motivation comes from research done in haptic interfaces for handheld devices. Techniques developed in a controlled environment were tested in real life conditions.
d. Creating a specialised dataset for evaluating skin detection techniques especially for gesture recognition. During the skin detection experiments, we found it difficult to compare the results of our skin detection. Gesture recognition studies do not report results on skin detection which is an intermediate step. Skin detection studies usually use publically available web images datasets . We hope to create a dataset similar to the KTH dataset that has become a standard dataset for evaluating action recognition techniques.
3.2 System Interaction
a. One of our initial aims is to develop a new- or adopt an existing cognitive workload model for evaluating vision–based gestural interfaces. We are interested in developing a model for specific gestural interfaces similar to existing subjective cognitive workload models e.g. NASA-TLX workload model (see  for an overview of the model).
b. The cognitive workload model will be combined with standard usability testing to form an elaborate evaluation framework for gestural interfaces. Traditionally, evaluation of gestural interfaces involves measuring accuracy/error rate of the computer vision technique. This can be combined with standard usability techniques and cognitive workload measurements to create a more well-rounded evaluation framework.
4. Completed Work
Skin colour has an important characteristic in that it occupies a certain part in chromaticity space  i.e. skin locus. Skin locus has been identified using skin samples from 150 images taken under different indoor lighting conditions, using two cameras. Traditionally, studies identify skin locus by using camera parameters and details regarding the light source. In these studies, cameras are calibrated for each condition of illumination. We noticed a significant overlap of skin colour distributions and different indoor conditions. We identified general thresholds for red and green chromaticity and evaluated skin segmentation for various datasets that were prepared under apparently similar indoor conditions. Three participants, one each of Caucasian, South Asian and East Asian origin participated in elaborate experimentation. The desired range for red chromaticity values lie between 0.40 and 0.65 inclusive, while for green, it is 0.25 and 0.35 inclusive. For our experiment, only the lower red threshold needs to be varied to adjust to these changes.
Our proposed technique and defined thresholds were evaluated using four sign language datasets and two web images datasets. These defined thresholds give high skin detection rates without any knowledge of camera parameters and illumination details. The detection rates for skin detection on sign language datasets are in excess of 86%. Our study has, for the first time, reported detailed results on web images. Results of our study have been submitted as a ‘work in progress’ paper to BMVC Postgraduate Workshop 2010.
A simple yet efficient two-step procedure can be used for updating thresholds while the system is running in order to accommodate drastically changing lighting conditions. The first step is image or frame differencing followed by simple thresholding to obtain a skin segmented image. The segmented skin region is then used to update thresholds by computing mean red and green chromaticity values.
There is a varying degree of false positives in skin detection which can lead to erroneous classification. Combining a skin-locus based technique with a statistical technique such as a histogram model has been investigated to remove false positives. The combination of the two, although computationally efficient, did not improve the results. This is mainly due to the overlap between skin and non-skin colour distribution. Therefore, skin detection is combined with optical flow information to remove false positives. This technique has resulted in the size of false positive regions being significantly reduced even if all the false positives are not removed. These regions can now be easily removed. We are using Lucas Kanade optical flow algorithm.
The initial experimentation was done using MATLAB 7.0. For the real-time webcam video stream, the segmentation was good but there was a lag of about 1.5 sec. This was caused by the overhead involved in computing optical flow. However, with an open CV, the segmentation based on skin colour and optical flow is very efficient and can be used in real-time. This is achieved by using the built-in, standard Lucas Kanade algorithm.
5. Current Work: Tracking Mechanism
Efficient and accurate tracking mechanisms will form the basis of this system. The skin regions of more than one person have been successfully segmented in real-time. Based on the evidence of our skin segmentation experiments, real-time particle filter tracking based on skin colour and optical flow is worth investigating. The focus of current work is:
- The accurate association of segmented skin regions (hands and face) of a user. In other words, grouping together segmented skin regions belonging to one person. This is challenging as there is a high chance of overlap and we want to limit constraints on the user in terms of his or her position and movement in the application area. Association of skin regions is not the requirement of our tracking and classification mechanism for hand gestures. This association is required as it is a general practice in the field and it would prove useful if the face is to be used as an additional visual clue.
- Developing our particle filter-based tracking mechanism. Our tracker will treat each segmented region as a particle or sample. For the proposed system, the idea is to use a Bayesian framework for tracking candidate regions for gesture recognition. To improve the performance of tracking, the most probable candidate regions are identified earlier (the background subtraction part is not part of the iterations of particle filter-based tracker). Using an initial vetting process for candidate regions, it would be possible to keep the number of samples reasonable for our particle filter. At every step, weights of samples will be updated, based on skin similarity measures and optical flow magnitudes.At this stage, the focus is to obtain accurate tracking results on recorded sequences. This is to improve/modify the mathematical model (observation and measurement equations) for the tracker before testing it with a real-time video stream.
Weights of samples (or candidate regions) will determine the most important and relevant candidate regions. The Bayesian framework will help in reliably tracking multiple candidate regions. As our aim is to develop a multiuser gesture recognition system, the joint measure based on optical flow magnitude and skin-similarity will also be used to assign priority to users.
6. Future Work
The future work is divided into three major phases.
6.1 Feature Extraction
Once we have a reliable tracker, the next step is to extract features from the tracked skin regions. At each step, the most probable candidate region is selected as region of interest or ROI. This is essentially the candidate skin region with the highest weight. The ROI is then used to extract the features which are to be used for gesture classification. For our work we will extract the hand contour shape of the ROI (the gesture making hand). Extracting proper hand contour shape will be possible as we have obtained good segmentation.
5.2 Static Gesture Classification
The first stage in gesture classification is the static gesture recognition. Based on the extracted features, simple standalone gestures will be classified. The extracted hand contour will be compared against a gesture ‘template’ created from training examples. Due to the variability in a hand posture, standard template matching will not work. Various deformable model fitting techniques that involve fitting training examples to a test example has shown good results for gesture classification [7, 8]. These techniques will probably require optimisation/improvement in order to use them in real time. For initial classification experiments we will use the three gestures that were used in our skin detection experiments i.e. thumbs-up, pointing gesture and halt/stop gesture (showing palm of hand). The major aim of this stage is to compare, in detail, various classification algorithms e.g. Support Vector Machine (SVM), K-Nearest Neighbour classification etc. This is important in order to determine a suitable classification technique for a real-time application.
At this stage we will introduce usability testing. Two types of usability evaluation will be conducted; expert evaluation and user evaluation. Expert evaluation will be done using cognitive walkthrough. An expert in this technique sits down and evaluates a system by setting out a series of tasks. For each task, the expert answers a set of questions; answering these questions can help reveal design issues that may make it difficult to achieve a certain level of functionality. A second type of testing is the user evaluation. Users are asked to perform one or more tasks and their performance is monitored and measured. A simple technique like ‘think-aloud’ is very useful in identifying any flaws early on in the system development process.
6.3 Dynamic Gesture Recognition
The next step is the classification of dynamic gestures. For recognition of these gestures, we will use a variant of Markov Models. We plan to use Variable Length Markov Models (VLMMs) in order to accommodate a flexible number of valid gestures made at one time. Multiple gesture recognition will also be a good test for our tracking mechanism.
At this stage, we expect to have a functioning prototype of the system; therefore, in addition to usability evaluation techniques discussed above, we will measure the cognitive workload. Users will be asked to perform certain tasks in a mock-up application scenario. The aim is not only evaluating the performance and user experience of our system but developing an evaluation framework for gestural interfaces. A specialized model for measuring cognitive workload in computer vision-based interfaces is to be developed. Combining this model with standard user valuation techniques like Fitt’s law we hope to create an evaluation framework for this discipline.
The completed work has shown very encouraging results for skin locus based segmentation. Our technique based on skin samples eradicates the need for tedious camera calibrations and specialised imaging setups. The focus of current work is developing a robust particle filter tracking mechanism based on skin similarity and optical flow. Joint criteria will not only remove false positives but will also be used to assign priority to ‘resolve conflict’ for our multiuser system. Major aims of the research are development of a multi-user gesture recognition system and development of models for usability evaluation and mental workload measurement. This is essentially a new direction in computer vision based gesture recognition research. This is important if gesture recognition systems are to be reliably deployed for applications beyond household electronics and educational aids.
 C. Hardenberg and F. Bérard, Bare-Hand Human-Computer Interaction. In Proceedings of the ACM workshop on Perceptive user interfaces, 2001, pp: 1 – 8.
 G. Iannizzotto , F. Rosa, C. Costanzo, and P. Lanzafame, A Multimodal Perceptual User Interface for Video-surveillance Environments, In Proceedings of International Conference of Multimodal Interfaces (ICMI’05), 2005, pp: 45-52.
 N. Stefanov, A. Galata and R. Hubbold, Real‐time Hand Tracking With Variable‐Length Markov Models of Behaviour, In Proceedings of IEEE Computer Society Conference on Computer Vision and Pattern Recognition 2005.
 M. Storring, H.J. Andersen, and E. Granum. Skin colour detection under changing lighting conditions. In Proceedings of Seventh Symposium Intelligent Robotics Systems, 1999, pp. 187-195.
 V. Vezhnevets, V. Sazonov, A. Andreeva. A survey on pixel-based skin colour detection techniques, GRAPHICON03, 2003, pp. 85-92.
 NASA TLX Homepage: http://humansystems.arc.nasa.gov/groups/TLX/
 Y. Yuan and K. Barner, ‘An Active Shape Model Based Tactile Hand Shape Recognition with Support Vector Machines’. In Proc. of the 40th Annual Conference on Information Sciences and Systems, 2006, pp. 1611-1616.
 C. Sun and M. Xie, ‘An Enhanced Active Shape Model for Facial Features Extraction.’ 11th IEEE International Conference on Communication Technology, 2008, pp: 661-664.