Elsevier

Pattern Recognition

Volume 64, April 2017, Pages 399-406
Pattern Recognition

A dynamic framework based on local Zernike moment and motion history image for facial expression recognition

https://doi.org/10.1016/j.patcog.2016.12.002Get rights and content

Highlights

  • Proposes a facial expression recognition framework that uses dynamic information.

  • Introduces QLZM_MCF to capture dynamic information in temporal domain.

  • Introduces enMHI_OF to utlise motion speed and spatial information.

  • Proposes a weighting strategy on a grid for high recognition rate.

Abstract

A dynamic descriptor facilitates robust recognition of facial expressions in video sequences. The current two main approaches to the recognition are basic emotion recognition and recognition based on facial action coding system (FACS) action units. In this paper we focus on basic emotion recognition and propose a spatio-temporal feature based on local Zernike moment in the spatial domain using motion change frequency. We also design a dynamic feature comprising motion history image and entropy. To recognise a facial expression, a weighting strategy based on the latter feature and sub-division of the image frame is applied to the former to enhance the dynamic information of facial expression, and followed by the application of the classical support vector machine. Experiments on the CK+ and MMI datasets using leave-one-out cross validation scheme demonstrate that the integrated framework achieves a better performance than using individual descriptor separately. Compared with six state-of-arts methods, the proposed framework demonstrates a superior performance.

Introduction

In recent years facial expression recognition has become a popular research topic [1], [2], [3]. With the recent advances in robotics, and as robots interact more and more with human and become a part of human living and work space, there is an increasing requirement that robots are able to understand human emotions via a facial expression recognition system [4]. Facial expression recognition system also plays a significant role in Human-Computer Interaction (HCI) [5], which has helped to create meaningful and responsive HCI interfaces. It has also been widely used in behavioural study, video games, animations, safety mechanism in auto-mobile, etc. [6].

Discriminative and robust features that represent facial expressions are important for effective recognition of facial expressions, and how to obtain them is still a challenging problem. Recent methods that address this problem can be categorised into global-based methods and local-based methods. It has been shown that local-based methods (e.g., based on Gabor wavelets using grid points) achieve better performance than the global-based ones (e.g., based on eigenfaces, Fisher's discriminant analysis, etc.) [7]. Gabor wavelet results in good performance due to its locality and orientation selectivity. However, its computational complexity requiring high computational time makes it unsuitable for real-time applications. Local Binary Pattern (LBP) descriptor which is based on the histogram of local patterns also achieves a promising performance [8].

Shape as a geometric-based representation is crucial for interpreting facial expressions. However, current state-of-the-art methods only focus on a small subset of possible shape representation, e.g., point-based methods that represent a face using the locations of several discrete points. Noting that image moments can describe simple properties of a shape, e.g., its area (or total intensity), its centre and its orientation, Zernike moments (ZMs) have been used to represent a face and facial expressions in [9], [10]. Zernike moments are rotation invariant features, which can be used to address in-plane head pose variation. In the field of facial expression recognition, rotation invariant LBP and uniform LBP [11] have also been used to overcome the rotation problem. In [12], Quantised Local Zernike Moment (QLZM) is used to describe the neighbourhood of a face sub-region. The Local Zernike moments have more discriminant power than other image features, e.g., local phase-magnitude histogram(H-LZM), cascaded LZM transformation (HLZM2) and local binary pattern (LBP) [13].

Since a facial expression involves a dynamic process, and the dynamics contain information that represents a facial expression more effectively, it is important to capture such dynamic information so as to recognise facial expressions over the entire video sequence. Recently, there has been more effort on modelling the dynamics of a facial expression sequence. However, the modelling is still a challenging problem. Thus, in this paper, we focus on analysing the dynamics of facial expression sequences. First, we extend the spatial domain QLZM descriptor into spatio-temporal domain, i.e., Motion Change Frequency based QLZM (QLZM_MCF), which enables the representation of temporal variation of expressions. Second, we apply optical flow to Motion History Image (MHI) [14], i.e., (optical flow based MHI) MHI_OF, to represent spatial-temporal dynamic information (i.e., velocity).

We utilise two types of features: a spatio-temporal shape representation, QLZM_MCF, to enhance the local spatial and dynamic information, and a dynamic appearance representation, MHI_OF. We also introduce an entropy-based method to provide spatial relationship of different parts of a face by computing the entropy value of different sub-regions of a face. The main contributions of this paper are: (a) QLZM_MCF; (b) MHI_OF; (c) an entropy-based method for MHI_OF to capture the motion information; and (d) a strategy integrating QLZM_MCF and entropy to enhance spatial information.

The rest of the paper is organised as follows. Previous related work is presented in Section 2. Section 3 presents QLZM_MCF, the method using MHI_OF and entropy, and the intergration of the two dynamic features. The framework and the experimental results are respectively presented in 4 Facial expression recognition framework, 5 Experiments. Finally, Section 6 concludes the paper.

Section snippets

Related work

The two main focuses in the current research on facial expression are basic emotion recognition and recognition based on facial action coding system (FACS) action units (AUs). The most widely used facial expression descriptors for recognition and analysis are the six prototypical expressions of Anger, Disgust, Fear, Happiness, Sadness and Surprise [15]. The most widely used facial muscle action descriptors are AUs [1]. With regard to basic emotion recognition, geometric-based features and

Motion history image

MHI can be considered as a two-component temporal template, a vector-valued image where each component of each pixel is some function of the motion at that pixel location. The MHI Hτ(x,y,t) is computed from an update function Ψ(x,y,t), i.e.,Hτ(x,y,t)={τ,Ψ(x,y,z)=1max(0,Hτ(x,y,t)δ),otherwisewhere (x,y,t) is the spatial coordinates (x,y) of an image pixel at time t (in terms of image frame number), the duration τ determines the temporal extent of the movement in terms of frames, and δ is the

Facial expression recognition framework

Fig. 5 outlines the proposed framework which comprises pre-processing, feature extraction and classification. The pre-processing includes facial landmark detection and face alignment, where face alignment is applied to reduce the effects of variation in head pose and scene illumination. We use the local evidence aggregated regression [38] to detect facial landmarks over each frame, where the locations of detected eyes and nose are used for face alignment including scaling and cropping. The

Facial expression datasets

We use the Extended CK dataset (CK+) as it is widely used for evaluating the performance of facial expression recognition methods and thus facilitates comparison of performances. The dataset includes 327 image sequences of six basic expressions (namely Anger, Disgust, Fear, Happiness, Sadness and Surprise) and a non-basic emotion expression (namely Contempt), performed by 118 subjects. Each image sequence from this dataset has various number of frames and starts with the neutral state and ends

Conclusion

This paper presents a facial expression recognition framework using enMHI_OF and QLZM_MCF. The framework which comprises pre-processing, feature extraction followed by 2D PCA and SVM classification achieves better performance than most of the state-of-art methods on CK+ dataset and MMI dataset. Our main contributions are three folds. First, we proposed a spatio-temporal feature based on QLZM. Second, we applied optical flow in MHI to obtain MHI_OF feature which incorporates velocity

Acknowledgements

The authors would like to thank China Scholarship Council / Warwick Joint Scholarship (Grant no. 201206710046) for providing the funds for this research.

Xijian Fan received B.Sc. in Information and Communication Technology from Nanjing University of Posts and Telecommunications, China, and M.Sc. in Computer Information and Science from Hohai University, China, in 2008 and 2012, respectively. He is currently pursuing PhD in Engineering at the University of Warwick, U.K. His research interests include image processing and facial expression recognition.

References (44)

  • B. Fasal et al.

    Automatic facial expression analysis: a survey

    Pattern Recognit.

    (2003)
  • T. Ahonen, A. Hadid, M. Pietikainen, Face recognition with local binary patterns, in: Proceedings of the European...
  • A. Ono

    Face recognition with Zernike moments

    Syst. Comput. Jpn.

    (2003)
  • C. Singh et al.

    Face recognition using Zernike and complex Zernike moment features

    Pattern Recognit. Image Anal.

    (2011)
  • E. Sariyanidi, H. Gunes, M. Gokmen, A. Cavallaro, Local Zernike Moments representations for facial affect recognition,...
  • E. Saryanidi, V. Dal, S.C. Tek, B. Tunc, M. Gökmen, Local Zernike Moments: A new representation for face recognition....
  • A.F. Bobick et al.

    The recognition of human movement using temporal templates

    IEEE Trans., Pattern Anal. Mach. Intell.

    (2001)
  • P. Ekman et al.

    Constants across cultures in the face and emotion

    J. Pers. Soc. Psychol.

    (1971)
  • M. Pantic et al.

    Facial action recognition for facial expression analysis from static face images

    IEEE Trans. Syst., Man Cybern.

    (2004)
  • M. Pantic et al.

    Dynamic of facial expressions - recognition of facial actions and their temporal segments from face profile image sequences

    IEEE Trans. Syst., Man Cybern.

    (2006)
  • S. Gokturk, J. Bouguet, C. Tomasi, B. Girod, Model-based face tracking for view-independent facial expression...
  • I. Cohen, N. Sebe, F. Cozman, M. Cirelo, T. Huang, Learning bayesian network classfiers for facial expression...
  • Cited by (65)

    • Mutual information regularized identity-aware facial expression recognition in compressed video

      2021, Pattern Recognition
      Citation Excerpt :

      Besides, the audio modality in AFEW can be used to boost the recognition performance [61–63]. We note that we only focus on the image compression in this paper, and the audio/video data are stored in separate tracks, but the additional modality can also potentially to be added on our framework following the multi-modal methods [61]. With the simplified mode variational LSTM-based [44] backbone, the exploration in the compressed domain can achieve comparable or even better recognition performance.

    View all citing articles on Scopus

    Xijian Fan received B.Sc. in Information and Communication Technology from Nanjing University of Posts and Telecommunications, China, and M.Sc. in Computer Information and Science from Hohai University, China, in 2008 and 2012, respectively. He is currently pursuing PhD in Engineering at the University of Warwick, U.K. His research interests include image processing and facial expression recognition.

    Tardi Tjahjadi received B.Sc. in Mechanical Engineering from University College London in 1980, and M.Sc. in Management Sciences in 1981 and Ph.D. in Total Technology in 1984 from UMIST, U.K. He has been an associate professor at Warwick University since 2000 and a reader since 2014. His research interests include image processing and computer vision.

    View full text