Abstract:
This paper proposes an audio-visual emotion recognition system that uses a
mixture of rule-based and machine learning techniques to improve the
recognition efficacy in the audio and video paths. The visual path is
designed using the Bi-directional Principal Component Analysis (BDPCA)
and Least-Square Linear Discriminant Analysis (LSLDA) for dimensionality
reduction and discrimination. The extracted visual features are passed
into a newly designed Optimized Kernel-Laplacian Radial Basis Function
(OKL-RBF) neural classifier. The audio path is designed using a
combination of input prosodic features (pitch, log-energy, zero crossing
rates and Teager energy operator) and spectral features (Mel-scale
frequency cepstral coefficients). The extracted audio features are
passed into an audio feature level fusion module that uses a set of
rules to determine the most likely emotion contained in the audio
signal. An audio visual fusion module fuses outputs from both paths. The
performances of the proposed audio path, visual path, and the final
system are evaluated on standard databases. Experiment results and
comparisons reveal the good performance of the proposed system.