Tuesday, December 1, 2009

Human Action Recognition and Localization

Semi-supervised Human Action Recoginition and Localization using Spatially and Temporally Integrated Local Features

Joint work by Tuan Hue THI, Jian Zhang (NICTA-Sydney), Li Cheng (TTI-Chicago), Li Wang (SEU-China), and Shin'ichi Satoh (NII-Tokyo)


This paper presents a novel framework in recognizing and localizing human action in video sequences using weakly supervised approach. Local space-time features are detected from video shots and represented in histogram vector of oriented gradients and flows. A Sparse Bayesian Kernel classification model is built to represent the compact characteristics of supervised data and adaptive to unknown data, which purpose is to label each local feature according to relevant class of action. Group Constraints among local features and Markov Chain Monte Carlo sampling are augmented into the model via data association to boost up the performance in accuracy and processing time. The labeling assignment results are first passed into a non-linear Support Vector Machine to decide the class action of the whole video shot. Then the same label values are fed into a Conditional Random Field to propagate label information among neighboring regions, hence, accurately locate the event areas. Testing of this proposed weakly trained model on the classical KTH dataset, the realistic Hollywood Human Action dataset and the challenging TRECVID event detection dataset has yielded the comparable results to most of state-of-the-art fully supervised techniques.

Local Feature Description as Video Representation

Space Time Interest Points detection approach developed from Laptev [1] is used to detect and describe local features of a video shot, hence its representation, as seen in the following few snapshots of TRECVID dataset.


Sparse Bayesian Kernel Machine for Labeling task

Video action labeling is a tedious and time-consuming task. The vast amount of growing video has also brought in the need for a technique that can learn from a little supervised amount of data and still be able to catch similar motion patterns in completely unknown environment. Among many popularly known classification techniques, Bayesian Learning approach seems to fit most to our interest of semi-supervised learning task, since it is more flexible in representing the diversion of learning and testing data source, as well as explicitly shows the link between each hypothesis with its computed score. The core idea of Bayesian approach is to analyze the approximation of the posterior distribution based on multiple trained hypotheses. We extend
the Bayesian idea of object recognition in image from Carbonetto et al. [2] into human action recognition in video with more constraints on the structure among interest points in both space and time. Each action class will have one classifier trained from its small supervised set, the negative samples are randomly sampled from the pool of all other classes.

The following directed graph visualize the semi-supervised sparse Bayesian Kernel classification framework with Group Statistics constraints augmentation.


In this design, x is the observation of interest point description, y is the class label indicating if this interest point belongs to this class action with binary values of -1 and 1, shaded is for supervised nodes, blank is for unknowns, squared is for fixed hyperparameters, z is the low-dimensional latent variables introduced to simplify the model computation. Gamma and Beta are the parameters representing the probit link function model introduced by Tham et al. [3], where Gamma is the feature selection parameters following a Bernoulli distribution of success rate T, which value is again defined by a Beta distribution of parameters a and b. Beta is the regression coefficients vector regularized by the term Delta square, which in turn assigned an inverse Gamma prior with two hyperparameters Mu and Nu.

Computation is done using Markov Chain Monte Carlo sampling in addition with a blocked Gibbs sampler. The outputs of the interest point labeling taks are shown in the following figure with greens representing interest points belong to the action class PersonRuns and yellows are those from the noisy background.


Action Classification using Support Vector Machine

Those positive labels are then passed into a non-linear Support Vector Machine for the classification task with weighting against probabilistic output to find the most possible action class, results are shown for action class ObjectPut in the folowing figure.


Action Localization using Conditional Random Fields Propagation

In video processing research domain, the concept of human activity or human event is rather abstract and loosely defined, especially for those videos obtained from the web or real world surveillance scenarios, the automatic retrieval of event regions is very essential and helpful for the activity analysis society. We tackle this challenging task by extending the work of Carbonetto et al. [1] from image processing domain to video segmentation, we also introduce a new way to define event region in a video shot, which can be used later for many other analysis purposes
like 3D modeling of objects involved.

We combine space and time correlation of all interest point regions into a Conditional Random Fields model, by extracting a cube of proportional size around each interest point center. In this case, not only we represent the relationship between points in the frame, but also point lying in adjacent frames. Two kinds of CRF potentials are defined, for each interest point region and for the relative position of each pair of regions.

The following figure shows the postively labeled interest points with CRF scorings of each region.


We use a Event Block definition to represent each event region in the video shot, which maintains the absolute and relative relationship of all interest points around what we call the integral volume of all highly weighted regions.

Experimental Results

Testing of 5% of supervision on three datasets yield the overall rate of 82.17% on KTH, 25.63% on HOHA, and 12.88% on TRECVID, which are highly comparable with other supervised systems.

Overall semi-supervised result on KTH is compared with other state-of-the-art fully supervised systems as in the following table.

All steps of the recognition and localization of PersonRuns action are demonstrated in the following steps.


Our system framework for semi-supervised human action recognition and localization using spatially and temporally local features.




Testing demonstrations for our system of semi-supervised human action recognition and localization using spatially and temporally integrated local features. 8 action models of TRECVID dataset are trained and tested on each unknown video sequence, those actions considered are: CellToEar, Embrace, ObjectPut, OpposingFlow, PersonMeet, PersonSplitUp, PersonRuns, and Pointing.




Action CellToEar on CAM1




Action CellToEar on CAM2




Action CellToEar on CAM3




Action CellToEar on CAM5




References

[1] P. Carbonetto, G. Dorko, C. Schmid, H. Kuck, and N. de Freitas. Learning to recognize objects with little supervision. Intl. Journal of Computer Vision, 77(1-3):219–237, May 2008.

[2] I. Laptev. On space-time interest points. Intl. Journal of Computer Vision, 64(2-3):107–123, September 2005.

[3] S.-S. Tham, A. Doucet, and K. Ramamohanarao. Sparse bayesian learning for regression and classification using markov chain monte carlo. In Proc. Intl. Conf. Machine Learning, 634–641, San Francisco, CA, USA, 2002.