TY - GEN
T1 - Learning temporal structures for human activity recognition
AU - Xu, Tiantian
AU - Wong, Edward K.
PY - 2017/1/1
Y1 - 2017/1/1
N2 - We propose a hierarchical method for learning temporal structures for the recognition of complex human activities or actions in videos. Low level features (HOG, HOF, MBHx and MBHy) are first computed from video snippets to form concatenated feature vectors. A novel segmentation algorithm based on K-means clustering is then used to divide the video into segments, with each segment corresponding to a sub-action with uniform motion characteristics. Using low level features as inputs, a many-to-one encoder is trained to extract generalized features for the snippets in each segment. A second many-to-one encoder is then used to compute higher-level features from the generalized features. The higher-level features from individual segments are then concatenated together and used to train a third many-to-one encoder to extract a high-level feature representation for the entire video. The final descriptor is the concatenation of higher-level features from individual segments and the high-level feature for the entire video. Using the proposed descriptor and a mutli-class linear support vector machine (SVM), we achieved state-of-the-art results on datasets Olympic Sports and UCF50, and beat the state-of-the-art result on the challenging HMD51 dataset by a wide margin of 17%.
AB - We propose a hierarchical method for learning temporal structures for the recognition of complex human activities or actions in videos. Low level features (HOG, HOF, MBHx and MBHy) are first computed from video snippets to form concatenated feature vectors. A novel segmentation algorithm based on K-means clustering is then used to divide the video into segments, with each segment corresponding to a sub-action with uniform motion characteristics. Using low level features as inputs, a many-to-one encoder is trained to extract generalized features for the snippets in each segment. A second many-to-one encoder is then used to compute higher-level features from the generalized features. The higher-level features from individual segments are then concatenated together and used to train a third many-to-one encoder to extract a high-level feature representation for the entire video. The final descriptor is the concatenation of higher-level features from individual segments and the high-level feature for the entire video. Using the proposed descriptor and a mutli-class linear support vector machine (SVM), we achieved state-of-the-art results on datasets Olympic Sports and UCF50, and beat the state-of-the-art result on the challenging HMD51 dataset by a wide margin of 17%.
UR - http://www.scopus.com/inward/record.url?scp=85072392008&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85072392008&partnerID=8YFLogxK
M3 - Conference contribution
T3 - British Machine Vision Conference 2017, BMVC 2017
BT - British Machine Vision Conference 2017, BMVC 2017
PB - BMVA Press
T2 - 28th British Machine Vision Conference, BMVC 2017
Y2 - 4 September 2017 through 7 September 2017
ER -