[1]王 飞,胡荣林,金 鹰.基于3D-CBAM注意力机制的人体动作识别[J].南京师范大学学报(工程技术版),2021,21(01):049-56.[doi:10.3969/j.issn.1672-1292.2021.01.008]
 Wang Fei,Hu Ronglin,Jin Ying.Human Action Recognition Based on 3D-CBAM Attention Mechanism[J].Journal of Nanjing Normal University(Engineering and Technology),2021,21(01):049-56.[doi:10.3969/j.issn.1672-1292.2021.01.008]
点击复制

基于3D-CBAM注意力机制的人体动作识别
分享到:

南京师范大学学报(工程技术版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
21卷
期数:
2021年01期
页码:
049-56
栏目:
计算机科学与技术
出版日期:
2021-03-15

文章信息/Info

Title:
Human Action Recognition Based on 3D-CBAM Attention Mechanism
文章编号:
1672-1292(2021)01-0049-08
作者:
王 飞胡荣林金 鹰
淮阴工学院计算机与软件工程学院,江苏 淮安 223003
Author(s):
Wang FeiHu RonglinJin Ying
School of Computer and Software Engineering,Huaiyin Institute of Technology,Huaian 223003,China
关键词:
机器视觉人体动作识别3D卷积神经网络注意力机制
Keywords:
machine visionhuman movement recognition3D convolutional neural networkattention mechanism
分类号:
TP391.4
DOI:
10.3969/j.issn.1672-1292.2021.01.008
文献标志码:
A
摘要:
针对已有的动作识别方法的特征提取不足、识别率较低等问题,结合双流网络、3D卷积神经网络和卷积LSTM网络的优势,提出一种融合模型. 该融合模型为了更好地提取人体动作特征,采用SSD目标检测方法将人体目标分割出作为局部特征和原视频的全局特征共同训练,并采用后期融合进行分类; 将3D卷积块注意模块采用shortcut结构的方式融合到3D卷积神经网络中,加强神经网络对视频的通道和空间特征提取; 并且通过将神经网络中部分3D卷积层替换为ConvLSTM层的方法,更好地得到视频的时序关系. 实验在公开的KTH数据集
Abstract:
Aiming at the problems of insufficient feature extraction and low recognition rate of existing action recognition methods,the paper proposes a fusion model by combining the advantages of two-stream network,3D convolutional neural network and convolutional LSTM network. In order to better extract human motion features,the fusion model adopts SSD target detection method to segment the human body as local features and global features of the original video for joint training,and adopts late fusion for classification. The 3D convolutional block attention module(3D-CBAM)is integrated into 3D convolutional neural network by using shortcut structure to enhance the neural network’s channel and spatial feature extraction. And by replacing part of the 3D convolutional layer of the neural network with ConvLSTM layer,the temporal relation of the video is better obtained. The experiment is carried out on the KTH dataset,and the results show that the proposed model has high recognition accuracy of human action.

参考文献/References:

[1] KONG Y,FU Y. Human action recognition and prediction:a survey[EB/OL]. [2020-05-21]. https://arxiv.org/abs/1806.11230.
[2]WANG H,KLASER A,SCHMID C,et al. Action recognition by dense trajectories[C]//2011 IEEE Conference on Computer Vision and Pattern Recognition. Colorado Springs,USA,2011.
[3]WANG H,SCHMID C. Action recognition with improved trajectories[C]//Proceedings of the IEEE international conference on computer vision. Sydney,Australia,2013.
[4]TRAN D,BOURDEV L,FERGUS R,et al. Learning spatio-temporal features with 3D convolutional networks[EB/OL]. [2020-05-20]. https://arxiv.org/abs/1412.0767.
[5]SIMONYAN K,ZISSERMAN A. Two-stream convolutional networks for action recognition in videos[C]//Advances in Neural Information Processing Systems. Montréal,Canada,2014.
[6]SRIVASTAVA N,MANSIMOV E,SALAKHUTDINOV R. Unsupervised learning of video representations using LSTMs[EB/OL]. [2020-06-11]. https://arxiv.org/abs/1502.04681v3.
[7]FRANCISCO O,DANIEL R. Deep convolutional and LSTM recurrent neural networks for multimodal wearable activity recognition[J]. Sensors,2016,16(1):115-140.
[8]罗会兰,童康,孔繁胜. 基于深度学习的视频中人体动作识别进展综述[J]. 电子学报,2019,47(5):1162-1173.
[9]ZHANG H B,ZHANG Y X,ZHONG B,et al. A comprehensive survey of vision-based human action recognition methods[J]. Sensors,2019,19(5):105-120.
[10]FEICHTENHOFER C,PINZ A,ZISSERMAN A. Convolutional two-stream network fusion for video action recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition,Las Vegas,USA,2016.
[11]SIMONYAN K,ZISSERMAN A. Very deep convolutional networks for large-scale image recognition[EB/OL]. [2015-05-20]. https://arxiv.org/abs/1409.1556.
[12]WANG L,XIONG Y,WANG Z,et al. Temporal segment networks:towards good practices for deep action recognition[C]//European Conference on Computer Vision. Cham:Springer,2016.
[13]张聪聪,何宁. 基于关键帧的双流卷积网络的人体动作识别方法[J]. 南京信息工程大学学报(自然科学版),2019,64(6):96-101.
[14]JI S W,XU W,YANG M,et al. 3D convolutional neuralnetworks for human action recognition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2013,35(1):221-231.
[15]NG Y H,HAUSKNECHT M,VIJAYANARASIMHAN S,et al. Beyond short snippets:deep networks for video classification[EB/OL]. [2020-05-21]. https://arxiv.org/abs/1503.08909.
[16]SHI X,CHEN Z,WANG H,et al. Convolutional LSTM network:a machine learning approach for precipitation nowcasting[C]//Advances in Neural Information Processing Systems. Montreal,Quebec,Canada,2015.
[17]HE K,ZHANG X,REN S,et al. Deep residual learning for image recognition[C]//IEEE Conference on Computer Vision and Pattern Recognition. Las Vegas,USA,2016.
[18]REN S,HE K,GIRSHICK R,et al. Faster R-CNN:towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence,2015,37(2):142-158.
[19]WOO S,PARK J,LEE J Y,et al. CBAM:convolutional block attention module[C]//European Conference on Computer Vision. Munich,Germany,2018.

备注/Memo

备注/Memo:
收稿日期:2020-08-08.
通讯作者:胡荣林,博士,副教授,研究方向:人机交互技术. E-mail:huronglin@hyit.edu.cn
更新日期/Last Update: 2021-03-15