[1]纪红蕾,丁 晗,赵朝阳,等.面向移动端的多人高效行为识别方法[J].江西师范大学学报(自然科学版),2023,(03):317-324.[doi:10.16357/j.cnki.issn1000-5862.2023.03.12]
 JI Honglei,DING Han,ZHAO Chaoyang,et al.The Efficient Multi-Person Action Detection on Mobile Devices[J].Journal of Jiangxi Normal University:Natural Science Edition,2023,(03):317-324.[doi:10.16357/j.cnki.issn1000-5862.2023.03.12]
点击复制

面向移动端的多人高效行为识别方法()
分享到:

《江西师范大学学报》(自然科学版)[ISSN:1006-6977/CN:61-1281/TN]

卷:
期数:
2023年03期
页码:
317-324
栏目:
出版日期:
2023-05-25

文章信息/Info

Title:
The Efficient Multi-Person Action Detection on Mobile Devices
文章编号:
1000-5862(2023)03-0317-08
作者:
纪红蕾1丁 晗23赵朝阳2唐 明2王金桥2
(1.中车工业研究院有限公司,北京 100071; 2.中国科学院自动化研究所,北京 100190; 3.中国科学院大学人工智能学院,北京 100049)
Author(s):
JI Honglei1DING Han23ZHAO Chaoyang2TANG Ming2WANG Jinqiao2
(1.CRRC Academy,Beijing 100071,China; 2.Institute of Automation, Chinese Academy of Sciences, Beijing 100190,China; 3.School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing 100049,China)
关键词:
多人行为检测 轻量化 移动端
Keywords:
multi-personaction detection light-weight mobile device oriented
分类号:
TP 391.41
DOI:
10.16357/j.cnki.issn1000-5862.2023.03.12
文献标志码:
A
摘要:
视频行为识别是有前景并且有挑战性的任务,但现有的大部分方法依赖大量的参数和运算.该文提出了一种基于连续多帧缓存的高效行为识别方法:首先针对多人场景的问题,输入单帧图片,结合人体检测器给出所有人的动作分类和得分; 然后通过使用时序位移模块缓存之前帧的特征,使网络具有时序信息处理的能力.实验结果表明:该方法取得了较好的轻量化效果,搭配额外的目标检测网络,可以做到多人场景实时的行为识别,体现了一定的识别速度和准确率优势.
Abstract:
Video action detection is a promising yet challenging task.However,most existing methods are computationally expensive.An action detection method based on consecutive multi-frame cache is presented.For multi-person scenarios,action classification can still be handled efficiently in combination with person detector based on single frame.Temporal shift module is introduced to cache the features of previous frames so that the network is endowed with the ability to process temporal information.Experiments show that the framework achieves fantastic light-weight effects.It proves the possibility to perform real-time action detection on multi-person scenarios with the help of person detector and shows advantages in both speed and accuracy.

参考文献/References:

[1] KRIZHEVSKY A,SUTSKEVERI, HINTON G E.ImageNet classification with deep convolutional neural networks [J].Communications of the ACM,2017,60(6): 84-90.
[2] VASWANI A,SHAZEER N,PARMAR N, et al.Attention is all you need [EB/OL].[2022-06-12].https://arxiv.org/abs/1706.03762v3.
[3] TRAN D,BOURDEV L,FERGUS R,et al.Learning spatiotemporal features with 3D convolutional networks[EB/OL].[2022-06-12].https://ieeexplore.ieee.org/document/7410867.
[4] CARREIRA J,ZISSERMAN A.Quo vadis, action recognition? a new model and the Kinetics dataset [EB/OL].[2022-06-12].https://arxiv.org/pdf/1705.07750.pdf.
[5] TRAN D,WANG Heng,TORRESANI L,et al.A closer look at spatiotemporal convolutions for action recognition[EB/OL].[2022-06-15].https://ieeexplore.ieee.org/document/8578773.
[6] MEHTA S,RASTEGARI M.MobileViT:light-weight,general-purpose,and mobile-friendly vision transformer[EB/OL].[2022-06-15].https://arxiv.org/abs/2110.02178.
[7] PAN Junting,BULAT A,TAN Fuwen,et al.EdgeViTs:competing light-weight CNNS on mobile devices with vision transformers [EB/OL].[2022-06-17].https://arxiv.org/abs/2205.03436.
[8] XIA Xin,LI Jiashi,WU Jie,et al.TRT-ViT:TensorRT-oriented vision transformer [EB/OL].[2022-06-15].https://arxiv.org/pdf/2205.09579.pdf.
[9] RYOO M S,PIERGIOVANNI A J,ARNAB A,et al.TokenLearner: what can 8 learned tokens do for images and videos?[EB/OL].[2022-06-16].https://arxiv.org/abs/2106.11297.
[10] BOLYA D, FU Chengyang,DAI Xiaoliang,et al.Token merging:your ViT but faster [EB/OL].[2022-06-16].https://arxiv.org/abs/1801.04381.
[11] SANDLER M, HOWARD A,ZHU Menglong,et al. MobileNetV2:inverted residuals and linear bottlenecks [EB/OL].[2022-06-16].https://ieeexplore.ieee.org/document/8578572.
[12] KAY W, CARREIRA J, SIMONYAN K,et al. The Kine-tics human action video dataset [EB/OL].[2022-06-19].https://arxiv.org/pdf/1705.06950.pdf.
[13] SOOMRO K,ZAMIR A R,SHAHM.UCF101:a dataset of 101 human actions classes from videos in the wild [EB/OL].[2022-06-16].https://arxiv.org/abs/1212.0402.
[14] K?PüKLüO,WEI Xiangyu,RIGOLL G. You only watch once:a unified CNN architecture for real-time spatiotemporal action localization [EB/OL].[2022-06-19].https://arxiv.org/abs/1911.06644v3.
[15] CHEN Shoufa,SUN Peize,XIE Enze,et al.Watch only once:an end-to-end video action detection framework [EB/OL].[2022-06-25].https://ieeexplore.ieee.org/document/9710781.
[16] GU Chunhui,SUN Chen,ROSS D A, et al.AVA:a video dataset of spatio-temporally localized atomic visual actions [EB/OL].[2022-06-25].https://arxiv.org/pdf/1705.08421.pdf.
[17] REDMON J,FARHADI A. YOLO9000:better, faster,stronger [EB/OL].[2022-06-25].https://arxiv.org/abs/1612.08242.
[18] SUN Peize,ZHANG Rufeng, JIANG Yi,et al. Sparse R-CNN:end-to-end object detection with learnable proposals[EB/OL].[2022-06-25].https://ieeexplore.ieee.org/document/9577670.
[19] TONG Zhan,SONG Yibing,WANG Jue,et al.VideoMAE:masked autoencoders are data-efficient learners for self-supervised video pre-training [EB/OL].[2022-07-02].https://arxiv.org/abs/2203.12602v3.
[20] FAN Haoqi,XIONG Bo,MANGALAM K,et al.Multiscale vision transformers [EB/OL].[2022-07-02].https://ieeexplore.ieee.org/document/9710800.
[21] LIN Ji,GAN Chuang, HAN Song.TSM:temporal shift mo-dule for efficient video understanding [EB/OL].[2022-07-02].https://ieeexplore.ieee.org/document/9008827.
[22] FEICHTENHOFER C,FAN Haoqi,MALIK J,et al.Slowfast networks for video recognition [EB/OL].[2022-06-19].https://ieeexplore.ieee.org/document/9008780/.
[23] LIU Ze,LIN Yutong,CAO Yue, et al.Swin Transformer:hierarchical vision transformer using shifted windows [EB/OL].[2022-07-09].https://ieeexplore.ieee.org/document/9710580.
[24] NI Bolin,PENG Houwen, CHEN Minghao,et al.Expanding language-image pretrained models for general video recognition [EB/OL].[2022-07-09].https://arxiv.org/abs/2208.02816.
[25] RADFORD A, KIM J W, HALLACY C,et al. Learning transferable visual models from natural language supervision [EB/OL].[2022-07-09].https://arxiv.org/abs/2103.00020.
[26] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16x16 words:transformers for image recognition at scale [EB/OL].[2022-07-09].https://arxiv.org/abs/2010.11929v1.
[27] FAN Haoqi,XIONG Bo,MANGALAM K,et al.Multiscale vision transformers [EB/OL].[2022-07-19].https://arxiv.org/abs/2104.11227v1.
[28] WU Chaoyuan,LI Yanghao,MANGALAM K,et al.MeMViT: memory-augmented multiscale vision transformer for efficient long-term video recognition [EB/OL].[2022-07-19].https://arxiv.org/abs/2201.08383v2.
[29] ZHANG Hao,HAO Yanbin,NGO C W. Token shift transformer for video classification [EB/OL].[2022-07-19].https://arxiv.org/abs/2108.02432v1.
[30] WANG C Y,BOCHKOVSKIY A,LIAO H Y M. YOLOv7:trainable bag-of-freebies sets new state-of-the-art for real-time object detectors [EB/OL].[2022-07-17].https://arxiv.org/abs/2207.02696.
[31] HE Kaiming,ZHANG Xiangyu,REN Shaoqing,et al.Deep residual learning for image recognition [EB/OL].[2022-07-17].https://ieeexplore.ieee.org/document/7780459.

备注/Memo

备注/Memo:
收稿日期:2022-11-15
基金项目:国家自然科学基金(61976210,62176254)资助项目.
作者简介:纪红蕾(1995—),女,河北衡水人,助理工程师,主要从事轨道交通智能产品应用技术研究.E-mail:jhl@crrc.tech.前两位作者对本文有同等贡献.
更新日期/Last Update: 2023-05-25