参考文献/References:
[1] KRIZHEVSKY A,SUTSKEVERI, HINTON G E.ImageNet classification with deep convolutional neural networks [J].Communications of the ACM,2017,60(6): 84-90.
[2] VASWANI A,SHAZEER N,PARMAR N, et al.Attention is all you need [EB/OL].[2022-06-12].https://arxiv.org/abs/1706.03762v3.
[3] TRAN D,BOURDEV L,FERGUS R,et al.Learning spatiotemporal features with 3D convolutional networks[EB/OL].[2022-06-12].https://ieeexplore.ieee.org/document/7410867.
[4] CARREIRA J,ZISSERMAN A.Quo vadis, action recognition? a new model and the Kinetics dataset [EB/OL].[2022-06-12].https://arxiv.org/pdf/1705.07750.pdf.
[5] TRAN D,WANG Heng,TORRESANI L,et al.A closer look at spatiotemporal convolutions for action recognition[EB/OL].[2022-06-15].https://ieeexplore.ieee.org/document/8578773.
[6] MEHTA S,RASTEGARI M.MobileViT:light-weight,general-purpose,and mobile-friendly vision transformer[EB/OL].[2022-06-15].https://arxiv.org/abs/2110.02178.
[7] PAN Junting,BULAT A,TAN Fuwen,et al.EdgeViTs:competing light-weight CNNS on mobile devices with vision transformers [EB/OL].[2022-06-17].https://arxiv.org/abs/2205.03436.
[8] XIA Xin,LI Jiashi,WU Jie,et al.TRT-ViT:TensorRT-oriented vision transformer [EB/OL].[2022-06-15].https://arxiv.org/pdf/2205.09579.pdf.
[9] RYOO M S,PIERGIOVANNI A J,ARNAB A,et al.TokenLearner: what can 8 learned tokens do for images and videos?[EB/OL].[2022-06-16].https://arxiv.org/abs/2106.11297.
[10] BOLYA D, FU Chengyang,DAI Xiaoliang,et al.Token merging:your ViT but faster [EB/OL].[2022-06-16].https://arxiv.org/abs/1801.04381.
[11] SANDLER M, HOWARD A,ZHU Menglong,et al. MobileNetV2:inverted residuals and linear bottlenecks [EB/OL].[2022-06-16].https://ieeexplore.ieee.org/document/8578572.
[12] KAY W, CARREIRA J, SIMONYAN K,et al. The Kine-tics human action video dataset [EB/OL].[2022-06-19].https://arxiv.org/pdf/1705.06950.pdf.
[13] SOOMRO K,ZAMIR A R,SHAHM.UCF101:a dataset of 101 human actions classes from videos in the wild [EB/OL].[2022-06-16].https://arxiv.org/abs/1212.0402.
[14] K?PüKLüO,WEI Xiangyu,RIGOLL G. You only watch once:a unified CNN architecture for real-time spatiotemporal action localization [EB/OL].[2022-06-19].https://arxiv.org/abs/1911.06644v3.
[15] CHEN Shoufa,SUN Peize,XIE Enze,et al.Watch only once:an end-to-end video action detection framework [EB/OL].[2022-06-25].https://ieeexplore.ieee.org/document/9710781.
[16] GU Chunhui,SUN Chen,ROSS D A, et al.AVA:a video dataset of spatio-temporally localized atomic visual actions [EB/OL].[2022-06-25].https://arxiv.org/pdf/1705.08421.pdf.
[17] REDMON J,FARHADI A. YOLO9000:better, faster,stronger [EB/OL].[2022-06-25].https://arxiv.org/abs/1612.08242.
[18] SUN Peize,ZHANG Rufeng, JIANG Yi,et al. Sparse R-CNN:end-to-end object detection with learnable proposals[EB/OL].[2022-06-25].https://ieeexplore.ieee.org/document/9577670.
[19] TONG Zhan,SONG Yibing,WANG Jue,et al.VideoMAE:masked autoencoders are data-efficient learners for self-supervised video pre-training [EB/OL].[2022-07-02].https://arxiv.org/abs/2203.12602v3.
[20] FAN Haoqi,XIONG Bo,MANGALAM K,et al.Multiscale vision transformers [EB/OL].[2022-07-02].https://ieeexplore.ieee.org/document/9710800.
[21] LIN Ji,GAN Chuang, HAN Song.TSM:temporal shift mo-dule for efficient video understanding [EB/OL].[2022-07-02].https://ieeexplore.ieee.org/document/9008827.
[22] FEICHTENHOFER C,FAN Haoqi,MALIK J,et al.Slowfast networks for video recognition [EB/OL].[2022-06-19].https://ieeexplore.ieee.org/document/9008780/.
[23] LIU Ze,LIN Yutong,CAO Yue, et al.Swin Transformer:hierarchical vision transformer using shifted windows [EB/OL].[2022-07-09].https://ieeexplore.ieee.org/document/9710580.
[24] NI Bolin,PENG Houwen, CHEN Minghao,et al.Expanding language-image pretrained models for general video recognition [EB/OL].[2022-07-09].https://arxiv.org/abs/2208.02816.
[25] RADFORD A, KIM J W, HALLACY C,et al. Learning transferable visual models from natural language supervision [EB/OL].[2022-07-09].https://arxiv.org/abs/2103.00020.
[26] DOSOVITSKIY A,BEYER L,KOLESNIKOV A,et al.An image is worth 16x16 words:transformers for image recognition at scale [EB/OL].[2022-07-09].https://arxiv.org/abs/2010.11929v1.
[27] FAN Haoqi,XIONG Bo,MANGALAM K,et al.Multiscale vision transformers [EB/OL].[2022-07-19].https://arxiv.org/abs/2104.11227v1.
[28] WU Chaoyuan,LI Yanghao,MANGALAM K,et al.MeMViT: memory-augmented multiscale vision transformer for efficient long-term video recognition [EB/OL].[2022-07-19].https://arxiv.org/abs/2201.08383v2.
[29] ZHANG Hao,HAO Yanbin,NGO C W. Token shift transformer for video classification [EB/OL].[2022-07-19].https://arxiv.org/abs/2108.02432v1.
[30] WANG C Y,BOCHKOVSKIY A,LIAO H Y M. YOLOv7:trainable bag-of-freebies sets new state-of-the-art for real-time object detectors [EB/OL].[2022-07-17].https://arxiv.org/abs/2207.02696.
[31] HE Kaiming,ZHANG Xiangyu,REN Shaoqing,et al.Deep residual learning for image recognition [EB/OL].[2022-07-17].https://ieeexplore.ieee.org/document/7780459.