Abstract:How to construct an accurate spatiotemporal feature learning and classification network is an essential problem in human action recognition.For the problems of single scales of spatiotemporal features extracted and complex network structure,this paper proposes a multiscale channels separation spatiotemporal convolution network that combined with attention mechanism.Firstly,based on the spatiotemporal convolution,the MCST module with a residual-like structure is used to segment and fusion the feature channel sizes.Not only can the network parameters be reduced,but also the multiscale range of spatiotemporal receptive fields can be obtained,making the spatiotemporal features are abundantly extracted by the network.Secondly,an improved non-local attention module(INLA)is introduced to construct a global dependence relationship of feature information with a low amount of calculation.so that the model can extract key information of features more efficiently.The proposed network has conducted a lot of experiments based on the classic action recognition datasets UCF101 and HMDB51.Experimental results show that the proposed MCST-Net recognition accuracy is higher than the current major algorithm of human action recognition.MCST-Net could effectively extract multi-scale spatiotemporal feature,and has the advantages of simple structure,fewer parameters and greater generalization ability.