When moving from image to video analysis with CNNs, the complexity of the task is increased by the extension into the temporal dimension. This dimension can be processed by introducing 3D convolutions, additional multi-frame optical flow images, or RNNs. The architectures can be split into models sensitive to local or global motion. Local methods capture a shorter time period and are suited to detect e.g., gestures, while global methods span a larger time interval and can capture a sequence of actions.
There are a variety of promising concepts shown in different publications. We can generally distinguish between architectures that model local or global motion. In other words, a local motion only covers short periods of time and tries to draw an inference from them. If the task is e.g., to distinguish between different arm gestures, the important information is probably encoded in the local details. However, if we want to capture information about the plot of a movie, a longer time window and thereby a more global approach is needed. In other architectures, a fusion between both approaches can be found.
Another major distinction is the method how the temporal dimension is included into the network. Besides 3D convolution, the usage of optical flow, RNNs, and connection via fully connected layers have been proposed.
Figure 2 demonstrates a few different concepts of how to fuse information over the temporal dimension through the network. The single-frame model gives the baseline, performing a standard 2D convolution on an individual frame. The late fusion places two separate single-frame networks with shared parameters a distance of 15 frames apart and then merges the two streams in a fully connected layer. The early fusion model combines information across an entire time window by extending the filter by one dimension and condensing the temporal information in one step. Lastly, the slow fusion model is a balanced mix of the two former networks that slowly fuses temporal information throughout the network. This is implemented via 3D convolutions with some temporal extent and stride.
A video sequence can be used to generate an image of multi-frame optical flow, highlighting how much single pixels change over time. Using 2D convolutions on these images helps to extract the temporal information. In RNNs, the outputs of the hidden layers are functions of the input and their previous values, which thereby introduce time dependency into the system.
Locally modeled temporal motion
Early experiments with 3D convolution have been performed by Baccouche et al. (1) and Ji et al. (2). Karpathy et al. (3) used a multi-resolution architecture to speed up training and were thereby able to train their network on a larger data set. Comparing the different models of figure 2, they showed that spatio-temporal convolutions work best if they are done step-wise, s.t. the temporal and spatial information is constantly merged to a larger degree, as illustrated in the slow fusion model. Tran et al. (7) constructed a very clean network with convolutions as 3x3x3 and (almost) all polling layers as 2x2x2. The architecture shown in figure 3 is also used for the C3D project (11) mentioned in the introduction.
A very interesting 2D convolution approach was introduced by Simonyan & Zisserman (4), which separates the spatial and temporal component by training one ConvNet on a single frame and another one on multi-frame optical flow (figure 4). The results of both networks are then fused in the end. The architecture was motivated by the two-stream hypothesis, according to which the human visual cortex contains two pathways: the ventral stream (which performs object recognition) and the dorsal stream (which recognizes motion).
Figure 3: C3D architecture. C3D net has 8 convolutions, 5 max-pooling , and 2 fully connected layers, followed by a softmax output layer. All 3D convolution kernels are 3x3x3 with stride 1 in both spatial and temporal dimensions. The number of filters are denoted in each box. The 3D pooling layers are denoted from pool1 to pool5. All pooling kernels are 2x2x2, except pool1 is 1x2x2. Each fully connected layer has 4096 output units. Source: (7)
Figure 4: Two-stream architecture for video classification. Source: (4)
Globally modeled temporal motion
In order to make the system able to process the temporal dimension on a more global scale, RNNs have to be introduced. Common choices are the Long Short-Term Memory (LSTM) network or the Gated Recurrent Unit (GRU), which are very similar in their architecture.
An architecture including temporal processing and a consecutive recurrence was already introduced by Baccouche et al. (1) in 2011. However, it received little attention at that time. In 2015, Donahue et al. (10) demonstrated an architecture (figure 5) including 3D convolutions and LSTMs, which could thereby go deeper in temporal space. As the illustration shows, such an architecture is suitable to work on problems with sequential inputs and fixed outputs as activity recognition (deep in time), with fixed inputs and sequential outputs as image description (deep in space), and with sequential inputs and outputs as video description (deep in space and time).
In a very recent approach, Balles et al. (10) fused a feed-forward, convolutional and a recurrent network into a single recurrent convolutional network (RCN), which provides a very elegant structure. The GRU-RCN method thereby only requires existing 2D convolution routines.
1) Baccouche, M., Mamalet, F., Wolf, C., Garcia, C., & Baskurt, A. (2011, November). Sequential Deep Learning for Human Action Recognition. In International Workshop on Human Behavior Understanding (pp. 29-39). Springer Berlin Heidelberg.
2) Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D Convolutional Neural Networks for Human Action Recognition. IEEE transactions on pattern analysis and machine intelligence, 35(1), 221-231.
3) Karpathy, A., Toderici, G., Shetty, S., Leung, T., Sukthankar, R., & Fei-Fei, L. (2014). Large-scale Video Classification with Convolutional Neural Networks. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition (pp. 1725-1732).
4) Simonyan, K., & Zisserman, A. (2014). Two-stream Convolutional Networks for Action Recognition in Videos. In Advances in neural information processing systems (pp. 568-576).
5) Taylor, G. W., Fergus, R., LeCun, Y., & Bregler, C. (2010, September). Convolutional Learning of Spatio-temporal Features. In European conference on computer vision (pp. 140-153). Springer Berlin Heidelberg.
6) Le, Q. V., Zou, W. Y., Yeung, S. Y., & Ng, A. Y. (2011, June). Learning Hierarchical Invariant Spatio-temporal Features for Action Recognition with Independent Subspace Analysis. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on (pp. 3361-3368). IEEE.
7) Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning Spatiotemporal Features with 3D Convolutional Networks. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4489-4497).
8) Yue-Hei Ng, J., Hausknecht, M., Vijayanarasimhan, S., Vinyals, O., Monga, R., & Toderici, G. (2015). Beyond Short Snippets: Deep Networks for Video Classification. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 4694-4702).
9) Donahue, J., Anne Hendricks, L., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., & Darrell, T. (2015). Long-term Recurrent Convolutional Networks for Visual Recognition and Description. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2625-2634).
10) Ballas, N., Yao, L., Pal, C., & Courville, A. (2015). Delving Deeper into Convolutional Networks for Learning Video Representations. arXiv preprint