This is a summary post on the paper – "Temporal Cycle-Consistency Learning" accepted at the Computer Vision and Pattern Recognition conference in 2019 . You can check out the code at GitHub or Google Colab. I try to provide as much background information as possible in order to make comprehending the paper easier. However, please note that I assume majority of the reading audience to be familiar with broad definitions associated with fields such as – Artificial Intelligence, Computer Vision, Machine Learning, Deep Learning and basics of convolutional neural networks.

Table of Contents

Bringing you up to speed...

Fundamentals: Artificial Intelligence, Computer Vision and Machine Learning

Artificial intelligence (AI) is intelligence exhibited by modern day machines. The field of AI is defined as the "study of of agents that receive percepts from the environment and perform actions" (in Artificial Intelligence: A Modern Approach, authors Stuart Russell and Peter Norvig). This is different from human intelligence which is of different types (Figure 1). Analogous to human intelligence, artificial intelligence is also of different types as shown in Figure 2.

Figure 1: Types of Human Intelligence (Source)

Figure 2: Types of Artificial Intelligence (Source)

Computer vision is a scientific field which deals with how computers gain an abstract understanding from digital images and/or videos. The core of all types of human intelligence is the human brain. So, an intuitive question would be – what is the human brain analogous component of artificial intelligence. One of the popular method to obtain artificial intelligence is with the help of Machine Learning (ML) – a research field which aims to teach computers to "learn" through experience. ML algorithms (techniques) are used to solve a variety of tasks as shown in Figure 2 and are designed to improve as they process more data. The tasks to be solved vary on the amount of feedback they provide to the system (computer) for "learning". In supervised learning, there is a significant amount of labelled training data available as input. While in unsupervised learning scenario, there is no labelled training data available. And, in semi-supervised learning a small portion of the training data is labelled and majority portion is unlabelled.

But, what is self-supervised learning? And how is it different from supervised, unsupervised and semi-supervised learning?

What is self-supervised learning?

Given an unsupervised task (unlabelled training data), self-supervised learning involves generating some sort of a supervisory signal which will aid in problem solving. In other words, self-supervised learning is a special case of supervised learning which reformulates the given unsupervised task into a supervised one. This is done by setting up a pseudo-supervised task (pretext task) which exploits the unlabelled data to learn representations. This technique has many applications and one of them is to overcome the manual annotation bottleneck of supervised learning techniques.

Figure 3: Supervised Learning Workflow (Source)

Figure 4: Objective of self-supervised learning (Source)

The general pipeline of self-supervised learning technique is shown in Figure 5. During the training phase, a general purpose pretext task is designed for convolutional neural network to solve. Eventually, the network solves the pretext task by automatically generating pseudo labels based on attributes of input data and learns useful representations. Once the self-supervised training is finished, the learned representations can be transferred to a supervised downstream task (having small amount of labelled input data) as pre- trained models to gain significant performance boosts.

Example: I use an unlabeled dataset of images of – cats, dogs, and birds. I rotate these images randomly by different angles – 30^o, 60^o, 90^o etc. and define the objective of pretext task as – predict the angle by which each image is rotated. Then, I train my convolutional neural network to solve this pretext task. The model autonomously identifies key attributes of input dataset and generates pseudo labels which enable it to solve the task. Upon completion of training, the model learns representations which capture an abstract understanding of the input dataset. Now, I use these learned representations for a downstream task of image classification on a small size labeled dataset (assuming the image classes remain cats, dogs and birds). So at the end – my model is able to perform with high accuracy even when I have only a few hundred labelled images.

Figure 5: Self-supervised learning pipeline (Source)

Hope you and me are on the same page now. Let us dive into the paper's summary...

Temporal Cycle-Consistency Learning: Paper Summary

Introduction

The authors present a new self-supervised learning technique to obtain temporal alignment between two similar videos (pretext task ) using convolutional neural network . The problem can be visualized in Figure 6 where four videos of humans performing the clean and jerk movement are given as input to the network. As you can see, the input videos are not aligned with respect to each other in time. After completion of the network’s training using the proposed self-supervised method, the input videos are aligned with respect to each other in time. This task of video alignment is unsupervised as majority of the available videos are unlabelled.

Figure 6: Problem a.k.a pretext task (Source)

But why did the authors choose this particular pretext task ?

Majority of the real world processes are sequential in nature. Given a process, there exist finite temporal correspondences – finite set of events which occur at unique point in time and are invariant to factors such as viewpoint, speed of the process being performed etc. For example, given the process of pouring water into a glass consists of a finite set of events – a) lift the water container, b) pour the water into a glass, c) stop pouring when the glass is full, d) place the container back on the table. Videos are commonly used to capture these processes . So, the authors thought, why not learn representations from videos by aligning them with respect to each other in time? These learned representations would ideally capture fine-grained temporal understanding of complex real world processes and would enable development of AI systems which have a better understanding of real world processes and the causal nature across events.

Why did they choose to develop a self-supervised learning technique?

During their literature survey, authors found out that learning from videos is primarily dependent on pure supervised learning methods. These methods need each frame of a video to be annotated. For a mere 10 second video with a frame rate of 15 per second – 150 frames would have to be annotated. This number scales up rather quickly and is not feasible to annotate manually. Thus, the authors decided to go the self-supervised way.

Now, the authors narrowed their literature search to self-supervised learning techniques. They came across several works (FlowWeb [1]) on the principle of cycle-consistency – a method which cycles across at least two or more samples, validates them as good or bad matches and performs a specific alignment operation . They found that this method has been applied mainly to address spatial correspondence (learning from images) problems and decided to use cycle-consistency to address a temporal correspondence problem.

Finally, the authors directed their search towards existing works on video alignment task. They came across Time Contrastive Networks (TCN) [2] – an existing self-supervised technique which takes multi-view po int videos of a same action (same action recorded from different angles) as input and aligns them in time. One of the major drawback of this method was that such synchronized data collection is difficult and suppresses computers from exploiting the vast number of raw videos available for learning.

Let us understand cycle-consistency a bit better and see how the authors use it to obtain temporal alignment in videos.

Methodology

Cycle-consistency is basal to the authors’ proposed self-supervised learning technique. In Figure 7 we see how the authors utilize cycle-consistency . They pass two similar videos through an encoder network and obtain embedding (low-dimensional continuous vector representations of discrete variables) for each frame present in the videos. Each colored circle (blue or green) represents one frame of the video. One frame (black circle) is chosen, its nearest neighbor (green circle) is computed and the same procedure is repeated with this computed neighbor. If the chosen frame (black circle) cycles back to itself using nearest neighbor computation, the frame is termed as cycle-consistent . Whereas another frame (red circle) might not cycle back to itself and gets termed as not cycle-consistent . Such points give rise to cycle-consistency error (loss) which the authors try to reduce and achieve the required time alignment across given input videos.

Figure 7: Cycle-consistent representation learning (Source)

However, cycle-consistency computation is not differentiable and gradients cannot be computed. This means that we cannot use cycle-consistency loss for learning as back-propagation cannot be performed. The authors overcame this hurdle by formulating two differentiable versions of cycle-consistency error and these versions are termed as – Temporal Cycle-Consistency (TCC).

Cycle-back classification

This formulation is a form of classification problem (Figure 8) and is described below:

For two similar videos $\begin{array}{l}S\end{array}$ and $\begin{array}{l}T\end{array}$ – obtain per-frame embedding sequences $\begin{array}{l}U\end{array}$ and $\begin{array}{l}V\end{array}$ respectively.
A point $\begin{array}{l}u_i \in U\end{array}$ is cycle-consistent only when its nearest neighbor $\begin{array}{l}v_j = argmin_{v \in V} \|u_i - v \|\end{array}$ and the nearest neighbor of $\begin{array}{l}v_j\end{array}$ in $\begin{array}{l}U\end{array}$ , i.e., $\begin{array}{l}u_k = argmin_{u \in U} \|v_j - u \|\end{array}$ result in the point $\begin{array}{l}u_i\end{array}$ cycling back to itself( $\begin{array}{l}i=k\end{array}$ ). By assuming that each frame of $\begin{array}{l}U\end{array}$ belongs to a different class there would be $\begin{array}{l}N\end{array}$ -classes in total. So the task of cycle-consistency reduces to correctly classifying the nearest neighbor.
Choose a frame $\begin{array}{l}u_i\end{array}$ and compute its soft-nearest neighbor $\begin{array}{l}\tilde{v}\end{array}$ in $\begin{array}{l}V\end{array}$ . Then calculate the nearest neighbor of $\begin{array}{l}\tilde{v}\end{array}$ back in $\begin{array}{l}U\end{array}$ .

Next, a similarity distribution representing the proximity between $\begin{array}{l}u_i\end{array}$ and each $\begin{array}{l}v_j \in V\end{array}$ is given by $\begin{array}{l}\alpha\end{array}$ and computed as –

$\begin{array}{l}\displaystyle \alpha_j = \dfrac{e^{-\|u_i - v_j\|^2}}{\Sigma^M_k e^{-\|u_i - v_k\|^2}}\end{array}$

Now, compute soft-nearest neighbor $\begin{array}{l}\tilde{v} \in V\end{array}$ of $\begin{array}{l}u_i\end{array}$ using softmax function as –

$\begin{array}{l}\displaystyle \tilde{v} = \Sigma^M_j \alpha_j v_j\end{array}$
Solve a $\begin{array}{l}N\end{array}$ -class classification problem where the logits (non-normalised predictions) are given by $\begin{array}{l}x_k = - \| \tilde{v} - u_k \|^2\end{array}$ and the predicted labels (normalized predictions) are given by $\begin{array}{l}\hat{y} = softmax(x)\end{array}$ .
Finally, optimize over the cross-entropy loss given by –

$\begin{array}{l}\displaystyle L_{cbc} = - \Sigma^N_j y_j log(\hat{y}_j),\end{array}$

where $\begin{array}{l}y_j\end{array}$ is the ground truth label which is a vector containing all zeros except for the $\begin{array}{l}j^{th}\end{array}$ index.

Figure 8: Cycle-back classification (Source)

Though the authors now have a differentiable cycle-consistency loss function, it does not account for temporal proximity, i.e., this formulation does not capture the temporal distance between the nearest neighbor of $\begin{array}{l}\tilde{v}\end{array}$ back in $\begin{array}{l}U\end{array}$ and the frame $\begin{array}{l}u_i\end{array}$ . For this reason, the authors propose a second formulation.

Cycle-back regression

This formulation can be visualized in Figure 9 and is described as below:

Repeat steps 1-6 of cycle-back regression formulation as described previously.

Next, compute a similarity distribution representing the proximity between $\begin{array}{l}\tilde{v}\end{array}$ and each $\begin{array}{l}u_k \in U\end{array}$ given by $\begin{array}{l}\beta\end{array}$ as –

$\begin{array}{l}\displaystyle \beta_k = \dfrac{e^{-\|\tilde{v} - u_k\|^2}}{\Sigma^N_j e^{-\|\tilde{v} - u_j\|^2}}\end{array}$

To incorporate temporal proximity, consider $\begin{array}{l}\beta\end{array}$ to be a discrete distribution of similarities over time and exhibits a peak in the neighborhood of $\begin{array}{l}i^{th}\end{array}$ index in time (corresponding to frame $\begin{array}{l}u_i \in U\end{array}$ ).
Impose a, Gaussian prior on $\begin{array}{l}\beta\end{array}$ and optimize the final loss function given by –

$\begin{array}{l}\displaystyle L_{cbr} = \dfrac{|i-\mu|^2}{\sigma^2} + \lambda log(\sigma),\end{array}$

where $\begin{array}{l}\mu = \Sigma^N_k \beta_k * k\end{array}$ and $\begin{array}{l}\sigma^2 = \Sigma^N_k \beta_k * (k - \mu)^2\end{array}$ are the mean and variance of the prior, and $\begin{array}{l}\lambda\end{array}$ is the regularization parameter.

Figure 9: Cycle-back regression (Source)

As a result, the cycle-back regression loss function accounts for temporal proximity by penalizing the model if it cycles back to a farther away time frame and reducing the penalization if it cycles back to a closer time frame.

Now, we have two differentiable Temporal Cycle-Consistency (TCC) losses and proceed to understand the experimental setup used by the authors.

Experimental Setup

Which datasets are used to validate the usefulness of learned representations?

The authors two video unlabeled datasets – Pouring [2] and Penn Action [3]. Both of them contain videos of humans performing actions. The former focuses more on objects being interacted with, while the latter focuses on the sports/exercise action being performed. For evaluation purpose, the authors further annotate the frames of these video datasets with – key events and phases (Figure 10). Key events are represented in blue boxes and phase(s) is the period between any two key events. This implies that multiple frames of a video between two key events will have the same phase label. The train/validation split for the datasets is same as in [2] and [3].

(a) Unlabeled videos – Top: Penn Action & Bottom: Pouring

(b) Labeled input videos

Figure 10: Datasets used for validation (Source)

What are the evaluation metrics?

The authors define the following three evaluation metrics which are computed on the validation set (Table 1):

Phase classification accuracy – is the accuracy with which the learned representations are able to correctly predict the phase label of a given video frame.
Phase progression – is a measure of how well the learned representations capture the progress of a given process or action.
Kendall's Tau – is an estimate of how well-aligned two video sequences are in time.

Unlike phase classification accuracy, phase progression and Kendall's Tau metrics do not need labeled data for quantification. Furthermore, they measure the effectiveness of learned representations at a more fine-grained level than phase classification accuracy.

Evaluation Metric	Range	Interpretation
Phase classification accuracy	$\begin{array}{l}0 - 100\%\end{array}$	The higher the value – better is the quality of learned representations
Phase progression	$\begin{array}{l}[0,1]\end{array}$	The higher the value – better is the quality of learned representations
Kendall's Tau	$\begin{array}{l}[-1, 1]\end{array}$	Value closer to $\begin{array}{l}1\end{array}$ – better is the time-alignment of the given two video sequences Value closer to $\begin{array}{l}-1\end{array}$ – better is the time-alignment of the given two video sequences in reverse

Table 1: Evaluation metrics

What are the comparison baselines?

Once the metrics are computed using TCC loss function, we need some sort of a baseline to compare the TCC learned representations with representations learned using other self-supervised techniques applied to videos. For this reason, authors choose the following two existing self-supervised learning techniques –

Shuffle and Learn (SaL) [4]

This method splits video into frames. Next orders the frames with indices representing their order in time. These indexed frames are shuffled and a convolutional neural network is used to predict if the frames are in the correct order or shuffled. This way, model learns representations which capture information about the order in which an action/process must be performed.

Figure 11: Shuffle and Learn (Source)

Time Contrastive Networks (TCN) [2]

This method samples anchor frame from the given video. Additionally, it samples a positive frame within the positive range and a negative frame outside of the margin range. The model uses a neural network which tries to answer questions like – what is similar between anchor and positive frames? how is the negative frame different from these two?. This way, model learns representations which capture semantically useful features invariant to viewpoint, speed of an action etc.

Figure 12: Time Contrastive Networks (Source)

Cool. We have baselines for performance comparison. But what architecture do the authors use to get the per-frame embedding?

The authors use an embedding network architecture as shown in Figure 13. The network receives input from ResNet-50 [5] trained on ImageNet dataset (while fine-tuning) or VGG-M [6] (while learning from scratch). It stacks $\begin{array}{l}k\end{array}$ (=20) context frames along with features of the given frame and outputs a $\begin{array}{l}128-\end{array}$ dimensional embedding for each frame of the video.

Figure 13: Embedding network architecture (Source)

And what is the exact temporal cycle-consistency loss? is it cycle-back classification (or) cycle-back regression?

The authors conducted an experiment on the Pouring dataset and compared the two differentiable versions of cycle-consistency loss against each other. And they found out that cycle-back regression outperformed cycle-back classification (Figure 14). Thus, they decided that cycle-back regression version would be termed as the temporal cycle-consistency (TCC) loss function.

Figure 14: Ablation of different cycle-consistency losses (Source)

Moving on to the exciting bits - results! And their implications.

Results and Conclusions

Phase classification accuracy

Figure 15 shows results when learning from scratch using VGG-M. On both datasets, TCC learned representations outperform those of SaL, TCN and supervised learning techniques (first row in each dataset section) when there is limited quantity of labeled data available for learning. SaL and TCN learn representations by operating on frames from a single video, whereas TCC learns representations by operating on frames of multiple videos. Authors feel that this could be one of the reasons for TCC's dominance.

Figure 16 shows results when learning using ResNet-50 pre-trained on ImageNet dataset. Here also we observe that standalone TCC outperforms SaL and TCN on Pouring dataset. But TCC combined with TCN gives best performance on Penn Action. Authors justify this by stating that multiple loss functions reduce over-fitting.

Figure 15: VGG-M – Phase classification accuracy (Source)

Figure 16: ResNet-50 – Phase classification accuracy (Source)

Phase progression and Kendall's Tau

In Figure 17, we observe that when learning from scratch TCC representations perform better on both phase progression and Kendall’s Tau for both the datasets. Additionally, while learning from pre-trained ResNet-50, we observe that Kendall’s Tau is significantly higher when we learn representations using the combined losses of TCC and TCN. This combined version outperforms both supervised learning and self-supervised learning techniques significantly on both the datasets.

Figure 17: Phase progression and Kendall's Tau (SL – supervised learning)

(Source)

Conclusions

Based on the above obtained results, authors conclude the following:

Temporal cycle-consistency (TCC) is a new general purpose self-supervised temporal alignment technique which learns representations useful for temporally fine-grained tasks.
TCC learned representations provide significant performance boost in the low-labeled data regime.
TCC learned representations can benefit any task – anomaly detection, fine-grained retrieval in videos etc. – which rely on the alignment of videos.

With this, I conclude the summary of the paper and provide my own thoughts of the paper.

Personal Thoughts

Overall, I personally enjoyed reading the paper. The things which the authors did well are – crisp and clear explanation of the methodology used, interesting and unique examples of downstream tasks which rely on alignment of videos, and making the source code available for public use. However there are a couple of things which could have been done better like – providing some brisk background information to the reader prior to introduction, being more precise in certain areas such as train/validation splits of the datasets, and doing a bit more of comparison across the baselines and TCC learned representations.

Also, the authors leave some unanswered questions like – what if an encoder network output a per-frame embedding greater than or less than $\begin{array}{l}128-\end{array}$ dimensions?; can this method be used in the medical field? etc. Possible future work could be use TCC to automatically recognize surgical phase from the workflow [7] and an application of TCC could be to detect anomalies in time-based cardiac sequences and predict onset of a disease.

References

[1] Tinghui Zhou, Y. J. Lee, S. X. Yu and A. A. Efros, "FlowWeb: Joint image set alignment by weaving consistent, pixel-wise correspondences," 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, 2015, pp. 1191-1200. URL: https://people.eecs.berkeley.edu/~tinghuiz/papers/cvpr15_flow.pdf

[2] P. Sermanet, C. Lynch, J. Hsu and S. Levine, "Time-Contrastive Networks: Self-Supervised Learning from Multi-view Observation," 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Honolulu, HI, 2017, pp. 486-487. URL: https://arxiv.org/abs/1704.06888

[3] W. Zhang, M. Zhu and K. G. Derpanis, "From Actemes to Action: A Strongly-Supervised Representation for Detailed Action Understanding," 2013 IEEE International Conference on Computer Vision, Sydney, NSW, 2013, pp. 2248-2255. URL: https://www.cv-foundation.org/openaccess/content_iccv_2013/papers/Zhang_From_Actemes_to_2013_ICCV_paper.pdf

[4] Misra, Ishan & Zitnick, C. & Hebert, Martial, "Shuffle and Learn: Unsupervised Learning Using Temporal Order Verification," 2016 European Conference on Computer Vision (ECCV), URL: https://arxiv.org/abs/1603.08561

[5] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, "Deep residual learning for image recognition," In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778. URL: https://arxiv.org/abs/1512.03385

[6] Chatfield, Ken & Simonyan, Karen & Vedaldi, Andrea & Zisserman, Andrew, "Return of the Devil in the Details: Delving Deep into Convolutional Nets," BMVC 2014 - Proceedings of the British Machine Vision Conference 2014. URL: https://arxiv.org/abs/1405.3531

[7] Czempiel, Tobias & Paschali, Magda & Keicher, Matthias & Simson, Walter & Feussner, Hubertus & Kim, Seong Tae & Navab, Nassir, "TeCNO: Surgical Phase Recognition with Multi-Stage Temporal Convolutional Networks," 2020. URL:https://arxiv.org/abs/2003.10751

Seitenhierarchie

Temporal Cycle-Consistency Learning