TL;DR

During the past two weeks, I have been following up on the hypothesis mentioned at the end of my last blog post.

- Experimented on small datasets to validate that the infoNCE objective used in training CLIP converges to rather than , where are some constants. This suggests that using CLIP directly for classification, (i.e. approximating as ), is indeed not principled. This is the main focus of the this blog post.

- Reviewed specific literature (Poole et al. 2019, Belghazi et al. 2018, Gutmann et al. 2010) to look for an explanation/proof.

- Practiced writing proofs in statistics. This is an on-going exercise to warm up to construct a sound, step-by-step proof of the above. So far, I have been working on Rejection Sampling, and a few chapters from the “All of Statistics — A Concise Course in Statistical Inference” book.

Experiments

I ran experiments on both synthetic and natural datasets to observe what converges to under three different objectives, including infoNCE.

To start, let’s use a synthetic example to illustrate the problem. We contrive a dataset in which the joint distribution is simply the product of the marginals. That is, for every pair in the joint distribution between random variables and . Here are three visualizations made from the data:

Joint distribution

Product of marginals

Joint distribution div. by product of marginals

We construct a by parameter matrix , each value is denoted as . We define and use SGD to train the parameters according to each of the following objectives. At convergence (batch size=500, iterations=between 1k and 2k), we visualize the table for each of the objective:

Negative log likelihood

Negative pseudo log-likelihood

infoNCE used in CLIP

In this contrived dataset, training with the negative log-likelihood objective or the negative pseudo log-likelihood both converges the visualization to the visualization rather than the visualization. However, training with the infoNCE objective results in a noisy patch with no clear signal.

To quantitatively measure how well is approximating , we measure the KL-divergence between the normalized table and the groundtruth from the data.

Objective | |

Minimize Negative Log Likelihood | 5.8801e-06 |

Minimize Negative Log Pseudo Likelihood | 3.9604e-06 |

Minimize Info NCE Loss | 0.0043 |

Now, let’s modify the distribution such that .

From the data:

Visualization of tables at convergence:

The visual observation tells us that training with the infoNCE objective, the actually converge to a pattern much closer to rather than .

Objective | |

Minimize Negative Log Likelihood | 9.8529e-06 |

Minimize Negative Log Pseudo Likelihood | 6.7194e-06 |

Minimize Info NCE Loss | 0.0037 |

Next, let’s look at an example that is more familiar to people who are interested probabilities. Consider the following procedure: we roll a 6-sided dice twice to collect two i.i.d. results, then collect an (=sum of results, =max of results) pair. We repeat this procedure many times to fill up a count table indexed by the support of two random variables (x-axis spans from 1 to 6, y-axis spans from 2 to 12). We look at the same set of visualization and metrics again.

From the data:

Visualization of tables at convergence:

Measure the difference between true distribution and normalized table:

Objective | |

Minimize Negative Log Likelihood | 5.5292e-07 |

Minimize Negative Log Pseudo Likelihood | 0.0025 |

Minimize Info NCE Loss | 0.0761 |

Finally, let’s look at a datasets found from the R datasets repository.

birthwt (188 datapoints) (x-axis: infant’s weight in 100g, y-axis: mother’s age in years)

From the data:

Visualization of tables at convergence:

Measure the difference between true distribution and normalized table:

Objective | |

Minimize Negative Log Likelihood | 5.6581e-07 |

Minimize Negative Log Pseudo Likelihood | 7.3230e-06 |

Minimize Info NCE Loss | 0.0010 |

These are just three of the set of small experiment results that I have. I am skipping the others in this blog post because their conclusions are all the same. Moving forward, I will be running experiments on datasets and architecture that are progressively more similar to CLIP. The end goal of these experiments, along with efforts to write the proof, is to demonstrate that there’s a principled method to improve CLIP’s classification performance, which is explained at the end of my last blog post.