This paper was accepted to the workshop on Distribution Shifts in NeurIPS 2023.

Large-scale training of models has become exceedingly more expensive. In an ever changing world where Petabytes of new data is generated every day, we want to be able to continually train models. In this paper, we create a benchmark for continual large-scale training of CLIP models where the data distribution varies only by time. Compared with traditional continual learning literature, there is no hard separation of tasks, i.e., we assume an infinite stream of data in a canonical format arrives that exhibits natural distribution shifts as time passes. We create multiple such benchmarks for CLIP training based on standard benchmarks such as DataComp and YFCC15M. We propose various evaluations and demonstrate that models trained on data up to a certain year will lose performance on certain categories of rapidly changing data. We propose simple learning rate schedules, and training with replay buffers to reduce the gap in forward transfer. We demonstrate that a simple baseline that continues training from the last checkpoint and replays old data can be competitive with an Oracle that gets all data up to now in one pass and trains with a large budget.

Related readings and updates.

TiC-CLIP: Continual Training of CLIP Models

Keeping large foundation models up to date on latest data is inherently expensive. To avoid the prohibitive costs of constantly retraining, it is imperative to continually train these models. This problem is exacerbated by the lack of any large scale continual learning benchmarks or baselines. We introduce the first set of web-scale Time-Continual (TiC) benchmarks for training vision-language models: TiC-DataComp, TiC-YFCC, and TiC-Redcaps…
See paper details

Considerations for Distribution Shift Robustness in Health

*=Equal Contributors This paper was accepted at the workshop "Trustworthy Machine Learning for Healthcare Workshop" at the conference ICLR 2023. When analyzing robustness of predictive models under distribution shift, many works focus on tackling generalization in the presence of spurious correlations. In this case, one typically makes use of covariates or environment indicators to enforce independencies in learned models to guarantee…
See paper details