At this year's Neurips, we present Model Zoo Datasets and Generative Hyper-Representations.
At this year's Neurips, we present Model Zoo Datasets and Generative Hyper-Representations.

Introduction

Learning on populations of Neural Networks (NNs) has become an emerging topic in recent years. The high dimensionality, non-linearity and non-convexity of NN training opens up exciting research questions investigating populations: i) do individual models in populations have something in common? ii) do models form meaningful structures in weight space? iii) can representations be learned of such structures? iv) can such structures be exploited to generate new models?
At last year’s NeurIPS, we took first steps to find answers to these questions with our paper presenting hyper-representations. There, we showed that populations of models, called model zoos, are indeed structured. With hyper-representations, we proposed a self-supervised learning method to learn representations of the weights of model zoos. Further, we showed that these representations are meaningful in the sense that they are predictive of model properties, such as their accuracy, epoch or hyperparamters.
At this years NeurIPS, two more contributions in this research direction got accepted: 1) the Model Zoo Dataset to facilitate research in that domain; and 2) Generative Hyper Representations to sample new NN weights.

Model Zoo Dataset

Link: https://modelzoos.cc
Paper: https://arxiv.org/abs/2209.14764
Talk: https://neurips.cc/virtual/2022/poster/55727

Background

Research on populations of Neural Networks requires access to such populations. While collections of models exist (i.e., model hubs like huggingface), they are usually unstructured, often small or contain only a small degree of diversity. For our own reserach and to facilitate reserach on model populations for the community at large, we therefore generated large, structured and diverse populations. The model zoos are an open source blueprint, so that they can be replicated, changed or extended to fit the community needs.

Schematic overview of the Model Zoo Generation.
Schematic overview of the Model Zoo Generation.

Dataset Generation

All in all, there are (as of now) 27 model zoos, with 50’360 unique Neural Network models and over 3.8 Million model states. The zoos are trained on 8 computer vision datasets (MNIST, F-MNIST, SVHN, USPS, STL, CIFAR10, CIFAR100, Tiny ImageNet), using 3 architectures (small CNN, medium CNN and ResNet-18). To include different types of diversity, we vary different hyperparmaters for different zoos. In some, we vary only the random seed, in others also the initalization method, activation function, optimizer, learning rate, weight decay and dropout.
As sparsification has not yet been studied on a population level, we also include sparsified model zoo twins, for which all models in a zoo at their last epoch are iteratively sparsified with Variational Dropout (VD).

Potential Use-Cases

Potential use-cases for model zoos.
Potential use-cases for model zoos.

As the domain of model populations is still somewhat new, we also consider potential use-cases for model populations. They include model analysis, i.e., the prediction of model properties based on populations, where already some work has been done. Similarly, populations could be used to investigate the learning dynamics of models further, and extend methods like Population Based Training. Model zoos can further be used as datasets for representation learning, as in our hyper-representations. Lastly, such populations may allow to systematically study how to generate weights, as good initializations or for transfer learning.

Hyper-Representations as Generative Models: Sampling Unseen Neural Network Weights

Paper: https://arxiv.org/abs/2209.14733
Talk: https://neurips.cc/virtual/2022/poster/53429
Code: https://github.com/HSG-AIML/NeurIPS_2022-Generative_Hyper_Representations

Background

Schematic overview of Generative Hyper-Representations.
Schematic overview of Generative Hyper-Representations.

In previous work, we showed that hyper-representations embed populations of models in a meaningful way, i.e., disentangling latent properties such as accuarcy, or training progress. For this project, we therefore investigated if such hyper-representations can be leveraged to generate new models with targeted properties.

Approach

The approach is split in two parts: i) training hyper-representations and ii) sampling hyper-representations. In our experiments, we noticed that weight distributions often vary over the layers. That causes the loss contribution and thus reconstruction quality to be unequally distributed and results in poor model performance. To fix that, we introduce a layer-wise loss normalization, which improves the reconstruction of weights of all layers and drastically improves model accuracy, see Figure below.

Top: baseline hyper-representation. Weight distributions of original and reconstructed weights do not match, consequently the reconstrcuted models have very low accuracy. Bottom: Hyper-representation trained with layer-wise loss norm. Reconstructed weight distributions match the original and have drastically improved accuracy.
Top: baseline hyper-representation. Weight distributions of original and reconstructed weights do not match, consequently the reconstrcuted models have very low accuracy. Bottom: Hyper-representation trained with layer-wise loss norm. Reconstructed weight distributions match the original and have drastically improved accuracy.

From such hyper-representations, we can now draw samples and decode them to weights. Previous work has shown that hyper-representations disentangle model accuracy, so that sampling can target high accuracy models. Unfortunately, the representation space is still relatively high dimensional, so that simply sampling the joint distribution is infeasible. To deal with that, we propose three sampling methods: i) by assuming conditional independence, sampling from the distribution per latent dimension; ii) based on the neighborhood, which is mapped approximately to a lower dimension; and iii) by using a GAN. The sampled hyper-representations are then decoded to weights and loaded in models.

Results

We evaluate the generated weights by comparing their accuracy, in fine-tuning and transfer learning experiments as well as ensembles.

First, we find that the sampling methods are specific: the choice of samlping method determines the accuracy bracket in which the sampled models end up in.


Left: Accuracy comparison of different sampling methods. By choosing different methods, accuracy brackets can be targeted. Right: Accuracy of populations during fine-tuning. Sampled populations $S_{KDE30}$ after a single epoch often outperforms the baselines with considerable more epochs.
Left: Accuracy comparison of different sampling methods. By choosing different methods, accuracy brackets can be targeted. Right: Accuracy of populations during fine-tuning. Sampled populations $S_{KDE30}$ after a single epoch often outperforms the baselines with considerable more epochs.

Further, we find that samlped models outperform or at least match the baselines in fine-tuning. In other words: samples drawn from hyper-representations often achieve higher performance than the models in the zoo on which the hyper-representation was trained.

However, we were curious if hyper-representations generalize beyond their task and architecture. Surprisingly, they do. In a transfer-learning setup, sampled weights outperform or match strong baselines. What is more, the generated weights generalize even to architecute changes and outperform training from scratch by a large margin.


Left: Accuracy of populations in a transfer learning setup. The sampled populations outperform both training from scratch as well as transfer-learning from pre-trained models. Right: generalization to variations in the architecture. Sampled weights considerably outperform training from scratch.
Left: Accuracy of populations in a transfer learning setup. The sampled populations outperform both training from scratch as well as transfer-learning from pre-trained models. Right: generalization to variations in the architecture. Sampled weights considerably outperform training from scratch.

Conclusion

The two contributions provide large and diverse datasets for research on model populations as well as new methods to generate neural network models. They provide grounds for many more research projects, to which we invite the community to join in.