By some metrics, research on Generative Adversarial Networks (GANs) has progressed substantially in the past 2 years.
Practical improvements to image synthesis models are being made almost too quickly to keep up with:
However, by other metrics, less has happened. For instance, there is still widespread disagreement about how GANs should be evaluated.
Given that current image synthesis benchmarks seem somewhat saturated, we think now is a good time to reflect on research goals for this sub-field.
Lists of open problems have helped other fields with this.
This article suggests open research problems that we’d be excited for other researchers to work on.
We also believe that writing this article has clarified our thinking about
GANs, and we would encourage other researchers to write similar articles about their own sub-fields.
We assume a fair amount of background (or willingness to look things up)
because we reference too many results to explain all those results in detail.
What are the Trade-Offs Between GANs and other Generative Models?
In addition to GANs, two other types of generative model are currently popular: Flow Models and Autoregressive Models
This statement shouldn’t be taken too literally.
Those are useful terms for describing fuzzy clusters in ‘model-space’, but there are models
that aren’t easy to describe as belonging to just one of those clusters.
I’ve also left out VAEs entirely;
they’re arguably no longer considered state-of-the-art at any tasks of record.
Roughly speaking, Flow Models apply a
stack of invertible transformations to a sample from a prior
so that exact log-likelihoods of observations can be computed.
On the other hand, Autoregressive Models factorize the
distribution over observations into conditional distributions
and process one component of the observation at a time (for images, they may process one pixel
at a time.)
Recent research suggests that these models have different
performance characteristics and trade-offs.
We think that accurately characterizing these trade-offs and deciding whether they are intrinsic
to the model families is an interesting open question.
For concreteness, let’s temporarily focus on the difference in computational cost between GANs and Flow Models.
At first glance, Flow Models seem like they might make GANs unnecessary.
Flow Models allow for exact log-likelihood computation and exact inference,
so if training Flow Models and GANs had the same computational cost, GANs might not be useful.
A lot of effort is spent on training GANs,
so it seems like we should care about whether Flow Models make GANs obsolete
Even in this case, there might still be other reasons to use adversarial training in contexts like
It also might still make sense to combine adversarial training with maximum-likelihood training.
However, there seems to be a substantial gap between the computational cost of training GANs and Flow Models.
To estimate the magnitude of this gap, we can consider two models trained on datasets of human faces.
The GLOW model is trained to generate 256×256 celebrity faces using
40 GPUs for 2 weeks and about 200 million parameters.
In contrast, progressive GANs are trained on a similar face dataset
with 8 GPUs for 4 days, using about 46 million parameters, to generate 1024×1024 images.
Roughly speaking, the Flow Model took 17 times more GPU days and 4 times more parameters
to generate images with 16 times fewer pixels.
This comparison isn’t perfect,
For instance, it’s possible that the progressive growing
technique could be applied to Flow Models as well.
but it gives you a sense of things.
Why are the Flow Models less efficient?
We see two possible reasons:
First, maximum likelihood training might be computationally harder to do than adversarial training.
In particular, if any element of your training set is assigned zero probability by your generative model,
you will be penalized infinitely harshly!
A GAN generator, on the other hand, is only penalized indirectly for assigning zero probability to training set elements,
and this penalty is less harsh.
Second, normalizing flows might be an inefficient way to represent certain functions.
Section 6.1 of does some small experiments on expressivity, but at
present we’re not aware of any in-depth analysis of this question.
We’ve discussed the trade-off between GANs and Flow Models, but what about Autoregressive Models?
It turns out that Autoregressive Models can be expressed as Flow Models
(because they are both reversible) that are not parallelizable.
Parallelizable is somewhat imprecise in this context.
We mean that sampling from Flow Models must in general be done sequentially, one observation at a time.
There may be ways around this limitation though.
It also turns out that Autoregressive Models are more time and parameter efficient than Flow Models.
Thus, GANs are parallel and efficient but not reversible,
Flow Models are reversible and parallel but not efficient, and
Autoregressive models are reversible and efficient, but not parallel.
This brings us to our first open problem:
What are the fundamental trade-offs between GANs and other generative models?
In particular, can we make some sort of CAP Theorem type statement about reversibility, parallelism, and parameter/time efficiency?
One way to approach this problem could be to study more models that are a hybrid of multiple model families.
This has been considered for hybrid GAN/Flow Models, but we think that
this approach is still underexplored.
We’re also not sure about whether maximum likelihood training is necessarily harder than GAN training.
It’s true that placing zero mass on a training data point is not explicitly prohibited under the
GAN training loss, but it’s also true that a sufficiently powerful discriminator will be able
to do better than chance if the generator does this.
It does seem like GANs are learning distributions of low support in practice though.
Ultimately, we suspect that Flow Models are fundamentally less expressive per-parameter than
arbitrary decoder functions, and we suspect that this is provable under certain assumptions.
What Sorts of Distributions Can GANs Model?
Most GAN research focuses on image synthesis.
In particular, people train GANs on a handful of standard (in the Deep Learning community) image datasets:
There is some folklore about which of these datasets is ‘easiest’ to model.
In particular, MNIST and CelebA are considered easier than Imagenet, CIFAR-10, or STL-10 due to being
Others have noted that ‘a high number of classes is what makes
ImageNet synthesis difficult for GANs’.
These observations are supported by the empirical fact that the state-of-the-art image
synthesis model on CelebA generates images that seem
substantially more convincing than the state-of-the-art image synthesis model on
However, we’ve had to come to these conclusions through the laborious and noisy process of
trying to train GANs on ever larger and more complicated datasets.
In particular, we’ve mostly studied how GANs perform on the datasets that
happened to be laying around for object recognition.
As with any science, we would like to have a simple theory that explains our experimental observations.
Ideally, we could look at a dataset, perform some computations without ever actually
training a generative model, and then say something like ‘this dataset will be
easy for a GAN to model, but not a VAE’.
There has been some progress on this topic,
but we feel that more can be done.
We can now state the problem:
Given a distribution, what can we say about how hard it will be for a GAN to model that distribution?
We might ask the following related questions as well:
What do we mean by ‘model the distribution’? Are we satisfied with a low-support representation, or do we want a true density model?
Are there distributions that a GAN can never learn to model?
Are there distributions that are learnable for a GAN in principle, but are not
efficiently learnable, for some reasonable model of resource-consumption?
Are the answers to these questions actually any different for GANs than they are for other
We propose two strategies for answering these questions:
Synthetic Datasets – We can study synthetic datasets to probe what traits affect learnability.
For example, in the authors create a dataset of synthetic triangles.
We feel that this
angle is under-explored.
Synthetic datasets can even be parameterized by quantities of interest, such as connectedness or smoothness,
allowing for systematic study.
Such a dataset could also be useful for studying other types of generative models.
Modify Existing Theoretical Results – We can take existing theoretical results and try to
modify the assumptions to account
for different properties of the dataset.
For instance, we could take results about GANs that apply given unimodal data distributions and see
what happens to them when the data distribution becomes multi-modal.
How Can we Scale GANs Beyond Image Synthesis?
Aside from applications like image-to-image
most GAN successes have been in image synthesis.
Attempts to use GANs beyond images have focused on three domains:
The discrete nature of text makes it difficult to apply GANs.
This is because GANs rely on backpropagating a signal from the discriminator through the generated content into the generator.
There are two approaches to addressing this difficulty.
The first is to have the GAN act only on continuous
representations of the discrete data, as in.
The second is use an actual discrete model and attempt to train the GAN using
gradient estimation as in.
Other, more sophisticated treatments exist,
but as far as we can tell, none of them produce results that are competitive (in terms of perplexity)
with likelihood-based language models.
Structured Data –
What about other non-euclidean structured data, like graphs?
The study of this type of data is called geometric deep
GANs have had limited success here, but so have other deep learning techniques,
so it’s hard to tell how much the GAN aspect matters.
We’re aware of one attempt to use GANs in this space,
which has the generator produce (and the discriminator ‘critique’) random walks
that are meant to resemble those sampled from a source graph.
Audio is the domain in which GANs are closest to achieving the success
they’ve enjoyed with images.
The first serious attempt at applying GANs to unsupervised audio synthesis
is, in which the authors
make a variety of special allowances for the fact that they are operating on audio.
More recent work suggests GANs can even
outperform autoregressive models on some perceptual metrics.
Despite these attempts, images are clearly the easiest domain for GANs.
This leads us to the statement of the problem:
How can GANs be made to perform well on non-image data?
Does scaling GANs to other domains require new training techniques,
or does it simply require better implicit priors for each domain?
We expect GANs to eventually achieve image-synthesis-level success on other continuous data,
but that it will require better implicit priors.
Finding these priors will require thinking hard about what makes sense and is computationally feasible
in a given domain.
For structured data or data that is not continuous, we’re less sure.
One approach might be to make both the generator and discriminator
be agents trained with reinforcement learning. Making this approach work could require
large-scale computational resources.
Finally, this problem may just require fundamental research progress.
What can we Say About the Global Convergence of GAN Training?
Training GANs is different from training other neural networks because we simultaneously optimize
the generator and discriminator for opposing objectives.
Under certain assumptions
These assumptions are very strict.
The referenced paper assumes (roughly speaking) that
the equilbrium we are looking for exists and that
we are already very close to it.
this simultaneous optimization
is locally asymptotically stable.
Unfortunately, it’s hard to prove interesting things about the fully general case.
This is because the discriminator/generator’s loss is a non-convex function of its parameters.
But all neural networks have this problem!
We’d like some way to focus on just the problems created by simultaneous optimization.
This brings us to our question:
When can we prove that GANs are globally convergent?
Which neural network convergence results can be applied to GANs?
There has been nontrivial progress on this question.
Broadly speaking, there are 3 existing techniques, all of which have generated
promising results but none of which have been studied to completion:
Simplifying Assumptions –
The first strategy is to make simplifying assumptions about the generator and discriminator.
For example, the simplifed LGQ GAN — linear generator, gaussian data, and quadratic discriminator — can be shown to be globally convergent, if optimized with a special technique
and some additional assumptions.
Among other things, it’s assumed that we can first learn the means of the Gaussian and then learn the variances.
It seems promising to gradually relax those assumptions to see what happens.
For example, we could move away from unimodal distributions.
This is a natural relaxation to study, because ‘mode collapse’ is a standard GAN pathology.
Use Techniques from Normal Neural Networks –
The second strategy is to apply techniques for analyzing normal neural networks (which are also non-convex)
to answer questions about convergence of GANs.
For instance, it’s argued in that the non-convexity
of deep neural networks isn’t a problem,
A fact that practitioners already kind of suspected.
because low-quality local minima of the loss function
become exponentially rare as the network gets larger.
Can this analysis can be ‘lifted into GAN space’?
In fact, it seems like a generally useful heuristic to take analyses of deep neural networks used as classifiers and see if they apply to GANs.
Game Theory –
The final strategy is to model GAN training using notions from game theory.
These techniques yield training procedures that provably converge to some kind of approximate nash equilibrium,
but do so using unreasonably large resource constraints.
The ‘obvious’ next step in this case is to try and reduce those resource constraints.
How Should we Evaluate GANs and When Should we Use Them?
When it comes to evaluating GANs, there are many proposals but little consensus.
Inception Score and FID –
Both these scores
use a pre-trained image classifier and both have
known issues .
A common criticism is that these scores measure
‘sample quality’ and don’t really capture ‘sample diversity’.
propose using MS-SSIM to
separately evaluate diversity, but this technique has some issues and hasn’t really caught on.
propose putting a Gaussian observation model on the outputs
of a GAN and using annealed importance sampling to estimate
the log likelihood under this model, but show that
estimates computed this way are inaccurate in the case where the GAN generator is also a flow model
The generator being a flow model allows for computation of exact log-likelihoods in this case.
Geometry Score –
suggest computing geometric properties of the generated data manifold
and comparing those properties to the real data.
Precision and Recall –
attempt to measure both the ‘precision’ and ‘recall’ of GANs.
Skill Rating –
have shown that trained GAN discriminators can contain useful information
with which evaluation can be performed.
Those are just a small fraction of the proposed GAN evaluation schemes.
Although the Inception Score and FID are relatively popular, GAN evaluation is clearly not a settled issue.
Ultimately, we think that confusion about how to evaluate GANs stems from confusion about
when to use GANs.
Thus, we have bundled those two questions into one:
When should we use GANs instead of other generative models?
How should we evaluate performance in those contexts?
What should we use GANs for?
If you want an actual density model, GANs probably aren’t the best choice.
There is now good experimental evidence that GANs learn a ‘low support’ representation of the target dataset
, which means there may be substantial parts of the test
set to which a GAN (implicitly) assigns zero likelihood.
Rather than worrying too much about this,
Though trying to fix this issue is a valid research agenda as well.
we think it makes sense to focus GAN research on tasks where this is fine or even helpful.
GANs are likely to be well-suited to tasks with a perceptual flavor.
Graphics applications like image synthesis, image translation, image infilling, and attribute manipulation
all fall under this umbrella.
How should we evaluate GANs on these perceptual tasks?
Ideally we would just use a human judge, but this is expensive.
A cheap proxy is to see if a classifier can distinguish between real and fake examples.
This is called a classifier two-sample test (C2STs)
The main issue with C2STs is that if the Generator has even a minor defect that’s systematic across samples
(e.g., ) this will dominate the evaluation.
Ideally, we’d have a holistic evaluation that isn’t dominated by a single factor.
One approach might be to make a critic that is blind to the dominant defect.
But once we do this, some other defect may dominate, requiring a new critic, and so on.
If we do this iteratively, we could get a kind of ‘Gram-Schmidt procedure for critics’,
creating an ordered list of the most important defects and
critics that ignore them.
Perhaps this can be done by performing PCA on the critic activations and progressively throwing out
more and more of the higher variance components.
Finally, we could evaluate on humans despite the expense.
This would allow us to measure the thing that we actually care about.
This kind of approach can be made less expensive by predicting human answers and only interacting with a real human when
the prediction is uncertain.
How does GAN Training Scale with Batch Size?
Large minibatches have helped to scale up image classification — can they also help us scale up GANs?
Large minibatches may be especially important for effectively using highly parallel hardware accelerators.
At first glance, it seems like the answer should be yes — after all, the discriminator in most GANs is just an image classifier.
Larger batches can accelerate training if it is bottlenecked on gradient noise.
However, GANs have a separate bottleneck that classifiers don’t: the training procedure can diverge.
Thus, we can state our problem:
How does GAN training scale with batch size?
How big a role does gradient noise play in GAN training?
Can GAN training be modified so that it scales better with batch size?
There’s some evidence that increasing minibatch size improves quantitative results and reduces training time.
If this phenomenon is robust, it would suggest that gradient noise is a dominating factor.
However, this hasn’t been systematically studied, so we believe this question remains open.
Can alternate training procedures make better use of large batches?
Optimal Transport GANs theoretically have better convergence properties than normal GANs,
but need a large batch size because they try to align batches of samples and training data.
As a result they seem like a promising candidate for scaling to very large batch sizes.
<!– Finally, large batch sizes are relevant for synchronous SGD, but that isn’t the only way to paralellize SGD. –>
Finally, asynchronous SGD could be a good alternative for making use of new hardware.
In this setting, the limiting factor tends to be that gradient updates are computed on ‘stale’ copies of the parameters.
But GANs seem to actually benefit from training on past parameter snapshots, so we might ask if
asynchronous SGD interacts in a special way with GAN training.
What is the Relationship Between GANs and Adversarial Examples?
It’s well known that image classifiers suffer from adversarial examples:
human-imperceptible perturbations that cause classifiers to give the wrong output when added to images.
It’s also now known that there are classification problems which can normally be efficiently learned,
but are exponentially hard to learn robustly.
Since the GAN discriminator is an image classifier, one might worry about it suffering from adversarial examples.
Despite the large bodies of literature on GANs and adversarial examples,
there doesn’t seem to be much work on how they relate.
There is work on using GANs to generate adversarial examples, but this is not quite the same thing.
Thus, we can ask the question:
How does the adversarial robustness of the discriminator affect GAN training?
How can we begin to think about this problem?
Consider a fixed discriminator D.
An adversarial example for D would exist if
there were a generator sample G(z) correctly classified as fake and
a small perturbation p such that G(z) + p is classified as real.
With a GAN, the concern would be that the gradient update for the generator would yield
a new generator G’ where G’(z) = G(z) + p.
Is this concern realistic?
shows that deliberate attacks on generative models can work,
but we are more worried about something you might call an ‘accidental attack’.
There are reasons to believe that these accidental attacks are less likely.
First, the generator is only allowed to make one gradient update before
the discriminator is updated again.
In contrast, current adversarial attacks are typically run for tens of iterations.
Second, the generator is optimized given a batch of samples from the prior, and this batch is different
for every gradient step.
Finally, the optimization takes place in the space of parameters of the generator rather than in pixel space.
However, none of these arguments decisively rules out the generator creating adversarial examples.
We think this is a fruitful topic for further exploration.