What's the reason for the mismatch between the definition of GAN loss function in two papers?

I was trying to understand the loss function of GANs, but I found a little mismatch between different papers.

This is taken from the original GAN paper:

The adversarial modeling framework is most straightforward to apply when the models are both multilayer perceptrons. To learn the generator's distribution pg over data x, we define a prior on input noise variables pz(z), then represent a mapping to data space as G(z;θg), where G is a differentiable function represented by a multilayer perceptron with parameters θg. We also define a second multilayer perceptron D(x;θd) that outputs a single scalar. D(x) represents the probability that x came from the data rather than pg. We train D to maximize the probability of assigning the correct label to both training examples and samples from G. We simultaneously train G to minimize log(1−D(G(z))) :


In other words, D and G play the following two-player minimax game with value function V(G,D) :
minGmaxDV(D,G)=Ex∼pdata (x)[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))]
Equation (1) in this version of the pix2pix paper

The objective of a conditional GAN can be expressed as

LcGAN(G,D)=Ex,y[logD(x,y)]+Ex,z[log(1−D(x,G(x,z))],

where G tries to minimize this objective against an adversarial D that tries to maximise it, i.e. G∗= arginmax LoGIN(G,D).


To test the importance of conditioning the discriminator, we also compare to an unconditional variant in which the discriminator does not observe x :

LGAN(G,D)=Ey[logD(y)]+Ex,z[log(1−D(G(x,z))].

Putting aside the fact that pix2pix is using conditional GAN, which introduces a second term y, the 2 formulas are quite resemble, except that in the pix2pix paper, they try to get minimax of LcGAN(G,D), which is defined to be Ex,y[...]+Ex,z[...], whereas in the original paper, they define minmax(G,D)=E[...]+E[...].


I am not coming from a good math background, so I am quite confused. I'm not sure where the mistake is, but assuming that E is an expectation (correct me if I'm wrong), the version in pix2pix makes more sense to me, although I think it's quite less likely that Goodfellow could make this mistake in his amazing paper. Maybe there's no mistake at all and it's me who does not understand them correctly.

Answered by Amanda Hawes

The question is about a mismatch between the GAN loss function in two papers. The first paper is Generative Adversarial Nets Ian J. Goodfellow et. al., 2014, and the excerpt image in the question is this.


The adversarial modeling framework is most straightforward to apply when the models are both multilayer perceptrons. To learn the generator’s distribution pg over data x, we define a prior on input noise variables pz(z), then represent a mapping to data space as G(z;θg), where G is a differentiable function represented by a multilayer perceptron with parameters θg. We also define a second multilayer perceptron D(x;θd) that outputs a single scalar. D(x) represents the probability that x came from the data rather than pg. We train D to maximize the probability of assigning the correct label to both training examples and samples from G. We simultaneously train G to minimize log(1−D(G(z))):

In other words, D and G play the following two-player minimax game with value function V(G,D):
minGmaxDV(D,G)=Ex∼pdata(x)[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))].(1)
The second paper is Image-to-Image Translation with Conditional Adversarial Networks, Phillip Isola Jun-Yan Zhu Tinghui Zhou Alexei A. Efros, 2018, and the excerpt image in the question is this.
The objective of a conditional GAN can be expressed as
LcGAN(G,D)=Ex,y[logD(x,y)]+Ex,z[log(1−D(x,G(x,z))],(1)
where G tries to minimise this objective against an adversarial D that tries to maximise it, i.e.
G∗=arginmax LoGIN(G,D).
To test the importance of conditioning the discriminator, we also compare to an unconditional variant in which the discriminator does not observe x:
LGAN(G,D)=Ey[logD(y)]+Ex,z[log(1−D(G(x,z))].(2)

In the above G refers to the generative network, D refers to the discriminative network, and G∗ refers to the minimum with respect to G of the maximum with respect to D. As the question author tentatively put forward, E is the expectation with respect to its subscripts. The question of discrepancy is that the right hand sides do not match between the first paper's equation (1) and the second paper's equation (2) which is absent of the condition involving y.

First paper:
Ex∼pdata(x)[logD(x)]+Ez∼pz(z)[log(1−D(G(z)))].(1)
Second paper:
Ey[logD(y)]+Ex,z[log(1−D(G(x,z))].(2)
The second later paper further states this.

GANs are generative models that learn a mapping from random noise vector z to output image y,G:z→y. In contrast, conditional GANs learn a mapping from observed image x and random noise vector z, to y,G:x,z→y.

Notice that there is no y in the first paper and the removal of the condition in the second paper corresponds to the removal of x as the first parameter of D. This is one of the causes of confusion when comparing the right hand sides. The others are use of variables and degree of explicitness in notation.

The tilda means drawn according to. The right hand side in the first paper indicates that the expectation involving x is based on a drawing according to the probability distribution of the data with respect to x and the expectation involving z is based on a drawing according to the probability distribution of z with respect to z.

The removal of the observation of x from the second right hand term of the second paper's equation (2), which is the first parameter of G, the replacement of that equation's y variable with the now freed up x variable, and the acceptance of the abbreviation of the tilde notation used in the first paper then brings both papers into exact agreement.



Your Answer

Interviews

Parent Categories