맨날 supervised의 image to image translation만 했어서 진짜 image generation의 의미를 가지는 GAN을 모르고 있는 것 같아서 하나씩 공부해보고 있다. 랜덤 벡터 z로부터 이미지를 생성하고, style은 어떻게 입히는지를 전반적으로 공부했었다. 그중에 내가 미리 노션에 정리해놨었던 논문을 블로그에 포스팅한다!
Introduction
- existing models are both inefficient and ineffective in such multi-domain image translation tasks
- Their inefficiency results from the fact that in order to learn all mappings among k domains, k(k-1) generators have to be trained
- Stargan learns the mapping among multiple domains using only a single generator and a discriminator
- Successfully learn multi-domain image translation between multiple datasets by utilizing a mask vector method that enables StarGAN to control all avialable domain labels
Related Works
Generative Adversarial Networks
- consists of two modules: a discriminator and a generator
Conditional GANs
- Prior studies have provided both the discriminator and generator with class information in order to generate samples conditioned on the class
- Other recent approaches focused on generating particular images highly relevant to a given text description
- domain transfer, super resolution imaging, photo editing
Image-to-Image Translation
- pix2pix : supervised cGAN (adversarial + l1)
- UNIT : VAEs with coGAN
- two generators share weights to learn the joint distribution of images in cross domains
- CycleGAN, DiscoGAN
- preserve key attributes between the input and the translated image by utilizing a cycle-consistency loss
⇒ All these frameworks are only capable of learning the relations between two different domains at a time
Materials and Method
Multi-Domain Image-to-Image Translation
- G to translate an input image x into an output image y conditioned on the target domain label c, $G(x, c) -> y$
- D produces probability distributions over both sources and domain labels, $D : x-> {D_{src}(x), D_{cls}(x)}$
- auxiliary classifier
Adversarial Loss
\[L_{adv} = E_x[log D_{src}(x)] + E_{x,c}[log(1-D{src}(G(x,c)))]\]Domain Classification Loss
c: original domain, c’: target domain
DCL of real images used to optimize D
\[L^r_{cls} = E_{x,c'}[-logD_{cls}(c'|x)]\]DCL of fake images used to optimize G
\[L^f_{cls} = E_{x,c}[-logD_{cls}(c|G(x,c)]\]Reconstruction Loss
\[L_{rec} = E_{x,c,c'}[||x-G(G(x,c),c')||_1]\]Full objective:
- $\lambda_{cls}=1, \lambda_{rec}=10$
Training with Multiple Datasets
- StarGAN can control all the labels at th test pahse
- issue
- label information is only partially known to each dataset
- problematic because the complette information on th label vector c’ is required when reconstructing the input image x from the translated image G(x,c)
Mask vector
- allows starGAN to ignore unspecified labels and focus on the explicitly known label provided by a particular dataset
- n-dimensional one-hot vector (n: # of datasets)
- ci can be binary vector for binary attributes or one-hot vector for categorical attribites
- multiple domain training 시 이 mask vector 사용
- the generator learns to ignore the unspecified labels, which are zro vectors and focus on the explicitly given label
- auxiliary classifier of the discriminator to generate probability distributions over labels for all datasets
- Discriminator minimizes only classification errors for labels related to CelebA ttributes, and not facial expressions related to RAD
Implementation
Wasserstein GAN
- replaced adversarial loss with Wasserstein GAN
Network Architecture
- Two convolutional layers with the stride size of two for downsampling
- six residual blocks
- two transpose convoluution layers with the stride size of two for upsampling
- instance normalization for generator, no normalization for discriminator
- PatchGANs for the discriminator network, which classifies whether local image patches are real or fake
Baseline models
- DIAT
- adversarial loss to learn mapping from xEX to yEY
- regularization term \( \mid\mid x-F(G(x))\mid\mid _1 \), F is feature extractor pretrained on a face recognition task
- CycleGAN
- regularization via cycle consitency loss
- IcGAN
- can perform attribute transfer using a cGAN
- combines an encoder with a cGAN
- Encoder to learn the inverse mappings of cGAN, $E_z:x→z$, $E_c:x→c$
- allows IcGAN to synthesis images by only changing th econditional vector and preserving the latent vector
Result
Experiment result in CelebA
- regularization effect of StarGAN through a multi-task learning framework
- compared to the IcGAN, our model demonstrates an advantage in preserving the facial identity feature of an input
- performs best in AMT
Experiment results on RaFD
- starGAN clearly generates teh most natural=looking expressions while properly maintaining the personal identity and facial features of the input
- classfication error of a facial expression on sythesized images
Experimental Results on CelebA+RaFD
StarGAN-SNG(single), starGAN-JNT(joint)
-
starGAN-JNT exhibits emotional expressions with high visual quality, while starGAN-SNG generates reasonable but blurry images with gray backgrounds
- Learned role of mask vector
- Gave a one-hot vector c by setting the dimension of a particular facial expression
- when wrong mask vector was used, starGAN-JNT fails to synthesize facial expressions, and it manipulates the age of the input image
- model ignores the facial expression label as unkown and treats the facial attribute label as valid by th mask vector
Conclusion
- proposed StarGAN, a scalable nmage-to-image translation model among multiple domains using.a single generator and a discriminator