[AI] StarGAN : Unified Generative Adversarial Networks for Multi-Domain Image-to-Image Translation

맨날 supervised의 image to image translation만 했어서 진짜 image generation의 의미를 가지는 GAN을 모르고 있는 것 같아서 하나씩 공부해보고 있다. 랜덤 벡터 z로부터 이미지를 생성하고, style은 어떻게 입히는지를 전반적으로 공부했었다. 그중에 내가 미리 노션에 정리해놨었던 논문을 블로그에 포스팅한다!

Introduction

  • existing models are both inefficient and ineffective in such multi-domain image translation tasks
    • Their inefficiency results from the fact that in order to learn all mappings among k domains, k(k-1) generators have to be trained
  • Stargan learns the mapping among multiple domains using only a single generator and a discriminator
  • Successfully learn multi-domain image translation between multiple datasets by utilizing a mask vector method that enables StarGAN to control all avialable domain labels

Related Works

Generative Adversarial Networks

  • consists of two modules: a discriminator and a generator

Conditional GANs

  • Prior studies have provided both the discriminator and generator with class information in order to generate samples conditioned on the class
  • Other recent approaches focused on generating particular images highly relevant to a given text description
  • domain transfer, super resolution imaging, photo editing

Image-to-Image Translation

  • pix2pix : supervised cGAN (adversarial + l1)
  • UNIT : VAEs with coGAN
    • two generators share weights to learn the joint distribution of images in cross domains
  • CycleGAN, DiscoGAN
    • preserve key attributes between the input and the translated image by utilizing a cycle-consistency loss

⇒ All these frameworks are only capable of learning the relations between two different domains at a time

Materials and Method

Multi-Domain Image-to-Image Translation

  • G to translate an input image x into an output image y conditioned on the target domain label c, $G(x, c) -> y$
  • D produces probability distributions over both sources and domain labels, $D : x-> {D_{src}(x), D_{cls}(x)}$
    • auxiliary classifier

Adversarial Loss

\[L_{adv} = E_x[log D_{src}(x)] + E_{x,c}[log(1-D{src}(G(x,c)))]\]

Domain Classification Loss

c: original domain, c’: target domain

DCL of real images used to optimize D

\[L^r_{cls} = E_{x,c'}[-logD_{cls}(c'|x)]\]

DCL of fake images used to optimize G

\[L^f_{cls} = E_{x,c}[-logD_{cls}(c|G(x,c)]\]

Reconstruction Loss

\[L_{rec} = E_{x,c,c'}[||x-G(G(x,c),c')||_1]\]

Full objective:

  • $\lambda_{cls}=1, \lambda_{rec}=10$
\[L_D = -L_{adv} + \lambda_{cls}L_{cls}\] \[L_G = L_{adv} + \lambda_{cls}L_{cls} + \lambda_{rec}L_{rec}\]

Training with Multiple Datasets

  • StarGAN can control all the labels at th test pahse
  • issue
    • label information is only partially known to each dataset
    • problematic because the complette information on th label vector c’ is required when reconstructing the input image x from the translated image G(x,c)

Mask vector

  • allows starGAN to ignore unspecified labels and focus on the explicitly known label provided by a particular dataset
  • n-dimensional one-hot vector (n: # of datasets)
\[c^{~}=[c_1, c_2, ..., c_n, m]\]
  • ci can be binary vector for binary attributes or one-hot vector for categorical attribites
  • multiple domain training 시 이 mask vector 사용
    • the generator learns to ignore the unspecified labels, which are zro vectors and focus on the explicitly given label
    • auxiliary classifier of the discriminator to generate probability distributions over labels for all datasets
      • Discriminator minimizes only classification errors for labels related to CelebA ttributes, and not facial expressions related to RAD

Implementation

Wasserstein GAN

  • replaced adversarial loss with Wasserstein GAN
\[\lambda_{gp}=1\] \[L_{adv} = E_x[D_{src}(x)]-E_{x,c}[D_{src}(G(x,c))] - \lambda_{gp}E_x[(||\nabla_{\hat x}D_{src}(\hat x)||_2-1)^2]\]

Network Architecture

  • Two convolutional layers with the stride size of two for downsampling
  • six residual blocks
  • two transpose convoluution layers with the stride size of two for upsampling
  • instance normalization for generator, no normalization for discriminator
  • PatchGANs for the discriminator network, which classifies whether local image patches are real or fake

Baseline models

  • DIAT
    • adversarial loss to learn mapping from xEX to yEY
    • regularization term \( \mid\mid x-F(G(x))\mid\mid _1 \), F is feature extractor pretrained on a face recognition task
  • CycleGAN
    • regularization via cycle consitency loss
  • IcGAN
    • can perform attribute transfer using a cGAN
    • combines an encoder with a cGAN
    • Encoder to learn the inverse mappings of cGAN, $E_z:x→z$, $E_c:x→c$
      • allows IcGAN to synthesis images by only changing th econditional vector and preserving the latent vector

Result

Experiment result in CelebA

이미지 이미지

  • regularization effect of StarGAN through a multi-task learning framework
  • compared to the IcGAN, our model demonstrates an advantage in preserving the facial identity feature of an input
  • performs best in AMT

Experiment results on RaFD

이미지 이미지

  • starGAN clearly generates teh most natural=looking expressions while properly maintaining the personal identity and facial features of the input
  • classfication error of a facial expression on sythesized images

Experimental Results on CelebA+RaFD

StarGAN-SNG(single), starGAN-JNT(joint) 이미지

  • starGAN-JNT exhibits emotional expressions with high visual quality, while starGAN-SNG generates reasonable but blurry images with gray backgrounds

  • Learned role of mask vector
    • Gave a one-hot vector c by setting the dimension of a particular facial expression 이미지
  • when wrong mask vector was used, starGAN-JNT fails to synthesize facial expressions, and it manipulates the age of the input image
  • model ignores the facial expression label as unkown and treats the facial attribute label as valid by th mask vector

Conclusion

  • proposed StarGAN, a scalable nmage-to-image translation model among multiple domains using.a single generator and a discriminator