M0kI’s bloG

[DeepLearning] Local Reparametrization Trick

2026-03-27T15:00:00+00:00

대학교 때 들었던 reparametrization trick을 local reparametrization trick을 보면서 다시 보게 되었다. expectation을 계산하기 위해서 항이 같아야 하는데 그렇지 않아서 같게 만들어 주는 것이 reparametrization trick이다. (아래에 자세히 설명하겠다.) 우리는 항상 사실 expectation을 계산하기 위해서 실제 적분 대신 Monte Carlo Estimation을 쓰고 있었던 것이다! 그런데 이 MC estimation을 가능하게 하려면 같은 텀이 있어야 하는데 그렇지 않은 것이다.

Reparametrization trick

우리가 어떤 분포 $f_\theta(x)$ 에 대해서 expectation을 구하고 싶은 상황을 생각해 보자. 이때, 해당 expectation을 적분 형태로 나타내면 다음과 같다.

\[L(\theta) = \mathbb{E}_{p(x)} [f_\theta(x)] = \int f_\theta(x) p(x)dx\]

우리는 보통 expectation을 maximize하는 것이 목적이기 때문에, 해당 expectation을 $\theta$ 에 대해 미분해야 한다. 당연히 $p(x)$ 에는 $\theta$ 텀이 없기때문에, 우리는 미분의 텀을 $f_\theta(x)$ 에만 적용하면 되고, 이건 $L(\theta)$ 를 미분한 텀도 다시 expectation 형태로 만들 수 있음을 뜻한다.

\[\triangledown L(\theta) = \nabla_\theta \int f_\theta(x) p(x)dx = \mathbb{E}_{p(x)}[\nabla_\theta f_\theta(x)]\]

이렇게 적분하는 p(x) 분포가 $\theta$ 에 independent하면, 우리는 몬테카를로 샘플링을 통해서 근사치를 구할 수 있다. 하지만 만약 p(x) 분포가 $\theta$ 에 dependent하면, 우리는 몬테카를로 샘플링을 통해서 근사치를 구할 수 없다. 이 때, 우리는 reparametrization trick을 사용한다. 해당 경우는

\[L(\theta) = \mathbb{E}_{p_\theta(x)} [f_\theta(x)] = \int f_\theta(x) p_\theta(x)dx\]

이 경우, 미분하게 되면 다음처럼 expectation 항 하나와 정리할 수 없는 항 하나로 나뉘게 된다.

\[\nabla_\theta L(\theta) = \nabla_\theta \int f_\theta(x) p_\theta(x)dx = \int \nabla_\theta f_\theta(x) p_\theta(x)dx + \int f_\theta(x) \nabla_\theta p_\theta(x)dx\]

즉,

\[\nabla_\theta L(\theta) = \mathbb{E}_{p(x)}[\nabla_\theta f_\theta(x)] + \int f_\theta(x) \nabla_\theta p_\theta(x)dx\]

두 번째 항은 analytic 하지 않아서 구할 수 없기 때문에, 우리는 $p_\theta(x)$ 를 다시 reparameterize 하여 이 문제를 해결할 수 있다.

우리가 $x=g(\epsilon ; \theta)$ 으로 정의하고, $\epsilon \sim p(\epsilon)$ 즉 가우시안 분포를 따른다고 정의하면, $p(x)$ 는 다음과 같이 표현할 수 있다.

\[\mathbb{E}_{p_\theta(x)} [f_\theta (x)] = \mathbb{E}_{p(\epsilon)} [f_\theta (g(\epsilon ; \theta))]\]

해당 항은 미분도 expectation으로 정리 가능하기 때문에, 몬테카를로 샘플링을 통해서 근사치를 구할 수 있다.

\[\mathbb{E}_{p(\epsilon)} [f_\theta (g(\epsilon ; \theta))] = \frac{1}{N} \sum_{i=1}^N f_\theta (g(\epsilon_i ; \theta))\]

Local reparametrization trick

Local reparametrization trick은 논문 Variational Dropout and the Local Reparameterization Trick에서 제안된 방법인데, weight $W$를 reparameterize 했을 때 너무 많은 random sample을 진행해야 해서, weight $W$ 대신 output인 $y$를 바로 reparametrize 하는 방식이다.

우리가 어떤 linear layer의 $W$를 reparametrize 하는 상황은 다음과 같이 표현이 가능하다.

\[y_i^T = x_i^T W, W\in \mathrm{R}^{1000}\times {1000}, x \in \mathrm{R}^{1000}\]

이때 $q(W) = \mathcal{N}(\mu, \diag(\sigma))$ 라고 가정하자.

이렇게 계산하면, 우리는 총 $\text{batch_size}\times 1000\times 1000$ 개의 random sample을 진행해야 한다. 하지만 만약 우리가 $y$를 reparametrize 한다면, 우리는 $\text{batch_size}\times 1000$ 개의 random sample만 진행하면 된다. 그리고 가우시안의 합은 다시 가우시안이기 때문에, y는 다음과 같이 나타낼 수 있다.

\[y_{i,j} = \mathcal{N} (\sum_{k=1}^{1000} x_{i,k} \mu_{k,j}, \sum_{k=1}^{1000} x_{i,k}^2 \sigma_{k,j}^2)\]

따라서 reparametrize 된 y는 다음과 같이 표현할 수 있게 된다.

\[y_{i,j} = \sum_{k=1}^{1000} x_{i,k} \mu_{k,j} + \epsilon_{i,j} \cdot\sqrt{\sum_{k=1}^{1000} x_{i,k}^2 \sigma_{k,j}^2}\]

이 경우, 우리는 노이즈를 큰 매트릭스인 $W$가 아닌 $y$에 대해서 샘플링하기 때문에, 훨씬 효율적으로 학습이 가능하다.

결국 reparametrize를 큰 matrix에 대해서 하면 계산이 너무 비효율적이니, 조금 더 작은 matrix에 대해서 reparametrize 하자는 게 local reparametrization trick 이다. 가우시안의 linearity 성질 (더해도 곱해도 가우시안)을 이용해서, 랜덤 노이즈를 가우시안에서 샘플링하고 이것을 더 작은 메트릭스에서 진행할 수 있게 된다! reparametrization trick과 함께 알아두면 좋은 트릭인 것 같다.

[Computer Network] 와이어샤크를 이용해 패킷 분석하기

2023-10-23T15:00:00+00:00

ping nslookup nmap (port 분석)

tcp 3 way handshake dns handshake tls handshake(https://babbab2.tistory.com/7)

[AI] DDPM: Denoising Diffusion Probabilistic Model

2023-09-30T15:00:00+00:00

diffusion 공부를 시작하고 수식 파티에서 헤어나오지 못하고 있다가 정신차리고 유도과정을 차근차근 기록해보려고 한다. 너무너무 복잡하고 내용도 많아서 이거 유도하는데만 며칠을 쓴 것 같다. 아무튼 이번에는 ddpm loss 유도 관련된 수식 정리이고 이전까지 했던 논문리뷰와는 좀 다를 수도 있다. (논문 정리가 아니라 수식 정리에 가깝다) 나중에 내가 다시 보려고 올리는 거라 내맘대로일 수 있다. ㅎ

수식 파티를 넘기고 코드만 돌리는 게 아니라 이해하기 위해서 내가 읽었던 논문/포스트 링크들을 남겨 놓는다. 수식 파티1 loss 유도만 보려면 Understanding Diffusion Models: A Unified Perspective

q, p

ddpm은 vae랑 비슷하게 수식을 가져가는 경향이 있다. (likelihood 모델이니까) vae에서는 맞추지 못하는 p(intractable하다고 표현한다)를 q로 근사하고 ELBO를 이용해 q가 p에 가까워지게 optimize 했다면, ddpm은 q forward를 우리가 아는 가우시안으로 정의해놓은 후 p reverse를 q를 이용해 ELBO로 추론해가는(학습하는) 과정이다. 그리고 Marcov chain과 각 step의 확률이 Gaussian distribution을 따른다는 가정이 들어간다. (저 두 가정 때문에 모든 유도가 가능해진다.) ddim에서는 Marcov chain을 안쓴다는데 ddpm 정리 끝나면 읽어봐야겠다.

forward process

우리는 각 step에서의 q를 다음과 같이 정의한다. 각 step에서 gaussian을 따른다. 즉 내가 t-1 step을 알면 t step에서의 이미지를 추론할 수 있게 된다 (q를 알고 있으므로)

\[q(x_t|x_{t-1})=\mathbf{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)\] \[q(x_{1:T}|x_0)=\prod_{t=1}^T q(x_t|x_{t-1})\]

backward process

비슷한 방법으로 추론해야하는 p또한 정의할 수 있다. p를 알기 위해 우리가 맞춰야 하는(학습해야 하는) 게 각 step의 평균, 분산이 되겠다. (가우시안이므로)

\[p_\theta(x_0:T):=p(x_T)\cdot \prod_{t=1}^Y p(x_{t-1}|x_t)\] \[p_\theta(x_{t-1}|x_t):=\mathbf{N}(\mu_\theta(x_t,t),\Sigma_\theta(x_t,t))\]

VAE ELBO 유도랑 비슷한 감이 있는데 내가 링크 여기서 VAE ELBO 유도를 했으므로 이걸로 퉁치겠다. DDPM도 여러 ELBO를 유도할 수 있는데, 그중에 가장 tractable한 ELBO, 즉 ddpm에서 사용하는 ELBO만 여기서 설명하겠다. (intractable한 ELBO는 내가 힘이 닿으면 추가하겠다)

minimize negative log likelihood (ELBO)

\[\mathbf{E}_{x_1:T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_0)}]\] \[= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_0, x_1, ..., x_T) \over p_\theta(x_1, x_2, ..., x_T|x_0)}]\]

양변에 $q(x_{1:T}\mid x_0)\over q(x_{1:T}\mid x+0)$ 를 쑤셔넣자.

\[= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_0, x_1, ..., x_T) \over p_\theta(x_1, x_2, ..., x_T|x_0)}*{q(x_{1:T}\mid x_0)\over q(x_{1:T}\mid x_0)}]\]

$p_\theta(x_0, x_1, …, x_T)$ 랑 $q(x_{1:T}\mid x_0)$ 랑 묶고 $p_\theta(x_1, x_2, …, x_T\mid x_0)$랑 $q(x_{1:T}\mid x_0)$ 랑 묶자. 그리고 로그니까 뺄셈으로 분리하면 두번째 항이 kl-divergence 폼이 된다.

\[= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_0, x_1, ..., x_T) \over q(x_{1:T}\mid x_0)}- \log{q(x_{1:T}\mid x_0)\over p_\theta(x_1, x_2, ..., x_T|x_0)}]\]

kl-divergence는 대충 이렇게 생겼다. 인테그랄을 지금 보이는 평균 notation으로 바꾸면 된다(이정도는 미래의 나도 기억하겠지?)

\[D_{KL}(P||Q)=\int_{\infty} p(x)log{p(x) \over q(x)}dx\] \[= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_0, x_1, ..., x_T) \over q(x_{1:T}\mid x_0)}]-D_{KL}(p\mid\mid q)\]

그런데 kl-divergence는 항상 0보다 크므로 앞에 항만 써서 부등식으로 나타낼 수 있게 된다.

\[<= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_0, x_1, ..., x_T) \over q(x_{1:T}\mid x_0)}]\] \[= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_{1:T}) \over q(x_{1:T}\mid x_0)}]\]

마르코브 체인에 의해,

\[= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_T)\cdot \prod_{t=1}^T p_\theta(x_{t-1}\mid x_t) \over \prod_{t=1}^T q(x_t\mid x_{t-1}) }]\]

로그니까 또 뺄셈으로 빼내면,

\[= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_T) - \sum_{t=1}^T \log{p_\theta(x_{t-1}\mid x_t) \over q(x_t\mid x_{t-1})} }]\]

일단 여기까지 하면 loss 반의반정도 유도까지 왔다. 근데 저 $q(x_t\mid x_{t-1})$ 를 위에 p처럼 뒤집어 주고 싶다. 우리가 $q(x_t\mid x_{t-1})$ 얘를 알고 있기 때문에 $q(x_{t-1}\mid x_{t})$를 유도해낼 수만 있으면 무적이 된다. 유도해보자.

마르코브 체인 때문에 x_0라는 조건을 걸어 줘도 상관이 없다. (어차피 initialization 조건 이니까.)

\[q(x_t|x_{t-1})=q(x_t\mid x_{t-1}, x_0)\]

베이즈 룰 때문에 다음과 같은 식이 성립한다.

\[q(x_t\mid x_{t-1}, x_0)={q(x_t, x_{t-1}, x_0)\over q(x_{t-1}, x_0)}\]

얘를 저기다 끼워 넣고 $q(x_t, x_0)\over q(x_t, x_0)$ 를 곱하자

\[q(x_t|x_{t-1})= {q(x_t, x_{t-1}, x_0)\over q(x_{t-1}, x_0)} \cdot {q(x_t, x_0)\over q(x_t, x_0)}\]

$q(x_t, x_{t-1}, x_0)$ 랑 $q(x_t, x_0)$랑 묶고 남은 애들 둘이 또 묶어주면

\[q(x_t|x_{t-1})= {q(x_t, x_{t-1}, x_0)\over q(x_t, x_0)} \cdot {q(x_t, x_0)\over q(x_{t-1}, x_0)}\]

결국 둘 사이의 관계성을 이끌어낼 수 있다.

\[q(x_t|x_{t-1})= {q(x_{t-1}\mid x_t, x_0)} \cdot {q(x_t, x_0)\over q(x_{t-1}, x_0)}\]

다시 loss 유도 항을 살펴보자. 여기다가 방금 유도한 저 q를 끼워 넣으면,

\[= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_T) - \sum_{t=1}^T \log{p_\theta(x_{t-1}\mid x_t) \over q(x_t\mid x_{t-1})} }]\] \[= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_T) - \sum_{t=1}^T \log{p_\theta(x_{t-1}\mid x_t) \over {q(x_{t-1}\mid x_t, x_0)} \cdot {q(x_t, x_0)\over q(x_{t-1}, x_0)}} }]\]

열심히 분리한다.

\[= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_T) - \sum_{t=2}^T \log{p_\theta(x_{t-1}\mid x_t)\cdot q(x_{t-1}, x_0) \over {q(x_{t-1}\mid x_t, x_0)} \cdot {q(x_t, x_0)} } }]\] \[= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_T)} - \sum_{t=2}^T \log{p_\theta(x_{t-1}\mid x_t) \over {q(x_{t-1}\mid x_t, x_0)} } - \sum_{t=2}^T \log {q(x_{t-1}, x_0) \over {q(x_t, x_0)}} - \log {p_\theta (x_0\mid x_1) \over q(x_1\mid x_0)}]\]

보면 $\sum_{t=2}^T \log {q(x_{t-1}, x_0) \over {q(x_t, x_0)}}$ 얘가 소거되는 형태의 항이다. 소거하고 보면 제일 첫번째 항(t=2)의 첫번째 항이랑 제일 마지막 항(t=T)의 두번쩨 항만 남는다. 결국

\[= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_T)} - \sum_{t=2}^T \log{p_\theta(x_{t-1}\mid x_t) \over {q(x_{t-1}\mid x_t, x_0)} } - \log {q(x_0, x_1) \over q(x_T, x_0)} - \log {p_\theta (x_0\mid x_1) \over q(x_1\mid x_0)}]\]

다시 로그를 잘 분해해서 겹치는 항인 $q(x_1\mid x_0)$를 제거하면 합치면 최종 loss를 구할 수 있다!!!!!

\[L_{VLB}= \mathbf{E}_{x_T \thicksim q(x_T|x_0)}[-\log {p_\theta(x_T) \over q(x_T\mid x_0)} - \sum_{t=2}^T \log{p_\theta(x_{t-1}\mid x_t) \over {q(x_{t-1}\mid x_t, x_0)} } - \log {p_\theta (x_0\mid x_1)}]\]

reparameterization

자 여기서 첫번째는 p랑 q를 가깝게 하는 term으로, 결국 두개의 KL-divergence를 최소화하는 term이다. 세번째 term은 초기조건 정도로, 결국 얘도 초기조건에서 kl-divergence를 최소화해주는 term이다. 우리가 신경써야 하는 건 두번째 term이다. 여기서 parameterization 테크닉이 들어간다.

우리가 아까 q forward의 평균과 분산을 미리 정의해 놨다.

\[q(x_t|x_{t-1})=\mathbf{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)\]

근데 우리 loss식에는 q reverse(${q(x_{t-1}\mid x_t, x_0)}$)의 평균과 분산($\tilde \mu(x_t,x_0)$ 와 $\tilde \beta$)을 찾기를 요구하고 있다. 얘도 결국 가우시안이다. 그래서 아까 열심히 유도해 놓은 요 $q(x_t\mid x_{t-1})$ 으로 얘의 평균과 분산을 찾아내야 한다.

\[q(x_{t-1}|x_{t})= \mathbf{N}(x_{t-1}; \tilde \mu(x_t,x_0), \tilde \beta I)\] \[q(x_t|x_{t-1})= {q(x_{t-1}\mid x_t, x_0)} \cdot {q(x_t, x_0)\over q(x_{t-1}, x_0)}\]

잠깐 $\alpha=1-\beta$이고 다 알파 표현 식으로 바꾸고 가자. (내가 너무 헷갈림)

\[q(x_t|x_{t-1})=\mathbf{N}(x_t; \sqrt{\alpha}x_{t-1}, (1-\alpha) I)\]

가우시안이고 marcov chain때문에 0번째에서 바로 t step으로 점프가 된다. 그냥 $x_t=\sqrt{\alpha}\cdot x_{t-1}+\sqrt{1-\alpha}\epsilon$ 점화식 가지고 계속 유도하다보면 밑에처럼 바꿀 수 있다. (나중에 되면 자세하게 적겠으나 어느 정도 할 수 있을 거라고 믿는다)

\[q(x_t|x_{0})=\mathbf{N}(x_t; \sqrt{\bar\alpha_t}x_{0}, (1-\bar\alpha_t) I)\]

결국 ${q(x_{t-1}\mid x_t, x_0)} \cdot {q(x_t, x_0)\over q(x_{t-1}, x_0)}$ 이 세개를 다 가우시안으로 표현할 준비가 다 되었다. 바로 위의 두 식을 잘 끼워 넣으면 된다.

\[q(x_t|x_{t-1})= \mathbf{N}(x_t; \sqrt{\alpha}x_{t-1}, (1-\alpha) I) \cdot {\mathbf{N}(x_{t-1}; \sqrt{\bar\alpha_{t-1}}x_{0}, (1-\bar\alpha_{t-1}) I)\over \mathbf{N}(x_t; \sqrt{\bar\alpha_t}x_{0}, (1-\bar\alpha_t) I)}\] \[q(x_t|x_{t-1})= \mathbf{N}(x_t; \sqrt{\alpha}x_{t-1}, (\beta_t) I) \cdot {\mathbf{N}(x_{t-1}; \sqrt{\bar\alpha_{t-1}}x_{0}, (1-\bar\alpha_{t-1}) I)\over \mathbf{N}(x_t; \sqrt{\bar\alpha_t}x_{0}, (1-\bar\alpha_t) I)}\]

여기서 실제 가우시안 식을 잠깐 보고 가자. 얘를 이용해야 한다. 우리가 지금 평균과 분산이 보이는 형태로 식을 다 바꿔놨으므로 위에 식을 아래식을 이용해 실제 가우시안 식으로 표현 가능하다. $p(x)={1\over \sigma \sqrt{2\pi}} \exp (-{(x-\mu)^2 \over 2\sigma^2})$

앞에 상수 (${1\over \sigma \sqrt{2\pi}}$)를 떼버리고 등호 대신 $\propto$로 표현해보자

\[q(x_t|x_{t-1}) \propto \exp (-{1\over 2}({(x_t-\sqrt{\alpha}x_{t-1})^2\over \beta_t} + {(x_{t-1}-\sqrt{\bar\alpha_{t-1}})^2\over (1-\bar\alpha_{t-1})} - {(x_t-\sqrt{\bar\alpha_t})^2\over (1-\bar\alpha_t)}))\]

얘를 다 풀어보자. (또 새로운 가우시안 폼으로 만들기 위해서다.) const라고 표현한 부분은 어차피 가우시안 폼 만들때 필요없는거라 유도하지 않은거다.

\[= \exp (-{1\over 2}({x_t^2-2\sqrt{\alpha_t}x_t x_{t-1}+\alpha_t x_{t-1}^2\over \beta_t}+{x_{t-1}^2-2\sqrt{\bar\alpha_{t-1}x_{t-1}x_0}+\bar\alpha_{t-1}x_0^2\over 1-\bar\alpha_{t-1}}-{x_t^2-2\sqrt{\bar\alpha_t}x_tx_0+\bar\alpha_t x_0^2\over 1-\bar\alpha_t}))\] \[= \exp (-{1\over 2}(({\alpha_t\over \beta_t}+{1\over 1-\bar\alpha_{t-1}})x_{t-1}^2+(-{2\sqrt{\alpha_{t-1}}\cdot x_0 \over 1- \bar\alpha_{t-1}}-{2\sqrt{\alpha_t}x_t\over \beta_t})x_{t-1}+const))\]

식이 너무 복잡하니까 $({\alpha_t\over \beta_t}+{1\over 1-\bar\alpha_{t-1}})$ 얘를 $A$라고 치환하고 $(-{2\sqrt{\alpha_{t-1}}\cdot x_0 \over 1- \bar\alpha_{t-1}}-{2\sqrt{\alpha_t}x_t\over \beta_t})$ 얘를 $B$라고 치환하고 가우시안 폼으로 바꿔보자.

\[= \exp (-{1\over 2}(Ax_{t-1}^2-Bx_{t-1}+const))\] \[= \exp (-{1\over 2}(A(x_{t-1}^2-{B\over A}x_{t-1}+{B\over 2A}^2)+const'))\] \[= \exp (-{1\over 2}(A(x_{t-1}-{B\over 2A})^2+const'))\]

결국 이 가우시안의 평균은 $B\over 2A$이고 분산은 $1\over A$ 인 것이다. 즉, $q(x_t\mid x_{t-1})$의 평균(뮤티)과 분산(틸테베타티)이 이것이 된다. A,B를 다시 넣어 보면

\[q(x_{t-1}|x_{t})= \mathbf{N}(x_{t-1}; \tilde \mu(x_t,x_0), \tilde \beta_t I)\] \[\tilde \mu(x_t,x_0)={B \over 2A}={\frac{\sqrt{\alpha_t}}{\beta_t} x_t + {1\over 1-\bar\alpha_{t-1}} x_0 \over {\frac{\sqrt{\alpha_t}}{\beta_t} \over \beta_t}+{1\over 1-\bar\alpha_{t-1}}}\]

여기서 분모가 깔끔하게 정리가 되는데, 통분하면

\[{\frac{\sqrt{\alpha_t}}{\beta_t} \over \beta_t}+{1\over 1-\bar\alpha_{t-1}}={\alpha_t - \alpha_t \bar\alpha_{t-1} + \beta_t\over \beta_t (1-\bar\alpha_{t-1})}={\alpha_t - \alpha_t \bar\alpha_{t-1} + 1-\alpha_t \over \beta_t (1-\bar\alpha_{t-1})}={1-\bar\alpha_t \over \beta_t (1-\bar\alpha_t)}\]

결국 뮤티는

\[\tilde \mu(x_t,x_0)=({\frac{\sqrt{\alpha_t}}{\beta_t} x_t + {\sqrt{\bar\alpha_{t-1}}\over 1-\bar\alpha_{t-1}} x_0})\cdot {(1-\bar\alpha_{t-1})\beta_t \over 1-\bar\alpha_t}\] \[\tilde \mu(x_t,x_0)={\frac{\sqrt{\alpha_t}(1-\bar\alpha_{t-1})}{1-\bar\alpha_t} x_t + {\sqrt{\bar\alpha_{t-1}}\over 1-\bar\alpha_t} x_0}\]

가 된다. 뮤티를 한번만 더 정리하자. 아까 말했던 이 점화식 $x_t=\sqrt{\alpha}\cdot x_{t-1}+\sqrt{1-\alpha}\epsilon$ 가지고 0번째에서 t step 점프하는 식을 유도했다. 가지고 오자.

\[x_t=\sqrt{\bar\alpha_t} x_0+\sqrt{1-\bar\alpha_t} \epsilon_t\]

얘를 x_0에 대해서 정리하면은 $x_0={x_t-\sqrt{1-\bar\alpha_t}\epsilon_0}\sqrt{\bar\alpha_t}$

얘를 q reverse의 평균인 뮤티식에 대입해서 정리해보면

\[\tilde \mu(x_t,x_0)={1\over\sqrt{\alpha_t}}(x_t-\frac{1-\alpha_t}{1-\bar\alpha_t}\epsilon_t)\]

q reverse의 분산인 틸테베타티는 (유도 생략 위에꺼 치느라 너무 힘들었다. 비슷하게 유도하면 된다)

\[\tilde \beta_t = {1-\bar\alpha_{t-1}\over 1-\bar\alpha_t}\cdot \beta_t\]

이로써 결국 여기까지 내용을 정리해보면, loss term을 ELBO를 통해서 유도했더니 세 개의 term이 나왔었고, + 우리가 미리 정의해놓아서 알고 있는 q forward의 평균, 분산 ($\sqrt{\alpha_t}, (1-\alpha_t)$)을 통해서 q reverse의 평균, 분산($\tilde \mu(x_t,x_0), \tilde \beta_t$)를 유도해냈다.

simplification

자, 이제 진짜 위에서 말했던 loss에서 두번째 term을 진짜진짜 정리할 수 있게 되었다. 모든 식을 다 유도했다. 집어넣기만 하면 된다.

loss를 다시 복기해보자. ddpm은 시그마($\beta$)를 고정하기 때문에 첫번째 loss term은 상수가 된다. 세번째 term은 아주 작은 noise가 더해진 이미지에서 원본 이미지로 reconstruction되는 term이므로 무시해도 된다. 결국 두번째 term만 잘 만져주면 된다.

우리가 죽도록 모든 분포가 가우시안을 따른다고 가정해왔기 때문에, 저 kl-divergence를 minimize하는 두번째 loss term은 더 간단하게 표현이 된다. (위에 있는 논문에서 말하는 위키 링크이다. kl-divergence between two gaussian distribution)

\[D_{KL}(\mathbf{N}(x;\mu_x,\Sigma_x)\mid\mid \mathbf{N}(y;\mu_y,\Sigma_y))={1\over 2}[\log \frac{|\Sigma_y}{\Sigma_x}-d+tr(\Sigma_y^{-1}\Sigma_x)+(\mu_y-\mu_x)^T\Sigma_y^{-1}(\mu_y-\mu_x)]\]

또 ddpm은 저 시그마(variance)가 동일하다고 아까 가정을 했기 때문에, (시그마 고정!) 그렇기 때문에 더 쉬운 문제가 되어버리는데

\[D_{KL}(q(x_{t-1}|x_t, x_0)\mid\mid p_\theta(x_{t-1}x_t))={1\over 2}[\log \frac{|\Sigma_q(t)}{\Sigma_q(t)}-d+tr(\Sigma_q(t)^{-1}\Sigma_q(t))+(\mu_\theta-\mu_q)^T\Sigma_q(t)^{-1}(\mu_\theta-\mu_q)]\] \[={1\over 2}[\log 1-d+d+(\mu_\theta-\mu_q)^T\Sigma_q(t)^{-1}(\mu_\theta-\mu_q)]\] \[={1\over 2}[(\mu_\theta-\mu_q)^T(\sigma_q^2(t)I)^{-1}(\mu_\theta-\mu_q)]\] \[={1\over2\sigma_q^2(t)}[||\mu_\theta-\mu_q||^2_2]\]

결국 그냥 뮤세타랑 뮤티 mse하는 거랑 똑같애진다. 이게 무슨말이냐면 뮤세타가 뮤티에 가까워지도록 학습한다는 얘기다. 우리가 위에서 우리가 열심히 뮤티를 유도했었던 걸 기억해보자. 이 뮤티를 가져와보면,

\[\tilde \mu(x_t,x_0)={1\over\sqrt{\alpha_t}}(x_t-\frac{1-\alpha_t}{1-\bar\alpha_t}\epsilon_t)\]

그리고 우리는 어차피 뮤티랑 뮤세타가 가까워져야 하는거니까 뮤티랑 뮤세타를 입실론 제외하고 똑같은 형태로 놓고 푸는게 이득이다. 따라서 뮤세타는

\[\mu_\theta(x_t,x_0)={1\over\sqrt{\alpha_t}}(x_t-\frac{1-\alpha_t}{1-\bar\alpha_t}\epsilon_\theta(x_t, t))\]

결국 우리는 입실론을 optimize하겠다는 말이 된다. 결국, 이 모델은 reparameterization을 통해 노이즈($\epsilon_\theta(x_t, t$)를 추론하는 모델이 된다. (L_t가 두번째 loss term이다) 정리해보면

\[L_t={1\over2\sigma_q^2(t)}[||\mu_\theta-\mu_q||^2_2]\] \[L_t={1\over2\sigma_q^2(t)}[||{1\over \alpha_t}(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_t-{1\over \alpha_t}(x_t-\frac{1-\alpha_t}{\sqrt{1-\bar\alpha_t}}\epsilon_\theta(x_t, t))||^2_2]\] \[L_t={(1-\alpha_t)^2\over2\alpha_t(1-\bar\alpha_t)\sigma_q^2(t)}[||\epsilon_t-\epsilon_\theta(x_t, t||^2_2]\]

우리가 또 x_t를 x_0에 대해 유도해놓은 식이 있기 때문에 얘를 대입하면

\[x_t=\sqrt{\bar\alpha_t} x_0+\sqrt{1-\bar\alpha_t} \epsilon_t\] \[L_t={(1-\alpha_t)^2\over2\alpha_t(1-\bar\alpha_t)\sigma_q^2(t)}[||\epsilon_t-\epsilon_\theta(\sqrt{\bar\alpha_t} x_0+\sqrt{1-\bar\alpha_t} \epsilon_t, t||^2_2]\]

1번, 3번 텀을 무시하고 L_t만 사용해서 얘를 $L_{simple}$ 이라고 부른다. 결국 얘가 loss다.

정리

결국 diffusion은 조금씩 가우시안 노이즈를 더하는, 우리가 생각하기 쉬운 t step을 q forward라 정의하고, p reverse를 추론하는 문제이다. 이때 q reverse를 q forward로 유도해 낼 수 있고, ELBO를 통한 loss 또한 p reverse와 q reverse를 가깝게 만드는(kl-divergence) 형태로 나오기 때문에 학습이 가능해진다. 이때, p reverse와 q reverse (사실 forward도 마찬가지지만)가 모두 가우시안을 따르기 때문에 p reverse와 q reverse 두 분포의 평균과 분산 두 variable을 맞춰주는 문제가 되어버린다. ddpm은 여기서 분산까지 고정해버리기 때문에 결국 평균을 맞춰주는 형태가 되어 버리고, 평균이 랜덤 노이즈로 표현되기 때문에 p와 q의 노이즈를 맞춰주는 문제로 simplification 되어버린다!

유도하느라 정말 너무 힘들었다. 가우시안 식 적을 때 즈음 부터 후회했지만 어찌저찌 해냈다. ㅋㅋ 평생 쓸 latex 다 쓴 것 같다. 기억안날 때마다 이거 보러 와야겠다.

[MAR] Unsupervised CT Metal Artifact Learning Attention-guided β-CycleGAN

2023-04-07T00:00:00+00:00

이번에는 MAR 두번째 논문이다. MAR 연구 특성상 사람한테 한번 metal을 박으면 metal이 없는 CT를 얻을 수가 없기 때문에 data 부족 문제에 시달릴수밖에 없고, 이것을 해결하기 위해 나는 data synthesize 방법을 취했지만, unsupervised로는 어떻게 푸는지 궁금해서 논문을 읽어보았다!

unsupervised i2i에서 가장 베이스로 사용되는 cycleGAN에서 발생하는 문제인 input domain 생성은 잘 되지만 target domain 생성이 잘 안되는 문제가 여기서도 발생했었는지, 발생했다면 그걸 해결하려고 했는지가 궁금했다. (다 읽고 나니 그게 β-CycleGAN인 이유인것 같다)

Introduction

FDK(Feldkamp, Davis and Kress) is the most widely used algorithm
metal implants 삽입이 ROI부근에 일어나면서 MAR이 중요해졌다
- X-ray photons cannot penetrate the metallic object consistently due to the object’s high attenuation
- this causes severe streaking and shading artifacts that deteriorate the image quality in reconstructed images

Conventional Methods

modify the sinogram and recnstruct objects by removing the corrupted sinogram and interpolating it from adjacent data

Conventional method들은 대부분 사이노그램단에서 이루어지는데, metal artifact 부분을 마스킹하고 주변 픽셀 값으로 마스크를 interpolation하는 방법이 주를 이룬다.

LI-MAR
- replaces the metallic parts in the original sinogram with linear interpolated values from the boundaries
- this usually causes new artifacts due to inaccurate values interpolated in the metallic parts in the sinogram
NMAR
- sinogram의 normalized 값을 이용하여 interpolation
- still have a limitation for generate applications due to the difficulty of optimal parameter selection

⇒ difficulty of optimal parameter selection (결국 rule-based가 아니라 optimal한 파라미터를 찾는 deep learning을 사용하겠다는 말이다.)

iterative reconstruction methods (iMAR)
- iterative하게 CT reconstruction을 하게 되면 artifact가 사라진다는 연구가 있었지만, 경험상 artifact를 완벽하게 지우지는 못하고 이런 효과가 있네?의 느낌이다.
- 그리고 당연하겠지만 extremely high computational complexity

DL-based Methods (supervised)

CNN based, pix2pix
sinogram network and the image network by learning two CNNs (dudonet)

⇒ all supervised methods

⇒ to utilize pairs of unmatched images, unsupervised learning approaches should be used.

GAN이 input domain의 distribution에서 target domain distribution을 잘 match 시킬 수 있다

그러나 GAN 자체로는 mode collapse 문제가 존재하므로, cycle-consistent adversairal network

Recently, the mathematical origin of cycleGAN was revealed using optimal transport theory as an unsupervised distribution matching between two probability spaces.

Unsupervised MAR Methods

ADN
- artifact disentanglement network
- ADN은 artifact-affected image에서 artifact와 content components를 각각 encoding 하여 각각의 공간으로 보냄으로써 분리시키는 방법
- disentangle이 잘 되었다면 content component가 content information은 모두 가지고 있으면서 artifact에 대한 정보는 하나도 가지고 있지 않을 것
- 안좋은점
  - highly complicated network architecture due to the explicit disentanglement steps.
  - training에 등장하지 않은 artifact에 대해서는 artificial feature가 나타남
- 그리 효과적이지는 않은 것 같고, artifact를 disentangle하겠다는 관점으로 접근했다는 점이 contribution인 것 같다. 그리고 disentangle을 하려고 모델 구조가 되게 복잡해져버리는 단점이 있다.
method proposed by Ranzini et al.
- uses paired MRI and CT

β-VAE for feature space disentaglement

논문이 β-VAE에서 아이디어를 얻어서 만들어졌기 때문에, β-VAE에 대한 기본을 공부해보자.

maximize ELBO! $p_\theta(x|z)$ : decoder (z→x) ..$q_\phi(z|x)$ : encoder (x→z) $q_\phi(z|x)$ : user-chosen posterior distribution model parameterized by $\phi$.

objective function: ㄴ $logp_{\theta}(x) = log(\int p_{\theta}(x|z)p(z)dz)= log(\int p_{\theta}(x|z){p(z) \over q_\phi(z|x)}q_\phi(z|x)dz)$

Jensen’s Inequality에 의해,

\[log(\int p_{\theta}(x|z){p(z) \over q_\phi(z|x)}\;q_\phi(z|x)dz) >= \int log({p_\theta(x|z)p(z) \over q_\phi(z|x)})\;q_\phi(z|x)dz\]

얘는 이렇게 두개로 가를 수 있고,

\[\int log({p_\theta(x|z)p(z) \over q_\phi(z|x)})\;q_\phi(z|x)dz = \int log(p_\theta(x|z))q_\phi(z|x)dz + \int log({p(z)\over q_\phi(z|x)})q_\phi(z|x)dz\]

KL-divergence를 활용하기 위해 이 식을 보자.

Kullback–Leibler divergence

\[D_{KL}(P||Q)=\int_{\infty} p(x)log{p(x) \over q(x)}dx\]

이것을 적용하면,

\[\int log({p(z)\over q_\phi(z|x)})q_\phi(z|x)dz = -\int log({q_\phi(z|x)\over p(z)})q_\phi(z|x)dz = -D_{KL}(q_\phi(z|x)||p(z))\]

결국

\[logp_{\theta}(x) >= \int log\, p_\theta(x|z) q_\phi(z|x)dz - D_{kl}(q_\phi(z|x)||p(z))\] \[- logp_{\theta}(x) <= - \int log\, p_\theta(x|z) q_\phi(z|x)dz + D_{kl}(q_\phi(z|x)||p(z))\]

그런데 KL divergence의 최솟값이 0이므로 upper bound의 minimize가 가능해진다.

이렇게 유도된 VAE 식의 KL-divergence에 $\beta$를 붙인 걸 $\beta$-VAE라고 부른다. 베타를 붙인 이유는

As a high $\beta$ imposes more constraint on the latent space, it turns out that the latent space is more interpretable and controllable, which is known as the disentanglement.
- $\beta$가 작을수록 z에 constraint를 줄이는 효과가 나서 표현력이 커지고, 클수록 z가 하나의 가우시안에 근사되어 표현력이 낮아짐(VAE는 분포를 가우시안에 근사하므로)

level of feature disentanglement

inspired by β-VAE(vairaitonal uatoencoder), we control the level of the importance in terms olf the statistical distances in the original and target domains using a weighting parameter β.
$\beta$-VAE처럼 cycleGAN에서 weighting paramter β를 활용해 학습을 조절하겠다는 게 이 논문의 의의이다!

Theory

the transportation from a measure space (Y,ν) to another measure space (X, $\mu$)

$G_\theta : Y -> X$
the transportation from (X, $\mu$) to (Y, v) : $F_\phi : X->Y$
Unsupervised learning에서 optimal transport map은 dist($\mu_\theta, \mu$)와 dist($v_\phi, v$)를 minimize하는 것으로 얻을 수 있다

\[G_\theta: W_1(\mu, \mu_\theta) = inf \int ||x-G_\theta(y)||d\pi(x, y)\] \[F_\phi: W_1(v, v_\phi) = inf \int ||F_\phi(x) - y||d\pi(x, y)\]

각각을 따로 minimize하는 대신, joint distribution $\pi$에 대해 같이 minimize

\[inf \int ||x-G(y)|| + ||F(x)-y|| d\pi(x,y)\]

“Optimal transport, cycleGAN, and penalized LS for unsupervised learning in verse problems” 논문에서 위 식을 dual formulation으로 나타낼 수 있다고 함.

\[min_{\theta, \phi} max \; l_{cycleGAN}(\theta, \phi, \psi, \varphi) := \lambda l_{cycle}(\theta, \phi) + l_{Disc}(\theta, \phi; \psi, \varphi)\] \[l_{cycle}(\theta, \phi) = \int_X ||x-G_\theta(F_\phi(x))||d\mu(x) + \int_Y||y-F_\phi(G_\theta(y))||dv(y)\] \[l_{Disc}(\theta, \phi; \psi, \varphi) = max \int_{X}D_\psi (x) d\mu(x)-\int_Y D_\varphi(G_\theta(y))dv(y) + max \int_{Y}D_\psi (y) dv(y)-\int_Y D_\varphi(F_\phi(x))d\mu(x)\]

따라서 이 form은 1-Lipschitz discriminators를 사용한 것만 빼면 기존 cyclegan과 동일하다.
LS-GAN 방법론이 imposing the finite Lipschitz condition과 깊은 관련이 있다는 것을 보임
- LS-GAN variation as our discriminator term을 사용

$\beta$-CycleGAN for metal artifact disentanglement

VAE처럼 G와 F에 다른 weight를 주겠다는 게 point이다. (실제로 cycleGAN 실험을 해 보면 G는 잘되는데 F는 안되고 이런 문제가 있다)

\[l_{\beta-cycleGAN}(\theta, \phi; \psi, \varphi) = \lambda l_{\beta-cycle}(\theta, \phi) + l_{Disc}(\theta, \phi; \psi, \varphi)\] \[l_{\beta-cycle}(\theta, \phi) = \int_X ||x-G_\theta(F_\phi(x))||d\mu(x) + {1 \over \beta}\int_Y||y-F_\phi(G_\theta(y))||dv(y)\]

~~Y에서~~ Metal artifact 정도가 모두 다르고, 심지어 clean 이미지도 가지고 있는 경우도 있음
- 따라서 clean(CT without metal artifact)은 input과 동일한 output을 내놓아야 하므로 Identity loss를 사용

\[l_{identity}(\theta, \phi) = \int_X ||x-G_\theta(x)||d\mu(x) + \int_Y||y-F_\phi(y)||dv(y)\]

최종 objective:

\[l_{MAR}(\theta, \phi; \psi, \varphi) := l_{\beta-cycle}(\theta, \phi) + l_{Disc}(\theta, \phi; \psi, \varphi)+\gamma l_{identity}(\theta, \phi)\]

Geometry of Attention

artifact가 발생할 수 있는 부분과, 그 부분에 집중할 수 있게 해주는 모듈을 도입
To address these issues, a method mimicking the human visual system can be a good option because humans exploit a sequence of partial glimpses and selectively focus on salient parts in order to capture the visual structure much better.
The convolutional block attention module(CBAM) is one of the simplest yet effective one.
GAN-based deep convolutional networks는 geometric/structural pattern을 잘 못 잡아냄
- due to small receptive field from convolution operator
- kernel size를 키우면 되긴 하지만 computational and the statistical efficiency가 작아짐
global dependencies를 잡는 attention mechanisms
- self attention
  - SAGAN efficiently learns to find global and long-range of dependencies within internal representationss of images
  - 그러나 calculating the key and query from entire images is often computationally expensive and can cause memory problems as the spatial sizes of input get bigger
- CBAM
  - simple yet effective
여기서는 self-attention 대신 CBAM 모듈을 사용했는데, 역시 동감한다. self-attention은 너무 무겁고 생각만큼 안 좋은데, CBAM을 쓰면 훨씬 가벼우면서 성능도 잘 나온다. (나도 처음에 CBAM 사용을 고려했다)

A: spatial attention map, T: channel attention map, Z: feature map

Deep convolutional framelets theory에 의해, 이것은 1x1 conv 후 global pooling을 하는 것과 동일
CBAM은 spatial/channel attention을 모두 유지하면서, computational complexity도 작게 가져가는 좋은 attention model

Channel attention module
- Channel attention focuses on ‘what’ are important channels
- layers
  - average pooling : aggregate spatial information (squeeze)
  - maxpooling : gathering another iportant clue about distinctive object features
  - two multi-layer perceptron (squeeze)
Spatial attention module
- focuses on ‘where’ is an informative
- average pooling & maxpooling used
- 7x7 convolution operator

저 channelwise attention이랑 spatial attention을 같이 쓰는 방법론이 되게 많이 나왔는데, CBAM은 channel attention과 spatial attention을 따로 태운 후 merge하는 형태가 아니라 순차적으로 channel attention, spatial attention을 태우기 때문에 가볍다. (개인적으로는 그래서 더 안정적인 것 같기도 하다..?)

Meterials and Methods

Dataset

Real Metal Artifact Data
- From the equal-spaced conebeam projection data, we reconstructed the CT images by FDK.
- 504x504x400
- 3 patients for training, 1 patients for validation, 1 patients for test
- 800 with metal artifact 400 without metal artifacts
Synthetic Metal Artifact Data
- 10997 artifact-free images for LiTS (Liver Tumor Segmentation Challenge)
- CNN-MAR과 동일한 방법론으로 synthesize
- simulates the beam hardening effect and Poisson noise
- 5860 artifact 4115 clean for train
- 373 artifact 335 clean for test
- 256x256 (비교실험 때 ADN이 너무 커서 512x512 할수 없었음)

Network Architecture

$G_\theta, F_\phi$ 모두 U-net + attention in skip/concatentation 으로 진행된다. MAR을 해본 경험으로써 저 skip쪽에 attention 모듈을 다는 게 되게 중요한데, 그 이유가 encoder feature에서 미처 제거되지 못한 artifact들이 이후에 decoder와 concat하면서 다 살아나 버리기 때문에, encoder feature에 attention을 붙여 artifact를 떨어뜨려 가면서 concat하는 게 매우 관건이다. 여기도 그걸 알고 네트워크를 이런 식으로 구성한 것 같다.

$D_\psi, D_\varphi$ 는 PatchGAN base 4 conv + fc with batchnorm

Training Detail

input : 504 x 504
$\lambda = 10, \beta = 1, \gamma=1$
$\beta$가 클수록 모델이 artifact-free image generation에 집중하는 경향을 보임
- useful in real data case
metal artifact가 없는 이미지라도 beam hardening artifacts가 있을 수 있기 때문에 identity loss를줌
- lessening the value of the hyper-parameter that is involving property that do not need to be changed
50 epoch (early stopping)
xavier initializer
lr : 2x10-3
synthetic data에서
- 256x256
- $\lambda = 10, \beta = 1, \gamma=5$
- synthetic data는 artifact generation procedure가 simple하기때문에, two statistical distance에 동일한 weight를 사용
- no artifacts in the images with no metal artifacts, 따라서 높은 identity loss rate

Real data

(a) input (b) proposed (c) LI (d) NMAR (e) ADN with downsampled input (f) proposed with downsampled input

Homogeneous region (tissue)

아쉬운 점은 supervised 방법론과의 비교는 없다는 점이다. 아무래도 synthetic data가 어느 정도 realistic할 수 있는 metal artifact reduction에서 self-supervised가 supervised를 이기기에는 무리가 있었지 않나.. 생각이 든다.

Synthetic data

blation Study

CycleGAN w/o CBAM
- artifacts in the background had becom fainter compared to those of the input
- real metallic objects are incorrectly removed
Dependency on the disentanglement parameter
- $\beta$=10일때가 더 결과가 좋다

Discussion

→ novel $\beta$-cyclegan with an attention module for MAR (attention 모듈을 활용한 $\beta$-cyclegan 모델을 MAR task에 활용했다.)

local position with globally radiating artifact ⇒ CBAM to focus on important features in both spatial and channel domain (CBAM attention을 사용해 spatial/channel 모두에서 attention을 적용했다)

disentanglement parameter $\beta$ that imposes relative importance on the reconstructed artifact-free images compared to the artifact generation process (disentanglement parameter $\beta$ 가 artifact를 생성하는 네트워크(G)에 비해 artifact를 지우는 네트워크(F)에 더 비중을 가게 해준다.)

정리

MAR에서 self-supervised를 어떻게 사용하는지가 궁금해서 봤는데 역시 MAR에서 self-supervise가 supervise를 이기기는 힘든 것 같다. 성능이 많이 안 따라와줘서 조금 아쉽기도 한데, 그래도 내가 self-supervised로 문제를 풀었다면 이렇게 풀었겠구나를 그대로 따라간 것 같아 논문 하나로 내가 생각한 걸 정리할 수 있었던 것 같다. attention을 사용한 위치랑 이유가 나랑도 비슷해서 신기하기도 하고 역시 다들 비슷한 생각을 하고 있구나를 또 느낄 수 있었다..ㅋㅋ synthesized data가 너무 강력한 MAR task에서 이후에 self-supervised가 언제 그리고 어떻게 superivsed를 능가하게 될 지 궁금하다.

[AI] PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer

2022-10-28T00:00:00+00:00

회사 인턴을 하면서 heart rate estimation을 해야 하는데 쓸만한 걸 못찾아서 잔뜩 스트레스 받고 있다가 찾은 모델이다. 트랜스포머 기반으로 해서 현재 SOTA 모델이다. 2D CNN기반은 거의 사용 못할 결과가 나오고, 대부분의 모델이 rPPG만 predict해서 실제 rPPG로 심박수를 계산해보면 처참한 수준의 결과만 얻을 수 있다. 그런데 이 모델은 rPPG에서 psd로 심박수로 추출하여 loss term에 활용하고 있기 때문에 심박수만 필요한 내 task에서 심박수가 쓸만한 정도로 나온다. 정말 한 줄기 빛의 모델이다.

논문 링크

배경지식 (ppg란?)

링크

신체에서 얇은 부분(손끝, 볼, 귀)은 빛이 약간 통과 → 혈액의 흐름 관찰 가능
심장이 피를 보내기 위해서 뛸 때 생기는 미세한 변화 : 맥파, Plethysmogram
- 맥파에 따라 변하는 미세한 혈류량을 조사해서 파악.

PPG센서에서 피부로 빛을 쏠 때, 혈류량에 따라(심장이 이완 시 혈류량 증가, 수축시 혈류량 감소 ⇒ 심장의 이완, 수축 주기를 알 수 있음 ⇒ 심박수) 흡수되는 빛의 양이 달라지므로 빛이 얼마나 흡수되었는지를 측정하여 혈액량의 변화를 detect

심박수 = $1/PPI*60$
반면 ecg는 심박동과 관련되어 나타나는 전위변화 (전기신호 세기와 간격)

rppg는 remote PPG로, facial video에서 얼굴의 혈류 변화량을 보고 ppg를 예측하는 방식

Introduction

Electrocardiography(ECG)와 Photoplethysmography(PPG)가 심장 활동을 측정하는 가장 큰 두 방식이다. 그러나 두 방식 모두 몸에 부착되어야 하는 불편함이 있기 때문에 remote Photoplethy-smography(rPPG) 가 떠오르고 있다.

3 stage module

보통의 rppg 연구는 3 stage를 띈다. paper

a pre-processing stage, to minimize nuisance variation (face ROI 추출 및 전처리)
- detection + skin segmentation and tracking/ landmark
- 얼굴 이미지 추출이 얼마나 잘 되느냐가 rppg의 성능을 좌지우지 해서 초반에는 이쪽이 많이 발전
ppg signal extraction stage
- ex. CNN
a heart rate estimation stage from the estimated PPG signal (ppg를 bpm(heart rate)로 바꾸는 단)
- peak detection
- FFT (PSD)

가장 기본적인 방식

analyze subtle color changes on facial regions of interest with classical signal processing approaches

그 다음 color 부분 처리를 통한 feature 맵 변환(color subspace transformation methods, which utilize all skin pixels for rPPG measurement)

위 방식을 바탕으로 ROI preprocess된 feature map에서 rppg를 추출하는 learning-based model 등장 (non end-to-end)

ROI based preprocessed signal representations(e.g., time-frequency map and spatio-temporal map) are generated first, and then learnable models could capture rPPG features from these maps.
그러나 굉장히 strict하게 preprocessing 요구

아예 processing된 feature map이 아닌 video 자체를 input으로 받는 end-to-end 모델들이 등장

treat facial video frames as input and predict rPPG and other physiological signals directly
그러나
- 전처리단이 사라졌으므로 complex scenarios(e.g, head movement)에서 무너져내리는 경향이 있음
- rPPG-unrelated feature가 학습에 주 부분을 차지할 가능성 → large performance decrease in realistic dataset
- 전처리단에서 정제+증폭해줬던 부분이 없으므로 ppg추출 단이 훨씬 어려운 task로 바뀜

ppg → Heart rate

보통 모델 output이 ppg 추출에서 끝나는 모델이 많다

→ heart rate 변환 시 noise 제거 등 signal processing이 또 필요할 수도 있음
dl based이 아닌 traditional 방법론을 사용
- fft(PSD), peak detection
그래프 개형이 비슷한 것(mse)과 주기(bpm=heart rate)가 비슷한 것은 다르다고 생각!
- heart rate를 loss term에 사용하지 않는 이상 ppg 그래프보다도 heart rate가 실제와 많이 다른 경우가 존재

Transformer의 사용

rppg에서도 CNN based 모델이 대거 등장
NLP, image / video analysis에서도 성공적으로 작동했으므로 rppg에서도 적용할 수 있을 것
- 기존과 다른 점은, subtle한 픽셀 변화에 의존하기 때문에 global spatio-temporal perception이 challenging
- the long-range spatio-temporal attention이 필요
rPPG measurement from facial videos can be treated as a video sequence to signal sequence problem
- LONG RANGE contextual clues should be exploited for semantic modeling

Contribution

powerful video temporal difference transformer backbone으로 이루어진 physformer 모델의 제안
- long-range spatio-temporal relationship을 고려한 첫 rppg model
PhysFormer의 supervise 방법론들 제안
- label distribution learning
- curriculum learning guided dynamic loss in frequency domain
pretrain 없이 SOTA 달성

Remote physiological measurement

Plenty of traditional hand-crafted approaches

Face input에서 feature map 추출 단:

Selective merging information from different color channels / different ROIs

feature map에서 rppg 추출 단:

to improve signal-to-noise ratio of the recovered rPPG signals, several signal decomposition methods such as independent component analysis(ICA), and matrix completion이 제안

Learning based approach의 등장

deep learning based approaches dominate the field of rPPG measurement due to the strong spatio-temporal representation capabilities.
not end-to-end (facial ROI 전처리단 + rppg 추출단)
- facial ROI based spatial-temporal signal map 생성 단:
  - alleviate the interference from non-skin regions
  - ROI 전처리가 꽤 많은 비중을 차지
  - ROI 바깥 부분을 전혀 고려하지 못함(배경 밝기 변화 같은)
- map들에서 feature를 잡아내는 rppg단 (2D-CNN)
  - 실시간성에는 좋음
- 따라서 spatial-temporal 정보들을 고려하기가 힘듬
end-to-end
- 인접한 프레임만 고려하고 long range relationship의 주기적인 정보는 고려하지 않음
  
  (previous methods only consider the spatio-temporal rPPG features from adjacent frames and neglect the long-range relationship among quasi-periodic rPPG features)

Transformer for vision tasks

Most of these works are incompatible for long-video-sequence (over 150 frames)

rppg in vision transformer

trans-rppg
- extracts rPPG features from the preprocessed signal maps via ViT for face 3D mask presentation attack detection
efficientphys (temporal shift networks)
- Based on the temporal shift networks, efficientphys adds several swin transformer layers for global spatial attention

위 두 모델은 non-end-to-end모델 (feature map에서 rppg 추출 단)

Materials and Methods

PhysFormer Architecture.

Shallow stem

Video를 바로 patch로 쪼개서 multi head self attention에 넣는 것이 아니라 얇은 stem을 한번 거친 output을 patch로 쪼개서 MHSA에 넣음
- coarse한 local spatio-temporal feature 추출에 도움
- benefits the fast convergence and clearer subsequent global self-attention
3개의 convolution blocks이루어져 있다 (커널 사이즈: 1x5x5, 3x3x3, 3x3x3)
각 convolution에 Batch Normalization, ReLU, MaxPool을 순차적으로 사용
Pooling layer가 spational dimension을 반으로 나누므로
- Facial video input size가 3xTxHxW 일때 shallow stem output size는 D x T x H/8 x W/8

Tube tokenization

이 shallow stem의 output이 N개의 tube token으로 나뉘어지고, 이렇게 나뉘어진 tube token들이 N개의 temporal transformer block으로 들어가고, global-local refined rPPG 피처인 X_trans을 얻을 수 있다. (X_tube와 동일한 dimension)
stem에서 나온 output을 X_stem이라고 할때, X_stem을 non-overlapping한 tube token들로 나뉘어진다.
stem이 있으므로 position embedding과정을 따로 진행해주지 않았다.

Temporal difference multi-head self-attention

AutoHR 논문에서 등장한 temporal difference convolution 개념
기존의 self attention은 Query, Key, Value를 단순히 matmul 해주는 형태(1x1 conv)
matmul (point-wise linear projection) 하는 대신 Temporal difference convolution을 사용
- 동영상의 밝기와 같은 값들을 인접한 프레임끼리 빼줌으로써 무시 가능
TDC with learnable w

p0 : current spatio-temporal location, R : sampled local (3x3x3) neighborhood, R’ : sampled adjacent neighborhood

temporal difference convolution을 Q(query)/K(key)의 projection에 사용.

⇒ 미묘한 색 변화 감지를 위해 미세 특징을 포착할 수 있음 (can capture fine-grained local temporal difference features for subtle color change description)

V(value) projection 때에는 temporal difference convolution projection을 사용하지 않고 기존의 point-wise linear projection을 사용

i번째 head의 self-attention:

기존의 self-attention식처럼 τ =√Dh_i를 사용한 것이 rppg에서 효과가 좋지 못했으므로, 더 작은 τ를 사용해 sharper attention activation을 진행. (변수로 받아서 진행)
여기에 residual connection, layer normalization이 마지막으로 진행된다.

Spatio-temporal feed-forward

TD-MHSA의 output이 피드포워드 NN의 input FFNN(x) = MAX(0, xW1+b1)*W2+b2 (두개의 linear 층)
vanilla feed-forward network
- 두개의 linear transformation layer로 구성
- 두 레이어 사이에서 더 많은 feature 표현을 위해서 hidden dimension이 확장됨 (the hidden dimension D’ between two layers is expanded to learn a richer feature representation)
- temporal한 정보 고려가 힘듬
depthwise 3D convolution(with BN and nonlinear activation)을 이 두 레이어 사이에 사용
- ST-FF가 local inconsistency와 parts of noisy features를 다듬어 줄 수 있다고 제안
- richer locality는 TD-MHSA에게 충분한 relative position 신호를 제공

Label Distribution Learning

Facial age estimation task에서, 나이 차이가 적을수록 얼굴이 비슷해보이는 점을 이용

⇒ Facial rPPG signals with close HR values usually have similar periodicity.

하나의 얼굴 video에 대한 output이 하나의 label(HR)이기보다는, label distribution을 이용해 output이 여러 개의 HR value label에 대한 distribution으로 표현되도록 함.
- label distribution은 HR의 특정 범위(42, 180)을 포함하고, output은 각 label들이 얼마나 유력한지에 대한 정도를 표현함.
- Through this way, one facial video can contribute to both targeted HR value and its adjacent HRs.

💡 rPPG-based HR estimation problem을 specific L-classes multi-label classification problem으로 정의했다. (L=139, 42 to 180)

GT label을 어떻게 label distribution으로 만들었는가?

각 label의 entry p는 [0,1] 사이의 real value이고, 모든 label의 p 합은 1이 되도록 했다. (확률)
Gaussian distribution function을 이용
오히려 real distribution을 사용했을 때보다 gaussian distribution을 사용했을 때 더 성능이 괜찮았다!
결국 loss function은 L_LD = KL(p, Softmax(ˆp)) 나타낼 수 있다. KL은 Kullback-Leibler divergence이고, p hat은 power spectral density(PSD) of predicted rPPG signals. ⇒ rppg에서 HR 추출을 어떻게 했는지를 알려주고 있음
- Kullback-Leibler divergence - 두 확률분포의 차이를 계산하는데 사용하는 함수 (정보 엔트로피 차이를 계산)
- Power Spectral density(PSD)

이전에도 label distribution을 이용한 사례(A robust rppg approach with distribution learning)가 있었지만, 우리와 차이점은

이전에 이 방식을 사용한 이유는 frame마다의 얼굴 움직임에 대해 temporal HR outliers를 smoothen하는 것이지만, 우리는 adjacent label간의 efficient feature를 배우도록 하기 위해 진행하였다.
이전 연구에서는 hand-crafted rppg signal 추출 이후에 HR-estimation를 위한 post HR estimation이지만, 우리는 딥러닝 단에서 적용하였다.

Curriculum Learning Guided Dynamic Loss

Curriculum Learning - 쉬운 task부터 학습

temporal domain loss

mean square error loss, negative Pearson loss
- negative Pearson correlation : maximize the trend similarity and minimize peak location errors (Remote Photoplethysmograph Signal Measurement from Facial Videos Using Spatio-Temporal Networks에서 등장한 loss)
signal-trend-level constraints을 주므로 금방 학습 → 쉬운 overfitting

frequency domain loss

cross-entropy loss, signal to noise ratio loss
target frequency bands에서 주기적인 피처들을 학습해야 함
realistic한 rPPG-irrelevant 노이즈 때문에 주파수단의 강한 constraints

이게 중요하다. frequency domain loss가 심박수를 loss term으로 활용할 거라는 말이기 때문에 rPPG prediction에서만 잘 작동하는 것이 아니라 heart rate estimation에서도 잘 작동할 수 있게 된다.

또한, dynamic supervision to gradually enlarge the frequency constraints를 제안. linear increment strategy/exponential increment strategy 중 exponential strategy를 적용 (alpha = 0.1, beta_0 = 1.0, n = 5.0)

pretrain을 사용하면 HR로만으로도 학습 가능할 수도 있을 것 같다!! 지금 가지고 있는 rPPG 데이터가 없어서 한 줄기 빛이 된다.

Results

세개의 physiological signal에 대해 진행 :

heart rate(HR), heart rate variability(HRV), respiration frequency(RF)

사용한 데이터셋

VIPL-HR

large-scale dataset for remote physiological measurement under less-constrained scenarios
- head movement, illumination을 포함
2378 RGB videos of 107 subjects recorded with different head movements, lighting conditions and acquisition devices

MAHNOB-HCI

one of the most widely used benchmark for remote HR measurement evaluations
527 facial videos with 61 fps from 27 subjects

MMSE-HR

102 RGB videos from 40 subjects

OBF

high-quality dataset for remote physiological signal emeasurement
200 5 min ong RGB videos with 60 fps recorded from 100 healthy adults

mean HR estimation은 위의 네개 데이터셋에서 진행
HRV, RF estimation은 OBF에서만 진행
- follow existing methods, and report low frequency(LF), high frequency(HF), and LF/HF ratio for HRV and RF estimation
SD, MAE, RMSE, pcc(peason’s correlation coefficient)

Implementation Details

MTCNN face detector을 이용하여 첫번째 프레임의 enlarged face area를 crop, 이후 frame들도 이 region으로 fix
MAHNOB-HCI, OBF는 fps=30으로 내렸음
Physformer 세팅 :
- N=12, h=4, D=96, D’=144,
- TD-MHSA : θ=0.7 (for TDC), τ=2.0
- tube size T_s x H_s x W_s = 4x4x4
train시에 randomly sample RGB face clips with size 160x128x128(TxHxW)
Random horizontal flipping and temporally up/down-sampling for data augmentation
Adam optimizer
lr = 1e-4, weight decay = 5e-5
batch size=4
25 epoch으로
- alpha=0.1 for temporal loss,
- exponentially increased parameter beta=[1,5] for frequency loss.
σ=1.0 for label distriution learning (gaussian distribution)
testing : seperate 30-second videos into three short clips with 10 seconds, and video-level HR is calculated via averaging HR from three short clips

Intra-dataset Testing

HR estimation on VIPL-HR

traditional methods(Tulyakov2016, POS, CHROM) perform poorly due to the complex scenarios
end-to-end learning based methods(PhysNet, DeepPhys, AutoHR) predict unreliable HR values with large RMSE compared with non-end-to-end learning approaches(PhythmNet, ST-Attention, NAS-HR, CVD, Dual-GAN)
all five non-end-to-end methods first extract fine-grained signal maps from multiple facial ROIs, and then more dedicated rPPG clues would be extracted via the cascaded models.
Physformer는 pretrain 없이도 좋은 결과를 낸다!

HR estimation on MAHNOB-HCI

finetune the VIPL-HR pretrained model on MAHNOB-HCI for further 15 epoc

HR, HRV and RF estimation on OBF

Physformer also gives more accurate estimation in terms of HR, RF, and LF/HF compared with the preprocessed signal map based non-end-to-end learning method CVD.

Cross-dataset Testing

VIPL에 train 후 MMSE-HR에 finetune 하지 않고 바로 testing
SOTA 달성
1. The predicted HRs are highly correlated with the ground truth HRs
2. The model learns domain invariant intrinsic rPPG-aware features.

Ablation Study

Impact of tube tokenization four tokenization configuration 유무

stem을 사용했을 때가 더 결과가 좋았음
input resolution을 내릴수록 결과가 안좋아짐

Impact of TD-MHSA and ST-FF

Spatio-temporal attention 없을 때 performance degrades sharply
τ가 MHSA에 영향을 많이 준다 ⇒ ViT처럼 타우를 사용하면 성능 drop
ViT보다 τ를 작게 잡았을 때(sharp) 성능이 더 향상 되었다.

Impact of label distribution learning Impact of dynamic supervision

Temporal loss와 frequency cross-entropy loss에서, frequency loss쪽이 label distribution
label distribution을 exponential strategy로 사용했을 때 가장 결과가 좋음
label distribution을 아예 사용하지 않았을 때는, exponential strategy를 적용해도 결과가 좋지 않았음
- It is interesting to find from the last two rows that using real PSD distribution from ground truth PPG signals as p, the performance is inferior due to the lack of an obvious peak and partial noise. (real PSD distribution보다 가우시안 distribution이 더 좋았다)
Fold-1 VIPL-HR로 fixed/dynamic supervision을 실험하였을 때

Impact of theta and layer/head numbers

Physformer could achieve smaller RMSE when theta=0.4 and 0.7, indicating the importance of the normalized local temporal difference features for global spatio-temporal attention.

With deeper temporal transformer blocks, the RMSE are reduced progressively despite heavier computational cost.

In terms of the impact of head numbers, it is clear to find that PhysFormer with four heads perform the best while fewer heads lead to sharp performance drops.

Conclusion

rppg를 위한 End-to-end video transformer architecture를 만듬
Temporal difference transformer를 사용한 성능 향상
더 효율적인 architecture
- 모바일에서는 사용하기 어려운 parameter 수
More accurate yet efficient spatio-temporal self-attention mechanism for long sequence rPPG monitoring

rPPG라는 게 얼굴에서 사람의 눈으로 감지하기 어려운 혈색의 변화를 감지해서 ppg 그래프를 그려야 하는 문제이기 때문에, 이게 진짜 가능한거냐라고 말하는 사람들도 있고 나도 어느 정도는 동감한다. 그래서 많은 rPPG 추출 모델들이 이 subtle한 혈색 변화를 증폭시킬 수 있는 모듈을 사용하는데 여기서는 그게 TDC였던 거다.

그런데 직접 physformer 실험을 해보면 30fps에서는 잘 작동하더라도 frame interpolation으로 fps를 바꿔줬을 때 잘 작동하지 않는다. 이게 모델이 30 fps에서만 주로 학습이 되어서 일수도 있지만, 내가 생각하기에는 모델이 프레임 간의 차이(temporal difference)를 보고 있다기보다 그 순간의 혈색이 얼마나 붉은지를 보는, temporal 정보를 고려하고 있지 않을 수도 있는 것 같다. 사람 혈색이 얼마나 빠르게 주기적으로 변하냐보다도, 특정 프레임에서 사람 얼굴이 붉으면 이 사람이 운동 중이라고 생각하고 높은 심박수를 뱉는 경향성이 있다. 그래서 인종에 대한 bias도 많이 걸리고, 아무튼 참 어려운 문제이다.

[알고리즘/코테 합격자 되기] 시간복잡도/문법

2022-10-23T15:00:00+00:00

**코딩테스트 합격자 되기 파이썬편 1주차 스터디 내용을 정리해본다! 코딩 테스트 합격자 되기 - 파이썬 편

시간복잡도

보통 시간복잡도와 공간복잡도는 알고리즘의 성능을 나타내는 지표로써 쓰인다. 이때 공간복잡도보다도 시간복잡도가 주로 고려되야 할 사항인 것 같다.

시간복잡도는 보통 연산 횟수와 관련이 있어서, 나는 1번 연산이 들어갈 때 (변수에 값 할당, 또는 연산) 시간복잡도가 1이 늘어난다고 생각하며 계산한다. 따라서 내가 짠 코드에서 최대 n번 연산이 들어갈 때 시간복잡도를 O(n)이라고 표현한다. (빅오 표기법)

보통은 시간복잡도를 최악의 상황에서 도는 빅오 표기법을 사용하지만, 빅-세타 표기법(평균적인 경우)과 빅-오메가 표기법(최선의 경우) 또한 존재한다. 이것은 시간복잡도 계산 방식 자체는 비슷하고, 최선의 경우와 평균 경우를 고려하여 시간복잡도를 계산하면 된다.

빅오 표기법

빅오 표기법은 최악의 상황을 고려하기 때문에 계산이 간결해지는 경향이 있다. 만약 시간복잡도가 O(1n^2+2n-1)이라면 이 시간복잡도는 O(n^2)과 동일하다. 빅오는 최악의 상황을 고려하기 때문에, n이 무한히 커지는 상황(최악의 상황)을 고려한다. 따라서 무한히 커지는 n이 있을 때, 최고차항만 남기고 나머지 차수는 버려도 되는 것이다.

이는 아래 빅오 표기법의 수식을 보면 알 수 있다.

특정 x 시점 이후부터 항상 $f(x) ≤ C * g(x)$를 만족

C는 상수

따라서 $n^2+2n-1<=C*n^2$가 특정 n에서 만족되기 때문에 (여기서는 n=1부터 만족된다) 시간복잡도를 단순히 O(n^2)이라고 표현할 수 있다.

코딩 테스트 필수 문법

빌트인 데이터 타입

정수형(int), 부동소수형(float), 문자열 타입이 있음. 산술연산, 비교연산, 비트연산, 논리연산이 모두 가능함. 심지어 c에서는 math.h include 해야 가능한 abs도 가능함.

부동소수형에서는 엡실론을 주의해야함. 부동소수를 이진법으로 표현하는 과정에서 오차가 발생하기 때문인데, 이 오차 때문에 문제 통과를 못하는 경우도 있으므로 주의해야 한다.

컬렉션 데이터 타입

리스트(list), 튜플(tuple), 딕셔너리(dictionary), 셋(set), 문자열(string) 등이 있음. 특히 나는 리스트에서 중복 원소 제거할 때 for문 돌면서 막 지우지 말고 그냥 set변환 한번 했다가 다시 리스트로 변환하곤 했음.!

뮤터블 객체는 리스트, 튜플, 셋임. 따라서 뮤터블 객체를 다른 변수에 담고 수정하면 함께 수정이 됨. 따라서 .copy() 같은 걸 이용해서 복사해서 넣어줘야 같이 동시에 수정이 안됨. 이 외에 이뮤터블 객체들은 복사해서 넣어주지 않아도 동시에 수정이 안됨.!

리스트 슬라이싱, 딕셔너리 사용법은 내가 대충 알고 있으므로 따로 자세히 정리하지는 않겠음. !!튜플은 삽입, 삭제가 안됨.

리스트의 시간복잡도

list.append() : 맨 뒤에 그냥 넣으면 되므로 O(1)
list.insert(i, num) : i번째까지 가야하므로 O(n)
list.pop() : 맨 뒤에 그냥 제거하면 되므로 O(1)
list.remove(num) : num을 list에서 찾아내는데 걸리는 시간 O(n), 삽입 시간 O(1) => O(n)
list.extend([list2]) : 맨 뒤에 list2 원소들을 하나씩 append 하므로 O(1)이 k번 => O(k)
list[k] : 접근 자체는 상수 시간 O(1) (연결리스트는 O(n)으로 알고 있는데 그 리스트가 아닌 것 같다.?)
list1+list2 : list.extend는 기존 리스트에 append하므로 O(k)였지만 얘는 새로 빈 리스트에 list1, list2원소를 모두 append하기 때문에 O(n+m)
list(set(list)) : 시간복잡도가 O(n)
k in list : k를 찾아서 존재여부를 리턴하므로 탐색 시 걸리는 O(n)

딕셔너리의 시간복잡도

딕셔너리는 리스트와 달리 탐색에 걸리는 시간이 모두 상수시간이다.

따라서 dic.get[key], dic[key], dic.pop(key) 모두 O(1).

셋의 시간복잡도

set은 add, remove, discard는 모두 O(1) 상수시간에 이루어진다. 나머지는

set1.union(set2): O(len(set1)+len(set2)) (셋 안에 모든 요소를 흩으므로)
set1.intersection(set2): O(min(len(set1), len(set2)))
set1.difference(set2) : O(len(set1))
list(set(set1)) : O(len(set1)) (모든 요소를 list에 append)
k in set1 : 탐색은 O(1)

문자열의 시간복잡도

문자열은 내가 잘 모르는 문법도 있어서 좀 정리를 해보겠다. (ㅠ)

str1 + str2 : O(str1+str2) 두 문자열을 더함
delimiter.join(string_list) : O(총 문자열 길이) string list를 delimiter로 연결
str.replace(old, new) : O(n) 특정 문자를 찾아 변경
str.split(sep) : O(n) sep 기준으로 문자열 split (리스트 리턴)
str.startswith(prefix) : O(len(prefix)) prefix로 문자열이 시작하는지 체크
str.endswith(suffix) : O(len(suffix)) suffix로 문자열이 끝나는지 체크 (뒤에서부터 탐색)

람다식(함수)

람다식은 다음처럼 정의함! c에는 못 본 거라 내가 잘 쓸 줄 모름(ㅠ)

lambda x, y : x + y 

람다로 간단하게 함수를 위처럼 정의하고 변수에 담아서 부를 수도 있고, 아니면 인수로 람다식을 넘기는 방법도 있다. 두번째를 오며가며 더 많이 본 것 같다..?

[백준] 10775 공항

2022-10-08T00:00:00+00:00

단계별로 풀어보기 문제를 보다가 내가 예전에 틀린 문제가 있어서 가져와보았다! 유니온 파인드 개념을 쓰는 문제였는데, 유니온 파인드 개념을 전혀 기억을 못해서 유니온 파인드 개념부터 공부해서 풀었다. 차근차근 한문제씩 풀어 나가는 게 나의 목표이다.

문제 설명

링크 공항에 비행기 p대가 있고, 게이트가 g개 있다. 문제만 읽으면 잘 이해가 가지 않는데, i번째 비행기를 1-g_i번째 게이트에는 무조건 도킹할 수 있어야 하고 도킹할 수 없으면 공항이 폐쇄된다고 한다. (g_i가 input으로 주어지게 된다) 이 말은 즉 i번째 비행기에 대해 input g_i가 들어오고, 이 i번째 비행기는 1-g_i번째 게이트에 세울 수 있다. 최대 몇 대의 비행기를 세울 수 있는지를(세울 수 없어서 공항 폐쇄 직전까지) 답으로 내면 된다.

문제 풀이

유니온 파인드 개념을 알았더라도 풀기 어려웠던 문제 같다.

먼저 유니온 파인드 개념이 무엇이냐면, 그래프 알고리즘에서 등장한다.

array[i] : i번째 노드의 부모 노드(루트 노드)

라고 했을 때, union이 두개의 disjoint set(루트 노드가 다른 두 그래프)를 합치는 것이고, find가 해당 그래프의 루트 노드를 return 하는 함수이다. 결국 부모 노드를 잘 다룰 수 있으면 된다. 따라서 union(x, y)는 x와 y의 루트 노드들을 각각 찾고(px, py라고 하겠다) px의 부모 노드를 py로 설정해주면 된다(array[px]=py). find함수는 array[i]=i인 경우 자신의 루트노드이므로 탐색을 멈추고 자신을 return하면 되고, 자신의 루트노드가 아닌 경우는 i의 부모 노드(array[i])에 대해 find함수를 재귀로 한번 더 돌려주면 된다. 여기서 중요한 점은 시간초과를 해결하는 것인데, 메모이제이션을 이용해 return 직전에 i의 부모 노드를 find(i)로 설정해 주면 된다. 즉 부모 노드가 현재 i의 부모 노드의 부모 노드일 수 있기 때문에, 계속 갱신해주면 된다.

아래 문제는 유니온 파인드 개념을 사용하면 풀 수 있는데, 나는 바로 떠올리지는 못했다. i번째 비행기는 1-g_i게이트들에 위치할 수 있는데, 이미 누군가 g_i게이트에 비행기를 세워 놓았으면 g_i-1에 비행기를 세워야 한다. 따라서 각 비행기를 세울 때마다 몇 번째 게이트에 세울 지 또 탐섹해야하기 때문에 O(p*g) 시간이 걸리게 된다. 따라서 각각의 비행기에 대해 몇 번째 게이트에 세우면 되는지 바로 알 수 있어야 한다. 이것은 어떻게 할 수 있냐면, 해당 비행기가 특정 게이트 x에 세워 졌다면, 그 다음 비행기는 무조건 x-1에 세워야 하므로 특정 게이트 x와 x전에 세워진 게이트들과, x-1번째 게이트를 union시켜주면 된다. 즉 내가 세운 게이트의 부모 노드를 내가 세운 게이트 바로 전 노드로 설정해주면 된다. 이렇게 하면 이후에 내가 x번째 게이트에 다시 세우려고 시도할 때, x-1번째 게이트의 부모부터 고려할 수 있게 된다.

코드는 아래와 같다.

#include 

int a[100001];

int find_(int x){
    if(a[x]==x) return x;
    return a[x] = find_(a[x]);
}

void union_(int x, int y){
    a[find_(x)]=find_(y);
}

int main(){
    int g, p;
    scanf("%d%d", &g, &p);

    for(int i=1;i<=g;i++) a[i]=i;

    int ans=0;
    while(p--){
        int tmp;
        scanf("%d", &tmp);

        if(find_(tmp)==0) break;

        ans++;
        union_(find_(tmp), find_(tmp)-1);
    }

    printf("%d", ans);
}

[백준] 2839 설탕 배달

2022-09-25T00:00:00+00:00

정말정말 오랜만에 백준 문제를 가지고 왔다! 설탕 배달이라는 문제인데, dp로 풀 수 있는 문제라고 해서 가져왔다. 옛날에 틀린 문제라 무작정 어렵다는 생각을 가지고 도전하지 않았던 문제인데, 왠걸 지금 푸니까 한번만에 맞췄다! 실력이 조금씩 조금씩 늘고 있는 것 같아 기분이 좋다.😊

문제 설명

링크 설탕 배달을 하는 문제이다. 설탕 그람 수가 n으로 주어지면, n그램을 3그램 봉지들과 5그램 봉지들로 만들면 된다. 이때, 총 봉지 수가 최소로 하는 문제이다.!

문제 풀이

dp라는 걸 알고 보아서 그런지는 모르겠지만 간단하다!

dp[n] : n그램을 만드는 데 필요한 봉지 수

채울 수 없는 봉지들은 INTMAX로 두고, (나는 그냥 엄청 큰 수로 두었다) for문을 돌면서 dp를 갱신해주면 된다. 이때, n번째 그램은 n-3그램을 만드는데 필요한 봉지 수에서 3그램 봉지를 하나 더하거나, n-5그램을 만드는 필요한 봉지 수에서 5그램 봉지를 하나 더하면 구할 수 있다. 봉지 수가 최소가 되야 하므로, 둘 중 더 봉지 수가 작은 걸 선택하면 된다! 내가 채울 수 없는 봉지들을 INTMAX로 둔 이유는, 채울 수 없는 봉지들을 이용해서 dp값이 갱신 될때, INTMAX값 이상의 값은 무시될 수 있기 때문이다. 채울 수 없는 k그램을 이용해 k+3/k+5그램을 갱신하게 되면, 더 작은 봉지 수를 선택하므로 k그램이 무시되거나, 갱신하는 그램 또한 만들 수 없는 그램이 되어 INTMAX보다 더 큰 값으로 갱신이 된다. 결국 INTMAX이상의 값들은 사용할 수 없는 값으로 보고, -1을 출력해 주면 된다. 아래는 코드이다.!

#include 

int dp[5001];

int main(){
    int n;
    scanf("%d", &n);

    dp[0]=100000001;
    dp[1]=100000001;
    dp[2]=100000001;
    dp[3]=1;
    dp[4]=100000001;
    dp[5]=1;

    for(int i=6;i<=n;i++){
        dp[i]=(dp[i-3]<dp[i-5])?dp[i-3]+1:dp[i-5]+1;

        //printf("%d: %d vs %d => %d\n", i, dp[i-3], dp[i-5], dp[i]);
    }

    printf("%d\n", (dp[n]>=100000001)?-1:dp[n]);

}

[백준] 2559 수열

2022-07-12T00:00:00+00:00

오늘은 어쩌다보니 고른 문제가 쉬워서 블로그에다가 포스팅이라도 해야 겠다 싶어서 포스팅하게 되었다! 쉽든 어렵든 꼬박꼬박 공부하면서 포스팅하려고 한다. 오늘과 미래의 나에게 파이팅. 👊🏻

문제 설명

링크 이 문제는 간단한 문제이다! 금방 생각해 낼 수 있는 문제이나, 슬라이딩 윈도우 개념을 설명하기 위해서 가지고 왔다. 슬라이딩 윈도우는 복잡한 기법은 아니고, 그냥 중복되는 요소들을 재사용하는 것이다(메모리 아끼기 용). 따라서 이 문제에서도 슬라이딩 윈도우 기법을 사용해서 메모리를 아끼게 된다. 문제는 간단하다. 주어진 수열의 길이가 n일 때 k 길이의 부분 수열 중에서 합이 가장 커지는 부분을 찾아 합을 return하면 된다.

문제 풀이

간단하다. 그냥 i에서 i+k까지의 합을 구했을 때, i+1에서 i+k+1까지의 합을 구하기 위해서는 i+1~i+k까지의 합에서 i+k+1번째를 더해주고, i번째를 빼주기만 하면 된다. 이런 식으로 s를 갱신하여 슬라이딩 윈도우 기법을 사용할 수 있다. 이렇게 s를 구하면서 max까지 같이 구하면 O(n)에 문제를 풀 수 있다.

#include 

long long n, k;
int t[100001];

int main(){
    scanf("%lld%lld", &n, &k);
    for(int i=0;i<n;i++) scanf("%d", &t[i]);

    long long sum=0;
    for(int i=0;i<k;i++) sum+=t[i];

    long long max=sum;
    long long s=sum;
    for(int i=k;i<n;i++){
        s=(s+t[i])-t[i-k];
        if(s>max) max=s;
    }
    printf("%lld\n", max);
}

[백준] 2629 양팔저울

2022-07-11T00:00:00+00:00

방학 때 또 1일1커밋을 해보기 위해서 오랜만에 블로그를 잡았다! 꼬박꼬박 공부해서 문제를 다 잘 풀 수 있었으면 좋겠다.

문제 설명

링크 이 문제는 평범한 배낭과 비슷한 냅색 문제라고 한다! 냅색 문제가 기억이 안나서 다시 보고 왔다.. ㅎㅎ 시간이 나면 평범한 배낭 문제도 포스팅 해봐야 겠다! 어쨌든 이 문제는 양팔 저울에 추랑 구슬을 올려서 구슬 무게를 맞출 수 있는지를 묻는 문제이다. 추의 개수와 각 추의 무게가 주어지면, 그 추를 양팔 저울에 적절히 올려서 구슬 무게를 맞출 수 있는지를 체크하면 된다. 모든 경우의 수를 고려하는 것은 시간초과가 나서 불가능하기때문에, dp를 이용한다.

문제 풀이

이 문제의 핵심은 dp식을 어떻게 세우느냐에 있다. 나는 바로 생각을 못해서, 살짝살짝 풀이를 참고했다. (ㅎㅎ) 링크 dp식은 다음과 같이 세울 수 있다.

bool dp[i][w] : i번째까지 고려했을 때 w라는 무게를 만들 수 있는가?

이렇게 dp식을 세우고 나면, i번째까지 고려했을 때 그 다음은 어떻게 해야할지 알 수 있다. 즉, 내가 i번째까지 고려해서 w라는 무게가 만들어 졌다면, dp[i][w]는 true가 된다. 그리고 i+1번째까지 고려할 때의 경우까지 고려할 수 있다. 여기에는 세가지가 있다.

i+1번째 추를 담지 않고 i+1번째까지 고려한다.
i+1번째 추를 담으면서 i+1번째까지 고려한다.
i+1번째 추를 구슬이랑 같이 담으면서 i+1번째까지 고려한다. 이것을 이용해서 점화식을 세우면 된다. 즉,

void f(int i, int w){
    if(i>cn) return;
    if(w>40000) return;
    if(dp[i][w]) return;

    dp[i][w]=true;

    f(i+1, w+chu[i]);
    f(i+1, w);
    if (w-chu[i]>=0) f(i+1, w-chu[i]);
    else f(i+1, chu[i]-w);
}

가 된다. 여기서 중요한 점은 break되는 조건을 걸어주는 것인데, 이걸 걸어주지 않으면 f가 끝나지 않고 계속 돌게 된다. 즉, i번째까지 고려하는 이 i가 추 개수를 넘어설 때 break해주고, 총 무게가 40000이 넘어갈 때(조건에 있다) break를 걸어주고, 이미 true가 된 dp가 들어오면 break를 걸어준다. 이렇게 하면 양팔저울 문제를 풀 수 있다. 아래는 코드이다.

#include 

int cn;
int chu[31];
int bn;
int b[8];

bool dp[31][50000]={false, }; // dp[i][j] : i번째까지 담아서 j무게를 만들 수 있는가?

void f(int i, int w){
    if(i>cn) return;
    if(w>40000) return;
    if(dp[i][w]) return;

    dp[i][w]=true;

    f(i+1, w+chu[i]);
    f(i+1, w);
    if (w-chu[i]>=0) f(i+1, w-chu[i]);
    else f(i+1, chu[i]-w);
}

int main(){
    scanf("%d", &cn);
    for(int i=0;i<cn;i++) scanf("%d", &chu[i]);
    int bn;
    scanf("%d", &bn);
    for(int i=0;i<bn;i++) scanf("%d", &b[i]);

    f(0,0);

    for(int i=0;i<bn;i++){
        if(b[i]>15000) printf("%s ", "N"); // 여기서도 물어보는 무게가 30*500g이상이면 당연히안되므로 dp를 흝을이유도 없다
        else if(dp[cn][b[i]]) printf("%s ", "Y");
        else printf("%s ", "N");
    }
    printf("\n");
    
}

M0kI’s bloG

[DeepLearning] Local Reparametrization Trick

Reparametrization trick

Local reparametrization trick

[Computer Network] 와이어샤크를 이용해 패킷 분석하기

[AI] DDPM: Denoising Diffusion Probabilistic Model

q, p

forward process

backward process

minimize negative log likelihood (ELBO)

reparameterization

simplification

정리

[MAR] Unsupervised CT Metal Artifact Learning Attention-guided β-CycleGAN

Introduction

Related Works

Conventional Methods

DL-based Methods (supervised)

Unsupervised MAR Methods

β-VAE for feature space disentaglement

Theory

$\beta$-CycleGAN for metal artifact disentanglement

Geometry of Attention

Meterials and Methods

Dataset

Network Architecture

Training Detail

Real data

Synthetic data

blation Study

Discussion

정리

[AI] PhysFormer: Facial Video-based Physiological Measurement with Temporal Difference Transformer

배경지식 (ppg란?)

Introduction

3 stage module

가장 기본적인 방식

위 방식을 바탕으로 ROI preprocess된 feature map에서 rppg를 추출하는 learning-based model 등장 (non end-to-end)

아예 processing된 feature map이 아닌 video 자체를 input으로 받는 end-to-end 모델들이 등장

Transformer의 사용

Contribution

Related Works

Remote physiological measurement

Plenty of traditional hand-crafted approaches

Learning based approach의 등장

Transformer for vision tasks

Materials and Methods

PhysFormer Architecture.

Shallow stem

Tube tokenization

Temporal difference multi-head self-attention

Spatio-temporal feed-forward

Label Distribution Learning

GT label을 어떻게 label distribution으로 만들었는가?

Curriculum Learning Guided Dynamic Loss

temporal domain loss

frequency domain loss

Results

사용한 데이터셋

VIPL-HR

MAHNOB-HCI

MMSE-HR

OBF

Implementation Details

Intra-dataset Testing

HR estimation on VIPL-HR

HR estimation on MAHNOB-HCI

HR, HRV and RF estimation on OBF

Cross-dataset Testing

Ablation Study

Conclusion

[알고리즘/코테 합격자 되기] 시간복잡도/문법

시간복잡도

빅오 표기법

코딩 테스트 필수 문법

빌트인 데이터 타입

컬렉션 데이터 타입

리스트의 시간복잡도

딕셔너리의 시간복잡도

셋의 시간복잡도