MAKE MONEY ONLINE

# [NEW] What are Diffusion Models? | proposed คือ – NATAVIGUIDES

proposed คือ: คุณกำลังดูกระทู้

Diffusion models are a new type of generative models that are flexible enough to learn any arbitrarily complex data distribution while tractable to analytically evaluate the distribution. It has been shown recently that diffusion models can generate high-quality images and the performance is competitive to SOTA GAN.

[Updated on 2021-09-19: Highly recommend this blog post on score-based generative modeling by Yang Song (author of several key papers in the references)].

So far, I’ve written about three types of generative models, GAN, VAE, and Flow-based models. They have shown great success in generating high-quality samples, but each has some limitations of its own. GAN models are known for potentially unstable training and less diversity in generation due to their adversarial training nature. VAE relies on a surrogate loss. Flow models have to use specialized architectures to construct reversible transform.

Diffusion models are inspired by non-equilibrium thermodynamics. They define a Markov chain of diffusion steps to slowly add random noise to data and then learn to reverse the diffusion process to construct desired data samples from the noise. Unlike VAE or flow models, diffusion models are learned with a fixed procedure and the latent variable has high dimensionality (same as the original data).

Fig. 1. Overview of different types of generative models.

## What are Diffusion Models?

Several diffusion-based generative models have been proposed with similar ideas underneath, including diffusion probabilistic models (Sohl-Dickstein et al., 2015), noise-conditioned score network (NCSN; Yang & Ermon, 2019), and denoising diffusion probabilistic models (DDPM; Ho et al. 2020).

### Forward diffusion process

Given a data point sampled from a real data distribution $$\mathbf{x}_0 \sim q(\mathbf{x})$$, let us define a forward diffusion process in which we add small amount of Gaussian noise to the sample in $$T$$ steps, producing a sequence of noisy samples $$\mathbf{x}_1, \dots, \mathbf{x}_T$$. The step sizes are controlled by a variance schedule $$\{\beta_t \in (0, 1)\}_{t=1}^t$$.

$q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 – \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I}) \quad q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) = \prod^T_{t=1} q(\mathbf{x}_t \vert \mathbf{x}_{t-1})$

$q(\mathbf{x}_t \vert \mathbf{x}_{t-1}) = \mathcal{N}(\mathbf{x}_t; \sqrt{1 – \beta_t} \mathbf{x}_{t-1}, \beta_t\mathbf{I}) \quad q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) = \prod^T_{t=1} q(\mathbf{x}_t \vert \mathbf{x}_{t-1})$

The data sample $$\mathbf{x}_0$$ gradually loses its distinguishable features as the step $$t$$ becomes larger. Eventually when $$T \to \infty$$, $$\mathbf{x}_T$$ is equivalent to an isotropic Gaussian distribution.

Fig. 2. The Markov chain of forward (reverse) diffusion process of generating a sample by slowly adding (removing) noise. (Image source: Ho et al. 2020 with a few additional annotations)

A nice property of the above process is that we can sample $$\mathbf{x}_t$$ at any arbitrary time step $$t$$ in a closed form using reparameterization trick. Let $$\alpha_t = 1 – \beta_t$$ and $$\bar{\alpha}_t = \prod_{i=1}^T \alpha_i$$:

\begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 – \alpha_t}\mathbf{z}_{t-1} & \text{ ;where } \mathbf{z}_{t-1}, \mathbf{z}_{t-2}, \dots \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 – \alpha_t \alpha_{t-1}} \bar{\mathbf{z}}_{t-2} & \text{ ;where } \bar{\mathbf{z}}_{t-2} \text{ merges two Gaussians (*).} \\ &= \dots \\ &= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_t}\mathbf{z} \\ q(\mathbf{x}_t \vert \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 – \bar{\alpha}_t)\mathbf{I}) \end{aligned}

\begin{aligned} \mathbf{x}_t &= \sqrt{\alpha_t}\mathbf{x}_{t-1} + \sqrt{1 – \alpha_t}\mathbf{z}_{t-1} & \text{ ;where } \mathbf{z}_{t-1}, \mathbf{z}_{t-2}, \dots \sim \mathcal{N}(\mathbf{0}, \mathbf{I}) \\ &= \sqrt{\alpha_t \alpha_{t-1}} \mathbf{x}_{t-2} + \sqrt{1 – \alpha_t \alpha_{t-1}} \bar{\mathbf{z}}_{t-2} & \text{ ;where } \bar{\mathbf{z}}_{t-2} \text{ merges two Gaussians (*).} \\ &= \dots \\ &= \sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_t}\mathbf{z} \\ q(\mathbf{x}_t \vert \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_t; \sqrt{\bar{\alpha}_t} \mathbf{x}_0, (1 – \bar{\alpha}_t)\mathbf{I}) \end{aligned}

(*) Recall that when we merge two Gaussians with different variance, $$\mathcal{N}(\mathbf{0}, \sigma_1^2\mathbf{I})$$ and $$\mathcal{N}(\mathbf{0}, \sigma_2^2\mathbf{I})$$, the new distribution is $$\mathcal{N}(\mathbf{0}, (\sigma_1^2 + \sigma_2^2)\mathbf{I})$$. Here the merged standard deviation is $$\sqrt{(1 – \alpha_t) + \alpha_t (1-\alpha_{t-1})} = \sqrt{1 – \alpha_t\alpha_{t-1}}$$.

Usually, we can afford a larger update step when the sample gets noisier, so $$\beta_1 < \beta_2 < \dots < \beta_T$$ and therefore $$\bar{\alpha}_1 > \dots > \bar{\alpha}_T$$.

#### Connection with stochastic gradient Langevin dynamics

Langevin dynamics is a concept from physics, developed for statistically modeling molecular systems. Combined with stochastic gradient descent, stochastic gradient Langevin dynamics (Welling & Teh 2011) can produce samples from a probability density $$p(\mathbf{x})$$ using only the gradients $$\nabla_\mathbf{x} \log p(\mathbf{x})$$ in a Markov chain of updates:

$\mathbf{x}_t = \mathbf{x}_{t-1} + \frac{\epsilon}{2} \nabla_\mathbf{x} p(\mathbf{x}_{t-1}) + \sqrt{\epsilon} \mathbf{z}_t ,\quad\text{where } \mathbf{z}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$

$\mathbf{x}_t = \mathbf{x}_{t-1} + \frac{\epsilon}{2} \nabla_\mathbf{x} p(\mathbf{x}_{t-1}) + \sqrt{\epsilon} \mathbf{z}_t ,\quad\text{where } \mathbf{z}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$

where $$\epsilon$$ is the step size. When $$T \to \infty, \epsilon \to 0$$, $$\mathbf{x}_T$$ equals to the true probability density $$p(\mathbf{x})$$.

Compared to standard SGD, stochastic gradient Langevin dynamics injects Gaussian noise into the parameter updates to avoid collapses into local minima.

### Reverse diffusion process

If we can reverse the above process and sample from $$q(\mathbf{x}_{t-1} \vert \mathbf{x}_t)$$, we will be able to recreate the true sample from a Gaussian noise input, $$\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$$. Note that if $$\beta_t$$ is small enough, $$q(\mathbf{x}_{t-1} \vert \mathbf{x}_t)$$ will also be Gaussian. Unfortunately, we cannot easily estimate $$q(\mathbf{x}_{t-1} \vert \mathbf{x}_t)$$ because it needs to use the entire dataset and therefore we need to learn a model $$p_\theta$$ to approximate these conditional probabilities in order to run the reverse diffusion process.

$p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) \quad p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$

$p_\theta(\mathbf{x}_{0:T}) = p(\mathbf{x}_T) \prod^T_{t=1} p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) \quad p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$

Fig. 3. An example of training a diffusion model for modeling a 2D swiss roll data. (Image source: Sohl-Dickstein et al., 2015)

It is noteworthy that the reverse conditional probability is tractable when conditioned on $$\mathbf{x}_0$$:

$q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \color{blue}{\tilde{\boldsymbol{\mu}}}(\mathbf{x}_t, \mathbf{x}_0), \color{red}{\tilde{\beta}_t} \mathbf{I})$

$q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \color{blue}{\tilde{\boldsymbol{\mu}}}(\mathbf{x}_t, \mathbf{x}_0), \color{red}{\tilde{\beta}_t} \mathbf{I})$

Using Bayes’ rule, we have:

\begin{aligned} q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) &= q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0) \frac{ q(\mathbf{x}_{t-1} \vert \mathbf{x}_0) }{ q(\mathbf{x}_t \vert \mathbf{x}_0) } \\ &\propto \exp \Big(-\frac{1}{2} \big(\frac{(\mathbf{x}_t – \sqrt{\alpha_t} \mathbf{x}_{t-1})^2}{\beta_t} + \frac{(\mathbf{x}_{t-1} – \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0)^2}{1-\bar{\alpha}_{t-1}} – \frac{(\mathbf{x}_t – \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\ &= \exp\Big( -\frac{1}{2} \big( \color{red}{(\frac{\alpha_t}{\beta_t} + \frac{1}{1 – \bar{\alpha}_{t-1}})} \mathbf{x}_{t-1}^2 – \color{blue}{(\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{2\sqrt{\bar{\alpha}_t}}{1 – \bar{\alpha}_t} \mathbf{x}_0)} \mathbf{x}_{t-1} + C(\mathbf{x}_t, \mathbf{x}_0) \big) \Big) \end{aligned}

\begin{aligned} q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) &= q(\mathbf{x}_t \vert \mathbf{x}_{t-1}, \mathbf{x}_0) \frac{ q(\mathbf{x}_{t-1} \vert \mathbf{x}_0) }{ q(\mathbf{x}_t \vert \mathbf{x}_0) } \\ &\propto \exp \Big(-\frac{1}{2} \big(\frac{(\mathbf{x}_t – \sqrt{\alpha_t} \mathbf{x}_{t-1})^2}{\beta_t} + \frac{(\mathbf{x}_{t-1} – \sqrt{\bar{\alpha}_{t-1}} \mathbf{x}_0)^2}{1-\bar{\alpha}_{t-1}} – \frac{(\mathbf{x}_t – \sqrt{\bar{\alpha}_t} \mathbf{x}_0)^2}{1-\bar{\alpha}_t} \big) \Big) \\ &= \exp\Big( -\frac{1}{2} \big( \color{red}{(\frac{\alpha_t}{\beta_t} + \frac{1}{1 – \bar{\alpha}_{t-1}})} \mathbf{x}_{t-1}^2 – \color{blue}{(\frac{2\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{2\sqrt{\bar{\alpha}_t}}{1 – \bar{\alpha}_t} \mathbf{x}_0)} \mathbf{x}_{t-1} + C(\mathbf{x}_t, \mathbf{x}_0) \big) \Big) \end{aligned}

where $$C(\mathbf{x}_t, \mathbf{x}_0)$$ is some function not involving $$\mathbf{x}_{t-1}$$ and details are omitted. Following the standard Gaussian density function, the mean and variance can be parameterized as follows:

\begin{aligned} \tilde{\beta}_t &= 1/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 – \bar{\alpha}_{t-1}}) = \frac{1 – \bar{\alpha}_{t-1}}{1 – \bar{\alpha}_t} \cdot \beta_t \\ \tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t, \mathbf{x}_0) &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_t}}{1 – \bar{\alpha}_t} \mathbf{x}_0)/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 – \bar{\alpha}_{t-1}}) = \frac{\sqrt{\alpha_t}(1 – \bar{\alpha}_{t-1})}{1 – \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 – \bar{\alpha}_t} \mathbf{x}_0\\ \end{aligned}

\begin{aligned} \tilde{\beta}_t &= 1/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 – \bar{\alpha}_{t-1}}) = \frac{1 – \bar{\alpha}_{t-1}}{1 – \bar{\alpha}_t} \cdot \beta_t \\ \tilde{\boldsymbol{\mu}}_t (\mathbf{x}_t, \mathbf{x}_0) &= (\frac{\sqrt{\alpha_t}}{\beta_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_t}}{1 – \bar{\alpha}_t} \mathbf{x}_0)/(\frac{\alpha_t}{\beta_t} + \frac{1}{1 – \bar{\alpha}_{t-1}}) = \frac{\sqrt{\alpha_t}(1 – \bar{\alpha}_{t-1})}{1 – \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 – \bar{\alpha}_t} \mathbf{x}_0\\ \end{aligned}

Thanks to the nice property, we can represent $$\mathbf{x}_0 = \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t – \sqrt{1 – \bar{\alpha}_t}\mathbf{z}_t)$$ and plug it into the above equation and obtain:

\begin{aligned} \tilde{\boldsymbol{\mu}}_t &= \frac{\sqrt{\alpha_t}(1 – \bar{\alpha}_{t-1})}{1 – \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 – \bar{\alpha}_t} \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t – \sqrt{1 – \bar{\alpha}_t}\mathbf{z}_t) \\ &= \color{cyan}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \mathbf{z}_t \Big)} \end{aligned}

\begin{aligned} \tilde{\boldsymbol{\mu}}_t &= \frac{\sqrt{\alpha_t}(1 – \bar{\alpha}_{t-1})}{1 – \bar{\alpha}_t} \mathbf{x}_t + \frac{\sqrt{\bar{\alpha}_{t-1}}\beta_t}{1 – \bar{\alpha}_t} \frac{1}{\sqrt{\bar{\alpha}_t}}(\mathbf{x}_t – \sqrt{1 – \bar{\alpha}_t}\mathbf{z}_t) \\ &= \color{cyan}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \mathbf{z}_t \Big)} \end{aligned}

As demonstrated in Fig. 2., such a setup is very similar to VAE and thus we can use the variational lower bound to optimize the negative log-likelihood.

\begin{aligned} – \log p_\theta(\mathbf{x}_0) &\leq – \log p_\theta(\mathbf{x}_0) + D_\text{KL}(q(\mathbf{x}_{1:T}\vert\mathbf{x}_0) \| p_\theta(\mathbf{x}_{1:T}\vert\mathbf{x}_0) ) \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_{\mathbf{x}_{1:T}\sim q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T}) / p_\theta(\mathbf{x}_0)} \Big] \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} + \log p_\theta(\mathbf{x}_0) \Big] \\ &= \mathbb{E}_q \Big[ \log \frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ \text{Let }L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log \frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \geq – \mathbb{E}_{q(\mathbf{x}_0)} \log p_\theta(\mathbf{x}_0) \end{aligned}

\begin{aligned} – \log p_\theta(\mathbf{x}_0) &\leq – \log p_\theta(\mathbf{x}_0) + D_\text{KL}(q(\mathbf{x}_{1:T}\vert\mathbf{x}_0) \| p_\theta(\mathbf{x}_{1:T}\vert\mathbf{x}_0) ) \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_{\mathbf{x}_{1:T}\sim q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T}) / p_\theta(\mathbf{x}_0)} \Big] \\ &= -\log p_\theta(\mathbf{x}_0) + \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} + \log p_\theta(\mathbf{x}_0) \Big] \\ &= \mathbb{E}_q \Big[ \log \frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ \text{Let }L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log \frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \geq – \mathbb{E}_{q(\mathbf{x}_0)} \log p_\theta(\mathbf{x}_0) \end{aligned}

It is also straightforward to get the same result using Jensen’s inequality. Say we want to minimize the cross entropy as the learning objective,

\begin{aligned} L_\text{CE} &= – \mathbb{E}_{q(\mathbf{x}_0)} \log p_\theta(\mathbf{x}_0) \\ &= – \mathbb{E}_{q(\mathbf{x}_0)} \log \Big( \int p_\theta(\mathbf{x}_{0:T}) d\mathbf{x}_{1:T} \Big) \\ &= – \mathbb{E}_{q(\mathbf{x}_0)} \log \Big( \int q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})} d\mathbf{x}_{1:T} \Big) \\ &= – \mathbb{E}_{q(\mathbf{x}_0)} \log \Big( \mathbb{E}_{q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})} \Big) \\ &\leq – \mathbb{E}_{q(\mathbf{x}_{0:T})} \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})} \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})}\Big[\log \frac{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})}{p_\theta(\mathbf{x}_{0:T})} \Big] = L_\text{VLB} \end{aligned}

\begin{aligned} L_\text{CE} &= – \mathbb{E}_{q(\mathbf{x}_0)} \log p_\theta(\mathbf{x}_0) \\ &= – \mathbb{E}_{q(\mathbf{x}_0)} \log \Big( \int p_\theta(\mathbf{x}_{0:T}) d\mathbf{x}_{1:T} \Big) \\ &= – \mathbb{E}_{q(\mathbf{x}_0)} \log \Big( \int q(\mathbf{x}_{1:T} \vert \mathbf{x}_0) \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})} d\mathbf{x}_{1:T} \Big) \\ &= – \mathbb{E}_{q(\mathbf{x}_0)} \log \Big( \mathbb{E}_{q(\mathbf{x}_{1:T} \vert \mathbf{x}_0)} \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})} \Big) \\ &\leq – \mathbb{E}_{q(\mathbf{x}_{0:T})} \log \frac{p_\theta(\mathbf{x}_{0:T})}{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})} \\ &= \mathbb{E}_{q(\mathbf{x}_{0:T})}\Big[\log \frac{q(\mathbf{x}_{1:T} \vert \mathbf{x}_{0})}{p_\theta(\mathbf{x}_{0:T})} \Big] = L_\text{VLB} \end{aligned}

READ  [Update] การรับทราบใบสมัครงาน(Aknowledging an Application) | รับ งาน ภาษา อังกฤษ - NATAVIGUIDES

To convert each term in the equation to be analytically computable, the objective can be further rewritten to be a combination of several KL-divergence and entropy terms (See the detailed step-by-step process in Appendix B in Sohl-Dickstein et al., 2015):

\begin{aligned} L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ &= \mathbb{E}_q \Big[ \log\frac{\prod_{t=1}^T q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{ p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t) } \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \Big( \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)}\cdot \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1}\vert\mathbf{x}_0)} \Big) + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1} \vert \mathbf{x}_0)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{q(\mathbf{x}_1 \vert \mathbf{x}_0)} + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big]\\ &= \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_T)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} – \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1) \Big] \\ &= \mathbb{E}_q [\underbrace{D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t))}_{L_{t-1}} \underbrace{- \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)}_{L_0} ] \end{aligned}

\begin{aligned} L_\text{VLB} &= \mathbb{E}_{q(\mathbf{x}_{0:T})} \Big[ \log\frac{q(\mathbf{x}_{1:T}\vert\mathbf{x}_0)}{p_\theta(\mathbf{x}_{0:T})} \Big] \\ &= \mathbb{E}_q \Big[ \log\frac{\prod_{t=1}^T q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{ p_\theta(\mathbf{x}_T) \prod_{t=1}^T p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t) } \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=1}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t\vert\mathbf{x}_{t-1})}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \Big( \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)}\cdot \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1}\vert\mathbf{x}_0)} \Big) + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_t \vert \mathbf{x}_0)}{q(\mathbf{x}_{t-1} \vert \mathbf{x}_0)} + \log\frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big] \\ &= \mathbb{E}_q \Big[ -\log p_\theta(\mathbf{x}_T) + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} + \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{q(\mathbf{x}_1 \vert \mathbf{x}_0)} + \log \frac{q(\mathbf{x}_1 \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)} \Big]\\ &= \mathbb{E}_q \Big[ \log\frac{q(\mathbf{x}_T \vert \mathbf{x}_0)}{p_\theta(\mathbf{x}_T)} + \sum_{t=2}^T \log \frac{q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)}{p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t)} – \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1) \Big] \\ &= \mathbb{E}_q [\underbrace{D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T))}_{L_T} + \sum_{t=2}^T \underbrace{D_\text{KL}(q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_{t-1} \vert\mathbf{x}_t))}_{L_{t-1}} \underbrace{- \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1)}_{L_0} ] \end{aligned}

Let’s label each component in the variational lower bound loss separately:

\begin{aligned} L_\text{VLB} &= L_T + L_{T-1} + \dots + L_0 \\ \text{where } L_T &= D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T)) \\ L_t &= D_\text{KL}(q(\mathbf{x}_t \vert \mathbf{x}_{t+1}, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_t \vert\mathbf{x}_{t+1})) \text{ for }1 \leq t \leq T-1 \\ L_0 &= – \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1) \end{aligned}

\begin{aligned} L_\text{VLB} &= L_T + L_{T-1} + \dots + L_0 \\ \text{where } L_T &= D_\text{KL}(q(\mathbf{x}_T \vert \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_T)) \\ L_t &= D_\text{KL}(q(\mathbf{x}_t \vert \mathbf{x}_{t+1}, \mathbf{x}_0) \parallel p_\theta(\mathbf{x}_t \vert\mathbf{x}_{t+1})) \text{ for }1 \leq t \leq T-1 \\ L_0 &= – \log p_\theta(\mathbf{x}_0 \vert \mathbf{x}_1) \end{aligned}

Every KL term in $$L_\text{VLB}$$ (except for $$L_0$$) compares two Gaussian distributions and therefore they can be computed in closed form. $$L_T$$ is constant and can be ignored during training because $$q$$ has no learnable parameters and $$\mathbf{x}_T$$ is a Gaussian noise. Ho et al. 2020 models $$L_0$$ using a separate discrete decoder derived from $$\mathcal{N}(\mathbf{x}_0; \boldsymbol{\mu}_\theta(\mathbf{x}_1, 1), \boldsymbol{\Sigma}_\theta(\mathbf{x}_1, 1))$$.

### Parameterization of $$L_t$$ for Training Loss

Recall that we need to learn a neural network to approximate the conditioned probability distributions in the reverse diffusion process, $$p_\theta(\mathbf{x}_{t-1} \vert \mathbf{x}_t) = \mathcal{N}(\mathbf{x}_{t-1}; \boldsymbol{\mu}_\theta(\mathbf{x}_t, t), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t))$$. We would like to train $$\boldsymbol{\mu}_\theta$$ to predict $$\tilde{\boldsymbol{\mu}}_t = \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \mathbf{z}_t \Big)$$. Because $$\mathbf{x}_t$$ is available as input at training time, we can reparameterize the Gaussian noise term instead to make it predict $$\mathbf{z}_t$$ from the input $$\mathbf{x}_t$$ at time step $$t$$:

\begin{aligned} \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) &= \color{cyan}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \mathbf{z}_\theta(\mathbf{x}_t, t) \Big)} \\ \text{Thus }\mathbf{x}_{t-1} &= \mathcal{N}(\mathbf{x}_{t-1}; \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \mathbf{z}_\theta(\mathbf{x}_t, t) \Big), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)) \end{aligned}

\begin{aligned} \boldsymbol{\mu}_\theta(\mathbf{x}_t, t) &= \color{cyan}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \mathbf{z}_\theta(\mathbf{x}_t, t) \Big)} \\ \text{Thus }\mathbf{x}_{t-1} &= \mathcal{N}(\mathbf{x}_{t-1}; \frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \mathbf{z}_\theta(\mathbf{x}_t, t) \Big), \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)) \end{aligned}

The loss term $$L_t$$ is parameterized to minimize the difference from $$\tilde{\boldsymbol{\mu}}$$ :

\begin{aligned} L_t &= \mathbb{E}_{\mathbf{x}_0, \mathbf{z}} \Big[\frac{1}{2 \| \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) \|^2_2} \| \color{blue}{\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)} – \color{green}{\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)} \|^2 \Big] \\ &= \mathbb{E}_{\mathbf{x}_0, \mathbf{z}} \Big[\frac{1}{2 \|\boldsymbol{\Sigma}_\theta \|^2_2} \| \color{blue}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \mathbf{z}_t \Big)} – \color{green}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \boldsymbol{\mathbf{z}}_\theta(\mathbf{x}_t, t) \Big)} \|^2 \Big] \\ &= \mathbb{E}_{\mathbf{x}_0, \mathbf{z}} \Big[\frac{ \beta_t^2 }{2 \alpha_t (1 – \bar{\alpha}_t) \| \boldsymbol{\Sigma}_\theta \|^2_2} \|\mathbf{z}_t – \mathbf{z}_\theta(\mathbf{x}_t, t)\|^2 \Big] \\ &= \mathbb{E}_{\mathbf{x}_0, \mathbf{z}} \Big[\frac{ \beta_t^2 }{2 \alpha_t (1 – \bar{\alpha}_t) \| \boldsymbol{\Sigma}_\theta \|^2_2} \|\mathbf{z}_t – \mathbf{z}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_t}\mathbf{z}_t, t)\|^2 \Big] \end{aligned}

#### Simplification

\begin{aligned} L_t &= \mathbb{E}_{\mathbf{x}_0, \mathbf{z}} \Big[\frac{1}{2 \| \boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) \|^2_2} \| \color{blue}{\tilde{\boldsymbol{\mu}}_t(\mathbf{x}_t, \mathbf{x}_0)} – \color{green}{\boldsymbol{\mu}_\theta(\mathbf{x}_t, t)} \|^2 \Big] \\ &= \mathbb{E}_{\mathbf{x}_0, \mathbf{z}} \Big[\frac{1}{2 \|\boldsymbol{\Sigma}_\theta \|^2_2} \| \color{blue}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \mathbf{z}_t \Big)} – \color{green}{\frac{1}{\sqrt{\alpha_t}} \Big( \mathbf{x}_t – \frac{\beta_t}{\sqrt{1 – \bar{\alpha}_t}} \boldsymbol{\mathbf{z}}_\theta(\mathbf{x}_t, t) \Big)} \|^2 \Big] \\ &= \mathbb{E}_{\mathbf{x}_0, \mathbf{z}} \Big[\frac{ \beta_t^2 }{2 \alpha_t (1 – \bar{\alpha}_t) \| \boldsymbol{\Sigma}_\theta \|^2_2} \|\mathbf{z}_t – \mathbf{z}_\theta(\mathbf{x}_t, t)\|^2 \Big] \\ &= \mathbb{E}_{\mathbf{x}_0, \mathbf{z}} \Big[\frac{ \beta_t^2 }{2 \alpha_t (1 – \bar{\alpha}_t) \| \boldsymbol{\Sigma}_\theta \|^2_2} \|\mathbf{z}_t – \mathbf{z}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_t}\mathbf{z}_t, t)\|^2 \Big] \end{aligned}

Empirically, Ho et al. (2020) found that training the diffusion model works better with a simplified objective that ignores the weighting term:

$L_t^\text{simple} = \mathbb{E}_{\mathbf{x}_0, \mathbf{z}_t} \Big[\|\mathbf{z}_t – \mathbf{z}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_t}\mathbf{z}_t, t)\|^2 \Big]$

$L_t^\text{simple} = \mathbb{E}_{\mathbf{x}_0, \mathbf{z}_t} \Big[\|\mathbf{z}_t – \mathbf{z}_\theta(\sqrt{\bar{\alpha}_t}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_t}\mathbf{z}_t, t)\|^2 \Big]$

The final simple objective is:

$L_\text{simple} = L_t^\text{simple} + C$

$L_\text{simple} = L_t^\text{simple} + C$

where $$C$$ is a constant not depending on $$\theta$$.

Fig. 4. The training and sampling algorithms in DDPM (Image source: Ho et al. 2020)

#### Connection with noise-conditioned score networks (NCSN)

Song & Ermon (2019) proposed a score-based generative modeling method where samples are produced via Langevin dynamics using gradients of the data distribution estimated with score matching. The score of each sample $$\mathbf{x}$$’s density probability is defined as its gradient $$\nabla_{\mathbf{x}} \log p(\mathbf{x})$$. A score network $$s_\theta: \mathbb{R}^D \to \mathbb{R}^D$$ is trained to estimate it. To make it scalable with high-dimensional data in the deep learning setting, they proposed to use either denoising score matching (add a pre-specified small noise to the data; Vincent, 2011) or slided score matching (use random projections; Yang et al., 2019).

Recall that Langevin dynamics can sample data points from a probability density distribution using only the score $$\nabla_{\mathbf{x}} \log p(\mathbf{x})$$ in an iterative process.

However, according to the manifold hypothesis, most of the data is expected to concentrate in a low dimensional manifold, even though the observed data might look only arbitrarily high-dimensional. It brings a negative effect on score estimation since the data points cannot cover the whole space. In regions where data density is low, the score estimation is less reliable. After adding a small Gaussian noise to make the perturbed data distribution cover the full space $$\mathbb{R}^D$$, the training of the score estimator network becomes more stable. Song & Ermon (2019) improved it by perturbing the data with the noise of different levels and train a noise-conditioned score network to jointly estimate the scores of all the perturbed data at different noise levels.

The schedule of increasing noise levels resembles the forward diffusion process.

### Parameterization of $$\beta_t$$

The forward variances are set to be a sequence of linearly increasing constants in Ho et al. (2020), from $$\beta_1=10^{-4}$$ to $$\beta_T=0.02$$. They are relatively small compared to the normalized image pixel values between $$[-1, 1]$$. Diffusion models in their experiments showed high-quality samples but still could not achieve competitive model log-likelihood as other generative models.

Nichol & Dhariwal (2021) proposed several improvement techniques to help diffusion models to obtain lower NLL. One of the improvements is to use a cosine-based variance schedule. The choice of the scheduling function can be arbitrary, as long as it provides a near-linear drop in the middle of the training process and subtle changes around $$t=0$$ and $$t=T$$.

$\beta_t = \text{clip}(1-\frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}, 0.999) \quad\bar{\alpha}_t = \frac{f(t)}{f(0)}\quad\text{where }f(t)=\cos\Big(\frac{t/T+s}{1+s}\cdot\frac{\pi}{2}\Big)$

$\beta_t = \text{clip}(1-\frac{\bar{\alpha}_t}{\bar{\alpha}_{t-1}}, 0.999) \quad\bar{\alpha}_t = \frac{f(t)}{f(0)}\quad\text{where }f(t)=\cos\Big(\frac{t/T+s}{1+s}\cdot\frac{\pi}{2}\Big)$

where the small offset $$s$$ is to prevent $$\beta_t$$ from being too small when close to $$t=0$$.

Fig. 5. Comparison of linear and cosine-based scheduling of $$\beta_t$$ during training. (Image source: Nichol & Dhariwal, 2021)

### Parameterization of reverse process variance $$\boldsymbol{\Sigma}_\theta$$

Ho et al. (2020) chose to fix $$\beta_t$$ as constants instead of making them learnable and set $$\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) = \sigma^2_t \mathbf{I}$$ , where $$\sigma_t$$ is not learned but set to $$\beta_t$$ or $$\tilde{\beta}_t = \frac{1 – \bar{\alpha}_{t-1}}{1 – \bar{\alpha}_t} \cdot \beta_t$$. Because they found that learning a diagonal variance $$\boldsymbol{\Sigma}_\theta$$ leads to unstable training and poorer sample quality.

Nichol & Dhariwal (2021) proposed to learn $$\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t)$$ as an interpolation between $$\beta_t$$ and $$\tilde{\beta}_t$$ by model predicting a mixing vector $$\mathbf{v}$$ :

$\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) = \exp(\mathbf{v} \log \beta_t + (1-\mathbf{v}) \log \tilde{\beta}_t)$

$\boldsymbol{\Sigma}_\theta(\mathbf{x}_t, t) = \exp(\mathbf{v} \log \beta_t + (1-\mathbf{v}) \log \tilde{\beta}_t)$

However, the simple objective $$L_\text{simple}$$ does not depend on $$\boldsymbol{\Sigma}_\theta$$ . To add the dependency, they constructed a hybrid objective $$L_\text{hybrid} = L_\text{simple} + \lambda L_\text{VLB}$$ where $$\lambda=0.001$$ is small and stop gradient on $$\boldsymbol{\mu}_\theta$$ in the $$L_\text{VLB}$$ term such that $$L_\text{VLB}$$ only guides the learning of $$\boldsymbol{\Sigma}_\theta$$. Empirically they observed that $$L_\text{VLB}$$ is pretty challenging to optimize likely due to noisy gradients, so they proposed to use a time-averaging smoothed version of $$L_\text{VLB}$$ with importance sampling.

Fig. 6. Comparison of negative log-likelihood of improved DDPM with other likelihood-based generative models. NLL is reported in the unit of bits/dim. (Image source: Nichol & Dhariwal, 2021)

## Speed up Diffusion Model Sampling

It is very slow to generate a sample from DDPM by following the Markov chain of the reverse diffusion process, as $$T$$ can be up to one or a few thousand steps. One data point from Song et al. 2020: “For example, it takes around 20 hours to sample 50k images of size 32 × 32 from a DDPM, but less than a minute to do so from a GAN on an Nvidia 2080 Ti GPU.”

READ  [NEW] กริยา3ช่อง irregular verb ภาษาอังกฤษAbcdeมีคำแปล | want กริยา 3 ช่อง - NATAVIGUIDES

One simple way is to run a strided sampling schedule (Nichol & Dhariwal, 2021) by taking the sampling update every $$\lceil T/S \rceil$$ steps to reduce the process from $$T$$ to $$S$$ steps. The new sampling schedule for generation is $$\{\tau_1, \dots, \tau_S\}$$ where $$\tau_1 < \tau_2 < \dots <\tau_S \in [1, T]$$ and $$S < T$$.

For another approach, let’s rewrite $$q_\sigma(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0)$$ to be parameterized by a desired standard deviation $$\sigma_t$$ according to the nice property:

\begin{aligned} \mathbf{x}_{t-1} &= \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_{t-1}}\mathbf{z}_{t-1} \\ &= \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_{t-1} – \sigma_t^2} \mathbf{z}_t + \sigma_t\mathbf{z} \\ &= \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_{t-1} – \sigma_t^2} \frac{\mathbf{x}_t – \sqrt{\bar{\alpha}_t}\mathbf{x}_0}{\sqrt{1 – \bar{\alpha}_t}} + \sigma_t\mathbf{z} \\ q_\sigma(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_{t-1}; \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_{t-1} – \sigma_t^2} \frac{\mathbf{x}_t – \sqrt{\bar{\alpha}_t}\mathbf{x}_0}{\sqrt{1 – \bar{\alpha}_t}}, \sigma_t^2 \mathbf{I}) \end{aligned}

\begin{aligned} \mathbf{x}_{t-1} &= \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_{t-1}}\mathbf{z}_{t-1} \\ &= \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_{t-1} – \sigma_t^2} \mathbf{z}_t + \sigma_t\mathbf{z} \\ &= \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_{t-1} – \sigma_t^2} \frac{\mathbf{x}_t – \sqrt{\bar{\alpha}_t}\mathbf{x}_0}{\sqrt{1 – \bar{\alpha}_t}} + \sigma_t\mathbf{z} \\ q_\sigma(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) &= \mathcal{N}(\mathbf{x}_{t-1}; \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_{t-1} – \sigma_t^2} \frac{\mathbf{x}_t – \sqrt{\bar{\alpha}_t}\mathbf{x}_0}{\sqrt{1 – \bar{\alpha}_t}}, \sigma_t^2 \mathbf{I}) \end{aligned}

Recall that in $$q(\mathbf{x}_{t-1} \vert \mathbf{x}_t, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{t-1}; \tilde{\boldsymbol{\mu}}(\mathbf{x}_t, \mathbf{x}_0), \tilde{\beta}_t \mathbf{I})$$, therefore we have:

$\tilde{\beta}_t = \sigma_t^2 = \frac{1 – \bar{\alpha}_{t-1}}{1 – \bar{\alpha}_t} \cdot \beta_t$

$\tilde{\beta}_t = \sigma_t^2 = \frac{1 – \bar{\alpha}_{t-1}}{1 – \bar{\alpha}_t} \cdot \beta_t$

Let $$\sigma_t^2 = \eta \cdot \tilde{\beta}_t$$ such that we can adjust $$\eta \in \mathbb{R}^+$$ as a hyperparameter to control the sampling stochasticity. The special case of $$\eta = 0$$ makes the sampling process deterministic. Such a model is named the denoising diffusion implicit model (DDIM; Song et al., 2020). DDIM has the same marginal noise distribution but deterministically maps noise back to the original data samples.

During generation, we only sample a subset of $$S$$ diffusion steps $$\{\tau_1, \dots, \tau_S\}$$ and the inference process becomes:

$q_{\sigma, \tau}(\mathbf{x}_{\tau_{i-1}} \vert \mathbf{x}_{\tau_t}, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{\tau_{i-1}}; \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_{t-1} – \sigma_t^2} \frac{\mathbf{x}_{\tau_i} – \sqrt{\bar{\alpha}_t}\mathbf{x}_0}{\sqrt{1 – \bar{\alpha}_t}}, \sigma_t^2 \mathbf{I})$

$q_{\sigma, \tau}(\mathbf{x}_{\tau_{i-1}} \vert \mathbf{x}_{\tau_t}, \mathbf{x}_0) = \mathcal{N}(\mathbf{x}_{\tau_{i-1}}; \sqrt{\bar{\alpha}_{t-1}}\mathbf{x}_0 + \sqrt{1 – \bar{\alpha}_{t-1} – \sigma_t^2} \frac{\mathbf{x}_{\tau_i} – \sqrt{\bar{\alpha}_t}\mathbf{x}_0}{\sqrt{1 – \bar{\alpha}_t}}, \sigma_t^2 \mathbf{I})$

While all the models are trained with $$T=1000$$ diffusion steps in the experiments, they observed that DDIM ($$\eta=0$$) can produce the best quality samples when $$S$$ is small, while DDPM ($$\eta=1$$) performs much worse on small $$S$$. DDPM does perform better when we can afford to run the full reverse Markov diffusion steps ($$S=T=1000$$). With DDIM, it is possible to train the diffusion model up to any arbitrary number of forward steps but only sample from a subset of steps in the generative process.

Fig. 7. FID scores on CIFAR10 and CelebA datasets by diffusion models of different settings, including $$\color{cyan}{\text{DDIM}}$$ ($$\eta=0$$) and $$\color{orange}{\text{DDPM}}$$ ($$\hat{\sigma}$$). (Image source: Song et al., 2020)

Compared to DDPM, DDIM is able to:

1. Generate higher-quality samples using a much fewer number of steps.
2. Have “consistency” property since the generative process is deterministic, meaning that multiple samples conditioned on the same latent variable should have similar high-level features.
3. Because of the consistency, DDIM can do semantically meaningful interpolation in the latent variable.

## Conditioned Generation

While training generative models on ImageNet data, it is common to generate samples conditioned on class labels. To explicit incorporate class information into the diffusion process, Dhariwal & Nichol (2021) trained a classifier $$f_\phi(y \vert \mathbf{x}_t, t)$$ on noisy image $$\mathbf{x}_t$$ and use gradients $$\nabla_\mathbf{x} \log f_\phi(y \vert \mathbf{x}_t, t)$$ to guide the diffusion sampling process toward the target class label $$y$$. Their ablated diffusion model (ADM) and the one with additional classifier guidance (ADM-G) are able to achieve better results than SOTA generative models (e.g. BigGAN).

Fig. 8. The algorithms use guidance from a classifier to run conditioned generation with DDPM and DDIM. (Image source: Dhariwal & Nichol, 2021])

Additionally with some modifications on the UNet architecture, Dhariwal & Nichol (2021) showed performance better than GAN with diffusion models. The architecture modifications include larger model depth/width, more attention heads, multi-resolution attention, BigGAN residual blocks for up/downsampling, residual connection rescale by $$1/\sqrt{2}$$ and adaptive group normalization (AdaGN).

## Quick Summary

• Pros: Tractability and flexibility are two conflicting objectives in generative modeling. Tractable models can be analytically evaluated and cheaply fit data (e.g. via a Gaussian or Laplace), but they cannot easily describe the structure in rich datasets. Flexible models can fit arbitrary structures in data, but evaluating, training, or sampling from these models is usually expensive. Diffusion models are both analytically tractable and flexible

• Cons: Diffusion models rely on a long Markov chain of diffusion steps to generate samples, so it can be quite expensive in terms of time and compute. New methods have been proposed to make the process much faster, but the sampling is still slower than GAN.

Cited as:

@article{weng2021diffusion,
title   = "What are diffusion models?",
author  = "Weng, Lilian",
journal = "lilianweng.github.io/lil-log",
year    = "2021",
url     = "https://lilianweng.github.io/lil-log/2021/07/11/diffusion-models.html"
}


## References

[1] Jascha Sohl-Dickstein et al. “Deep Unsupervised Learning using Nonequilibrium Thermodynamics.” ICML 2015.

[2] Max Welling & Yee Whye Teh. “Bayesian learning via stochastic gradient langevin dynamics.” ICML 2011.

[3] Yang Song & Stefano Ermon. “Generative modeling by estimating gradients of the data distribution.” NeurIPS 2019.

[4] Yang Song & Stefano Ermon. “Improved techniques for training score-based generative models.” NeuriPS 2020.

[5] Jonathan Ho et al. “Denoising diffusion probabilistic models.” arxiv Preprint arxiv:2006.11239 (2020). [code] [6] Jiaming Song et al. “Denoising diffusion implicit models.” arxiv Preprint arxiv:2010.02502 (2020). [code] [7] Alex Nichol & Prafulla Dhariwal. “ Improved denoising diffusion probabilistic models” arxiv Preprint arxiv:2102.09672 (2021). [code] [8] Praffula Dhariwal & Alex Nichol. “Diffusion Models Beat GANs on Image Synthesis.” arxiv Preprint arxiv:2105.05233 (2021). [code]

## [NEW] Atomic Models | proposed คือ – NATAVIGUIDES

Article to be checked

Suggested reviewer: RNDr. Petr Heřman

## Atomic models[edit | edit source]

There has been a variety of atomic models throughout history of atomic physics, that refers mainly to a period from the beginning of 19th century to the first half of 20th century, when a final model of atom which is being used nowadays (or accepted as the most accurate one) was invented. Although the awareness of atom existence goes way back to the antique period of the world history (Greek conception of atom), this article will be mainly about five basic atomic models, from which each one has somehow contributed to how we percept the structure of atom itself – Dalton´s Billiard Ball Model, J.J Thomson’s “plum pudding” model, Rutherford’s Planetary model, Bohr’s Atomic model, Electron Cloud Model/Quantum Mechanics Model.

### John Dalton’s atomic model[edit | edit source]

Ilustration of Dalton’s perception of atom

John Dalton was an English scientist, who came up with an idea that all matter is composed of very small things. It was the first complete attempt to describe all matter in terms of particles. He called these particles atoms and formed an atomic theory. In this theory he claims that:

• All matter is made of atoms. Atoms are indivisible and indestructible
• All atoms of a given element are identical in mass and properties
• Compounds are formed by a combination of two or more different kinds of atoms
• A chemical reaction is a rearrangement of atoms

Parts of his theory had to be modified based on the discovery of subatomic particles and isotopes. We now also know that atoms are not indivisible, because they are made up of neutrons, electrons and protons.

Ilustration of Thomson’s perception of atom

### Plum pudding model[edit | edit source]

After discovery of an electron in 1897, people realised that atoms are made up of even smaller particles. Shortly after in 1904 J. J. Thomson proposed his famous “plum pudding model“. In this model, atoms were known to consist of negatively charged electrons, however the atomic nucleus had not been discovered yet.
Thomson knew that atom had an overall neutral charge. He thought that there must be something to counterbalance the negative charge of an electron. He came up with an idea that negative particles are floating within a soup of diffuse positive charge. His model is often called the plum pudding model, because of his similarity to a popular English dessert.

### Rutherford’s model of the atom[edit | edit source]

Rutherford was first, who suggested that Thomson’s plum pudding model was incorrect. His new model introduces nucleus to the atom theory. Nucleus contains relatively high central charge concentrated into very small volume. This small volume also contains the bulk of the atomic mass of the atom. Nucleus is surrounded by lighter and negatively charged electrons. His model is sometimes known as the planetary model of the atom.
However, there were still some major problems with this model. For example Rutherford could not explain why atoms only emit light at certain frequencies. This problem was solved later by a Danish physicist Niels Henrik David Bohr.

### Bohr’s model of the atom[edit | edit source]

Bohr model describes the atom as a positively charged nucleus, which is surrounded by electrons. Electrons travel in circular orbits, attraction is provided by electrostatic forces. Normally occupied energy level of the electron is called the ground state. The electron can move to the less – stable level by absorbing energy. This higher – energy level is called excited state. The electron can return to its original level by releasing the energy. All in all, when electron jumps between orbits, it is accompanied by an emitted or absorbed amount of energy (hv).

### Electron Cloud Model/Quantum Mechanics Model of Atom[edit | edit source]

Quantum Mechanics Model of Atom is nowadays being taught as the most “realistic” atomic model that describes atomic mechanisms as how present science presumes they work. It came to exist as a result of combination of number of scientific assumptions:

1. All particles could be percieved as matter waves with a wavelength. (Louis de Broglie)
2. Resulting from the previous assumption, atomic model which treats electrons also as matter waves was proposed. (Erwin Schrödinger, quantum mechanical atomic model emerged from the solution of Schrödinger’s equation for electron in central electrical field of nucleus.)
3. Principle of uncertainty states that we can’t know both the energy and position of an electron. Therefore, as we learn more about the electron’s position, we know less about its energy, and vice versa. (Werner Heisenberg)
4. There exists more than one energy level of electron in the atom. Electrons are assigned certain atomic orbitals, that can differ from one another in energy. (Niels Bohr)
5. Electrons have an intrinsic property called spin, and an electron can have one of two possible spin values: spin-up or spin-down. Any two electrons occupying the same orbital must have opposite spins. (the Stern-Gerlach Experiment)

### Basic description of the quantum mechanical atomic model:[edit | edit source]

Quantum mechanics physics propose that electrons are moving around the nucleus not on specifically defined electron paths (as we have seen eg. in Rutherford’s planetary atomic model), but in a certain three dimensional space (atomic orbital), in which their own occurrence has a certain probability, meaning their position cannot be calculated with 100% accuracy.

Four numbers, called quantum numbers, were introduced to describe the characteristics of electrons and their orbitals.

#### Quantum numbers[edit | edit source]

##### Principal quantum number: n[edit | edit source]
• describes the energy level of the electron in an atom (it also describes the average distance of the orbital from nucleus)
• It has positive whole number values: 1, 2, 3, 4,… (theoretically speaking the numbers could go to infinite, practically there are 7 known energy levels), it can be seen sometimes described in capital letters instead of numbers, beginning with K (K, L, M, N…)
• the n value describes the size of the orbital
##### Angular momentum quantum number: l[edit | edit source]
• describes basically the shape of the orbital
• This number is limited by the principal quantum number. Its value goes from 0 to n-1. For example, for orbitals with principal quantum number n=2 there can by 2 different shapes of orbitals (2 different values l=0 and l=1)
• for every number exists a letter describing the shape of the orbital as shown in the table below
• Value of l (subshell) Letter

Value of l (subshell)
Letter

s

1
p

2
d

3
f

4
g

Orbital shapes and their orientation for different angular momentum and magnetic numbers

##### Magnetic quantum number: ml[edit | edit source]
• describes how different shapes of orbitals are oriented in space. Its value can be from -l to 0 to +l. For example, for value l=1 there exists 3 values m= -1, 0, +1, meaning that the shape of that orbital can be oriented in 3 different ways in space.

Value of l
Values of ml

1
-1,0,+1

3
-3,-2,-1,0,+1,+2,+3

##### Spin quantum number: ms[edit | edit source]
• describes in which direction is an electron spinning in a magnetic field . That can be either clockwise or counterclockwise and as a result, there are only 2 values allowed: -1/2 and +1/2.
• One consequence of electron spin is that a maximum of two electrons can occupy any given orbital, and the two electrons occupying the same orbital must have opposite spin. This is also called the Pauli exclusion principle.

#### Principles of atom structure[edit | edit source]

Based on knowledge of quantum numbers, we are now able to build an electron configuration of an atom that describes electron arrangement in an atom. Apart from Pauli exclusion principle, there are 2 other rules that we must follow:

• Aufbau principle: Each electron occupies the lowest energy orbital available.

Basic ilustration of the Aufbau principle

• Hund’s rule: a single electron with the same spin must occupy each orbital in a sublevel before they pair up with an electron with an opposite spin.

Hund’s Rule of Maximum Spin Multiplicity

Examples of atoms described by quantum numbers

The importance of quantum mechanical atomic model has 2 main aspects. First, being able to build an electron structure of atoms of specific substances helps us to understand how the atoms interact in molecules, therefore we are one step closer to a more detailed description of attributes of those substances. Second, it leaves an open door to more potential theories which could expand our knowledge and perception of the world and universe surrounding us.

• What are the 6 models of the atom? | Socratic. (2015, May 23). Retrieved October 29, 2017, from https://socratic.org/questions/what-are-the-6-models-of-the-atom
• Kvantově mechanický model atomu. (n.d.). Retrieved October 29, 2017, from http://www.dobreznamky.cz/kvantove-mechanicky-model-atomu/
• Stavba atomového obalu. (2008, October 4). Retrieved October 29, 2017, from https://www.odmaturuj.cz/fyzika/stavba-atomoveho-obalu/
• The quantum mechanical model of the atom. (n.d.). Retrieved October 29, 2017, from https://www.khanacademy.org/science/physics/quantum-physics/quantum-numbers-and-orbitals/a/the-quantum-mechanical-model-of-the-atom
• Benešová, M., & Satrapová, H. (2002). Odmaturuj z chemie. Didaktis.
• Plum pudding model. (2017, October 12). Retrieved October 29, 2017, from https://en.wikipedia.org/wiki/Plum_pudding_model
• Bohr model. (2017, October 19). Retrieved October 29, 2017, from https://en.wikipedia.org/wiki/Bohr_model
• The Editors of Encyclopædia Britannica. (2014, June 05). Bohr atomic model. Retrieved October 29, 2017, from https://www.britannica.com/science/Bohr-atomic-model
• Orbital shapes and their orientation for different angular momentum and magnetic numbers – MEFANET, síť lékařských fakult ČR a SR. (n.d.). Orbital. Retrieved October 29, 2017, from http://www.wikiskripta.eu/w/Orbital#/media/File:Single_electron_orbitals.jpg
• Ilustration of Thomson’s perception of atom – W. (n.d.). Retrieved October 29, 2017, from https://www.meritnation.com/ask-answer/question/why-is-thomsons-model-also-known-as-plum-pudding-model/structure-of-the-atom/1299876
• Basic ilustration of the Aufbau principle – Electronic configuration. (n.d.). Retrieved October 29, 2017, from http://www.chemie.utb.cz/rvicha/Sac/vystprincip.html
• Hund’s Rule of Maximum Spin Multiplicity. (n.d.). Retrieved October 29, 2017, from http://www.eurekasparks.org/2015/07/hunds-rule-of-maximum-spin-multiplicity.html
• Examples of atoms described by quantum numbers – (n.d.). Retrieved October 29, 2017, from http://chemie-obecna.blogspot.cz/2011/08/radioaktivita.html

## GTA 5 Airplane With One Wing Fails Emergency Landing (Plane Crash) GTA V Movie

This is not a real plane crash and emergency landing video, it is a GTA 5 game. I made this video for entertainment gaming purpose.
Thank you for understanding! Enjoy your viewing 😊
(Game Scenario) Winter, 1.00 PM, the plane was on a flight from Las Venturas to Los Santos when suddenly it got into a strong turbulent zone. The plane was turning left and right with great force.
The structure of the plane was strong, but due to the fatigue of the metal on the right side, the wing broke! The plane lost control, the pilot gravitated to the control wheel, he sends the airport dispatcher about the distress.
All fire departments immediately react to the situation, a possible emergency landing site is assumed and an ambulance, police and firefighters are moving there.
See further what happened to the plane…
GTA 5 Snow Mod https://www.gta5mods.com/misc/winteringtavallinoneoivpackagesv10
Playlist of my other GTA 5 Plane Crashes https://www.youtube.com/playlist?list=PLZG1HSQNDqvAH3e0NqKvoDeHS8rPYwsRL
© All rights belong to their respective owners
Thank you so much for viewing! 😊
GTA5

นอกจากการดูบทความนี้แล้ว คุณยังสามารถดูข้อมูลที่เป็นประโยชน์อื่นๆ อีกมากมายที่เราให้ไว้ที่นี่: ดูความรู้เพิ่มเติมที่นี่

## Best doodles airplanes 1. Doodles are flying and singing.

Best doodles airplanes.
Doodles are awesome. Reallife doodles special compilation.
Everything is better with doodles.
Enjoy the video.
doodle
cutedoodle
airplane
royfilmes
royfilms

## World’s Biggest Airplane

Hypothetical 5 deck passenger airliner.
This is an entirely original animated work created and owned by Aristomenis Tsirbas

## Why Is the Proposed FinCEN Rule for Unhosted Wallets Being Pushed So Quickly? – Ep.204

Jeremy Allaire of Circle Pay and Kristin Smith of the Blockchain Association explain the impact of and motivation behind Treasury Secretary Steve Mnuchin’s proposed FinCEN rule that targets “unhosted” wallets. In this episode, they cover:
● what the new FinCEN rule says, and how it would impact unhosted or selfhosted wallets, as well as crypto businesses
● why they believe this is really politically motivated and unilateral midnight rule making by Secretary Mnuchin
● what other bureaucrats and policymakers think should be done instead
● whether the rule only affects businesses rather than individuals
● what information will be recorded according to the rule
● ways to circumvent compliance
● how the rule affects DeFi and Web3
● the procedural hurdles Mnuchin took to propose the rule, and what they recommend the crypto community to do try to stop the implementation of the rule
● how the rule comports with European GDPR regulations
● which government entities will be tasked with making the regulatory changes the space needs
● why PresidentElect Joe Biden’s administration may be more favorable for the space
Crypto.com: http://crypto.com
1inch: http://1inch.exchange
Circle: https://www.circle.com/en/
The Blockchain Association: https://theblockchainassociation.org
Stories on proposed rule:
https://www.coindesk.com/fincenproposeskycrulesforcryptowallets
https://www.theblockcrypto.com/post/88347/treasurycryptowalletsreportingrule
Coin Center response: https://www.coincenter.org/amidnightruleforcryptocurrencytransactionreports/

++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
FOLLOW THE UNCHAINED and UNCONFIRMED PODCAST!
WEBSITE → https://unchainedpodcast.com
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
NEW EPISODES OF UNCHAINED EVERY TUESDAY!
Apple → https://podcasts.apple.com/us/podcast/id1123922160?ls=1
Spotify → https://open.spotify.com/show/1cJrrfGY1SKBIRn5noKSAf?si=6SI4qIcRTEO7EhOe0V9n4w
Stitcher → https://www.stitcher.com/podcast/forbespodcastnetwork/unchainedbigideasfromtheworldsofblockchainandfintech
Website → https://unchainedpodcast.com/category/unchained/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
NEW EPISODES OF UNCONFIRMED EVERY FRIDAY!
Apple → https://podcasts.apple.com/us/podcast/unconfirmedinsightsanalysisfromtopmindsincrypto/id1347049808?ls=1\u0026mt=2
Spotify → https://open.spotify.com/show/67Kt4UameBIU6KFUl8QJKj
Stitcher → https://www.stitcher.com/podcast/laurashin4/unconfirmedinsightsandanalysisfromthetopmindsincrypto
Website → https://unchainedpodcast.com/category/unconfirmed/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
YOUR NOHYPE RESOURCE FOR ALL THINGS CRYPTO!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++

## This is the Drone sent by the US military to destroy China

Copyright Disclaimer Under Section 107 of the Copyright Act 1976, allowance is made for \”fair use\” for purposes such as criticism, commenting, news reporting, teaching, scholarship, and research. Fair use is a use permitted by copyright statute that might otherwise be infringing. Nonprofit, educational, or personal use tips the balance in favor of fair use.
1)This video has no negative impact on the original works (It would actually be positive for them)
2)This video is also for teaching purposes.
3)It is not transformative in nature.
4)I only used bits and pieces of videos to get the point across where necessary.
US MILITARY SYSTEM does not own the rights to these video clips. They have, in accordance with fair use, been repurposed with the intent of educating and inspiring others. However, if any content owners would like their images removed, please contact us by email at usmilitarysystem@gmail.com

นอกจากการดูบทความนี้แล้ว คุณยังสามารถดูข้อมูลที่เป็นประโยชน์อื่นๆ อีกมากมายที่เราให้ไว้ที่นี่: ดูวิธีอื่นๆMAKE MONEY ONLINE

ขอบคุณที่รับชมกระทู้ครับ proposed คือ

### Nam Trần

Hello guys! My name is Tran Nam, I am a person who loves general knowledge. That's why I created this website to share a full update about the general idea. Hopefully the information I provide will bring useful knowledge to all of you!

Check Also
Close