# Summary for Differential Dynamic Programming with Temporally Decomposed Dynamics

This is a summary by David Qiu (david@davidqiu.com) for the paper Differential Dynamic Programming with Temporally Decomposed Dynamics by Dr. Akihiko Yamaguchi (CMU) and Prof. Christopher G. Atkeson (CMU).

## Problem Formulation

In this paper, it is assumed that we have a previously learnt dynamical model. Such a model may be learnt from samples collected previously from experiments, which could be conducted with random parameters to collect dynamics data. This dynamical model is formally expressed as

$$F_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right) \rightarrow \boldsymbol{x}_{n+1}$$

where $0 \le n \le N$ is a finite time step of the dynamical system, $\boldsymbol{x}_{n} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }$ is the state of the dynamical system at time $n$ as a random variable of normal distribution that $\boldsymbol{x}_{n} \sim \mathcal{N} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n} \right)$ with respected to its mean value $\overline{\boldsymbol{x}}_{n} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }$ and its covariance matrix $\boldsymbol{\Sigma}_{n} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| \times \left\| \boldsymbol{x} \right\| }$, $\boldsymbol{a}_{n} \in \mathbb{R}^{ \left\| \boldsymbol{a} \right\| }$ is the action taken at time $n$, $\boldsymbol{x}_{n+1} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }$ is the state of the dynamical system at time $n+1$ as a random variable of normal distribution that $\boldsymbol{x}_{n+1} \sim \mathcal{N} \left( \overline{\boldsymbol{x}}_{n+1}, \boldsymbol{\Sigma}_{n+1} \right)$ with respected to its mean value $\overline{\boldsymbol{x}}_{n+1} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }$ and its covariance matrix $\boldsymbol{\Sigma}_{n+1} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| \times \left\| \boldsymbol{x} \right\| }$, and $F_{n} : \left( \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }, \mathbb{R}^{ \left\| \boldsymbol{a} \right\| } \right) \rightarrow \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }$ is the dynamics at time $n$ that transform the dynamical system from the current state $x_{n}$ to the following state $x_{n+1}$ after taking action $a_{n}$.

The reward function is defined manually so as to instruct the controller to accomplish some tasks, which is formally expressed as

$$R_{n} \left( \boldsymbol{x}_{n} \right) \rightarrow r_{n}$$

where $\boldsymbol{x}_{n} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }$ is the dynamical system state at time $n$, $r_{n} \in \mathbb{R}$ is the reward gained after reaching state $x_{n}$, and $R_{n} : \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } \rightarrow \mathbb{R}$ is the reward function at time $n$ that define the reward with respect to a specific state.

The evaluation function (reward-to-gain function) at time $n$ is formally defined as

$$J_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{U}_{n} \right) \doteq \sum_{n'=n+1}^{N} \mathbb{E}_{x_{n'}} \left[ R_{n'} \left( \boldsymbol{x}_{n'} \right) \right]$$

where $\boldsymbol{x}_{n} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }$ is the dynamical system state at time $n$, $\boldsymbol{U}_{n} \equiv \left\{ \boldsymbol{a}_{n}, \boldsymbol{a}_{n+1},.., \boldsymbol{a}_{N-1} \right\}$ is the action sequence (control sequence) from time $n$ to last action time $N-1$, $R_{n} : \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } \rightarrow \mathbb{R}$ is the reward function at time $n$, and $J_{n} : \left( \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }, \{ \mathbb{R}^{ \left\| \boldsymbol{a} \right\| } \}^{ N-1-n } \right) \rightarrow \mathbb{R}$ is the function that evaluates the reward-to-gain from state $\boldsymbol{x}_{n}$ by taking a action sequence $\boldsymbol{U}_{n}$.

The purpose of dynamic programming in state $\boldsymbol{x}_{n}$ at time $n$ is to choose a action sequence $\boldsymbol{U}_{n}$ so as to maximize the evaluation function $J_{n}$ that

$$\boldsymbol{U}_{n}^{*} \leftarrow \arg\max_{\boldsymbol{U}_{n}} J_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{U}_{n} \right)$$

where $\boldsymbol{U}_{n}^{*}$ is the optimal action sequence in state $x_{n}$ at time $n$.

## Dynamics Estimation

If the dynamic system is deterministic with no noise, we can learn a model of the dynamics

$$F_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right) \rightarrow \boldsymbol{x}_{n+1}$$

that accurately describe the transition from state $s_{n}$ to state $s_{n+1}$ after taking action $a_{n}$, which has been discussed in the above section. Unfortunately, the existence of process noise, sensor noise and uncertainty of the dynamical system (environment) has made it impossible to obtain such an accurate model. So we need to estimate the dynamics considering noise and uncertainty. As given by the problem assumptions, the noise conforms to normal probability distribution, so estimating the transition of the state mean values from $\overline{\boldsymbol{x}}_{n}$ to $\overline{\boldsymbol{x}}_{n+1}$ and the covariance matrices from $\boldsymbol{\Sigma}_{n}$ to $\boldsymbol{\Sigma}_{n+1}$ is equivalent to directly estimating the dynamics of the dynamical system. It is because we can reconstruct the dynamic function in Gaussian form if the mean value and the covariance matrix are given.

### Dynamics Mean Value Estimation

Formally, the state mean value transition estimation function is defined as

$$\tilde{F}_{n} \left( \overline{\boldsymbol{x}}_{n}, \Sigma_{n}, \boldsymbol{a}_{n} \right) \rightarrow \overline{\boldsymbol{x}}_{n+1}$$

where $\overline{\boldsymbol{x}}_{n} \in \mathbb{R}^{\left\| \boldsymbol{x} \right\|}$ is the mean value of the dynamic system state at time $n$, $\boldsymbol{\Sigma}_{n} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| \times \left\| \boldsymbol{x} \right\| }$ is the covariance matrix of the dynamic system state at time $n$, $\boldsymbol{a}_{n} \in \mathbb{R}^{\left\| \boldsymbol{a} \right\|}$ is the action taken at time $n$, $\overline{\boldsymbol{x}}_{n+1} \in \mathbb{R}^{\left\| \boldsymbol{x} \right\|}$ is the mean value of the dynamic system state at time $n+1$, and $\tilde{F}_{n} : \left( \mathbb{R}^{\left\| \boldsymbol{x} \right\|}, \mathbb{R}^{ \left\| \boldsymbol{x} \right\| \times \left\| \boldsymbol{x} \right\| }, \mathbb{R}^{\left\| \boldsymbol{a} \right\|} \right) \rightarrow \mathbb{R}^{\left\| \boldsymbol{x} \right\|}$ is the state mean value dynamics estimation function at time $n$ indicating the state mean value transition from state $\boldsymbol{x}_{n}$ with covariance matrix $\boldsymbol{\Sigma}_{n}$ to state $\boldsymbol{x}_{n+1}$ after taking an action $\boldsymbol{a}_{n}$.

### Dynamics Covariance Matrix Estimation

Similarly, the formal expression of the covariance matrix transition estimation function is defined as

$$S_{n} \left( \overline{\boldsymbol{x}}_{n}, \Sigma_{n}, \boldsymbol{a}_{n} \right) \rightarrow \boldsymbol{\Sigma}_{n+1}$$

where $\overline{\boldsymbol{x}}_{n} \in \mathbb{R}^{\left\| \boldsymbol{x} \right\|}$ is the mean value of the dynamic system state at time $n$, $\boldsymbol{\Sigma}_{n} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| \times \left\| \boldsymbol{x} \right\| }$ is the covariance matrix of the dynamic system state at time $n$, $\boldsymbol{a}_{n} \in \mathbb{R}^{\left\| \boldsymbol{a} \right\|}$ is the action taken at time $n$, $\boldsymbol{\Sigma}_{n+1} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| \times \left\| \boldsymbol{x} \right\| }$ is the covariance matrix of the dynamic system state at time $n+1$, and $S_{n} : \left( \mathbb{R}^{\left\| \boldsymbol{x} \right\|}, \mathbb{R}^{ \left\| \boldsymbol{x} \right\| \times \left\| \boldsymbol{x} \right\| }, \mathbb{R}^{\left\| \boldsymbol{a} \right\|} \right) \rightarrow \mathbb{R}^{ \left\| \boldsymbol{x} \right\| \times \left\| \boldsymbol{x} \right\| }$ is the state covariance matrix dynamics estimation function at time $n$ indicating the state covariance matrix transition from state $\boldsymbol{x}_{n}$ to state $\boldsymbol{x}_{n+1}$ after taking an action $\boldsymbol{a}_{n}$.

### Dynamics Model

Practically, the above state mean value dynamics estimation function $\tilde{F}_{n}$ and state covariance matrix dynamics estimation function $S_{n}$ may take the form as a neural network or a linear regression model, which are trained with samples collected from experiments. In the paper, such a model is referred to as a (differentiable) process model.

If we have these two estimation functions, we can reconstruct the dynamics model formally as

$$F_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right) \rightarrow \boldsymbol{x}_{n+1} \sim \mathcal{N} \left( \tilde{F}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right), S_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right) \right)$$

where $\overline{\boldsymbol{x}}_{n} \equiv \mathbb{E} \left[ \boldsymbol{x}_{n} \right]$ and $\boldsymbol{\Sigma}_{n} \equiv \mathrm{cov} \left[ \boldsymbol{x}_{n} \right]$ are the mean value and the covariance matrix of the dynamic system state $\boldsymbol{x}_{n}$ at time $n$, respectively. And the normal probability distribution on right of the equation is identical to $\mathcal{N} \left( \overline{\boldsymbol{x}}_{n+1}, \boldsymbol{\Sigma}_{n+1} \right)$, where $\overline{\boldsymbol{x}}_{n+1}$ and $\boldsymbol{\Sigma}_{n+1}$ are the mean value and covariance matrix of the dynamic system state $\overline{\boldsymbol{x}}_{n+1}$, respectively, and are given by the above state mean value dynamics estimation function and state covariance matrix dynamics estimation function.

To summarize, we are now able to estimate the dynamics $F_{n}$ from the previous state $\boldsymbol{x}_{n}$ to the following state $\boldsymbol{x}_{n+1}$, by estimating the transition of the mean value from $\overline{\boldsymbol{x}}_{n}$ to $\overline{\boldsymbol{x}}_{n+1}$ and the covariance matrices from $\boldsymbol{\Sigma}_{n}$ to $\boldsymbol{\Sigma}_{n+1}$. Additionally, the mean value transition function $\tilde{F}_{n}$ and covariance matrix transition function $S_{n}$ are learnt from samples collected previously with techniques such as neural network. Conclusively, if the mean value $\overline{\boldsymbol{x}}_{0}$ and the covariance matrix $\boldsymbol{\Sigma}_{0}$ of the initial state $\boldsymbol{x}_{0}$, and an action sequence $\boldsymbol{U}_{0}$ are given, we can estimate the following mean values $\overline{\boldsymbol{x}}_{1}, \overline{\boldsymbol{x}}_{2},.., \overline{\boldsymbol{x}}_{N}$ and the covariance matrices $\boldsymbol{\Sigma}_{1}, \boldsymbol{\Sigma}_{2},.., \boldsymbol{\Sigma}_{N}$ of the dynamic system states $\boldsymbol{x}_{1}, \boldsymbol{x}_{2},.., \boldsymbol{x}_{N}$ in the next $N$ steps so as to estimate these states, because they obey a normal distribution.

## Reward Approximation

As discussed in the previous section, the purpose of dynamic programming is to choose a action sequence $\boldsymbol{U}_{n}$ so as to maximize the evaluation function $J_{n}$ that

$$\boldsymbol{U}_{n}^{*} \leftarrow \arg\max_{\boldsymbol{U}_{n}} J_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{U}_{n} \right)$$

where $\boldsymbol{U}_{n}^{*}$ is the optimal action sequence in state $x_{n}$ at time $n$. In order to obtain compute for the optimal action sequence $\boldsymbol{U}_{n}^{*}$, it is necessary to have the concrete expression for $J_{n}$, which as discussed is formally

$$J_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{U}_{n} \right) \doteq \sum_{n'=n}^{N-1} \mathbb{E}_{x_{n'+1}} \left[ R_{n'+1} \left( \boldsymbol{x}_{n'+1} \right) \right]$$

where $\boldsymbol{x}_{n} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }$ is the dynamical system state at time $n$, $\boldsymbol{U}_{n} \equiv \left\{ \boldsymbol{a}_{n}, \boldsymbol{a}_{n+1},.., \boldsymbol{a}_{N-1} \right\}$ is the action sequence (control sequence) from time $n$ to last action time $N-1$, $R_{n} : \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } \rightarrow \mathbb{R}$ is the reward function at time $n$, and $J_{n} : \left( \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }, \{ \mathbb{R}^{ \left\| \boldsymbol{a} \right\| } \}^{ N-1-n } \right) \rightarrow \mathbb{R}$ is the function that evaluates the reward-to-gain from state $\boldsymbol{x}_{n}$ by taking a action sequence $\boldsymbol{U}_{n}$. Notice that the definition of iteration variable $n'$ is slightly different from that in the original equation in the problem formulation section, which is to clearly show that the reward-to-gain starts from the next state $\boldsymbol{x}_{n+1}$ at time $n+1$.

The puzzle that has not yet been solved is the terms on the right side of the above equation. That is, in other words, the concrete expression of the expected reward. Since the parameter of the expected reward function obeys a normal distribution, so the expected reward can be rewritten formally as

$$\mathbb{E}_{x_{n}} \left[ R_{n} \left( \boldsymbol{x}_{n} \right) \right] = \mathbb{E}_{x_{n}} \left[ R_{n} \left( \boldsymbol{x}_{n} \right) \middle| \boldsymbol{x}_{n} \sim \mathcal{N} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n} \right) \right]$$

We can calculate the expectation considering the probability distribution of normal distributon, which can be expressed formally as

$$\mathbb{E}_{x_{n}} \left[ R_{n} \left( \boldsymbol{x}_{n} \right) \right] = \int_{ \boldsymbol{u} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } }^{ \left( \left\| \boldsymbol{x} \right\| \right) } R_{n} \left( \boldsymbol{u} \right) G \left( \boldsymbol{u} \middle| \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n} \right) \mathrm{d} \boldsymbol{u}$$

where $\int_{ \boldsymbol{u} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } }^{ \left( \left\| \boldsymbol{x} \right\| \right) } \equiv \iint \cdots \int_{ \boldsymbol{u} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } }$ with there being $\left\| \boldsymbol{x} \right\|$ integral symbols, $\int$'s, and $G : \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } \rightarrow \mathbb{R}$ is the multivariate Gaussian function mapping the state value to its corresponding probability considering the mean value and the covariance matrix of the state.

However, if the expression of $R_{n}$ can be sophisticated, so taking the integral directly can be somehow impossible in many cases. (Or is it?) So one common way to obtain the value of expected reward is value approximation, where quadratic approximation method is employed in this paper.

More generally, if we could bring up a method to approximate the reward function, including the expectation (mean value) and variance with respect to the state, it would be much more convenient for us to carry out the next steps. Similar to the dynamics estimation, a reward model is formally described as

$$R_{n} \left( \boldsymbol{x}_{n} \right) \rightarrow r_{n}$$

where $\boldsymbol{x}_{n} \in \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }$ is the dynamical system state at time $n$ and is a random variable with its mean value as $\overline{\boldsymbol{x}}_{n}$ and its covariance matrix as $\boldsymbol{\Sigma}_{n}$, $r_{n+1} \in \mathbb{R}$ then is also a random variable of the reward at state $\boldsymbol{x}_{n}$, and $R_{n} : \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } \rightarrow \mathbb{R}$ is the reward function at time $n$.

### Taylor Series Quadratic Expension

To approximate the reward function expectation, we consider the second-order Taylor series expension of the reward function, before which we denote the offset of the state random variable $\boldsymbol{x}_{n}$ from its mean value $\overline{\boldsymbol{x}}_{n}$ as $\delta \boldsymbol{x}_{n}$ for convenience that

$$\delta \boldsymbol{x}_{n} = \boldsymbol{x}_{n} - \overline{\boldsymbol{x}}_{n}$$

Since the dynamic system state $\boldsymbol{x}_{n}$ obeys the normal distribution with respect to its mean value $\overline{\boldsymbol{x}}_{n}$ and its covariance matrix $\boldsymbol{\Sigma}_{n}$ that $\boldsymbol{x}_{n} \sim \mathcal{N} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n} \right)$. So the offset is also a random variable obeying a normal distribution with its covariance matrix being $\boldsymbol{\Sigma}_{n}$ but its mean value being $\boldsymbol{0}$ that $\delta \boldsymbol{x}_{n} \sim \mathcal{N} \left( \boldsymbol{0}, \boldsymbol{\Sigma}_{n} \right)$.

And then the second-order Taylor series expension of the reward function can be formally expressed as

$$R_{n} \left( \boldsymbol{x}_{n} \right) \approx R'_{n} \left( \boldsymbol{x}_{n} \right) = R'_{n} \left( \overline{\boldsymbol{x}}_{n} + \delta \boldsymbol{x}_{n} \right) = R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) + \boldsymbol{b}_{n}^{\mathsf{T}} \delta \boldsymbol{x}_{n} + \delta \boldsymbol{x}_{n}^{\mathsf{T}} \boldsymbol{A}_{n} \delta \boldsymbol{x}_{n}$$

where

$$\boldsymbol{b}_{n} \equiv \frac{1}{1!} \bigtriangledown_{\boldsymbol{x}} R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) \equiv \frac{1}{1!} \mathrm{D}_{\boldsymbol{x}}^{1} R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) \equiv \begin{bmatrix} \frac{ \partial R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) }{ \partial x_{1} } \\ \frac{ \partial R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) }{ \partial x_{2} } \\ \cdots \\ \frac{ \partial R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) }{ \partial x_{ \left\| \boldsymbol{x} \right\| } } \end{bmatrix}$$

is the first-order partial derivative with constant factor of the reward function $R_{n}$ with respect to the dynamic system state $\boldsymbol{x}$ at the mean value position $\overline{\boldsymbol{x}}_{n}$ of the state $\boldsymbol{x}_{n}$ at time $n$,

$$\boldsymbol{A}_{n} \equiv \frac{1}{2!} \mathrm{H}_{\boldsymbol{x}} \left( R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) \right) \equiv \frac{1}{2!} \mathrm{D}_{\boldsymbol{x}}^{2} R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) \equiv \frac{1}{2} \begin{bmatrix} \frac{ \partial^{2} R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) }{ \partial x_{1} \partial x_{1} } & \frac{ \partial^{2} R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) }{ \partial x_{1} \partial x_{2} } & \cdots & \frac{ \partial^{2} R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) }{ \partial x_{1} \partial x_{ \left\| \boldsymbol{x} \right\| } } \\ \frac{ \partial^{2} R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) }{ \partial x_{2} \partial x_{1} } & \frac{ \partial^{2} R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) }{ \partial x_{2} \partial x_{2} } & \cdots & \frac{ \partial^{2} R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) }{ \partial x_{2} \partial x_{ \left\| \boldsymbol{x} \right\| } } \\ \vdots & \vdots & \ddots & \vdots \\ \frac{ \partial^{2} R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) }{ \partial x_{ \left\| \boldsymbol{x} \right\| } \partial x_{1} } & \frac{ \partial^{2} R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) }{ \partial x_{ \left\| \boldsymbol{x} \right\| } \partial x_{2} } & \cdots & \frac{ \partial^{2} R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) }{ \partial x_{ \left\| \boldsymbol{x} \right\| } \partial x_{ \left\| \boldsymbol{x} \right\| } } \\ \end{bmatrix}$$

is the second-order partial derivative, or says the Hessian Matrix, with constant factor of the reward function $R_{n}$ with respect to the dynamic system state $\boldsymbol{x}$ at the mean value position $\overline{\boldsymbol{x}}_{n}$ of the state $\boldsymbol{x}_{n}$ at time $n$, $R_{n} : \mathbb{R}^{ \left\| \boldsymbol{x} \right\| \times \left\| \boldsymbol{x} \right\| } \rightarrow \mathbb{R}$ is the reward function at time $n$, and $R'_{n} : \mathbb{R}^{ \left\| \boldsymbol{x} \right\| \times \left\| \boldsymbol{x} \right\| } \rightarrow \mathbb{R}$ is the second-order Taylor series expension of the reward function $R_{n}$ at position $\overline{\boldsymbol{x}}_{n}$ at time $n$. And the denotations $\boldsymbol{x}_{n}$, $\overline{\boldsymbol{x}}_{n}$ and $\delta \boldsymbol{x}_{n}$ have been illustrated in the previous paragraph.

### Reward Expectation Quadratic Approximation

The approximation of the expected reward is based on the second-order Taylor series expension of the reward function, and is denoted formally as

$$\mathbb{E}_{ \boldsymbol{x}_{n} } \left[ R_{n} \left( \boldsymbol{x}_{n} \right) \right] \approx \mathbb{E}_{ \boldsymbol{x}_{n} } \left[ R'_{n} \left( \boldsymbol{x}_{n} \right) \right] = \tilde{R}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n} \right) = \mathbb{E}_{ \delta \boldsymbol{x}_{n} } \left[ R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) + \boldsymbol{b}_{n}^{\mathsf{T}} \delta \boldsymbol{x}_{n} + \delta \boldsymbol{x}_{n}^{\mathsf{T}} \boldsymbol{A}_{n} \delta \boldsymbol{x}_{n} \right]$$

where $\tilde{R}_{n}$ is the approximation of the expected reward with respect to the mean value $\overline{\boldsymbol{x}}_{n}$ and the covariance matrix $\boldsymbol{\Sigma}_{n}$ of the state random variable $\boldsymbol{x}_{n}$ at time $n$. The approximation of the expected reward depends only on the mean value and the covariance matrix of the state, because the randomness will be removed after calculating the expectation. And be careful that $\tilde{R}_{n}$ is the approximation of the expected reward function $\mathbb{E}_{ \boldsymbol{x}_{n} } \left[ R_{n} \left( \boldsymbol{x}_{n} \right) \right]$ rather than the approximation of the reward function itself, which has in turn been denoted as $R'_{n}$. The denotation here could be somehow confusing.

Concretely, the approximation of the expected reward can be seperated into three components in accord to the principles of expectation calculation that

$$\tilde{R}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n} \right) = \mathbb{E}_{ \delta \boldsymbol{x}_{n} } \left[ R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) \right] + \mathbb{E}_{ \delta \boldsymbol{x}_{n} } \left[ \boldsymbol{b}_{n}^{\mathsf{T}} \delta \boldsymbol{x}_{n} \right] + \mathbb{E}_{ \delta \boldsymbol{x}_{n} } \left[ \delta \boldsymbol{x}_{n}^{\mathsf{T}} \boldsymbol{A}_{n} \delta \boldsymbol{x}_{n} \right]$$

where

$$\mathbb{E}_{ \delta \boldsymbol{x}_{n} } \left[ R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) \right] = R_{n} \left( \overline{\boldsymbol{x}}_{n} \right)$$

because $R_{n} \left( \overline{\boldsymbol{x}}_{n} \right)$ is a constant value;

\begin{align*} \mathbb{E}_{ \delta \boldsymbol{x}_{n} } \left[ \boldsymbol{b}_{n}^{\mathsf{T}} \delta \boldsymbol{x}_{n} \right] & = \mathbb{E}_{ \delta \boldsymbol{x}_{n} } \left[ \sum_{i=1}^{ \left\| \boldsymbol{x} \right\| } b_{n, i} \delta x_{n, i} \right] \\ & = \sum_{i=1}^{ \left\| \boldsymbol{x} \right\| } b_{n, i} \mathbb{E}_{ \delta x_{n, i} } \left[ \delta x_{n, i} \right] \\ & = \sum_{i=1}^{ \left\| \boldsymbol{x} \right\| } b_{n, i} \cdot 0 \\ & = 0 \end{align*}

because $\delta \boldsymbol{x}_{n}$ obeys a normal distribution with its mean value being $\boldsymbol{0}$ and the expected value of a normal distribution is its mean value;

\begin{align*} \mathbb{E}_{ \delta \boldsymbol{x}_{n} } \left[ \delta \boldsymbol{x}_{n}^{\mathsf{T}} \boldsymbol{A}_{n} \delta \boldsymbol{x}_{n} \right] &= \mathbb{E}_{ \delta \boldsymbol{x}_{n} } \left[ \sum_{i=0}^{ \left\| \boldsymbol{x} \right\| } \sum_{j=0}^{ \left\| \boldsymbol{x} \right\| } A_{n, i, j} \delta x_{n, i} \delta x_{n, j} \right] \\ &= \sum_{i=0}^{ \left\| \boldsymbol{x} \right\| } \sum_{j=0}^{ \left\| \boldsymbol{x} \right\| } A_{n, i, j} \mathbb{E}_{ \delta x_{n, i}, \delta x_{n, j} } \left[ \delta x_{n, i} \delta x_{n, j} \right] \\ &= \sum_{i=0}^{ \left\| \boldsymbol{x} \right\| } \sum_{j=0}^{ \left\| \boldsymbol{x} \right\| } A_{n, i, j} \mathbb{E}_{ x_{n, i}, x_{n, j} } \left[ \left( x_{n, i} - \overline{x}_{n, i} \right) \left( x_{n, j} - \overline{x}_{n, j} \right) \right] \\ &= \sum_{i=0}^{ \left\| \boldsymbol{x} \right\| } \sum_{j=0}^{ \left\| \boldsymbol{x} \right\| } A_{n, i, j} \Sigma_{n, i, j} \\ &= \sum_{i=0}^{ \left\| \boldsymbol{x} \right\| } \sum_{j=0}^{ \left\| \boldsymbol{x} \right\| } A_{n, i, j} \Sigma_{n, j, i} \\ &= \mathrm{Tr} \left( \boldsymbol{A}_{n} \boldsymbol{\Sigma}_{n} \right) \end{align*}

because $\Sigma_{n, i, j} \doteq \mathbb{E}_{ x_{n, i}, x_{n, j} } \left[ \left( x_{n, i} - \overline{x}_{n, i} \right) \left( x_{n, j} - \overline{x}_{n, j} \right) \right]$ is the definition of the covariance matrix in multivariate Gaussian distribution, $\Sigma_{n, i, j} = \Sigma_{n, j, i}$ is rendered due to the covariance matrix $\boldsymbol{\Sigma}$ being symmetric; and $\sum_{i=0}^{ \left\| \boldsymbol{x} \right\| } \sum_{j=0}^{ \left\| \boldsymbol{x} \right\| } A_{n, i, j} \Sigma_{n, j, i}$ actually sums up the product of the $i$-th row of matrix $\boldsymbol{A}_{n}$ and the $i$-th column of matrix $\boldsymbol{\Sigma}_{n}$, which is equivalent to the trace $\mathrm{Tr} \left( \boldsymbol{A}_{n} \boldsymbol{\Sigma}_{n} \right)$ of the product of these two matrices.

To sum up, the expected reward can be approximated with the expected value of the second-order Taylor series expension of the reward function, which is formally and concretely expressed as

$$\mathbb{E}_{ \boldsymbol{x}_{n} } \left[ R_{n} \left( \boldsymbol{x}_{n} \right) \right] \approx \tilde{R}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n} \right) = R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) + \mathrm{Tr} \left( \boldsymbol{A}_{n} \boldsymbol{\Sigma}_{n} \right)$$

where $\boldsymbol{A}_{n}$ is the Hessian Matrix, with constant factor of the reward function $R_{n}$ with respect to the dynamic system state $\boldsymbol{x}$ at the mean value position $\overline{\boldsymbol{x}}_{n}$ of the state $\boldsymbol{x}_{n}$ at time $n$, and $\overline{\boldsymbol{x}}_{n}$ and $\boldsymbol{\Sigma}_{n}$ are respectively the mean value and the covariance matrix of the state $\boldsymbol{x}_{n}$.

### Reward Variance Quadratic Approximation

$$\mathrm{var} \left[ R_{n} \left( \boldsymbol{x}_{n} \right) \right] \approx T \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n} \right) = 2 \mathrm{Tr} \left( \boldsymbol{A}_{n} \boldsymbol{\Sigma}_{n} \boldsymbol{A}_{n} \boldsymbol{\Sigma}_{n} \right) + \boldsymbol{b}_{n}^{\mathsf{T}} \boldsymbol{\Sigma}_{n} \boldsymbol{b}_{n}$$

### Reward Model

After having the quadratic approximations of reward expectation and variance, we can reconstruct a Gaussian model for the approximated reward, which is formally

$$R_{n} \left( \boldsymbol{x}_{n} \right) \overset{\sim}{\rightarrow} \tilde{r}_{n} \sim \mathcal{N} \left( \tilde{R}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n} \right), T \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n} \right) \right)$$

where $\overline{\boldsymbol{x}}_{n} \equiv \mathbb{E} \left[ \boldsymbol{x}_{n} \right]$ and $\boldsymbol{\Sigma}_{n} \equiv \mathrm{cov} \left[ \boldsymbol{x}_{n} \right]$ are the mean value and the covariance matrix of the dynamic system state $\boldsymbol{x}_{n}$ at time $n$, respectively.

To summarize, we are now able to approximate the (complicated) reward function $R_{n}$ as a Gaussian model by the quadratic approximations of the reward expectation and the reward variance. Actually, in this paper, only the reward expectation quadratic approximation is used to estimate the evaluation function $J_{n}$ so as to find the optimal action sequence. And the reward variance quadratic approximation is just a bonus.

## Evaluation Function Gradient

To optimize the action sequence, we can take advantage of the evaluation function gradient along with Newton-Raphson Method. In other words, if the gradient of the evaluation function is obtained, we could apply gradient descent to optimize the action sequence so as to find the optimal action sequence.

### Bellman Equation Form for Evaluation Function

By definition:

$$J_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{U}_{n} \right) \doteq \sum_{n'=n}^{N-1} \mathbb{E}_{ \boldsymbol{x}_{n'+1} } \left[ R_{n'+1} \left( \boldsymbol{x}_{n'+1} \right) \right]$$

where

$$\mathbb{E}_{ \boldsymbol{x}_{n} } \left[ R_{n} \left( \boldsymbol{x}_{n} \right) \right] \approx \tilde{R}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n} \right) = R_{n} \left( \overline{\boldsymbol{x}}_{n} \right) + \mathrm{Tr} \left( \boldsymbol{A}_{n} \boldsymbol{\Sigma}_{n} \right)$$

So it would be convenicent to redefine the evaluation function as

$$J_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{U}_{n} \right) \doteq \sum_{n'=n}^{N-1} \tilde{R}_{n'+1} \left( \overline{\boldsymbol{x}}_{n'+1}, \boldsymbol{\Sigma}_{n'+1} \right)$$

The definition of the evaluation function renders the possibility to rewrite it into an iterative form, or says the Bellman Equation form for the evaluation function, which is formally

$$J_{n} \left( \boldsymbol{x}_{N-1}, \boldsymbol{U}_{N-1} \right) = \tilde{R}_{N} \left( \overline{\boldsymbol{x}}_{N}, \boldsymbol{\Sigma}_{N} \right)$$

$$\vdots$$

$$J_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{U}_{n} \right) = \tilde{R}_{n+1} \left( \overline{\boldsymbol{x}}_{n+1}, \boldsymbol{\Sigma}_{n+1} \right) + J_{n+1} \left( \boldsymbol{x}_{n+1}, \boldsymbol{U}_{n+1} \right)$$

$$\vdots$$

$$J_{0} \left( \boldsymbol{x}_{0}, \boldsymbol{U}_{0} \right) = \tilde{R}_{1} \left( \overline{\boldsymbol{x}}_{1}, \boldsymbol{\Sigma}_{1} \right) + J_{1} \left( \boldsymbol{x}_{1}, \boldsymbol{U}_{1} \right)$$

### Gradient

In accord to the previous section, we have the Bellman Equation form for the evaluation function formally as

\begin{align*} J_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{U}_{n} \right) & = \tilde{R}_{n+1} \left( \overline{\boldsymbol{x}}_{n+1}, \boldsymbol{\Sigma}_{n+1} \right) + J_{n+1} \left( \boldsymbol{x}_{n+1}, \boldsymbol{U}_{n+1} \right) \\ & = \tilde{R}_{n+1} \left( \tilde{F}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right), S_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right) \right) + J_{n+1} \left( \tilde{F}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right), \boldsymbol{U}_{n+1} \right) \end{align*}

So the partial derivative of $J_{n}$ with respect to the corresponding action $\boldsymbol{a}_{n}$ at time $n$, according to the train rule for multivariate calculus, would be

$$\frac{ \partial J_{n} }{ \partial \boldsymbol{a}_{n} } = \frac{ \partial \tilde{F}_{n} }{ \partial \boldsymbol{a}_{n} } \frac{ \partial \tilde{R}_{n+1} }{ \partial \tilde{F}_{n} } + \frac{ \partial S_{n} }{ \partial \boldsymbol{a}_{n} } \frac{ \partial \tilde{R}_{n+1} }{ \partial S_{n} } + \frac{ \partial \tilde{F}_{n} }{ \partial \boldsymbol{a}_{n} } \frac{ \partial J_{n+1} }{ \partial \tilde{F}_{n} }$$

In the paper, it is assumed that the noise at each step of the dynamic process is independent to the action taken, so the dynamics of the covariance matrix is irrelevant to the action $\boldsymbol{a}_{n}$, which means $\frac{ \partial S_{n} }{ \partial \boldsymbol{a}_{n} } = 0$. And keep in mind that $\partial \tilde{F}_{n} = \overline{\boldsymbol{x}}_{n+1}$, so we can further derive the partial derivative of the evaluation function as

\begin{align*} \frac{ \partial J_{n} }{ \partial \boldsymbol{a}_{n} } & = \frac{ \partial \tilde{F}_{n} }{ \partial \boldsymbol{a}_{n} } \frac{ \partial \tilde{R}_{n+1} }{ \partial \boldsymbol{x}_{n+1} } + \frac{ \partial \tilde{F}_{n} }{ \partial \boldsymbol{a}_{n} } \frac{ \partial J_{n+1} }{ \partial \boldsymbol{x}_{n+1} } \\ & = \frac{ \partial \tilde{F}_{n} }{ \partial \boldsymbol{a}_{n} } \left( \frac{ \partial \tilde{R}_{n+1} }{ \partial \boldsymbol{x}_{n+1} } + \frac{ \partial J_{n+1} }{ \partial \boldsymbol{x}_{n+1} } \right) \\ & = \frac{ \partial \tilde{F}_{n} }{ \partial \boldsymbol{a}_{n} } \boldsymbol{\Omega}_{n} \end{align*}

where $\boldsymbol{\Omega}_{n}$ is defined for the convenience in the following derivation. In accord to the conclusion of the evaluation function derivative, $\boldsymbol{\Omega}_{n}$ can be rewritten into the Bellman Equation form with $\boldsymbol{\Omega}_{n+1}$, which is formally

\begin{align*} \boldsymbol{\Omega}_{n} & = \frac{ \partial \tilde{R}_{n+1} }{ \partial \boldsymbol{x}_{n+1} } + \frac{ \partial J_{n+1} }{ \partial \boldsymbol{x}_{n+1} } \\ & = \frac{ \partial \tilde{R}_{n+1} }{ \partial \boldsymbol{x}_{n+1} } + \frac{ \partial \tilde{F}_{n+1} }{ \partial \boldsymbol{a}_{n} } \boldsymbol{\Omega}_{n+1} \end{align*}

Considering the last state at time $N$, the term $\boldsymbol{\Omega}_{n}$ at the time would be

$$\boldsymbol{\Omega}_{n} = \boldsymbol{\Omega}_{N} = 0$$

In conclusion, the derivative of the evaluation function $J_{n}$ with respect to the corresponding action is

$$\frac{ \partial J_{n} }{ \partial \boldsymbol{a}_{n} } = \frac{ \partial \tilde{F}_{n} }{ \partial \boldsymbol{a}_{n} } \boldsymbol{\Omega}_{n}$$

where

$$\boldsymbol{\Omega}_{n} = \begin{cases} \boldsymbol{\Omega}_{N} = 0 & \text{ if } n = N \\ \frac{ \partial \tilde{R}_{n+1} }{ \partial \boldsymbol{x}_{n+1} } + \frac{ \partial \tilde{F}_{n+1} }{ \partial \boldsymbol{a}_{n} } \boldsymbol{\Omega}_{n+1} & \text{ otherwise } \end{cases}$$

Now we have the gradient of the evalution function (approximated by $\tilde{R}_{n}$), and can apply gradient descent to optimize the action sequence so as to find the optimal action sequence.

## Reference States

In this paper, two optimization criteria are brought up, which are respectively evaluation function and value function. The evaluation function criterion and corresponding optimization method using this criterion has been discussed in previous sections, and the value function criterion for optimizing the action sequence will be discussed in this section. (Note that value function criterion is no longer used in Graph-DDP)

For the last reference state, it is defined as the one state that maximize the reward function. And for other reference states and reference actions, which renders a successive state, they are defined as the state and action pair that trying both to maximizing the reward and minimizing the error between the next state and the successive reference state.

### Value Function

To begin with, we need to define a value function that indicates the how close a state is to its reference state, which is formally

$$V_{n} \left( \boldsymbol{x}_{n} \right) = - \left( \boldsymbol{x}^{*}_{n} - \boldsymbol{x}_{n} \right)^{ \mathsf{T} } \boldsymbol{W}_{\text{rs}} \left( \boldsymbol{x}^{*}_{n} - \boldsymbol{x}_{n} \right)$$

where $\boldsymbol{x}^{*}_{n}$ and $\boldsymbol{x}_{n}$ are respectively the reference state and the state at time $n$, $\boldsymbol{W}_{\text{rs}}$ is a manually defined weight matrix (whose function is not yet well understood), and $V_{n} : \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } \rightarrow \mathbb{R}$ is the value function that maps a state to a value that indicates how close this state is to its corresponding reference state at time $n$.

### Last Reference State

Notice that the last state at time $n = N$ does not have successive state, so it is not necessary to consider the error between the following state and the corresponding reference state. Instead, we just need to consider the reward of this state so as to find the optimal state that generates the best reward. And formally we can define a function as

$$L_{N} \left( \boldsymbol{x}_{N} \right) = R_{N} \left( \boldsymbol{x}_{N} \right)$$

where $\boldsymbol{x}_{N}$ is the last state in the dynamic process, $R_{N} : \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } \rightarrow \mathbb{R}$ is the reward function for the last state at time $N$, and $L_{N} : \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } \rightarrow \mathbb{R}$ is the function that indicates the performace of the last state $\boldsymbol{x}_{N}$ at time $N$.

To find a reference state that maximize the performance, we could employ the optimization method to solve for such a state. Formally, the goal of optimization is

$$\boldsymbol{x}^{*}_{N} \leftarrow \arg\max_{ \boldsymbol{x}_{N} } L_{N} \left( \boldsymbol{x}_{N} \right)$$

where $\boldsymbol{x}^{*}_{N}$ is the desired reference state that maximize the performance, or says the reward at the last state.

To solve this optimization problem, we could employ gradient descent method, which suggests that we could execute for multiple times the following operation

$$\boldsymbol{x}_{N} \leftarrow \boldsymbol{x}_{N} + \frac{ \partial L_{N} \left( \boldsymbol{x}_{N} \right) }{ \partial \boldsymbol{x}_{N} } = \boldsymbol{x}_{N} + \frac{ \partial R_{N} \left( \boldsymbol{x}_{N} \right) }{ \partial \boldsymbol{x}_{N} }$$

until the value of the last state $\boldsymbol{x}_{N}$ converges, or says reaches the (local) maxima. At this time, such a converged last state is numerically equal to the reference state $\boldsymbol{x}^{*}_{N}$. So we now have the last reference state.

### General Reference States

After having the value function and the definition of the last reference state, we can now redefine function $L_{n}$ for $\forall n < N$ that combines the reward and the error between the following state and the corresponding reference state in the form of Bellman Equation, which is formally

$$L_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right) = R_{n} \left( \boldsymbol{x}_{n} \right) + V_{n+1} \left( \boldsymbol{x}_{n+1} \right)$$

where $\boldsymbol{x}_{n}$ and $\boldsymbol{a}_{n}$ are respectively the selected state and action at time $n$ for evaluation, $\boldsymbol{x}_{n+1} \leftarrow F_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right)$ is the next state of the dynamic system derived from state $\boldsymbol{x}_{n}$ and action $\boldsymbol{a}_{n}$, $R_{n} : \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } \rightarrow \mathbb{R}$ is the reward function for state $\boldsymbol{x}_{n}$ at time $n$, $V_{n} : \mathbb{R}^{ \left\| \boldsymbol{x} \right\| } \rightarrow \mathbb{R}$ is the value function indicating how close the following state $\boldsymbol{x}_{n+1}$ is to its reference state at time $n+1$, and $L_{n} : \left( \mathbb{R}^{ \left\| \boldsymbol{x} \right\| }, \mathbb{R}^{ \left\| \boldsymbol{a} \right\| } \right) \rightarrow \mathbb{R}$ is the function that indicates the performace of the state and action pair $\left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right)$ at time $n$. Notice that it would be confusing when considering the importance of reward and error, which means there is no explicit scalar to indicate how reward or error matters for the performance evaluation. (I supposed it was one of the function of the manually defined weight matrix $W_{\text{rs}}$ to indicate which factor matters more)

The goal of optimization is the same as that in the case of the last state, but there is an additional action parameter here. Formally, the goal is defined as

$$\boldsymbol{x}^{*}_{n}, \boldsymbol{a}^{*}_{n} \leftarrow \arg\max_{ \boldsymbol{x}_{n}, \boldsymbol{a}_{n} } L_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right)$$

where $\boldsymbol{x}^{*}_{n}$ and $\boldsymbol{a}^{*}_{n}$ are respectively the reference state and action at time $n$ that maximize the performace, which takes both reward and error between the next state and its corresponding reference state into account.

By the definition of the value function and the dynamics from $\boldsymbol{x}_{n}$ to $\boldsymbol{x}_{n+1}$ by taking an action $\boldsymbol{a}_{n}$, we can expend the value function term $V_{n+1} \left( \boldsymbol{x}_{n+1} \right)$ into

\begin{align*} V_{n+1} \left( \boldsymbol{x}_{n+1} \right) & = - \left( \boldsymbol{x}^{*}_{n+1} - \boldsymbol{x}_{n+1} \right)^{ \mathsf{T} } \boldsymbol{W}_{\text{rs}} \left( \boldsymbol{x}^{*}_{n+1} - \boldsymbol{x}_{n+1} \right) \\ & = - \left( \boldsymbol{x}^{*}_{n+1} - F_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right) \right)^{ \mathsf{T} } \boldsymbol{W}_{\text{rs}} \left( \boldsymbol{x}^{*}_{n+1} - F_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right) \right) \end{align*}

By applying gradient descent to solve this optimization problem, we can execute for multiple times the following operation

$$\boldsymbol{x}_{n} \leftarrow \boldsymbol{x}_{n} + \frac{ \partial L_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right) }{ \partial \boldsymbol{x}_{n} } = \frac{ \partial R_{n} \left( \boldsymbol{x}_{n} \right) }{ \partial \boldsymbol{x}_{n} } + 2 \frac{ \partial F_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right) }{ \partial \boldsymbol{x}_{n} } \boldsymbol{W}_{\text{rs}} \left( \boldsymbol{x}^{*}_{n+1} - F_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right) \right)$$

$$\boldsymbol{a}_{n} \leftarrow \boldsymbol{a}_{n} + \frac{ \partial L_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right) }{ \partial \boldsymbol{a}_{n} } = 2 \frac{ \partial F_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right) }{ \partial \boldsymbol{a}_{n} } \boldsymbol{W}_{\text{rs}} \left( \boldsymbol{x}^{*}_{n+1} - F_{n} \left( \boldsymbol{x}_{n}, \boldsymbol{a}_{n} \right) \right)$$

until the value of the state $\boldsymbol{x}_{n}$ and action $\boldsymbol{a}_{n}$ converges, or says reaches the (local) maxima. At this time, such a converged state and action pair is numerically equal to the reference state $\boldsymbol{x}^{*}_{n}$ and the reference action $\boldsymbol{a}^{*}_{n}$ at time $n$. So we now have a method to obtain general reference states.

## Value Function Gradient

In addition to the evaluation function $J_{n}$, the value function $V_{n}$ can be another criterion for optimizing the action sequence, where an action sequence is chosen to maximize the value function, or says making the states in the derived state sequences close to the reference states.

The state is a random variable $\boldsymbol{x}_{n} \sim \mathcal{N} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n} \right)$, so we consider the expectation of the state. With the value function criterion, the expectation of the value function over the state distribution is

$$\mathbb{E}_{ \boldsymbol{x}_{n} } \left[ V_{n} \left( \boldsymbol{x}_{n} \right) \right] = V_{n} \left( \overline{\boldsymbol{x}}_{n} \right) - \mathrm{Tr} \left( \boldsymbol{W}_{\text{rs}} \boldsymbol{\Sigma}_{n} \right)$$

where $\overline{\boldsymbol{x}}_{n}$ and $\boldsymbol{\Sigma}_{n}$ are the mean value and the covariance matrix of the state $\boldsymbol{x}_{n}$ at time $n$.

Recall that by the value function criterion, it is the goal to minimize the error between the following state $\boldsymbol{x}_{n+1}$ and its reference state $\boldsymbol{x}^{*}_{n+1}$ by altering the previous action $\boldsymbol{a}_{n}$ at time $n$. By the definition of the value function, we can derive the value function expectation derivative of the following state with respect to its previous action being taken, which can be formally described as

\begin{align*} \frac{ \partial \mathbb{E}_{ \boldsymbol{x}_{n+1} } \left[ V_{n+1} \left( \boldsymbol{x}_{n+1} \right) \right] }{ \partial \boldsymbol{a}_{n} } & = \frac{ \partial \left( V_{n+1} \left( \overline{\boldsymbol{x}}_{n+1} \right) - \mathrm{Tr} \left( \boldsymbol{W}_{\text{rs}} \boldsymbol{\Sigma}_{n+1} \right) \right) }{ \partial \boldsymbol{a}_{n} } \\ & = \frac{ \partial V_{n+1} \left( \overline{\boldsymbol{x}}_{n+1} \right) }{ \partial \boldsymbol{a}_{n} } - \frac{ \partial \mathrm{Tr} \left( \boldsymbol{W}_{\text{rs}} \boldsymbol{\Sigma}_{n+1} \right) }{ \partial \boldsymbol{a}_{n} } \\ & = \frac{ \partial V_{n+1} \left( \tilde{F}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right) \right) }{ \partial \boldsymbol{a}_{n} } - \frac{ \partial \mathrm{Tr} \left( \boldsymbol{W}_{\text{rs}} S_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right) \right) }{ \partial \boldsymbol{a}_{n} } \end{align*}

where it is assumed that the dynamics of the covariance matrix from $\boldsymbol{\Sigma}_{n}$ to $\boldsymbol{\Sigma}_{n+1}$ is independent from $\boldsymbol{a}_{n}$, so $\frac{ \partial S_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right) }{ \partial \boldsymbol{a}_{n} } = 0$, which indicates the last term on the above equation equals to $0$. So in accord to the train rule along with the above conclusion, the value function expectation is then

\begin{align*} \frac{ \partial \mathbb{E}_{ \boldsymbol{x}_{n+1} } \left[ V_{n+1} \left( \boldsymbol{x}_{n+1} \right) \right] }{ \partial \boldsymbol{a}_{n} } & = \frac{ \partial V_{n+1} \left( \tilde{F}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right) \right) }{ \partial \boldsymbol{a}_{n} } \\ & = \frac{ \partial \left( - \left( \boldsymbol{x}^{*}_{n+1} - \tilde{F}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right) \right)^{ \mathsf{T} } \boldsymbol{W}_{\text{rs}} \left( \boldsymbol{x}^{*}_{n+1} - \tilde{F}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right) \right) \right) }{ \partial \boldsymbol{a}_{n} } \\ & = 2 \frac{ \tilde{F}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right) }{ \partial \boldsymbol{a}_{n} } \boldsymbol{W}_{\text{rs}} \left( \boldsymbol{x}^{*}_{n+1} - \tilde{F}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right) \right) \\ & = 2 \frac{ \tilde{F}_{n} \left( \overline{\boldsymbol{x}}_{n}, \boldsymbol{\Sigma}_{n}, \boldsymbol{a}_{n} \right) }{ \partial \boldsymbol{a}_{n} } \boldsymbol{W}_{\text{rs}} \left( \boldsymbol{x}^{*}_{n+1} - \overline{\boldsymbol{x}}_{n+1} \right) \end{align*}

For simplicity, the above gradient of value function expectation can be rewritten as

$$\frac{ \partial \mathbb{E} \left[ V_{n+1} \right] }{ \partial \boldsymbol{a}_{n} } = 2 \frac{ \tilde{F}_{n} }{ \partial \boldsymbol{a}_{n} } \boldsymbol{W}_{\text{rs}} \left( \boldsymbol{x}^{*}_{n+1} - \overline{\boldsymbol{x}}_{n+1} \right)$$

Now we have the gradient of the value function expectation (estimated by the following reference state $\boldsymbol{x}^{*}_{n+1}$), and can apply gradient descent to optimize the action sequence so as to find the optimal action sequence.

Notice that we now have two criteria for optimizing the action sequence, and these two criteria are just for gradient descent purposes. The ultimate goal of optimizing the action sequence is defined in the first section as the optimal action sequence that maximize the evaluation function $J_{n}$. These two criteria are merged together when applying gradient descent iteration in the program, and the criterion that actually engenders the best $J_{n}$ is selected at that iteration round.