- An observation is defined as: y = h ( x ) + w y=h(x)+w y=h(x)+w where x ∈ R n x\in\mathbb{R}^n x∈Rn and y ∈ R m y\in\mathbb{R}^m y∈Rm denote the unknown vector and the measurement vector. h : R n → R m h:\mathbb{R}^n\rightarrow\mathbb{R}^m h:Rn→Rm is a function of x x x and w w w is the observation noise with the power density f w ( w ) f_w(w) fw(w).
- It is assumed that x x x is a random variable with an a priori power density f x ( x ) f_x(x) fx(x) before the observation.
- The goal is to compute the "best" estimation of x x x using the observation.
Optimal Estimation¶
- The optimal estimation x ^ \hat x x^ is defined based on a cost function J J J: x ^ o p t = arg min x ^ E [ J ( x − x ^ ) ] \hat x_{opt}=\underset{\hat x}{\arg\min} E[J(x-\hat x)] x^opt=x^argminE[J(x−x^)]
- Some typical cost functions:
- Minimum Mean Square Error ( x ^ M M S E \hat x_{MMSE} x^MMSE): J ( x − x ^ ) = ( x − x ^ ) T W ( x − x ^ ) , W > 0 J(x-\hat x)=(x-\hat x)^\text{T} W(x-\hat x), W>0 J(x−x^)=(x−x^)TW(x−x^),W>0
- Absolute Value ( x ^ A B S \hat x_{ABS} x^ABS): J ( x − x ^ ) = ∣ x − x ^ ∣ J(x-\hat x)=|x-\hat x| J(x−x^)=∣x−x^∣
-
Maximum a Posteriori (
x
^
M
A
P
\hat x_{MAP}
x^MAP):
J
(
x
−
x
^
)
=
{
0
if
∣
x
−
x
^
∣
≤
ϵ
1
if
∣
x
−
x
^
∣
>
ϵ
,
ϵ
→
0
J(x-\hat x)=\begin{cases} 0 & \text {if }|x-\hat x|\leq \epsilon \\\\\\ 1 & \text {if }|x-\hat x| \gt \epsilon\end{cases}, \epsilon\rightarrow 0
J(x−x^)=⎩
- It can be shown that: x ^ M M S E ( y ) = E [ x ∣ y ] \hat x_{MMSE}(y)=E[x|y] x^MMSE(y)=E[x∣y] ∫ − ∞ x ^ A B S ( y ) f x ∣ y ( x ∣ y ) d x = ∫ x ^ A B S ( y ) + ∞ f x ∣ y ( x ∣ y ) d x \int_{-\infty}^{\hat x_{ABS}(y)}f_{x|y}(x|y)dx=\int_{\hat x_{ABS}(y)}^{+\infty}f_{x|y}(x|y)dx ∫−∞x^ABS(y)fx∣y(x∣y)dx=∫x^ABS(y)+∞fx∣y(x∣y)dx x ^ M A P = arg max x f x ∣ y ( x ∣ y ) \hat x_{MAP}=\arg\max_{x} f_{x|y}(x|y) x^MAP=argmaxxfx∣y(x∣y)
- If the a posteriori density function f x ∣ y ( x ∣ y ) f_{x|y}(x|y) fx∣y(x∣y) has only one maximum and it is symmetric with respect to E [ x ∣ y ] E[x|y] E[x∣y] then all the above estimates are equal to E [ x ∣ y ] E[x|y] E[x∣y].
- In fact, assuming these conditions for f x ∣ y ( x ∣ y ) f_{x|y}(x|y) fx∣y(x∣y), E ( x ∣ y ) E(x|y) E(x∣y) is the optimal estimation for any cost function J J J if J ( 0 ) = 0 J(0)=0 J(0)=0 and J ( x − x ^ ) J(x-\hat x) J(x−x^) is nondecreasing with distance (Sherman's Theorem).
- Maximum Likelihood Estimation: x ^ M L \hat x_{ML} x^ML is the value of x x x that maximizes the probability of observing y y y: x ^ M L ( y ) = arg max x f y ∣ x ( y ∣ x ) \hat x_{ML}(y)=\underset{x}{\arg\max} f_{y|x}(y|x) x^ML(y)=xargmaxfy∣x(y∣x)
- It can be shown that x ^ M L = x ^ M A P \hat x_{ML}=\hat x_{MAP} x^ML=x^MAP if there is no a priori information about x x x.
Linear Gaussian Observation¶
- Consider the following observation: y = A x + B w y=Ax+Bw y=Ax+Bw where w ∼ N ( 0 , I r ) w\sim\mathcal{N}(0,I_r) w∼N(0,Ir) is a Gaussian random vector and matrices A m × n A_{m\times n} Am×n and B m × r B_{m\times r} Bm×r are known.
- In this observation, x x x is estimable if A A A has full column rank otherwise there will be infinite solutions for the problem.
- If B B T BB^\text{T} BBT is invertible, then: f y ∣ x ( y ∣ x ) = 1 ( 2 π ) n / 2 ∣ B B T ∣ 1 / 2 × exp ( − 1 2 [ ( y − A x ) T ( B B T ) − 1 ( y − A x ) ] ) \begin{align*} f_{y|x}(y|x) &= \frac{1}{(2\pi)^{n / 2}|BB^\text{T}|^{1 / 2}} \\\\\\ &\quad \times \exp\left( -\frac{1}{2} \left[(y-Ax)^\text{T} (BB^\text{T})^{-1} (y-Ax)\right] \right) \end{align*} fy∣x(y∣x)=(2π)n/2∣BBT∣1/21×exp(−21[(y−Ax)T(BBT)−1(y−Ax)])
- The maximum likelihood estimation can be computed as: x ^ M L = arg min x ( y − A x ) T ( B B T ) − 1 ( y − A x ) = ( A T ( B B T ) − 1 A ) − 1 A T ( B B T ) − 1 y \begin{align}\hat x_{ML}=&\arg\min_x (y-Ax)^\text{T}(BB^\text{T})^{-1}(y-Ax)\\\\=&(A^\text{T}(BB^\text{T})^{-1}A)^{-1}A^\text{T}(BB^\text{T})^{-1}y\end{align} x^ML==argxmin(y−Ax)T(BBT)−1(y−Ax)(AT(BBT)−1A)−1AT(BBT)−1y
- It is very interesting that x ^ M L \hat x_{ML} x^ML is the Weighted Least Square (WLS) solution to the following equation: y = A x y=Ax y=Ax with the weight matrix W = B B T W=BB^\text{T} W=BBT i.e. x ^ W L S = arg min x ( y − A x ) T W ( y − A x ) \hat x_{WLS}=\arg\min_x (y-Ax)^\text{T} W(y-Ax) x^WLS=argminx(y−Ax)TW(y−Ax)
- x ^ M L \hat x_{ML} x^ML is an unbiased estimation: b ( x ) = E [ x − x ^ M L ∣ x ] = E [ ( A T ( B B T ) − 1 A ) − 1 A T ( B B T ) − 1 y − x ∣ x ] = 0 \begin{align*} b(x) &= E[x - \hat{x}_{ML} | x] \\ &= E\left[(A^\text{T} (BB^\text{T})^{-1} A)^{-1} A^\text{T} (BB^\text{T})^{-1} y - x \mid x\right] = 0 \end{align*} b(x)=E[x−x^ML∣x]=E[(AT(BBT)−1A)−1AT(BBT)−1y−x∣x]=0
- The covariance of the estimation error is: P e ( X ) = E [ ( x − x ^ M L ) ( x − x ^ M L ) T ∣ x ] = ( A T ( B B T ) − 1 A ) − 1 P_e(X)=E[(x-\hat x_{ML})(x-\hat x_{ML})^\text{T}|x]=(A^\text{T}(BB^\text{T})^{-1}A)^{-1} Pe(X)=E[(x−x^ML)(x−x^ML)T∣x]=(AT(BBT)−1A)−1
- x ^ M L \hat x_{ML} x^ML is efficient in the sense of Cramér Rao bound.
- Example: Consider the following linear Gaussian observation: y = a x + w y=ax+w y=ax+w where a a a is a nonzero real number and w ∼ N ( 0 , r ) w\sim\mathcal{N}(0,r) w∼N(0,r) is the observation noise.
- Maximum a Posteriori Estimation: To compute x ^ M A P \hat x_{MAP} x^MAP, it is assumed that the a priori density of x x x is Gaussian with mean m x m_x mx and variance p x p_x px: x ∼ N ( m x , p x ) x\sim\mathcal{N}(m_x,p_x) x∼N(mx,px)
- The conditions of Sherman's Theorem is satisfied and therefore: x ^ M A P = E [ x ∣ y ] = m x + p x y p y ( y − m y ) = m x + a p x a 2 p x + r ( y − a m x ) = a p x y + m x r a 2 p x + r \begin{align*} \hat x_{MAP}=&E[x|y]\\ =&m_x+\frac{p_{xy}}{p_y}(y-m_y)\\ =&m_x+\frac{ap_{x}}{a^2p_x+r}(y-am_x)\\ =&\frac{ap_xy+m_xr}{a^2p_x+r} \end{align*} x^MAP====E[x∣y]mx+pypxy(y−my)mx+a2px+rapx(y−amx)a2px+rapxy+mxr
- Estimation bias: b M A P = E [ x − x ^ M A P ] = m x − a p x e [ y ] + r m x a 2 p x + r = m x − a 2 p x m x + r m x a 2 p x + r = 0 b_{MAP}=E[x-\hat x_{MAP}]=m_x-\frac{ap_xe[y]+rm_x}{a^2p_x+r}=m_x-\frac{a^2p_xm_x+rm_x}{a^2p_x+r}=0 bMAP=E[x−x^MAP]=mx−a2px+rapxe[y]+rmx=mx−a2px+ra2pxmx+rmx=0
- Estimation error covariance: p M A P = E [ ( x − x ^ M A P ) 2 ] = p x − a 2 p x 2 a 2 p x + r = p x r a 2 p x + r p_{MAP}=E[(x-\hat x_{MAP})^2]=p_x-\frac{a^2p_x^2}{a^2p_x+r}=\frac{p_xr}{a^2p_x+r} pMAP=E[(x−x^MAP)2]=px−a2px+ra2px2=a2px+rpxr
- Maximum Likelihood Estimation: For this example, we have: f y ∣ x ( y ∣ x ) = f w ( y − a x ) = 1 2 π r exp ( − ( y − a x ) 2 2 r ) f_{y|x}(y|x)=f_w(y-ax)=\frac{1}{\sqrt{2\pi r}}\exp(-\frac{(y-ax)^2}{2r}) fy∣x(y∣x)=fw(y−ax)=2πr1exp(−2r(y−ax)2)
- With this information: x ^ M L = arg max x f y ∣ x ( y ∣ x ) = y a \hat x_{ML}=\arg\max_x f_{y|x}(y|x)=\frac{y}{a} x^ML=argmaxxfy∣x(y∣x)=ay
- Estimation bias: b M L = E [ x − x ^ M L ∣ x ] = x − a x x = 0 b_{ML}=E[x-\hat x_{ML}|x]=x-\frac{ax}{x}=0 bML=E[x−x^ML∣x]=x−xax=0
- Estimation error covariance: p M L = E [ ( x − x ^ M L ) 2 ∣ x ] = E [ ( x − a x + w a ) 2 ] = r a 2 p_{ML}=E[(x-\hat x_{ML})^2|x]=E[(x-\frac{ax+w}{a})^2]=\frac{r}{a^2} pML=E[(x−x^ML)2∣x]=E[(x−aax+w)2]=a2r
- Comparing x M A P x_{MAP} xMAP and x M L x_{ML} xML, we have: lim p x → + ∞ x ^ M A P = x ^ M L \lim_{p_x\rightarrow +\infty}\hat x_{MAP}=\hat x_{ML} limpx→+∞x^MAP=x^ML It means that if there is no a priori information about x x x, the two estimations are equal.
- For the error covariance, we have: 1 p M A P = 1 p M L + 1 p x \frac{1}{p_{MAP}}=\frac{1}{p_{ML}}+\frac{1}{p_x} pMAP1=pML1+px1
- In other words, information after observation is the sum of information of the observation and information before the observation.
- Estimation error covariance: lim p x → + ∞ p ^ M A P = p ^ M L \lim_{p_x\rightarrow +\infty}\hat p_{MAP}=\hat p_{ML} limpx→+∞p^MAP=p^ML
- It is possible to include a priori information in maximum likelihood estimation.
- A priori distribution of x x x, N ( m x , p x ) \mathcal{N}(m_x,p_x) N(mx,px), can be rewritten as the following observation: m x = x + v m_x=x+v mx=x+v where v ∼ N ( 0 , p x ) v\sim\mathcal{N}(0,p_x) v∼N(0,px) is the observation noise.
- Combined observation:
z
=
A
x
+
u
z=Ax+u
z=Ax+u where:
z
=
[
m
x
y
]
,
A
=
[
1
0
0
a
]
,
u
=
[
v
w
]
z=\left[\begin{matrix}
m_x\\\\y\end{matrix}\right], A=\left[\begin{matrix} 1 & 0\\\\0 & a\end{matrix}\right], u=\left[\begin{matrix}
v\\\\ w\end{matrix}\right]
z=
-
The assumption is that
v
v
v and
w
w
w are independent.
Therefore:
u
∼
N
(
0
,
[
p
x
0
0
r
]
)
u\sim\mathcal{N}\left(0,\left[\begin{matrix} p_x& 0\\\\0 &
r\end{matrix}\right]\right)
u∼N
- Maximum liklihood estimation: x ^ M L p ( z ) = arg max x f z ∣ x ( z ∣ x ) = arg min x ( ( m x − x ) 2 p x + ( y − a x ) 2 r ) = a p x y + m x r a 2 p x + r = x ^ M A P \begin{align*}\hat x_{MLp}(z)=&\arg\max_x f_{z|x}(z|x)\\\\\\ =&\arg\min_x \left(\frac{(m_x-x)^2}{p_x}+\frac{(y-ax)^2}{r}\right)\\\\\\ =&\frac{ap_xy+m_xr}{a^2p_x+r}=\hat x_{MAP}\end{align*} x^MLp(z)===argxmaxfz∣x(z∣x)argxmin(px(mx−x)2+r(y−ax)2)a2px+rapxy+mxr=x^MAP
- x ^ M L p \hat x_{MLp} x^MLp is unbiased and has the same error covariance as x ^ M A P \hat x_{MAP} x^MAP.
- Therefore x ^ M L p \hat x_{MLp} x^MLp and x ^ M A P \hat x_{MAP} x^MAP are equivalent.
Standard Kalman Filter¶
-
Consider the following linear system:
{
x
(
k
+
1
)
=
A
(
k
)
x
(
k
)
+
w
(
k
)
y
(
k
)
=
C
(
k
)
x
(
k
)
+
v
(
k
)
\begin{cases}x(k+1)&=&A(k)x(k)+w(k) \\\\ y(k)&=&C(k)x(k)+v(k)\end{cases}
⎩
- w ( k ) ∼ N ( 0 , Q ( k ) ) w(k)\sim\mathcal{N}(0,Q(k)) w(k)∼N(0,Q(k)) and v ( k ) ∼ N ( 0 , R ( k ) ) v(k)\sim\mathcal{N}(0,R(k)) v(k)∼N(0,R(k)) are independent Gaussian white noise processes where R ( k ) R(k) R(k) is invertible.
- It is assumed that there is an a priori estimation of x, denoted by x ^ − ( k ) \hat x^-(k) x^−(k), which is assumed to be unbiased with a Guassian estimation error, independent of w w w and v v v: e − ( k ) ∼ N ( 0 , P − ( k ) ) e^-(k)\sim\mathcal{N}(0,P^-(k)) e−(k)∼N(0,P−(k)) where P − ( k ) P^-(k) P−(k) is invertible.
- Kalman filter is a recursive algorithm to compute the state estimation.
-
Output Measurement: Information in
x
^
−
(
k
)
\hat x^-(k)
x^−(k) and
y
(
k
)
y(k)
y(k) can be written as the following observation:
[
x
^
−
(
k
)
y
(
k
)
]
=
[
I
C
(
k
)
]
x
(
k
)
+
[
e
−
(
k
)
v
(
k
)
]
\left[\begin{matrix}\hat x^-(k)\\\\ y(k) \end{matrix}\right]=\left[\begin{matrix} I\\\\C(k)\end{matrix}\right] x(k)+\left[\begin{matrix} e^-(k)\\\\v(k)\end{matrix}\right]
- Using the Weighted Least Square (WLS) and matrix inversion formula: ( A + B D − 1 C ) − 1 = A − 1 − A − 1 B ( D + C A − 1 B ) − 1 C A − 1 (A+BD^{-1}C)^{-1}=A^{-1}-A^{-1}B(D+CA^{-1}B)^{-1}CA^{-1} (A+BD−1C)−1=A−1−A−1B(D+CA−1B)−1CA−1
- Assuming: K ( k ) = P − ( k ) C T ( k ) [ C ( k ) P − ( k ) C T ( k ) + R ( k ) ] − 1 K(k)=P^-(k)C^\text{T}(k)[C(k)P^-(k)C^\text{T}(k)+R(k)]^{-1} K(k)=P−(k)CT(k)[C(k)P−(k)CT(k)+R(k)]−1
- We have: x ^ ( k ) = x ^ − ( k ) + K ( k ) ( y ( k ) − C ( k ) x ^ − ( k ) ) \hat x(k)=\hat x^-(k)+K(k)(y(k)-C(k)\hat x^-(k)) x^(k)=x^−(k)+K(k)(y(k)−C(k)x^−(k))
- State estimation is the sum of a priori estimation and a multiplicand of output prediction error. Since: y ^ − ( k ) = C ( k ) x ^ − ( k ) \hat y^-(k)=C(k)\hat x^-(k) y^−(k)=C(k)x^−(k)
- K ( k ) K(k) K(k) is the Kalman filter gain.
- Estimation error covariance: P ( k ) = ( I − K ( k ) C ( k ) ) P − ( k ) P(k)=(I-K(k)C(k))P^-(k) P(k)=(I−K(k)C(k))P−(k)
- Information: x ^ ( k ) = x ( k ) + e ( k ) \hat x(k)=x(k)+e(k) x^(k)=x(k)+e(k) where e ( k ) ∼ N ( 0 , P ( k ) ) e(k)\sim\mathcal{N}(0,P(k)) e(k)∼N(0,P(k))
- State Update: To complete a recursive algorithm, we need to compute x ^ − ( k + 1 ) \hat x^-(k+1) x^−(k+1) and P − ( k + 1 ) P^-(k+1) P−(k+1).
-
Information:
x
^
(
k
)
=
x
(
k
)
+
e
(
k
)
0
=
[
−
I
A
(
k
)
]
[
x
(
k
+
1
)
x
(
k
)
]
+
w
(
k
)
\begin{align}\hat x(k)=&x(k)+e(k)\\\\\
0=&\left[\begin{matrix} -I& A(k)\end{matrix}\right]\left[\begin{matrix} x(k+1)\\\\x(k)
\end{matrix}\right]+w(k)\end{align}
x^(k)= 0=x(k)+e(k)[−IA(k)]
- By removing x ( k ) x(k) x(k) from the above observation, we have: A ( k ) x ^ ( k ) = x ( k + 1 ) + A ( k ) e ( k ) − w ( k ) A(k)\hat x(k)=x(k+1)+A(k)e(k)-w(k) A(k)x^(k)=x(k+1)+A(k)e(k)−w(k)
- It is easy to see: x ^ − ( k + 1 ) = A ( k ) x ^ ( k ) \hat x^-(k+1)=A(k)\hat x(k) x^−(k+1)=A(k)x^(k)
- Estimation error: e − ( k + 1 ) = A ( k ) e ( k ) − w ( k ) e^-(k+1)=A(k)e(k)-w(k) e−(k+1)=A(k)e(k)−w(k)
- Estimation covariance: P − ( k + 1 ) = A ( k ) P ( k ) A T ( k ) + Q ( k ) P^-(k+1)=A(k)P(k)A^\text{T}(k)+Q(k) P−(k+1)=A(k)P(k)AT(k)+Q(k)
Summary:
- Initial Conditions: x ^ − ( k ) \hat x^-(k) x^−(k) and its error covariance P − ( k ) P^-(k) P−(k).
- Gain Calculation: K ( k ) = P − ( k ) C T ( k ) [ C ( k ) P − ( k ) C T ( k ) + R ( k ) ] − 1 K(k)=P^-(k)C^\text{T}(k)[C(k)P^-(k)C^\text{T}(k)+R(k)]^{-1} K(k)=P−(k)CT(k)[C(k)P−(k)CT(k)+R(k)]−1
- x ^ ( k ) \hat x(k) x^(k): x ^ ( k ) = x ^ − ( k ) + K ( k ) ( y ( k ) − C ( k ) x ^ − ( k ) ) P ( k ) = ( I − K ( k ) C ( k ) ) P − ( k ) \begin{align*}\hat x(k)=&\hat x^-(k)+K(k)(y(k)-C(k)\hat x^-(k))\\\\\\ P(k)=&(I-K(k)C(k))P^-(k) \end{align*} x^(k)=P(k)=x^−(k)+K(k)(y(k)−C(k)x^−(k))(I−K(k)C(k))P−(k)
- x ^ − ( k + 1 ) \hat x^-(k+1) x^−(k+1): x ^ − ( k + 1 ) = A ( k ) x ^ ( k ) P − ( k + 1 ) = A ( k ) P ( k ) A T ( k ) + Q ( k ) \begin{align*}\hat x^-(k+1)=&A(k)\hat x(k)\\\\ P^-(k+1)=&A(k)P(k)A^\text{T}(k)+Q(k)\end{align*} x^−(k+1)=P−(k+1)=A(k)x^(k)A(k)P(k)AT(k)+Q(k)
- Go to gain calculation and continue the loop for k + 1 k+1 k+1.
Remarks:
- Estimation residue: γ ( k ) = y ( k ) − C ( k ) x ^ − ( k ) \gamma(k)=y(k)-C(k)\hat x^-(k) γ(k)=y(k)−C(k)x^−(k)
- Residue covariance: P γ ( k ) = C ( k ) P − ( k ) C T ( k ) + R ( k ) P_\gamma(k)=C(k)P^-(k)C^\text{T}(k)+R(k) Pγ(k)=C(k)P−(k)CT(k)+R(k)
- The residue signal is used for monitoring the performance of Kalman filter.
- Modeling error, round off error, disturbance, correlation between input and measurement noise and other factors might cause a biased and colored residue.
- The residue signal can be used in Fault Detection and Isolation (FDI).
- The standard Kalman filter is not numerically robust because it contains matrix inversion. For example, the calculated error covariance matrix might not be positive definite because of computational errors.
-
There are different implementations of Kalman filter to improve the
standard Kalman filter in the following aspects:
- Computational efficiency
- Dealing with disturbance or unknown inputs
- Dealing singular systems (difference algebraic equations)