GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.
Fall 2016: Pattern Classification and Machine Learning
[TOC]
- input variable : features, covariates, independent variables, explanatory variables, exogenous variables, predictors, regressors
- output variable : target, label, reponse, outcome, dependent variables, endogenous variables, measured variables, regressands
- regression : relate input to output for prediction or interpretation
- data :
$(y_n,\v x_n)$ - dimensionality :
$D$ - data-size :
$N$
- dimensionality :
- regression function :
$y\approx f(\v x_n);\forall n$ - correlation
$\not=$ causation
- linear regression : model that assume linear relationship
- parameters :
$\v w=(w_0,w_1)$ - simple linear regression :
$y_n\approx f(\v x_n)=w_0+w_1x_{n1}$ - multi linear regression :
$y_n\approx f(\v x_n)=w_0+w_1x_{n1}+\cdots+w_Dx_{nD}=\v x_n^\top \v w$ with$\v x_n=(1,x_{n1},\cdots,x_{nD})^\top$ - learning : estimation, fitting, find
$\v w$ -
$D>N$ problem : more parameters than data, under-determined for linear regression, use regularization
- cost function : energy, loss, training objective, quantifies the model
-
$y$ real-valued : better symmetric around$0$ - same penalization : large mistakes and very-large mistakes
-
- statistical vs computational trade-off
- outliers : far away data samples, andling them (statistical property)
- mean square error :
$MSE(\v w)=\frac{1}{N}\sum_{n=1}^N [y_n-f(\v x_n)]^2$ , outliers sensitive - mean absolute error :
$MAE(\v w)=\frac{1}{N}\sum\abs{y_n-f(\v x_n)}$ , outliers robust - convexity :
$f(\lambda \v u +(1-\lambda)\v v)\le\lambda f(\v u)+(1-\lambda)f(\v v)$ for any$\v u,\v v$ and$0\le\lambda\le0$ - concave : negative convexity
- unique global minimum value
- error vector :
$\v e=\v y -\v X\v w$ - Huber loss :
$Huber=\cases{\frac{1}{2}\v e^2 & \abs{\v e}\le\delta\\ \delta\abs{\v e}-\frac{1}{2}\delta^2 & \abs{\v e}>\delta}$ , outliers robust, convex, differentiable - Tukey's bisquare loss :
$\frac{\partial \L}{\partial \v e}=\cases{\v e[1-\v e^2/\delta^2]^2 & \abs{\v e}\le\delta\\ 0 & \abs{\v e}>\delta}$ , outliers robust, non-convex - Hinge loss :
$Hinge=\max(0,\v e)$ , relu
- optimization :
$\min_{\v w}\L(\v w)$ - grid searh : brute-force, exponential,
$10^D$ - local minimum : $\L(\v w^)\le\L(\v w);\forall\v w\st\norm{\v w-\v w^}<\epsilon$
- global minimum :
$\L(\v w^*)\le\L(\v w);\forall\v w$ - strict minimum : inequality is strict
- lower bound : required for any cost function
- smooth optimization : differentiable
- gradient :
$\nabla\L(\v w)=\left(\frac{\partial\L(\v w)}{\partial w_1},\cdots,\frac{\partial\L(\v w)}{\partial w_D}\right)^\top$ - gradient descent (batch) :
$\v w^{(t+1)}=\v w^{(t)}-\gamma \v g$ with$\v g=\nabla\L(\v w^{(t)})$ - step-size : learning rate,
$\gamma>0$
- step-size : learning rate,
- MSE :
$\L(\v w)=\frac{1}{2N}\sum_{n=1}^N(y_n-\v x_n^\top\v w)^2=\frac{1}{2N}\v e^\top\v e$ - gradient :
$\nabla\L(\v w)=-\frac{1}{N}\v X^\top\v e$ - complexity :
$O(ND+D)$
- gradient :
- stochastic gradient descent :
$\v g = \nabla\L_n(\v w^{(t)})$ with$E[\nabla\L_n(\v w)]=\nabla\L(\v w)$ - complexity :
$O(D)$
- complexity :
- mini-batch stochastic gradient descent :
$\v g = \frac{1}{\abs{B}}\sum_{n\in B}\nabla\L_n(\v w^{(t)})$ with$B\subseteq N$ randomly chosen- complexity :
$O(D\abs{B})$
- complexity :
- alternative convexity characterization :
$\L(\v u)\ge\L(\v w)+\Delta\L(\v w)^\top(\v u-\v w);\forall\v u,\v w$ , always above its linearization, for differentiable functions only - subgradients :
$\v g$ st.$\L(\v u)\ge\L(\v w)+\v g^\top(\v u-\v w);\forall\v u$ , if$\L$ differtiable,$\v g=\nabla\L(\v w)$ - MAE :
$\v g=\cases{-1 & e<0\\ [-1,1]& e=0\\ 1 & e>0}$ - constrained optimization :
$\min_{\v w}\L(\v w)$ st.$\v w\in C$ - convex sets :
$\theta\v u+(1-\theta)\v v\in C$ for any$\v u,\v v$ and$0\le\theta\le0$ - intersection : convex
- projected gradient descent : unique, efficient,
$P_C(\v w')=\arg\min_{\v v\in C}\norm{\v v-\v w'}$ - update rule :
$\v w^{(t+1)}=P_C[\v w^{(t)}-\gamma\nabla\L(\v w^{(t)})]$
- update rule :
- penality functions :
$\min_{\v w}\L(\v w)+I_C(\v w)$ - brick wall :
$I_C(\v w)=\cases{0 & \v w\in C\\ \infty &\v w\not\in C}$ , not continuous - penalize error :
$\min_{\v w}\L(\v w)+\lambda\norm{A\v w-\v b}^2$ with trade-off parameter$\lambda$
- brick wall :
- stopping criteria :
$\nabla\L(\v w)$ close to$0$ , close to optimum - optimality
- necessary :
$\nabla\L(\v w^*)=0$ - sufficient : positive second-order derivative
- necessary :
- step-size selection : guarantee to converge only when
$\gamma <\gamma_{\min}$ , application specific - line-search : step-size defined automatically
- Robbins-Monroe condition :
$\v w^{(t+1)}=\v w^{(t)}+\gamma^{(t)}\nabla\L_n(\v w^{(t)})$ with$\sum_{t=1}^\infty\gamma^{(t)}=\infty$ and$\sum_{t=1}^\infty (\gamma^{(t)})^2<\infty$ -
$\gamma^{(t)}=1/(1+t)^r$ with$r\in(0.5,1)$
-
- feature normalization : gradient sensitive to ill-conditioning, directions might converge at different speed, need normalization
- Hessian matrix : $\v H(\v w^)=\frac{\partial^2\L(\v w^)}{\partial \v w\partial \v w^\top}$ positive semi-definite
- normal equations :
$\nabla\L(\v w^*)=0$ - MSE :
$\nabla\L(\v w)=-\frac{1}{N}\v X^\top\v e=0$ give$\v w^*=(\v X^\top\v X)^{-1}\v X^\top\v y$ - complexity :
$O(ND^2+D^3)$
- complexity :
- Gram matrix :
$\v X^\top\v X$ inversible iff$rank(\v X)=D$ (full column rank) - rank deficientcy
-
$D>N$ : always$rank(\v X)<D$ - ill-conditioned :
$D\le N$ but some columns collinear
-
- gaussian
- random variable :
$\N(y\mid\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp{\left[-\frac{(y-\mu)^2}{2\sigma^2}\right]}$ - random vector :
$\N(\v y\mid\v \mu,\v \Sigma)=\frac{1}{\sqrt{(2\pi)^D\det(\Sigma)}}\exp{\left[-\frac{1}{2}(\v y-\v \mu)^\top\Sigma^{-1}(\v y-\v\mu)\right]}$
- random variable :
- probabilistic model :
$y_n=\v x_n^\top\v w+\epsilon_n$ with$\epsilon_n\sim\N(0\mid\sigma^2)$ - log-likelihood :
$\L_{lik}(\v w)=\log p(\v y\mid\v X,\v w)=-\frac{1}{2\sigma^2}\sum_{n=1}^N (y_n-\v x_n^\top\v w)^2+ cnst\approx E[\log p(y\mid\v x,\v w)]$ - maximum likelihood estimator :
$\arg\min_{\v w}\L_{mse}(\v w)=\arg\max_{\v w}\L_{lik}(\v w)$ - expected log-like-lihood :
$\L_{lik}(\v w)\approx E_{p(y, x)}[\log p(y\mid \v x,\v w)]$ -
$\v w_{mle}\to\v w_{true}$ in probability - asymptotically normal :
$(\v w_{mle}-\v w_{true})\to\frac{1}{\sqrt{N}}\N(\v w_{mle}\mid \v 0,\v F^{-1}(\v w_{true}))$ with Fisher information$\v F(\v w))=-E_{p(\v Y)}[\frac{\partial^2\L}{\partial \v w\partial \v w^\top}]$ $\v w_{lse}=\arg\max_{\v w}\log\left[\Pi_{n=1}^N\N(y_n\mid\v w_n^\top\v w,\sigma^2)\right]$
- expected log-like-lihood :
- laplace :
$p(y_n\mid\v x_n,\v w)=\frac{1}{2b}\exp{\frac{1}{b}\abs{y_n-\v X_n^\top\v w}}$ , for MAE
- underfit : too limited, not fitting the signal well
- overfit : too powerful, fitting noise in addition to the signal
- get more data : might reduce overfitting
- Occam's razor : "Plurality is not to be posited without necessity"
- regularization : add term that penalizes complex models,
$\min_{\v w}\L(\v w)+\Omega(\v w)$
- basis functions : nonlinear transformations applied to input variable,
$\phi_j(\v x)$ ,$M$ basis - linear basis function models :
$y_n=w_o+\sum_{j=1}^M w_j\phi_j(\v x_n)=\v\phi(\v x_n)^\top\v w$ with$\v\phi(\v x_n)^\top=[\v 1,\phi_1(\v x_n),\cdots,\phi_M(\v x_n)]$ - linear : in
$\v w$ - nonlinear : in
$\v x$ - dimensions of
$\v w$ :$M+1\not = D$ - remplace
$\v w$ : by $\v \Phi=\begin{bmatrix}1&\phi_1(\v x_1)&\cdots&\phi_M(\v x_1)\\ \vdots&\vdots&\ddots&\vdots\\ 1&\phi_1(\v x_N)&\cdots & \phi_M(\v x_N)\end{bmatrix}$ in$\v y=\v\Phi \v w$
- linear : in
-
$L_2$ -regularization :$\Omega(\v w)=\lambda\norm{\v w}_2^2$ with $\norm{\v w}2^2=\sum{j=0}^M w_j^2$ and$\lambda$ regularization parameter- enforce : model to be simpler
- ridge regression :
$L_2$ -- regularization with MSE
- explicit solution :
$\v w_{ridge}^*=(\v\Phi^\top\v\Phi+2N\lambda\v I_M)^{-1}\v\Phi^\top\v y$
- fight ill-conditioning : eigenvalues are at least
$2N\lambda$ , improve condition number of Gram matrix
- maximum-a-posteriori :
$\v w_{ridge}=\arg\max_{\v w}\log\left[\Pi_{n=1}^N\N(y_n\mid\v x_n^\top\v w,\sigma^2)\cdot\N(\v w\mid 0, \frac{1}{\lambda}\v I)\right]$ -
$L_1$ -regularization :$\Omega(\v w)=\lambda\norm{\v w}_1$ with $\norm{\v w}1=\sum{j=0}^M \abs{w_j}$- enforce : sparsity, some elements of
$\v w$ to be$0$
- enforce : sparsity, some elements of
- lasso :
$L_1$ with MSE - other regularizations
- shrinkage
- dropout
- weight decay
- early stopping
- maximum-a-posteriori :
$\v w_{map}=\arg\max_{\v w}p(\v y\mid\v w, \v X, \v\lambda)\cdot p(\v w\mid\v\theta)$ , maximize both likelihood and prior- prior :
$p(\v w\mid\v \lambda)$ - ridge regression :
$\v w_{ridge}=\arg\max_{\v w}\log\left[\Pi_{n=1}^N \N(y_n\mid\v x_n^\top\v w,\sigma^2)\cdot\N(\v w\mid 0,\frac{1}{\lambda}\v I)\right]$
- prior :
- underlying distribution :
$\mathcal D=\mathcal Y\times\mathcal X$ , unknown - independent samples :
$S={(y_n,\v x_n)}^N_{n=1}$ - learning algorithm :
$\mathcal A(S)$ - prediction function :
$f_S=\mathcal A(S)$ - training set :
$S_t$ , 80% - validation set :
$S_v$ , 20% - generalization error :
$\L_D(f)=E[l(y,f(\v x))]$ with$l$ loss function, true/expected risk/loss - trainging error :
$L_S(f_S)=\frac{1}{N}\sum_{n=1}^N l(y_n,f(\v x_n))$ , empirical- issue as a function of itself :
$L_S(f_S)=\frac{1}{N}\sum_{n=1}^N l(y_n,f_S(\v x_n))$ - validation :
$L_{S_v}(f_{S_t})=\frac{1}{\abs{S_v}}\sum_{(y_n,\v x_n)\in S_v}l(y_n,f_{S_t}(\v x_n))$ with$L_D(f_{S_t})=E_{S_v\sim D}[L_{S_v}(f_{S_t})]$ - finite size variation :
$P\left{\max_k\abs{L_D(f_k)-L_{S_v}(f_k)}\ge\sqrt{\frac{\ln(2K/\delta)}{2\abs{S_v}}}\right}\le\delta$ for$K$ hyper-parameters
- issue as a function of itself :
- Chernoff bound :
$\Phi_i$ iid RV with range$[0,1]$ ,$P\left{\abs{\frac{1}{N}\sum_{n=1}^N\Phi_n-E[\Phi]}>\epsilon\right}\le 2e^{-2N\epsilon^2}$ -
$K$ -fold cross-validation : train on$K-1$ , test on remaining, permute and average results - Markov inquality :
$P(X\ge a)\le\frac{E(X)}{a}$ - Hoeffding lemma : zero mean RV
$E[e^{sX}]\le e^{s^8/8(b-a)^2}$
- simple model : high bias (bad fit), small variance
- complex model : small bias, high variance
- data generation model :
$y=f(\v x)+\epsilon$ with zero mean noise$\epsilon$ - decomposition =
$E_{S_t\sim D,noise}[(f(\v x_0)+\epsilon-f_{S_t}(\v x_0))^2]=var[\epsilon]+(f(\v x_0)-E_{S_t\sim D}[f_{S_t}(\v x_0)])^2+E_{S_t\sim D}[(E_{S_t'\sim D}[f_{S_t'}(\v x_0)]-f_{S_t}(\v x_0))^2]=noise + bias +variance$ , all non-negative
- classification : relates input
$\v x$ to output$y$ but$y\in{C_1,C_2}$ restricted to discrete values - class labels :
$C_i$ - multi-class classification :
$y\in{C_0,\ldots,C_{K-1}}$ - decision boundaries : decided by a classifier
- MSE : count equally bad positive and negative error (whereas only one is wrong), small
$\Rightarrow$ small classification error (inverse not true) - nearest neighbors : euclidean distance, useless in high dimension
-
$k$ nearest neighbors - smoothing kernels : weighted linear combinaison of elements in the neighborhood
- linear decision boundaries : hyperplanes
- support vector machines : take hyperplane with highest robustness to wiggling
- kernel trick : augment feature vector with some non-linear functions
- neural networks : apprioriate non-linear transform of input that is linearly separable
- optimal classification : for known generating model
- known joint distribution :
$p(\v x, y)$ - optimal choice for class label :
$\hat y(\v x)$ - probability of misclassification :
$P{\hat y(\v x)=y}=\int p(\v x)\sum_{y\in\mathcal Y}p(y\mid\v x)\v 1_{\hat y(\v X)=y}dx$ - strategy :
$\hat y(\v x)=\arg\max_{y\in\mathcal Y}p(y\mid x)$
- known joint distribution :
- logistic function :
$\sigma(x)=\frac{e^x}{1+e^x}$ - binary posterior probability
$p(C_1\mid\v x)=\sigma(\v x^\top w)$ $p(C_2\mid\v x)=1-\sigma(\v x^\top w$
- training : maximize
$p(\v y\mid\v X,\v w)=\Pi_{n:y_n=C_1}p(y_n=C_1\mid\v x_n)\Pi_{n:y_n=C_2}p(y_n=C_2\mid \v x_n)=\Pi_{n=1}^N\sigma(\xtw)^{y_n}[1-\sigma(\xtw)]^{1-y_n}$ - cost function : minimize
$\L(\v w)=-\sum_{n=1}^N y_n\log\sigma(\v x_n^\top\v w)+(1-y_n)\log[1-\sigma(\v x_n^\top\v w)]=\sum_{n=1}^N\log[1+\exp(\v x_n^\top\v w)]-y_n \v x_n^\top\v w$ - gradient :
$\nabla\L(\v w)=\sum_{n=1}^N\v x_n(\sigma(\v x_n^\top\v w)-y_n)=\v X^\top[\sigma(\v X\v w)-\v y]$ , convex - gradient descent :
$\v w^{(t+1)}=\v w^{(t)}-\gamma^{(t)}\nabla\L(\v w^{(t)}$ - Hessian :
$\v H_{i,j}=\frac{\partial^2\L(\v w)}{\partial\v w_i\partial \v w_j}$ ,$D\times D$ , symmetric, convex if positve semi-definite - Taylor series :
$\sum_{n=0}^\infty \frac{f^{(n)(a)}}{n!}(x-a)^n$ - second order : $\L(\v w)\approx\L(\v w^)+\nabla\L(\v w^)^\top(\v w-\v w^)+\frac{1}{2}(\v w-\v w^)^\top\v H(\v w^)(\v w-\v w^)$
- Newton's method :
$\v w^{(t+1)}=\v w^{(t)}-\gamma^{(t)}(\v H^{(t)})^{-1}\nabla\L(\v w^{(t)})$ - Hessian :
$\v H(\v w)=\v X^\top\v S\v X$ with$\v S$ $N\times N$ with diagonal entries$S_{nn}=\sigma(\v x_n^\top\v w)[1-\sigma(\v x_n^\top\v w)]$
- Hessian :
- linearly separable issue : no finite weight vector (infinite)
- penalized logistic regression :
$\arg\min_{\v w}-\sum_{n=1}^N\log p(y_n\mid\v x_n^\top,\v w)+\lambda\sum_{d=1}^D\v w_d^2$
- shorthand :
$\v\eta=\xtw$ - exponential family :
$p(y\mid\v\eta)=h(y)\exp[\v \eta^\top\v\varphi(y)-A(\v\eta)]$ - 2 degrees of freedom :
$h(y)$ ,$\v\varphi(y)$ - normalization term :
$A(\v\eta)=\log[\int_y h(y)\exp[\v\eta^\top\v\varphi(y)]dy]<\infty$ , cumulant, log partition - sufficient statistics :
$\v\varphi(y)$ - logisitc :
$p(y\mid\v\eta)=\frac{e^{\v\eta y}}{1+e^{\v\eta}}=\exp[\v\eta y -\log(1+e^{\v \eta})]$ - Bernouilli :
$p(y\mid\mu)=\mu^y(1-\mu)^{1-y}=\exp[(\log\frac{\mu}{1-\mu})y+\log(1-\mu)]$ - Gaussian :
$p(y\mid\mu)=\frac{1}{\sqrt{2\pi\sigma^2}}]\exp{-\frac{(y-\mu)^2}{2\sigma^2}}=\exp[(\mu/\sigma^2, -1/(2\sigma^2))(y, y^2)^\top-\frac{\mu^2}{2\sigma^2}-\frac{1}{2}\log(2\pi\sigma^2)]$ - properties
- convex cumulant : as function of
$\v\eta$ - derivative and moments (multiparameters) :
$\nabla A(\v\eta)=E[\v\varphi(y)]$ - Hessian :
$\nabla^2 A(\v\eta)=var[\v\varphi(y)]$
- convex cumulant : as function of
- 2 degrees of freedom :
- link function :
$\v\eta=g(\v\mu)$ with$\mu=E[\v\varphi(y)]$ when scalar - generalized linear models :
$p(y\mid\v x,\v w)=h(y)\exp[\v x^\top\v w\varphi(y)-A(\v x^\top\v w)]$ from exponential familiy have easy to solve maximum likelihood- assume samples obey :
$p(y_n\mid\v x_n,\v w)=h(y_n)e^{\eta_n\varphi(y_n)-A(\eta_n)}$ - cost function :
$\L(\v w)=-\sum_{n=1}^N\log(h(y_n))+\v x_n^\top\v w\varphi(y_n)-A(\xtw)$ - gradient :
$\nabla_{\v w}\L(\v w)=-\sum_{n=1}^N\v x\varphi(y_n)-\v x g^{-1}(\xtw)=\v X^\top[g^{-1}(\v X\v w)-\varphi(\v y)]$ with$\frac{d A(\eta)}{\eta}=g^{-1}(\eta)$
- assume samples obey :
- Bayes classifier : MAP classifier
-
$k$ -nearest neighbor :$f_{S_t,k}(\v x)=\frac{1}{k}\sum_{n:\v x_n\in nbh_{S_t,k}(\v x)} y_n$ with$nbh_{S_t,k}(\v x)$ set of$k$ input point in$S_t$ the closest to$\v x$ - variant : weight individual by their distances
- classification :
$f_{S_t,k}(\v x)= majority; element{y_n\mid \v x_n\in nbh_{S_t,k}(\v x)}$ (for binary,$k$ should be odd) -
$E_{S_t}[\L(f_{S_t})]\le \L(f_\star)+c E_{S_t,\v X\sim D}[\norm{\v x - nbh_{S_t, 1}(\v x)}]\le 2\L(f_\star)+4c\sqrt{d}N^{-1/(d+1)}$ with$\L(f_\star)$ risk (classification error) of optimal Bayes classifier
- curse of dimensionality : intuitions fail in high dimensions
- generalizing correctly becomes exponentially harder as the dimensionality grows because fixed-size training sets cover a dwindling fraction of the input space
- in high-dimension, data-points are far from each other : choice of nearest neighbor becomes effectively random
- Lipschitz continuity :
$\abs{\eta(\v x)-\eta(\v x')}\le c\norm{\v x -\v x'}$ with$c$ constant and$\eta(\v x)=P(y=1\mid\v x)$
- support vector machine : change cost from Logistic to Hinge
- cost :
$\min_{\v w}\sum_{n=1}^N [1-y_n\v x_n^\top\v w]_++\frac{\lambda}{2}\norm{\v w}^2$ - hinge loss :
$[z]_+=\max{0,z}$ - no obvious probabilistic interpretation
- extension to multi-class non-trivial
- cost :
- labels
- logistic :
$y_n\in{0,1}$ - svm :
$y_n\in{\pm 1}$ - hinge :
$Hinge(f)=[1-yf]_+$ - mse :
$MSE(f)=(1-yf)^2$ - logistic :
$logisticLoss(f)=\log(1+e^{-yf})$
- hinge :
- logistic :
- duality : difficult optimization, define
$\L(\v w)=\max_{\v\alpha} G(\v w, \v \alpha)$ so choose between$\min_{\v w}\max_{\v\alpha}G(\v w,\v \alpha)=\max_{\v\alpha}\min_{\v w}G(\v w,\v\alpha)$ - hinge :
$[\nu_n]_+=\max{0,\nu_n}=\max(\alpha_n\nu_n);\forall n$ with$\alpha_n\in[0,1]$ - rewrite svm
$\min_{\v w}\max_{\v\alpha\in [0,1]^N}\sum_{n=1}^N \alpha_n(1-y_n\v x_n^\top\v w)+\frac{\lambda}{2}\norm{\v w}^2$ - convex in
$\v w$ - concave in
$\v\alpha$ - gradient :
$\nabla_{\v w}G(\v w,\v\alpha)=-\sum_{n=1}^N\alpha_ny_n\v x_n+\lambda\v w$ - first-order optimality condition for
$\v w$ :$\v w(\v\alpha)=\frac{1}{\lambda}\sum_{n=1}^N\alpha_n y_n\v x_n=\frac{1}{\lambda}\v X\v Y\v \alpha$ with$\v Y=diag(\v y)$ and$\v X$ $N$ data examples as columns - dual optimization of svm :
$\max_{\v\alpha\in[0,1]^N}\sum_{n=1}^N\alpha_n(1-\frac{1}{\lambda} y_n\v x_n^\top\v X\v Y\v \alpha)+\frac{\lambda}{2}\norm{\frac{1}{\lambda}\v X\v Y\v\alpha}^2=\max_{\v\alpha\in[0,1]^N}\v\alpha^\top\v 1-\frac{1}{2\lambda}\v\alpha^\top\v Y\v X^\top\v X\v Y\v\alpha$ - differientiable : easy with coordinate descent
$\max_{\v\alpha\in[0,1]^N}\v\alpha^\top\v 1-\frac{1}{2\lambda}\v\alpha^\top\v Q\v\alpha$ with kernel$\v Q=diag(\v y)\v X\v X^\top diag(\v y)$ - solution
$\v\alpha$ : generally sparse, non-zero only for examples important for boundaries - data vectors
- non-support vectors : correct side, outside margin,
$\alpha_n=0$ - essential support vectors : lie on margin
$\alpha_n\in(0,1)$ - bound support vectors : strictly inside margin or wrong side
$\alpha_n=1$
- non-support vectors : correct side, outside margin,
- rewrite svm
- minmax theorem : okay to switch when one concave, other convex (convex conjugate)
- hinge :
- coordinate descent : update one coordinate at a time, keeping others fixed
$\alpha_i\in\v\alpha$
- representer theorem : for a
$\v w^\star$ minizing$\min_{\v w}\sum_{n=1}^N L_n(y_n,\v x_n^\top\v w)+\frac{\lambda}{2}\norm{\v w}^2$ there exists$\v\alpha^\star$ st.$\v w^\star =\v X\v\alpha^\star$ - trick :
$\v w^\star=(\v X\v X^\top+\lambda \v I_D)^{-1}\v X\v y=\v X(\v X^\top \v X+\lambda\v I_N)^{-1}\v y=\v X\v\alpha^\star$ - complexity before :
$O(D^3+ND^2)$ - complexity after :
$O(N^3+DN^2)$ -
$\v w^\star=\v X\v\alpha^\star$ where $\v X=\begin{bmatrix}x_{11}&\cdots& x_{1N}\\ \vdots &\ddots&\vdots\\ x_{D1}&\cdots& x_{DN}\end{bmatrix}$
- complexity before :
- kernel ridge regression :
$\v\alpha^\star=\arg\max_{\v\alpha}-\frac{1}{2}\v\alpha^\top(\v X^\top\v X+\lambda\v I_N)\v\alpha+\lambda\v\alpha^\top\v y$ - kernel function :
$\kappa(\v x,\v x')=\phi(\v x)^\top\phi(\v x')$ , faster than using$\phi$ - linear :
$\v K=\v X^\top\v X=x_i^\top x_j$ - basis
$\v K=\v\Phi^\top\v\Phi=\phi(x_i)^\top\phi(x_j)$ - radial basis function :
$\kappa(\v x,\v x')=\exp[-\frac{1}{2}(\v x-\v x')^\top(\v x-\v x')]$ - properties : must be an inner-product in some feature space, ensured by
-
$\v K$ symmetric :$\kappa(\v x,\v x')=\kappa(\v x',\v x)$ -
$\v K$ positive semi-definite : for any input set${\v x_n}$
-
- can lead infinite feature spaces
- linear :
- unsupervised learning
- density estimation
- feature learning / representation learning
- cluster : groups of points whose inter-point distances are small
- prototypes :
$\v \mu_i$ of cluster assignment$z_n\in{1,\ldots,K}$ - k-means : $\min_{\v z,\v\mu}\L(\v z,\v\mu)=\sum_{n=1}^N\sum_{k=1}^Kz_{nk}\norm{\v x_n-\mu_k}2^2=\norm{\v X-\v M\v Z^\top}{Frob}^2$ with
$\v\mu_k\in\R^D$ ,$z_{nk}\in{0,1}$ and$\sum_{k=1}^K z_{nk}=1$ - issues : computation heavy, cluster force spherical, hard cluster
- complexity :
$O(nkdi)$ with$i$ iteration
- Frobenius norm : $\norm{A}F=\sqrt{\sum_i\sum_j \abs{a{ij}}^2}$
- algorithm : coordinate descent
- initialize
$\v\mu_k$ - compute
$\v z_n$ given$\v\mu$ :$z_{nk}=1$ if$k=\arg\min_j\norm{\v x_n-\v\mu_j}_2^2$ - compute
$\v\mu_k$ given$\v z$ :$\v\mu_k=\frac{\sum_{n=1}^N z_{nk}\v x_n}{\sum_{n=1}^N z_{nk}}$ - iterate
- convergance : assured to local optimum
- initialize
- probabilistic model :
$p(\v x_n\mid\v\mu, \v z)=\Pi_{n}N(\v x_n\mid\v\mu_{k,\st z_{nk}=1},\v I)=\Pi_n\Pi_k \N(\v x_n\mid\v\mu_k,\v I)^{z_{nk}}$ , isotropic variance
- elliptical cluster :
$p(\v X\mid\v\mu, \v z)=\Pi_n\Pi_k N(\v x_n\mid\v\mu_k,\v \Sigma_k)^{z_{nk}}$ - soft-clustering :
$z_{nk}$ multinomial with$p(z_n=k)=\pi_k$ with$\sum_k\pi_k=1$ - gaussian mixture model :
$p(\v X,\v z\mid\v\mu,\v\Sigma,\v\pi)=\Pi_{n=1}^N p(\v x_n\mid z_n,\v\mu,\v\Sigma)p(z_n\mid\v\pi)=\Pi_{n=1}^N\Pi_{k=1}^K\N(\v x_n\mid\v\mu_k,\v\Sigma_k)^{z_{nk}}\Pi_{k=1}^K\pi_k^{z_{nk}}$ - observed data vectors :
$\v x_n$ - latent unobserved variables :
$z_n$ - unknown parameters :
$\v\theta={\v\mu_1,\ldots,\mu_K,\v\Sigma_1,\ldots,\v\Sigma_K,\v\pi}$ - advantage :
$z_n$ can be marginalized out of the cost functions as if not existing - marginalization :
$p(\v x_n\mid\v\theta)=\sum_{k=1}^K\pi_k\N(\v x_n\mid\v\mu_k,\v\Sigma_k)$ - complexity :
$O(D^2K)$ , before depend on$O(N)$ - maximum likelihood :
$\max_{\v\theta}\sum_{n=1}^N\log\sum_{k=1}^K\pi_k N(\v x_n\mid\v\mu_k,\v\Sigma_k)$ , not convex, degenerate, not identifiable (multiple combination)
- complexity :
- observed data vectors :
- expected-maximization : elegan method to optimize log inside the sum problem
- algorithm
- start with
$\v\theta^{(1)}$ - expectation step : compute lower bound to cost st. tight at the previous
$\v\theta^{(t)}$ $\L(\v\theta)\ge\underline\L(\v\theta,\v\theta^{(t)})$ and$\L(\v\theta^{(t)})=\underline\L(\v\theta^{(t)},\v\theta^{(t)})$ - per stata sample :
$\log\sum_k\pi_k\N(x_n\mid\v\mu_k,\v\Sigma_k)\ge\sum_k q_{kn}^{(t)}\log\frac{\pi_k\N(n_n\mid\v\mu_k,\v\Sigma_k)}{q_{kn}^{(t)}}$ - equality :
$q_{kn}^{(t)}=\frac{\pi_k^{(t)}\N(\v x_n\mid\v\mu_k^{(t)},\v\Sigma_k^{(t)})}{\sum_k \pi_k^{(t)}\N(\v x_n\mid\v\mu_k^{(t)},\v\Sigma_k^{(t)})}$ - compute :
$\L(\v\theta^{(t)})=\sum_n\log\sum_k\pi_k^{(t)}\N(\v x_n\mid\v\mu_k^{(t)},\v\Sigma_k^{(t)})$
- per stata sample :
- maximization step : update
$\v\theta$ ,$\v\theta^{(t+1)}=\arg\max_{\v\theta}\underline\L(\v\theta,\v\theta^{(t)})$ - maximize lower bound :
$\max_{\v\theta}\sum_n\sum_k q_{kn}^{(t)}[\log\pi_k+\log\N(\v x_n\mid\v\mu_k,\v\Sigma_k)]$ for fixed$q_{kn}^{(t)}$ -
$\mu$ update :$\v\mu_k^{(t+1)}=\frac{\sum_n q_{kn}^{(t)}\v x_n}{\sum_n q_{kn}^{(t)}}$ -
$\Sigma$ update :$\v\Sigma_k^{(t+1)}=\frac{\sum_n q_{kn}^{(t)}(\v x_n-\v\mu_k^{(t+1)})(\v x_n-\v\mu_k^{(t+1)})^\top}{\sum_n q_{kn}^{(t)}}$ -
$\pi$ update :$\pi_k^{(t+1)}=\frac{1}{N}\sum_n q_{kn}^{(t)}$ , need to add a Lagrangian term
- maximize lower bound :
- iterate
- same as k-means :
$\v\Sigma_k=\sigma^2\v I$
- start with
- posterior distribution :
$p(\v x_n,z_n\mid\v\theta)=p(\v x_n\mid z_n,\v\theta)p(z_n\mid\v\theta)=p(z_n\mid\v x_n,\v\theta)p(\v x_n\mid\v\theta)$ - EM in general :
$\v\theta^{(t+1)}=\arg\max_{\v\theta}\sum_n E_{p(z_n\mid\v x_n,\v\theta^{(t)})}[\log p(\v x_n, z_n\mid\v\theta)]$ with joint distribution$p(\v x_n,z_n\mid\v\theta)$ - log concavity :
$\log \sum_k q_kt_k\ge\sum_k q_k\log t_k$ when$q_k,t_k$ non-negative and$\sum_k q_k=1$
- matrix :
$\v X=D\times N$ with$d$ movie and$n$ user - factorization :
$\v X\approx\v W\v Z^\top$ with$\v W\in\R^{D\times K}$ and$\v Z\in\R^{N\times K}$ - learning :
$\min_{\v W,\v Z}\L(\v W,\v Z)=\frac{1}{2}\sum_{(d,n)\in\Omega}[x_{dn}-(\v W\v Z^\top)_{dn}]^2$ with$\Omega$ set of observed input, not convex, not identifiable - latent feature :
$K$ - regularization :
$\frac{1}{2}\sum_{(d,n)\in\Omega}[x_{dn}-(\v W\v Z^\top)_{dn}]^2+\frac{\lambda_w}{2}\norm{\v W}_F^2+\frac{\lambda_z}{2}\norm{\v Z}_F^2$ - stochastic gradient descent
- objective :
$\frac{1}{\abs{\Omega}}\sum_{(d,n)\in\Omega}[x_{dn}-(\v W\v Z^\top)_{dn}]^2$ - $w_{d',k}=-[x_{dn}-(\v W\v Z^\top){dn}]z{n,k}$ if
$d'=d$ else 0 - $z_{n',k}=-[x_{dn}-(\v W\v Z^\top){dn}]w{d,k}$ if
$n'=n$ else 0 - complexity :
$O((D+N)K)$
- objective :
- alternating least-squares : coordinate descent
$\frac{1}{2}\sum_d\sum_n[x_{dn}-(\v W\v Z^\top)_{dn}]^2=\frac{1}{2}\norm{\v X-\v W\v Z^\top}_F^2$ if no missing entries$\v Z^\top=(\v W^\top\v W+\lambda_z\v I_k)^{-1}\v W^\top\v X$ $\v W^\top=(\v Z^\top\v Z+\lambda_w\v I_k)^{-1}\v Z^\top\v X^\top$
- embedding : numerical representation for words
$w_i\mapsto \v w_i\in\R^K$ - co-occurence matrix :
$n_{ij}=$ #contexts where word$w_i$ occurs together with word$w_j$ , sparse$D\times N$ - context : document, paragraph, sentence, window
- words :
$w_d$ - context words :
$w_n$ - vocabulary :
$\mathcal V={w_1,\ldots,w_D}$
- matrix factorization : $\min_{\v W,\v Z}\L(\v W,\v Z)=\frac{1}{2}\sum_{(d,n)\in\Omega}f_{dn}[x_{dn}-(\v W\v Z^\top){dn}]^2$ with weights $f{dn}$
- use log of counts
$x_{dn}=\log(n_{dn})$ - GloVe model :
$f_{dn}=\min{1, (n_{dn}/n_{max})^\alpha}$ for$\alpha\in[0,1]$ - SGD or ALS
- use log of counts
- skip-gram model : original word2vec, use binary classification (logistic) to separate good word pairs
$(w_d,w_n)$ (appearing together in a context) from bad word pairs (random apparition) - fasttext : supervised,
$\v W\in\R^{1\times K}$ and$\v Z\in\R^{\abs{\mathcal V}\times K}$ - bag-of-words representation :
$\v x_n\in\R^{\abs{\mathcal V}}$ from sentence$s_n=(w_1,\ldots,w_m)$ - learning :
$\min_{\v W,\v Z}\L(\v W,\v Z)=\sum_{s_n}f(y_n\v W\v Z^\top\v x_n)$ with$f$ linear classifier and label$y_n\in{-1,1}$
- bag-of-words representation :
- dimensionality reduction
- principal component analysis : linear mapping from
$D$ to$K$ dimensions or maximum variance (decorrelates), bad if not normalized - unitary matrix :
$\v U\v U^\top=\v U^\top\v U=\v I$ with rows or cols orthonormal- linear transformation :
$\norm{\v U\v x}_2^2=\v x^\top\v U^\top\v U\v x=\v x^\top\v x=\norm{\v x}_2^2$
- linear transformation :
- singular value decomposition :
$\v X=\v U\v S\v V^\top$ -
$\v X=D\times N$ and$D<N$ -
$\v U=D\times D$ unitary : left singular vectors -
$\v S=D\times N$ diagonal : singular values, by convention$s_1\ge\cdots\ge s_D\ge 0$ -
$\v V^\top=N\times N$ unitary : right singular vectors - learning : minimize
$\norm{\v X-\v R\v C\v X}_F^2$ - compression :
$\v C=K\times D$ - reconstruction :
$\v R=D\times K$
- compression :
-
- SVD lemma : $\norm{\v X-\v{\hat{X}}}_F^2\ge\norm{\v X-\v U_k\v U_k^\top\v X}F^2=\sum{i\ge K+1}s_i^2$ with
$s_i$ singular values of$\v X$ and$\v U_k$ consists of first$K$ col of$\v U$ , with$\v{\hat{X}}=D\times N$ (rank$K$ ) - identity :
$\v U_K\v U_K^\top\v U\v S=\v U\v S^{(K)}$ - link to matrix factorization :
$\v X=\v U\v S\v V^\top=\v U\v S_D^{1/2}S^{1/2}\v V^\top=\v W\v Z^\top$ -
$K$ rank approximation$\v X_K=\v U\v S^{(K)}\v V^\top$ $\v X_K=\v U_K\v{\hat{S}}^{(K)}\v V^\top=\v U_K(\v{\hat{S}}^{(K)})_K^{1/2}(\v{\hat{S}}^{(K)})^{1/2}\v V^\top=\v W\v Z^\top$
-
- decorrelation :
$N\v K=\v X\v X^\top=\v U\v S\v V^\top\v V\v S^\top\v U^\top=\v U\v S\v S^\top\v U^\top=\v U\v S_D^2\v U^\top$ give uncorrelated if$\v{\hat{X}}=\v U^\top\v X$ - mean :
$\bar{\v{x}}=\frac{1}{N}\sum_n\v x_n$ - variance :
$\v K=\frac{1}{N}\sum_n(\v x_n-\bar{\v{x}})(\v x_n-\bar{\v{x}})^\top$
- mean :
- computation : solving eigenvector/value for
$\v X\v X^\top$
- input layer : size
$D$ - hidden layer :
$L$ layer of size$K$ - output layer
- edge :
$w_{i,j}^{(l)}$ from node$i$ in layer$l-1$ to node$j$ in layer$l$ - node :
$x_j^{(l)}=\phi(\sum_i w_{i,j}^{(l)}x_i^{(l-1)}+b_j^{(l)})$ at layer$l$ output of node$j$ - bias term :
$b_j^{(l)}$ parameter - activation function :
$\phi(\cdot)$ , non-linear- sigmoid :
$\phi(x)=\frac{1}{1+e^{-x}}$
- sigmoid :
- sigmoid-like : left limit
$0$ right limit$1$ - lemma : sufficiently smooth function can be approximated by neural net with one hidden layer and approximation error goes down like one over the number of node in the hidden layer
- consider :
$f(x)=\phi(w(x-b))$ - width of the transistion :
$1/w$ - approximated rectange in average :
$\phi(w(x-a))-\phi(w(x-b))$ - higher dimension : require an additionnal hidden layer to filter other dimension arms
- rectified linear function :
$(x)_+=\max{0,x}$ - classical Stone-Weierstrass theorem : there exists a polynomial
$\abs{f(x)-p(x)}<\epsilon$ for every$\epsilon >0$ and all$x\in[0,1]$ - can be approximated in
$L_\infty$ - continuous piecewise-linear function :
$q(x)=\sum_{i=1}^m(a_ix+b_i)1_{{r_{i-1}\le x < r_i>}}$ - continuity requires : $q(x)=\tilde a_1x+\tilde b_1+\sum_{i=2}^m\tilde a_i(x-\tilde b_i)+$ with $a_i=\sum{j=1}^i\tilde a_i$ and
$\tilde b_i=r_{i-1}$
- can be approximated in
- standard cost function :
$\L=\frac{1}{N}\sum_{n=1}^N(y_n-f(\v x_n))^2$ - SGD : not convex but stable (no outcome difference if a single sample replaced) thus less prone to overfit
- backpropagation :
- weight matrix :
$\v W^{(l)}$ that connects layer$l-1$ to$l$ ,$\v W_{i,j}^{(l)}=w_{i,j}^{(l)}$ - bias vectors :
$\v b^{(l)}$ - function on each layer :
$\v x^{(l)}=f^{(l)}(\v x^{(l-1)})=\phi((\v W^{(l)})^\top\v x^{(l-1)}+\v b^{(l)})$ - overall function :
$f(\v x^{(0)})=f^{(L+1)}\circ\cdots\circ f^{(1)}(\v x_n^{(0)})$ - cost function :
$\L=\frac{1}{N}\sum_{n=1}^N (y_n-f^{(L+1)}\circ\cdots\circ f^{(1)}(\v x_n^{(0)}))^2$ independent of the form of loss (only initialization changes) - algorithm
- start :
$\L_n=(y_n-f^{(L+1)}\circ\cdots\circ f^{(1)}(\v x_n^{(0)}))^2$ - forward pass :
$\v z^{(l)}=(\v W^{(l)})^\top\v x^{(l-1)}+\v b^{(l)}$ with$\v x^{(0)}=\v x_n$ and$\v x^{(l-1)}=\phi(\v z^{(l-1)})$ - corresponding vector at level
$l$ :$\delta_j^{(l)}=\frac{\partial\L_n}{\partial z_j^{(l)}}$ - backward pass :
$\delta_j^{(l)}=\frac{\partial\L_n}{\partial z_j^{(l)}}=\sum_k \frac{\partial\L_n}{\partial z_k^{(l+1)}}\frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}}=\sum_k\delta_k^{(l+1)}\v W_{j,k}^{(l+1)}\phi'(z_j^{(l)})$ - MSE start with :
$\delta^{(L+1)}=-2(y_n-\v x^{(L+1)})\phi'(z^{(L+1)})$
- MSE start with :
- vector form :
$\delta^{(l)}=(\v W^{(l+1)}\delta^{(l+1)})\odot\phi'(\v z^{(l)})$ - Hadamard product :
$\odot$ pointwise multiplication of vectors - weights :
$=\frac{\partial \L_n}{\partial w_{i,j}^{(l)}}=\sum_k\frac{\partial \L_n}{\partial z_k^{(l)}}\frac{\partial z_k^{(l)}}{\partial w_{i,j}^{(l)}}=\frac{\partial \L_n}{\partial z_j^{(l)}}\frac{\partial z_j^{(l)}}{\partial w_{i,j}^{(l)}}=\delta_j^{(l)}\v x_i^{(l-1)}$ - bias :
$\frac{\partial \L_n}{\partial b_j^{(l)}}=\sum_k\frac{\partial \L_n}{\partial z_k^{(l)}}\frac{\partial z_k^{(l)}}{\partial b_j^{(l)}}=\frac{\partial\L_n}{\partial z_j^{(l)}}\frac{\partial z_j^{(l)}}{\partial b_j^{(l)}}=\delta_j^{(l)}$ - sum dropped : a when chaning the weight
$w_{i,j}^{(l)}$ it changes only$z_j^{(l)}$
- start :
- weight matrix :
- sigmoid : positive, bounded, vanishing gradient problem
- tanh :
$tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=\phi(2x)-\frac{1}{2}$ , balanced, bounded, vanishing - relu : rectified linear unit, positive, unbounded, not vanishing
- leaky relu :
$f(x)=\max{\alpha x, x}$ , avoid null derivative - maxout :
$f(x)=\max{\v x^\top\v w_1+b_1,\ldots,\v x^\top\v w_k+b_k}$ , generalize relus
- local processing of data
- convolution :
$x^{(1)}[n]=\sum_k f[k]x^{(0)}[n-k]$ - weight sharing : same weights at every position, training by summing up edges that share same weights
- handling of borders
- zero padding
- valid padding : convolve only inside the original data
- multiple channel : one more dimension, typical scenario, decrease size and increase depth like a pyramid
- fast fourier transform :
$O(cL\log_2(L))$ - dft :
$\hat x[k]=\sum_{n=0}^L x[n]\exp{-\frac{2\pi j}{L}kn}$ - idft :
$x[n]=\frac{1}{L}\sum_{k=0}^L\hat x[k]\exp{\frac{2\pi j}{L}kn}$
- dft :
- regularization : get smaller weights, avoid overfitting
- weights :
$\frac{1}{2}\sum_{l=1}^{L+1}\mu^{(l)}\norm{\v W^{(l)}}_F^2$ with$\mu^{(l)}$ a constant that depends on layer - bias : common to not penalize
- update rule : $\Theta[t+1]=\Theta[t]-\eta(\nabla_\Theta \L+\mu\Theta[t])=\Thetat-\eta\nabla_\Theta\L$ with constant
$\mu$ and$\Theta=w_{i,j}^{(l)}$ - leads to weight decay
- weights :
- projected gradient descent : weights should not have a
$L_2$ norm bigger than a constant, if so, normalize with the constant - dataset augmentation : create some variants (rotation, resize), learn to be invariant to this transformation, or train a network for some tasks and specialize only last layers to their respective tasks
- dropout :
$p_i^{(l)}$ probability of keeping node$i$ in layer$l$ other are ignored, draw at each iteration ($K$ subnetworks), predict by averaging$K$ subnetworks or scale output by$p$ - bagging : averaging over many models
- graphical models : establish relationchip between random variable, bayes nets, markov hidden field, factor graphs
- Bayes nets
- random variables :
$X_1,\ldots,X_D$ - joint distribution :
$p(X_1,\ldots,X_D)$ - chain rule :
$p(X_1,\ldots,X_D)=p(X_1)p(X_2\mid X_1)\cdots p(X_D\mid X_1,\ldots, X_{D-1})$ , any of$D!$ orders - directed edge : from
$X_j$ to$X_i$ if$X_j$ (parent) in conditioning of$X_i$ (child) - topology : always get same graph, always acyclic
- random variables :
- tail-to-tail : both arrows point away (conditioning leads to independence)
- head-to-tail/tail-to-head : one arrow point away, another in (conditioning leads to independence)
- head-to-head : both arrows point in (conditioning creates dependence)
- descendant : child, child of child, ...
- blocked path : node
$X$ to node$Y$ blocked by$Z$ iff it contains a variable st. variable is in$Z$ and it is head-to-tail/tail-to-tail, or st. node is head-to-head and neither this node nor descendants are in$Z$ -
$D$ -separation :$X$ and$Y$ $D$ -separated by$Z$ iff every path from any element of$X$ to any element of$Y$ is blocked by$Z$ - independence :
$X$ conditionnaly independent of random variables$Y$ conditionned on variable$Z$ iff$X$ and$Z$ are$D$ -separated (symmetric) - markov blanket : minimal set so that every random variable outside this set is conditionally independent, contains parents, children, co-parents (parent of the children)
- generating sample : storage exponential in largest number of parents
- variables :
$x_1,\ldots,x_D$ can be random variables - multidimensional function : factorization
$f(x_1,\ldots,x_D)=\Pi_{a\in A}f_a(x_{\partial a})$ - variables : indexed by integers
- factor : count
$\abs{A}$ , indexed by lower case Roman letters - subset :
$x_{\partial a}$ involved by factor ($\partial$ means neighbors of)
- factor graph : represents factoriaztion with bipartite graph (
$D$ nodes to$\abs{A}$ nodes with edge from variable$i$ to factor$a$ if it participates in) - tree factor graph : if Bayes net is a tree, not the converse
- complexity of computing marginals : exponential in
$\max_{a\in A}\abs{x_{\partial a}}$ (better a lot of factors) - sum product algorithm : need
$f(x_1)=\sum_{\sim x_1}f(x_1,\ldots,x_6)$ out of$f(x_1,\ldots,x_6)=f_1(x_1,x_2,x_3)f_2(x_1,x_4,x_6)f_3(x_4)f_4(x_4,x_5)$ - brute force :
$O(\abs{\mathcal X}^6)$ - refinement :
$f(x_1)=[\sum_{x_2,x_3}f_1(x_1,x_2,x_3)][\sum_{x_4}f_3(x_4)(\sum_{x_6} f_2(x_1,x_4,x_6))(\sum_{x_5} f_4(x_4,x_5))]$ - complexity :
$O(\abs{\mathcal X}^3)$ , proportional to highest degree - marginalization :
$\sum_{\sim z}g(z,\ldots)=\sum_{\sim z}\Pi_{k=1}^K[g_k(z,\ldots)]=\Pi_{k=1}^K[\sum_{\sim z}g_k(z,\ldots)]$ - rules
- initialization at leaf
- factor :
$\mu(x)=f(x)$ - variable :
$\mu(x)=1$
- factor :
- node processing
- factor :
$\mu(x)=\sum_{\sim x}f(x, x_1,\ldots x_J)\Pi_{j=1}^J\mu_j(x_j)$ - variable :
$\mu(x)=\Pi_{k=1}^K\mu_k(x)$
- factor :
- marginalization :
$\Pi_{k=1}^{K+1}\mu_k(x)$
- initialization at leaf
- brute force :
- loopy belief propagation : group factor together, approximation