Skip to content

Latest commit

 

History

History
534 lines (468 loc) · 36 KB

cs433.md

File metadata and controls

534 lines (468 loc) · 36 KB

$$ \def\v#1{\boldsymbol{#1}} \def\abs#1{\left |#1\right |} \def\norm#1{\left |\left|#1\right |\right |} \def\cases#1{\begin{cases}#1\end{cases}} \def\L{\mathcal L} \def\N{\mathcal N} \def\R{\mathbb R} \def\st{\text{ st. }} \def\xtw{\v x^\top\v w} $$

CS433

GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.

Fall 2016: Pattern Classification and Machine Learning

[TOC]

Regression

  • input variable : features, covariates, independent variables, explanatory variables, exogenous variables, predictors, regressors
  • output variable : target, label, reponse, outcome, dependent variables, endogenous variables, measured variables, regressands
  • regression : relate input to output for prediction or interpretation
  • data : $(y_n,\v x_n)$
    • dimensionality : $D$
    • data-size : $N$
  • regression function : $y\approx f(\v x_n);\forall n$
  • correlation $\not=$ causation

Linear regression

  • linear regression : model that assume linear relationship
  • parameters : $\v w=(w_0,w_1)$
  • simple linear regression : $y_n\approx f(\v x_n)=w_0+w_1x_{n1}$
  • multi linear regression : $y_n\approx f(\v x_n)=w_0+w_1x_{n1}+\cdots+w_Dx_{nD}=\v x_n^\top \v w$ with $\v x_n=(1,x_{n1},\cdots,x_{nD})^\top$
  • learning : estimation, fitting, find $\v w$
  • $D>N$ problem : more parameters than data, under-determined for linear regression, use regularization

Cost functions

  • cost function : energy, loss, training objective, quantifies the model
    • $y$ real-valued : better symmetric around $0$
    • same penalization : large mistakes and very-large mistakes
  • statistical vs computational trade-off
  • outliers : far away data samples, andling them (statistical property)
  • mean square error : $MSE(\v w)=\frac{1}{N}\sum_{n=1}^N [y_n-f(\v x_n)]^2$, outliers sensitive
  • mean absolute error : $MAE(\v w)=\frac{1}{N}\sum\abs{y_n-f(\v x_n)}$, outliers robust
  • convexity : $f(\lambda \v u +(1-\lambda)\v v)\le\lambda f(\v u)+(1-\lambda)f(\v v)$ for any $\v u,\v v$ and $0\le\lambda\le0$
    • concave : negative convexity
    • unique global minimum value
  • error vector : $\v e=\v y -\v X\v w$
  • Huber loss : $Huber=\cases{\frac{1}{2}\v e^2 & \abs{\v e}\le\delta\\ \delta\abs{\v e}-\frac{1}{2}\delta^2 & \abs{\v e}>\delta}$, outliers robust, convex, differentiable
  • Tukey's bisquare loss : $\frac{\partial \L}{\partial \v e}=\cases{\v e[1-\v e^2/\delta^2]^2 & \abs{\v e}\le\delta\\ 0 & \abs{\v e}>\delta}$, outliers robust, non-convex
  • Hinge loss : $Hinge=\max(0,\v e)$, relu

Optimization

  • optimization : $\min_{\v w}\L(\v w)$
  • grid searh : brute-force, exponential, $10^D$
  • local minimum : $\L(\v w^)\le\L(\v w);\forall\v w\st\norm{\v w-\v w^}<\epsilon$
  • global minimum : $\L(\v w^*)\le\L(\v w);\forall\v w$
  • strict minimum : inequality is strict
  • lower bound : required for any cost function
  • smooth optimization : differentiable
  • gradient : $\nabla\L(\v w)=\left(\frac{\partial\L(\v w)}{\partial w_1},\cdots,\frac{\partial\L(\v w)}{\partial w_D}\right)^\top$
  • gradient descent (batch) : $\v w^{(t+1)}=\v w^{(t)}-\gamma \v g$ with $\v g=\nabla\L(\v w^{(t)})$
    • step-size : learning rate, $\gamma&gt;0$
  • MSE : $\L(\v w)=\frac{1}{2N}\sum_{n=1}^N(y_n-\v x_n^\top\v w)^2=\frac{1}{2N}\v e^\top\v e$
    • gradient : $\nabla\L(\v w)=-\frac{1}{N}\v X^\top\v e$
    • complexity : $O(ND+D)$
  • stochastic gradient descent : $\v g = \nabla\L_n(\v w^{(t)})$ with $E[\nabla\L_n(\v w)]=\nabla\L(\v w)$
    • complexity : $O(D)$
  • mini-batch stochastic gradient descent : $\v g = \frac{1}{\abs{B}}\sum_{n\in B}\nabla\L_n(\v w^{(t)})$ with $B\subseteq N$ randomly chosen
    • complexity : $O(D\abs{B})$
  • alternative convexity characterization : $\L(\v u)\ge\L(\v w)+\Delta\L(\v w)^\top(\v u-\v w);\forall\v u,\v w$, always above its linearization, for differentiable functions only
  • subgradients : $\v g$ st. $\L(\v u)\ge\L(\v w)+\v g^\top(\v u-\v w);\forall\v u$, if $\L$ differtiable, $\v g=\nabla\L(\v w)$
  • MAE : $\v g=\cases{-1 &amp; e&lt;0\\ [-1,1]&amp; e=0\\ 1 &amp; e&gt;0}$
  • constrained optimization : $\min_{\v w}\L(\v w)$ st. $\v w\in C$
  • convex sets : $\theta\v u+(1-\theta)\v v\in C$ for any $\v u,\v v$ and $0\le\theta\le0$
    • intersection : convex
  • projected gradient descent : unique, efficient, $P_C(\v w')=\arg\min_{\v v\in C}\norm{\v v-\v w'}$
    • update rule : $\v w^{(t+1)}=P_C[\v w^{(t)}-\gamma\nabla\L(\v w^{(t)})]$
  • penality functions : $\min_{\v w}\L(\v w)+I_C(\v w)$
    • brick wall : $I_C(\v w)=\cases{0 &amp; \v w\in C\\ \infty &amp;\v w\not\in C}$, not continuous
    • penalize error : $\min_{\v w}\L(\v w)+\lambda\norm{A\v w-\v b}^2$ with trade-off parameter $\lambda$
  • stopping criteria : $\nabla\L(\v w)$ close to $0$, close to optimum
  • optimality
    • necessary : $\nabla\L(\v w^*)=0$
    • sufficient : positive second-order derivative
  • step-size selection : guarantee to converge only when $\gamma &lt;\gamma_{\min}$, application specific
  • line-search : step-size defined automatically
  • Robbins-Monroe condition : $\v w^{(t+1)}=\v w^{(t)}+\gamma^{(t)}\nabla\L_n(\v w^{(t)})$ with $\sum_{t=1}^\infty\gamma^{(t)}=\infty$ and $\sum_{t=1}^\infty (\gamma^{(t)})^2&lt;\infty$
    • $\gamma^{(t)}=1/(1+t)^r$ with $r\in(0.5,1)$
  • feature normalization : gradient sensitive to ill-conditioning, directions might converge at different speed, need normalization
  • Hessian matrix : $\v H(\v w^)=\frac{\partial^2\L(\v w^)}{\partial \v w\partial \v w^\top}$ positive semi-definite

Least squares

  • normal equations : $\nabla\L(\v w^*)=0$
  • MSE : $\nabla\L(\v w)=-\frac{1}{N}\v X^\top\v e=0$ give $\v w^*=(\v X^\top\v X)^{-1}\v X^\top\v y$
    • complexity : $O(ND^2+D^3)$
  • Gram matrix : $\v X^\top\v X$ inversible iff $rank(\v X)=D$ (full column rank)
  • rank deficientcy
    • $D&gt;N$ : always $rank(\v X)&lt;D$
    • ill-conditioned : $D\le N$ but some columns collinear

Maximum likelihood

  • gaussian
    • random variable : $\N(y\mid\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp{\left[-\frac{(y-\mu)^2}{2\sigma^2}\right]}$
    • random vector : $\N(\v y\mid\v \mu,\v \Sigma)=\frac{1}{\sqrt{(2\pi)^D\det(\Sigma)}}\exp{\left[-\frac{1}{2}(\v y-\v \mu)^\top\Sigma^{-1}(\v y-\v\mu)\right]}$
  • probabilistic model : $y_n=\v x_n^\top\v w+\epsilon_n$ with $\epsilon_n\sim\N(0\mid\sigma^2)$
  • log-likelihood : $\L_{lik}(\v w)=\log p(\v y\mid\v X,\v w)=-\frac{1}{2\sigma^2}\sum_{n=1}^N (y_n-\v x_n^\top\v w)^2+ cnst\approx E[\log p(y\mid\v x,\v w)]$
  • maximum likelihood estimator : $\arg\min_{\v w}\L_{mse}(\v w)=\arg\max_{\v w}\L_{lik}(\v w)$
    • expected log-like-lihood : $\L_{lik}(\v w)\approx E_{p(y, x)}[\log p(y\mid \v x,\v w)]$
    • $\v w_{mle}\to\v w_{true}$ in probability
    • asymptotically normal : $(\v w_{mle}-\v w_{true})\to\frac{1}{\sqrt{N}}\N(\v w_{mle}\mid \v 0,\v F^{-1}(\v w_{true}))$ with Fisher information $\v F(\v w))=-E_{p(\v Y)}[\frac{\partial^2\L}{\partial \v w\partial \v w^\top}]$
    • $\v w_{lse}=\arg\max_{\v w}\log\left[\Pi_{n=1}^N\N(y_n\mid\v w_n^\top\v w,\sigma^2)\right]$
  • laplace : $p(y_n\mid\v x_n,\v w)=\frac{1}{2b}\exp{\frac{1}{b}\abs{y_n-\v X_n^\top\v w}}$, for MAE

Overfitting

  • underfit : too limited, not fitting the signal well
  • overfit : too powerful, fitting noise in addition to the signal
  • get more data : might reduce overfitting
  • Occam's razor : "Plurality is not to be posited without necessity"
  • regularization : add term that penalizes complex models, $\min_{\v w}\L(\v w)+\Omega(\v w)$

Ridge regresion and Lasso

  • basis functions : nonlinear transformations applied to input variable, $\phi_j(\v x)$, $M$ basis
  • linear basis function models : $y_n=w_o+\sum_{j=1}^M w_j\phi_j(\v x_n)=\v\phi(\v x_n)^\top\v w$ with $\v\phi(\v x_n)^\top=[\v 1,\phi_1(\v x_n),\cdots,\phi_M(\v x_n)]$
    • linear : in $\v w$
    • nonlinear : in $\v x$
    • dimensions of $\v w$ : $M+1\not = D$
    • remplace $\v w$ : by $\v \Phi=\begin{bmatrix}1&\phi_1(\v x_1)&\cdots&\phi_M(\v x_1)\\ \vdots&\vdots&\ddots&\vdots\\ 1&\phi_1(\v x_N)&\cdots & \phi_M(\v x_N)\end{bmatrix}$ in $\v y=\v\Phi \v w$
  • $L_2$-regularization : $\Omega(\v w)=\lambda\norm{\v w}_2^2$ with $\norm{\v w}2^2=\sum{j=0}^M w_j^2$ and $\lambda$ regularization parameter
    • enforce : model to be simpler
    • ridge regression : $L_2$-
      • regularization with MSE
      • explicit solution : $\v w_{ridge}^*=(\v\Phi^\top\v\Phi+2N\lambda\v I_M)^{-1}\v\Phi^\top\v y$
    • fight ill-conditioning : eigenvalues are at least $2N\lambda$, improve condition number of Gram matrix
  • maximum-a-posteriori : $\v w_{ridge}=\arg\max_{\v w}\log\left[\Pi_{n=1}^N\N(y_n\mid\v x_n^\top\v w,\sigma^2)\cdot\N(\v w\mid 0, \frac{1}{\lambda}\v I)\right]$
  • $L_1$-regularization : $\Omega(\v w)=\lambda\norm{\v w}_1$ with $\norm{\v w}1=\sum{j=0}^M \abs{w_j}$
    • enforce : sparsity, some elements of $\v w$ to be $0$
  • lasso : $L_1$ with MSE
  • other regularizations
    • shrinkage
    • dropout
    • weight decay
    • early stopping
  • maximum-a-posteriori : $\v w_{map}=\arg\max_{\v w}p(\v y\mid\v w, \v X, \v\lambda)\cdot p(\v w\mid\v\theta)$, maximize both likelihood and prior
    • prior : $p(\v w\mid\v \lambda)$
    • ridge regression : $\v w_{ridge}=\arg\max_{\v w}\log\left[\Pi_{n=1}^N \N(y_n\mid\v x_n^\top\v w,\sigma^2)\cdot\N(\v w\mid 0,\frac{1}{\lambda}\v I)\right]$

Model selection and cross-validation

  • underlying distribution : $\mathcal D=\mathcal Y\times\mathcal X$, unknown
  • independent samples : $S={(y_n,\v x_n)}^N_{n=1}$
  • learning algorithm : $\mathcal A(S)$
  • prediction function : $f_S=\mathcal A(S)$
  • training set : $S_t$, 80%
  • validation set : $S_v$, 20%
  • generalization error : $\L_D(f)=E[l(y,f(\v x))]$ with $l$ loss function, true/expected risk/loss
  • trainging error : $L_S(f_S)=\frac{1}{N}\sum_{n=1}^N l(y_n,f(\v x_n))$, empirical
    • issue as a function of itself : $L_S(f_S)=\frac{1}{N}\sum_{n=1}^N l(y_n,f_S(\v x_n))$
    • validation : $L_{S_v}(f_{S_t})=\frac{1}{\abs{S_v}}\sum_{(y_n,\v x_n)\in S_v}l(y_n,f_{S_t}(\v x_n))$ with $L_D(f_{S_t})=E_{S_v\sim D}[L_{S_v}(f_{S_t})]$
    • finite size variation : $P\left{\max_k\abs{L_D(f_k)-L_{S_v}(f_k)}\ge\sqrt{\frac{\ln(2K/\delta)}{2\abs{S_v}}}\right}\le\delta$ for $K$ hyper-parameters
  • Chernoff bound : $\Phi_i$ iid RV with range $[0,1]$, $P\left{\abs{\frac{1}{N}\sum_{n=1}^N\Phi_n-E[\Phi]}&gt;\epsilon\right}\le 2e^{-2N\epsilon^2}$
  • $K$-fold cross-validation : train on $K-1$, test on remaining, permute and average results
  • Markov inquality : $P(X\ge a)\le\frac{E(X)}{a}$
  • Hoeffding lemma : zero mean RV $E[e^{sX}]\le e^{s^8/8(b-a)^2}$

Bias-variance decomposition

  • simple model : high bias (bad fit), small variance
  • complex model : small bias, high variance
  • data generation model : $y=f(\v x)+\epsilon$ with zero mean noise $\epsilon$
  • decomposition = $E_{S_t\sim D,noise}[(f(\v x_0)+\epsilon-f_{S_t}(\v x_0))^2]=var[\epsilon]+(f(\v x_0)-E_{S_t\sim D}[f_{S_t}(\v x_0)])^2+E_{S_t\sim D}[(E_{S_t'\sim D}[f_{S_t'}(\v x_0)]-f_{S_t}(\v x_0))^2]=noise + bias +variance$, all non-negative

Classification

  • classification : relates input $\v x$ to output $y$ but $y\in{C_1,C_2}$ restricted to discrete values
  • class labels : $C_i$
  • multi-class classification : $y\in{C_0,\ldots,C_{K-1}}$
  • decision boundaries : decided by a classifier
  • MSE : count equally bad positive and negative error (whereas only one is wrong), small $\Rightarrow$ small classification error (inverse not true)
  • nearest neighbors : euclidean distance, useless in high dimension
  • $k$ nearest neighbors
  • smoothing kernels : weighted linear combinaison of elements in the neighborhood
  • linear decision boundaries : hyperplanes
  • support vector machines : take hyperplane with highest robustness to wiggling
  • kernel trick : augment feature vector with some non-linear functions
  • neural networks : apprioriate non-linear transform of input that is linearly separable
  • optimal classification : for known generating model
    • known joint distribution : $p(\v x, y)$
    • optimal choice for class label : $\hat y(\v x)$
    • probability of misclassification : $P{\hat y(\v x)=y}=\int p(\v x)\sum_{y\in\mathcal Y}p(y\mid\v x)\v 1_{\hat y(\v X)=y}dx$
    • strategy : $\hat y(\v x)=\arg\max_{y\in\mathcal Y}p(y\mid x)$

Logistic regression

  • logistic function : $\sigma(x)=\frac{e^x}{1+e^x}$
  • binary posterior probability
    • $p(C_1\mid\v x)=\sigma(\v x^\top w)$
    • $p(C_2\mid\v x)=1-\sigma(\v x^\top w$
  • training : maximize $p(\v y\mid\v X,\v w)=\Pi_{n:y_n=C_1}p(y_n=C_1\mid\v x_n)\Pi_{n:y_n=C_2}p(y_n=C_2\mid \v x_n)=\Pi_{n=1}^N\sigma(\xtw)^{y_n}[1-\sigma(\xtw)]^{1-y_n}$
  • cost function : minimize $\L(\v w)=-\sum_{n=1}^N y_n\log\sigma(\v x_n^\top\v w)+(1-y_n)\log[1-\sigma(\v x_n^\top\v w)]=\sum_{n=1}^N\log[1+\exp(\v x_n^\top\v w)]-y_n \v x_n^\top\v w$
  • gradient : $\nabla\L(\v w)=\sum_{n=1}^N\v x_n(\sigma(\v x_n^\top\v w)-y_n)=\v X^\top[\sigma(\v X\v w)-\v y]$, convex
  • gradient descent : $\v w^{(t+1)}=\v w^{(t)}-\gamma^{(t)}\nabla\L(\v w^{(t)}$
  • Hessian : $\v H_{i,j}=\frac{\partial^2\L(\v w)}{\partial\v w_i\partial \v w_j}$, $D\times D$, symmetric, convex if positve semi-definite
  • Taylor series : $\sum_{n=0}^\infty \frac{f^{(n)(a)}}{n!}(x-a)^n$
    • second order : $\L(\v w)\approx\L(\v w^)+\nabla\L(\v w^)^\top(\v w-\v w^)+\frac{1}{2}(\v w-\v w^)^\top\v H(\v w^)(\v w-\v w^)$
  • Newton's method : $\v w^{(t+1)}=\v w^{(t)}-\gamma^{(t)}(\v H^{(t)})^{-1}\nabla\L(\v w^{(t)})$
    • Hessian : $\v H(\v w)=\v X^\top\v S\v X$ with $\v S$ $N\times N$ with diagonal entries $S_{nn}=\sigma(\v x_n^\top\v w)[1-\sigma(\v x_n^\top\v w)]$
  • linearly separable issue : no finite weight vector (infinite)
  • penalized logistic regression : $\arg\min_{\v w}-\sum_{n=1}^N\log p(y_n\mid\v x_n^\top,\v w)+\lambda\sum_{d=1}^D\v w_d^2$

Generalized linear models

  • shorthand : $\v\eta=\xtw$
  • exponential family : $p(y\mid\v\eta)=h(y)\exp[\v \eta^\top\v\varphi(y)-A(\v\eta)]$
    • 2 degrees of freedom : $h(y)$, $\v\varphi(y)$
    • normalization term : $A(\v\eta)=\log[\int_y h(y)\exp[\v\eta^\top\v\varphi(y)]dy]&lt;\infty$, cumulant, log partition
    • sufficient statistics : $\v\varphi(y)$
    • logisitc : $p(y\mid\v\eta)=\frac{e^{\v\eta y}}{1+e^{\v\eta}}=\exp[\v\eta y -\log(1+e^{\v \eta})]$
    • Bernouilli : $p(y\mid\mu)=\mu^y(1-\mu)^{1-y}=\exp[(\log\frac{\mu}{1-\mu})y+\log(1-\mu)]$
    • Gaussian : $p(y\mid\mu)=\frac{1}{\sqrt{2\pi\sigma^2}}]\exp{-\frac{(y-\mu)^2}{2\sigma^2}}=\exp[(\mu/\sigma^2, -1/(2\sigma^2))(y, y^2)^\top-\frac{\mu^2}{2\sigma^2}-\frac{1}{2}\log(2\pi\sigma^2)]$
    • properties
      • convex cumulant : as function of $\v\eta$
      • derivative and moments (multiparameters) : $\nabla A(\v\eta)=E[\v\varphi(y)]$
      • Hessian : $\nabla^2 A(\v\eta)=var[\v\varphi(y)]$
  • link function : $\v\eta=g(\v\mu)$ with $\mu=E[\v\varphi(y)]$ when scalar
  • generalized linear models : $p(y\mid\v x,\v w)=h(y)\exp[\v x^\top\v w\varphi(y)-A(\v x^\top\v w)]$ from exponential familiy have easy to solve maximum likelihood
    • assume samples obey : $p(y_n\mid\v x_n,\v w)=h(y_n)e^{\eta_n\varphi(y_n)-A(\eta_n)}$
    • cost function : $\L(\v w)=-\sum_{n=1}^N\log(h(y_n))+\v x_n^\top\v w\varphi(y_n)-A(\xtw)$
    • gradient : $\nabla_{\v w}\L(\v w)=-\sum_{n=1}^N\v x\varphi(y_n)-\v x g^{-1}(\xtw)=\v X^\top[g^{-1}(\v X\v w)-\varphi(\v y)]$ with $\frac{d A(\eta)}{\eta}=g^{-1}(\eta)$

Curse of dimensionality and $k$-NN

  • Bayes classifier : MAP classifier
  • $k$-nearest neighbor : $f_{S_t,k}(\v x)=\frac{1}{k}\sum_{n:\v x_n\in nbh_{S_t,k}(\v x)} y_n$ with $nbh_{S_t,k}(\v x)$ set of $k$ input point in $S_t$ the closest to $\v x$
    • variant : weight individual by their distances
    • classification : $f_{S_t,k}(\v x)= majority; element{y_n\mid \v x_n\in nbh_{S_t,k}(\v x)}$ (for binary, $k$ should be odd)
    • $E_{S_t}[\L(f_{S_t})]\le \L(f_\star)+c E_{S_t,\v X\sim D}[\norm{\v x - nbh_{S_t, 1}(\v x)}]\le 2\L(f_\star)+4c\sqrt{d}N^{-1/(d+1)}$ with $\L(f_\star)$ risk (classification error) of optimal Bayes classifier
  • curse of dimensionality : intuitions fail in high dimensions
    • generalizing correctly becomes exponentially harder as the dimensionality grows because fixed-size training sets cover a dwindling fraction of the input space
    • in high-dimension, data-points are far from each other : choice of nearest neighbor becomes effectively random
  • Lipschitz continuity : $\abs{\eta(\v x)-\eta(\v x')}\le c\norm{\v x -\v x'}$ with $c$ constant and $\eta(\v x)=P(y=1\mid\v x)$

Support vector machines

  • support vector machine : change cost from Logistic to Hinge
    • cost : $\min_{\v w}\sum_{n=1}^N [1-y_n\v x_n^\top\v w]_++\frac{\lambda}{2}\norm{\v w}^2$
    • hinge loss : $[z]_+=\max{0,z}$
    • no obvious probabilistic interpretation
    • extension to multi-class non-trivial
  • labels
    • logistic : $y_n\in{0,1}$
    • svm : $y_n\in{\pm 1}$
      • hinge : $Hinge(f)=[1-yf]_+$
      • mse : $MSE(f)=(1-yf)^2$
      • logistic : $logisticLoss(f)=\log(1+e^{-yf})$
  • duality : difficult optimization, define $\L(\v w)=\max_{\v\alpha} G(\v w, \v \alpha)$ so choose between $\min_{\v w}\max_{\v\alpha}G(\v w,\v \alpha)=\max_{\v\alpha}\min_{\v w}G(\v w,\v\alpha)$
    • hinge : $[\nu_n]_+=\max{0,\nu_n}=\max(\alpha_n\nu_n);\forall n$ with $\alpha_n\in[0,1]$
      • rewrite svm $\min_{\v w}\max_{\v\alpha\in [0,1]^N}\sum_{n=1}^N \alpha_n(1-y_n\v x_n^\top\v w)+\frac{\lambda}{2}\norm{\v w}^2$
      • convex in $\v w$
      • concave in $\v\alpha$
      • gradient : $\nabla_{\v w}G(\v w,\v\alpha)=-\sum_{n=1}^N\alpha_ny_n\v x_n+\lambda\v w$
      • first-order optimality condition for $\v w$ : $\v w(\v\alpha)=\frac{1}{\lambda}\sum_{n=1}^N\alpha_n y_n\v x_n=\frac{1}{\lambda}\v X\v Y\v \alpha$ with $\v Y=diag(\v y)$ and $\v X$ $N$ data examples as columns
      • dual optimization of svm : $\max_{\v\alpha\in[0,1]^N}\sum_{n=1}^N\alpha_n(1-\frac{1}{\lambda} y_n\v x_n^\top\v X\v Y\v \alpha)+\frac{\lambda}{2}\norm{\frac{1}{\lambda}\v X\v Y\v\alpha}^2=\max_{\v\alpha\in[0,1]^N}\v\alpha^\top\v 1-\frac{1}{2\lambda}\v\alpha^\top\v Y\v X^\top\v X\v Y\v\alpha$
      • differientiable : easy with coordinate descent $\max_{\v\alpha\in[0,1]^N}\v\alpha^\top\v 1-\frac{1}{2\lambda}\v\alpha^\top\v Q\v\alpha$ with kernel $\v Q=diag(\v y)\v X\v X^\top diag(\v y)$
      • solution $\v\alpha$ : generally sparse, non-zero only for examples important for boundaries
      • data vectors
        • non-support vectors : correct side, outside margin, $\alpha_n=0$
        • essential support vectors : lie on margin $\alpha_n\in(0,1)$
        • bound support vectors : strictly inside margin or wrong side $\alpha_n=1$
    • minmax theorem : okay to switch when one concave, other convex (convex conjugate)
  • coordinate descent : update one coordinate at a time, keeping others fixed $\alpha_i\in\v\alpha$

Kernel ridge regression

  • representer theorem : for a $\v w^\star$ minizing $\min_{\v w}\sum_{n=1}^N L_n(y_n,\v x_n^\top\v w)+\frac{\lambda}{2}\norm{\v w}^2$ there exists $\v\alpha^\star$ st. $\v w^\star =\v X\v\alpha^\star$
  • trick : $\v w^\star=(\v X\v X^\top+\lambda \v I_D)^{-1}\v X\v y=\v X(\v X^\top \v X+\lambda\v I_N)^{-1}\v y=\v X\v\alpha^\star$
    • complexity before : $O(D^3+ND^2)$
    • complexity after : $O(N^3+DN^2)$
    • $\v w^\star=\v X\v\alpha^\star$ where $\v X=\begin{bmatrix}x_{11}&\cdots& x_{1N}\\ \vdots &\ddots&\vdots\\ x_{D1}&\cdots& x_{DN}\end{bmatrix}$
  • kernel ridge regression : $\v\alpha^\star=\arg\max_{\v\alpha}-\frac{1}{2}\v\alpha^\top(\v X^\top\v X+\lambda\v I_N)\v\alpha+\lambda\v\alpha^\top\v y$
  • kernel function : $\kappa(\v x,\v x')=\phi(\v x)^\top\phi(\v x')$, faster than using $\phi$
    • linear : $\v K=\v X^\top\v X=x_i^\top x_j$
    • basis $\v K=\v\Phi^\top\v\Phi=\phi(x_i)^\top\phi(x_j)$
    • radial basis function : $\kappa(\v x,\v x')=\exp[-\frac{1}{2}(\v x-\v x')^\top(\v x-\v x')]$
    • properties : must be an inner-product in some feature space, ensured by
      • $\v K$ symmetric : $\kappa(\v x,\v x')=\kappa(\v x',\v x)$
      • $\v K$ positive semi-definite : for any input set ${\v x_n}$
    • can lead infinite feature spaces

Unsupervised learning

  • unsupervised learning
    • density estimation
    • feature learning / representation learning

K-means clustering

  • cluster : groups of points whose inter-point distances are small
  • prototypes : $\v \mu_i$ of cluster assignment $z_n\in{1,\ldots,K}$
  • k-means : $\min_{\v z,\v\mu}\L(\v z,\v\mu)=\sum_{n=1}^N\sum_{k=1}^Kz_{nk}\norm{\v x_n-\mu_k}2^2=\norm{\v X-\v M\v Z^\top}{Frob}^2$ with $\v\mu_k\in\R^D$, $z_{nk}\in{0,1}$ and $\sum_{k=1}^K z_{nk}=1$
    • issues : computation heavy, cluster force spherical, hard cluster
    • complexity : $O(nkdi)$ with $i$ iteration
  • Frobenius norm : $\norm{A}F=\sqrt{\sum_i\sum_j \abs{a{ij}}^2}$
  • algorithm : coordinate descent
    • initialize $\v\mu_k$
    • compute $\v z_n$ given $\v\mu$ : $z_{nk}=1$ if $k=\arg\min_j\norm{\v x_n-\v\mu_j}_2^2$
    • compute $\v\mu_k$ given $\v z$ : $\v\mu_k=\frac{\sum_{n=1}^N z_{nk}\v x_n}{\sum_{n=1}^N z_{nk}}$
    • iterate
    • convergance : assured to local optimum
  • probabilistic model : $p(\v x_n\mid\v\mu, \v z)=\Pi_{n}N(\v x_n\mid\v\mu_{k,\st z_{nk}=1},\v I)=\Pi_n\Pi_k \N(\v x_n\mid\v\mu_k,\v I)^{z_{nk}}$, isotropic variance

Gaussian mixture models

  • elliptical cluster : $p(\v X\mid\v\mu, \v z)=\Pi_n\Pi_k N(\v x_n\mid\v\mu_k,\v \Sigma_k)^{z_{nk}}$
  • soft-clustering : $z_{nk}$ multinomial with $p(z_n=k)=\pi_k$ with $\sum_k\pi_k=1$
  • gaussian mixture model : $p(\v X,\v z\mid\v\mu,\v\Sigma,\v\pi)=\Pi_{n=1}^N p(\v x_n\mid z_n,\v\mu,\v\Sigma)p(z_n\mid\v\pi)=\Pi_{n=1}^N\Pi_{k=1}^K\N(\v x_n\mid\v\mu_k,\v\Sigma_k)^{z_{nk}}\Pi_{k=1}^K\pi_k^{z_{nk}}$
    • observed data vectors : $\v x_n$
    • latent unobserved variables : $z_n$
    • unknown parameters : $\v\theta={\v\mu_1,\ldots,\mu_K,\v\Sigma_1,\ldots,\v\Sigma_K,\v\pi}$
    • advantage : $z_n$ can be marginalized out of the cost functions as if not existing
    • marginalization : $p(\v x_n\mid\v\theta)=\sum_{k=1}^K\pi_k\N(\v x_n\mid\v\mu_k,\v\Sigma_k)$
      • complexity : $O(D^2K)$, before depend on $O(N)$
      • maximum likelihood : $\max_{\v\theta}\sum_{n=1}^N\log\sum_{k=1}^K\pi_k N(\v x_n\mid\v\mu_k,\v\Sigma_k)$, not convex, degenerate, not identifiable (multiple combination)

Expectation-maximization algorithm

  • expected-maximization : elegan method to optimize log inside the sum problem
  • algorithm
    • start with $\v\theta^{(1)}$
    • expectation step : compute lower bound to cost st. tight at the previous $\v\theta^{(t)}$ $\L(\v\theta)\ge\underline\L(\v\theta,\v\theta^{(t)})$ and $\L(\v\theta^{(t)})=\underline\L(\v\theta^{(t)},\v\theta^{(t)})$
      • per stata sample : $\log\sum_k\pi_k\N(x_n\mid\v\mu_k,\v\Sigma_k)\ge\sum_k q_{kn}^{(t)}\log\frac{\pi_k\N(n_n\mid\v\mu_k,\v\Sigma_k)}{q_{kn}^{(t)}}$
      • equality : $q_{kn}^{(t)}=\frac{\pi_k^{(t)}\N(\v x_n\mid\v\mu_k^{(t)},\v\Sigma_k^{(t)})}{\sum_k \pi_k^{(t)}\N(\v x_n\mid\v\mu_k^{(t)},\v\Sigma_k^{(t)})}$
      • compute : $\L(\v\theta^{(t)})=\sum_n\log\sum_k\pi_k^{(t)}\N(\v x_n\mid\v\mu_k^{(t)},\v\Sigma_k^{(t)})$
    • maximization step : update $\v\theta$, $\v\theta^{(t+1)}=\arg\max_{\v\theta}\underline\L(\v\theta,\v\theta^{(t)})$
      • maximize lower bound : $\max_{\v\theta}\sum_n\sum_k q_{kn}^{(t)}[\log\pi_k+\log\N(\v x_n\mid\v\mu_k,\v\Sigma_k)]$ for fixed $q_{kn}^{(t)}$
      • $\mu$ update : $\v\mu_k^{(t+1)}=\frac{\sum_n q_{kn}^{(t)}\v x_n}{\sum_n q_{kn}^{(t)}}$
      • $\Sigma$ update : $\v\Sigma_k^{(t+1)}=\frac{\sum_n q_{kn}^{(t)}(\v x_n-\v\mu_k^{(t+1)})(\v x_n-\v\mu_k^{(t+1)})^\top}{\sum_n q_{kn}^{(t)}}$
      • $\pi$ update : $\pi_k^{(t+1)}=\frac{1}{N}\sum_n q_{kn}^{(t)}$, need to add a Lagrangian term
    • iterate
    • same as k-means : $\v\Sigma_k=\sigma^2\v I$
  • posterior distribution : $p(\v x_n,z_n\mid\v\theta)=p(\v x_n\mid z_n,\v\theta)p(z_n\mid\v\theta)=p(z_n\mid\v x_n,\v\theta)p(\v x_n\mid\v\theta)$
  • EM in general : $\v\theta^{(t+1)}=\arg\max_{\v\theta}\sum_n E_{p(z_n\mid\v x_n,\v\theta^{(t)})}[\log p(\v x_n, z_n\mid\v\theta)]$ with joint distribution $p(\v x_n,z_n\mid\v\theta)$
  • log concavity : $\log \sum_k q_kt_k\ge\sum_k q_k\log t_k$ when $q_k,t_k$ non-negative and $\sum_k q_k=1$

Matrix factorization

  • matrix : $\v X=D\times N$ with $d$ movie and $n$ user
  • factorization : $\v X\approx\v W\v Z^\top$ with $\v W\in\R^{D\times K}$ and $\v Z\in\R^{N\times K}$
  • learning : $\min_{\v W,\v Z}\L(\v W,\v Z)=\frac{1}{2}\sum_{(d,n)\in\Omega}[x_{dn}-(\v W\v Z^\top)_{dn}]^2$ with $\Omega$ set of observed input, not convex, not identifiable
  • latent feature : $K$
  • regularization : $\frac{1}{2}\sum_{(d,n)\in\Omega}[x_{dn}-(\v W\v Z^\top)_{dn}]^2+\frac{\lambda_w}{2}\norm{\v W}_F^2+\frac{\lambda_z}{2}\norm{\v Z}_F^2$
  • stochastic gradient descent
    • objective : $\frac{1}{\abs{\Omega}}\sum_{(d,n)\in\Omega}[x_{dn}-(\v W\v Z^\top)_{dn}]^2$
    • $w_{d',k}=-[x_{dn}-(\v W\v Z^\top){dn}]z{n,k}$ if $d'=d$ else 0
    • $z_{n',k}=-[x_{dn}-(\v W\v Z^\top){dn}]w{d,k}$ if $n'=n$ else 0
    • complexity : $O((D+N)K)$
  • alternating least-squares : coordinate descent $\frac{1}{2}\sum_d\sum_n[x_{dn}-(\v W\v Z^\top)_{dn}]^2=\frac{1}{2}\norm{\v X-\v W\v Z^\top}_F^2$ if no missing entries
    • $\v Z^\top=(\v W^\top\v W+\lambda_z\v I_k)^{-1}\v W^\top\v X$
    • $\v W^\top=(\v Z^\top\v Z+\lambda_w\v I_k)^{-1}\v Z^\top\v X^\top$

Text representation learning

  • embedding : numerical representation for words $w_i\mapsto \v w_i\in\R^K$
  • co-occurence matrix : $n_{ij}=$ #contexts where word $w_i$ occurs together with word $w_j$, sparse $D\times N$
    • context : document, paragraph, sentence, window
    • words : $w_d$
    • context words : $w_n$
    • vocabulary : $\mathcal V={w_1,\ldots,w_D}$
  • matrix factorization : $\min_{\v W,\v Z}\L(\v W,\v Z)=\frac{1}{2}\sum_{(d,n)\in\Omega}f_{dn}[x_{dn}-(\v W\v Z^\top){dn}]^2$ with weights $f{dn}$
    • use log of counts $x_{dn}=\log(n_{dn})$
    • GloVe model : $f_{dn}=\min{1, (n_{dn}/n_{max})^\alpha}$ for $\alpha\in[0,1]$
    • SGD or ALS
  • skip-gram model : original word2vec, use binary classification (logistic) to separate good word pairs $(w_d,w_n)$ (appearing together in a context) from bad word pairs (random apparition)
  • fasttext : supervised, $\v W\in\R^{1\times K}$ and $\v Z\in\R^{\abs{\mathcal V}\times K}$
    • bag-of-words representation : $\v x_n\in\R^{\abs{\mathcal V}}$ from sentence $s_n=(w_1,\ldots,w_m)$
    • learning : $\min_{\v W,\v Z}\L(\v W,\v Z)=\sum_{s_n}f(y_n\v W\v Z^\top\v x_n)$ with $f$ linear classifier and label $y_n\in{-1,1}$

SVD and PCA

  • dimensionality reduction
  • principal component analysis : linear mapping from $D$ to $K$ dimensions or maximum variance (decorrelates), bad if not normalized
  • unitary matrix : $\v U\v U^\top=\v U^\top\v U=\v I$ with rows or cols orthonormal
    • linear transformation : $\norm{\v U\v x}_2^2=\v x^\top\v U^\top\v U\v x=\v x^\top\v x=\norm{\v x}_2^2$
  • singular value decomposition : $\v X=\v U\v S\v V^\top$
    • $\v X=D\times N$ and $D&lt;N$
    • $\v U=D\times D$ unitary : left singular vectors
    • $\v S=D\times N$ diagonal : singular values, by convention $s_1\ge\cdots\ge s_D\ge 0$
    • $\v V^\top=N\times N$ unitary : right singular vectors
    • learning : minimize $\norm{\v X-\v R\v C\v X}_F^2$
      • compression : $\v C=K\times D$
      • reconstruction : $\v R=D\times K$
  • SVD lemma : $\norm{\v X-\v{\hat{X}}}_F^2\ge\norm{\v X-\v U_k\v U_k^\top\v X}F^2=\sum{i\ge K+1}s_i^2$ with $s_i$ singular values of $\v X$ and $\v U_k$ consists of first $K$ col of $\v U$, with $\v{\hat{X}}=D\times N$ (rank $K$)
  • identity : $\v U_K\v U_K^\top\v U\v S=\v U\v S^{(K)}$
  • link to matrix factorization : $\v X=\v U\v S\v V^\top=\v U\v S_D^{1/2}S^{1/2}\v V^\top=\v W\v Z^\top$
    • $K$ rank approximation $\v X_K=\v U\v S^{(K)}\v V^\top$
    • $\v X_K=\v U_K\v{\hat{S}}^{(K)}\v V^\top=\v U_K(\v{\hat{S}}^{(K)})_K^{1/2}(\v{\hat{S}}^{(K)})^{1/2}\v V^\top=\v W\v Z^\top$
  • decorrelation : $N\v K=\v X\v X^\top=\v U\v S\v V^\top\v V\v S^\top\v U^\top=\v U\v S\v S^\top\v U^\top=\v U\v S_D^2\v U^\top$ give uncorrelated if $\v{\hat{X}}=\v U^\top\v X$
    • mean : $\bar{\v{x}}=\frac{1}{N}\sum_n\v x_n$
    • variance : $\v K=\frac{1}{N}\sum_n(\v x_n-\bar{\v{x}})(\v x_n-\bar{\v{x}})^\top$
  • computation : solving eigenvector/value for $\v X\v X^\top$

Neural nets - Basic structures

  • input layer : size $D$
  • hidden layer : $L$ layer of size $K$
  • output layer
  • edge : $w_{i,j}^{(l)}$ from node $i$ in layer $l-1$ to node $j$ in layer $l$
  • node : $x_j^{(l)}=\phi(\sum_i w_{i,j}^{(l)}x_i^{(l-1)}+b_j^{(l)})$ at layer $l$ output of node $j$
  • bias term : $b_j^{(l)}$ parameter
  • activation function : $\phi(\cdot)$, non-linear
    • sigmoid : $\phi(x)=\frac{1}{1+e^{-x}}$

Neural nets - Representation power

  • sigmoid-like : left limit $0$ right limit $1$
  • lemma : sufficiently smooth function can be approximated by neural net with one hidden layer and approximation error goes down like one over the number of node in the hidden layer
  • consider : $f(x)=\phi(w(x-b))$
  • width of the transistion : $1/w$
  • approximated rectange in average : $\phi(w(x-a))-\phi(w(x-b))$
  • higher dimension : require an additionnal hidden layer to filter other dimension arms
  • rectified linear function : $(x)_+=\max{0,x}$
  • classical Stone-Weierstrass theorem : there exists a polynomial $\abs{f(x)-p(x)}&lt;\epsilon$ for every $\epsilon &gt;0$ and all $x\in[0,1]$
    • can be approximated in $L_\infty$
    • continuous piecewise-linear function : $q(x)=\sum_{i=1}^m(a_ix+b_i)1_{{r_{i-1}\le x &lt; r_i&gt;}}$
    • continuity requires : $q(x)=\tilde a_1x+\tilde b_1+\sum_{i=2}^m\tilde a_i(x-\tilde b_i)+$ with $a_i=\sum{j=1}^i\tilde a_i$ and $\tilde b_i=r_{i-1}$

Neural nets - Training

  • standard cost function : $\L=\frac{1}{N}\sum_{n=1}^N(y_n-f(\v x_n))^2$
  • SGD : not convex but stable (no outcome difference if a single sample replaced) thus less prone to overfit
  • backpropagation :
    • weight matrix : $\v W^{(l)}$ that connects layer $l-1$ to $l$, $\v W_{i,j}^{(l)}=w_{i,j}^{(l)}$
    • bias vectors : $\v b^{(l)}$
    • function on each layer : $\v x^{(l)}=f^{(l)}(\v x^{(l-1)})=\phi((\v W^{(l)})^\top\v x^{(l-1)}+\v b^{(l)})$
    • overall function : $f(\v x^{(0)})=f^{(L+1)}\circ\cdots\circ f^{(1)}(\v x_n^{(0)})$
    • cost function : $\L=\frac{1}{N}\sum_{n=1}^N (y_n-f^{(L+1)}\circ\cdots\circ f^{(1)}(\v x_n^{(0)}))^2$ independent of the form of loss (only initialization changes)
    • algorithm
      • start : $\L_n=(y_n-f^{(L+1)}\circ\cdots\circ f^{(1)}(\v x_n^{(0)}))^2$
      • forward pass : $\v z^{(l)}=(\v W^{(l)})^\top\v x^{(l-1)}+\v b^{(l)}$ with $\v x^{(0)}=\v x_n$ and $\v x^{(l-1)}=\phi(\v z^{(l-1)})$
      • corresponding vector at level $l$ : $\delta_j^{(l)}=\frac{\partial\L_n}{\partial z_j^{(l)}}$
      • backward pass : $\delta_j^{(l)}=\frac{\partial\L_n}{\partial z_j^{(l)}}=\sum_k \frac{\partial\L_n}{\partial z_k^{(l+1)}}\frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}}=\sum_k\delta_k^{(l+1)}\v W_{j,k}^{(l+1)}\phi'(z_j^{(l)})$
        • MSE start with : $\delta^{(L+1)}=-2(y_n-\v x^{(L+1)})\phi'(z^{(L+1)})$
      • vector form : $\delta^{(l)}=(\v W^{(l+1)}\delta^{(l+1)})\odot\phi'(\v z^{(l)})$
      • Hadamard product : $\odot$ pointwise multiplication of vectors
      • weights : $=\frac{\partial \L_n}{\partial w_{i,j}^{(l)}}=\sum_k\frac{\partial \L_n}{\partial z_k^{(l)}}\frac{\partial z_k^{(l)}}{\partial w_{i,j}^{(l)}}=\frac{\partial \L_n}{\partial z_j^{(l)}}\frac{\partial z_j^{(l)}}{\partial w_{i,j}^{(l)}}=\delta_j^{(l)}\v x_i^{(l-1)}$
      • bias : $\frac{\partial \L_n}{\partial b_j^{(l)}}=\sum_k\frac{\partial \L_n}{\partial z_k^{(l)}}\frac{\partial z_k^{(l)}}{\partial b_j^{(l)}}=\frac{\partial\L_n}{\partial z_j^{(l)}}\frac{\partial z_j^{(l)}}{\partial b_j^{(l)}}=\delta_j^{(l)}$
      • sum dropped : a when chaning the weight $w_{i,j}^{(l)}$ it changes only $z_j^{(l)}$

Neural nets - Popular activation functions

  • sigmoid : positive, bounded, vanishing gradient problem
  • tanh : $tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=\phi(2x)-\frac{1}{2}$, balanced, bounded, vanishing
  • relu : rectified linear unit, positive, unbounded, not vanishing
  • leaky relu : $f(x)=\max{\alpha x, x}$, avoid null derivative
  • maxout : $f(x)=\max{\v x^\top\v w_1+b_1,\ldots,\v x^\top\v w_k+b_k}$, generalize relus

Neural nets - Convolutional nets

  • local processing of data
  • convolution : $x^{(1)}[n]=\sum_k f[k]x^{(0)}[n-k]$
  • weight sharing : same weights at every position, training by summing up edges that share same weights
  • handling of borders
    • zero padding
    • valid padding : convolve only inside the original data
  • multiple channel : one more dimension, typical scenario, decrease size and increase depth like a pyramid
  • fast fourier transform : $O(cL\log_2(L))$
    • dft : $\hat x[k]=\sum_{n=0}^L x[n]\exp{-\frac{2\pi j}{L}kn}$
    • idft : $x[n]=\frac{1}{L}\sum_{k=0}^L\hat x[k]\exp{\frac{2\pi j}{L}kn}$

Neural nets - Regularization, data augmentation, dropout

  • regularization : get smaller weights, avoid overfitting
    • weights : $\frac{1}{2}\sum_{l=1}^{L+1}\mu^{(l)}\norm{\v W^{(l)}}_F^2$ with $\mu^{(l)}$ a constant that depends on layer
    • bias : common to not penalize
    • update rule : $\Theta[t+1]=\Theta[t]-\eta(\nabla_\Theta \L+\mu\Theta[t])=\Thetat-\eta\nabla_\Theta\L$ with constant $\mu$ and $\Theta=w_{i,j}^{(l)}$
    • leads to weight decay
  • projected gradient descent : weights should not have a $L_2$ norm bigger than a constant, if so, normalize with the constant
  • dataset augmentation : create some variants (rotation, resize), learn to be invariant to this transformation, or train a network for some tasks and specialize only last layers to their respective tasks
  • dropout : $p_i^{(l)}$ probability of keeping node $i$ in layer $l$ other are ignored, draw at each iteration ($K$ subnetworks), predict by averaging $K$ subnetworks or scale output by $p$
  • bagging : averaging over many models

Graphical models - Bayes nets

  • graphical models : establish relationchip between random variable, bayes nets, markov hidden field, factor graphs
  • Bayes nets
    • random variables : $X_1,\ldots,X_D$
    • joint distribution : $p(X_1,\ldots,X_D)$
    • chain rule : $p(X_1,\ldots,X_D)=p(X_1)p(X_2\mid X_1)\cdots p(X_D\mid X_1,\ldots, X_{D-1})$, any of $D!$ orders
    • directed edge : from $X_j$ to $X_i$ if $X_j$ (parent) in conditioning of $X_i$ (child)
    • topology : always get same graph, always acyclic
  • tail-to-tail : both arrows point away (conditioning leads to independence)
  • head-to-tail/tail-to-head : one arrow point away, another in (conditioning leads to independence)
  • head-to-head : both arrows point in (conditioning creates dependence)
  • descendant : child, child of child, ...
  • blocked path : node $X$ to node $Y$ blocked by $Z$ iff it contains a variable st. variable is in $Z$ and it is head-to-tail/tail-to-tail, or st. node is head-to-head and neither this node nor descendants are in $Z$
  • $D$-separation : $X$ and $Y$ $D$-separated by $Z$ iff every path from any element of $X$ to any element of $Y$ is blocked by $Z$
  • independence : $X$ conditionnaly independent of random variables $Y$ conditionned on variable $Z$ iff $X$ and $Z$ are $D$-separated (symmetric)
  • markov blanket : minimal set so that every random variable outside this set is conditionally independent, contains parents, children, co-parents (parent of the children)
  • generating sample : storage exponential in largest number of parents

Graphical models - Factor graphs

  • variables : $x_1,\ldots,x_D$ can be random variables
  • multidimensional function : factorization $f(x_1,\ldots,x_D)=\Pi_{a\in A}f_a(x_{\partial a})$
    • variables : indexed by integers
    • factor : count $\abs{A}$, indexed by lower case Roman letters
    • subset : $x_{\partial a}$ involved by factor ($\partial$ means neighbors of)
  • factor graph : represents factoriaztion with bipartite graph ($D$ nodes to $\abs{A}$ nodes with edge from variable $i$ to factor $a$ if it participates in)
  • tree factor graph : if Bayes net is a tree, not the converse
  • complexity of computing marginals : exponential in $\max_{a\in A}\abs{x_{\partial a}}$ (better a lot of factors)
  • sum product algorithm : need $f(x_1)=\sum_{\sim x_1}f(x_1,\ldots,x_6)$ out of $f(x_1,\ldots,x_6)=f_1(x_1,x_2,x_3)f_2(x_1,x_4,x_6)f_3(x_4)f_4(x_4,x_5)$
    • brute force : $O(\abs{\mathcal X}^6)$
    • refinement : $f(x_1)=[\sum_{x_2,x_3}f_1(x_1,x_2,x_3)][\sum_{x_4}f_3(x_4)(\sum_{x_6} f_2(x_1,x_4,x_6))(\sum_{x_5} f_4(x_4,x_5))]$
    • complexity : $O(\abs{\mathcal X}^3)$, proportional to highest degree
    • marginalization : $\sum_{\sim z}g(z,\ldots)=\sum_{\sim z}\Pi_{k=1}^K[g_k(z,\ldots)]=\Pi_{k=1}^K[\sum_{\sim z}g_k(z,\ldots)]$
    • rules
      • initialization at leaf
        • factor : $\mu(x)=f(x)$
        • variable : $\mu(x)=1$
      • node processing
        • factor : $\mu(x)=\sum_{\sim x}f(x, x_1,\ldots x_J)\Pi_{j=1}^J\mu_j(x_j)$
        • variable : $\mu(x)=\Pi_{k=1}^K\mu_k(x)$
      • marginalization : $\Pi_{k=1}^{K+1}\mu_k(x)$
  • loopy belief propagation : group factor together, approximation