$$ \def\v#1{\boldsymbol{#1}} \def\abs#1{\left |#1\right |} \def\norm#1{\left |\left|#1\right |\right |} \def\cases#1{\begin{cases}#1\end{cases}} \def\L{\mathcal L} \def\N{\mathcal N} \def\R{\mathbb R} \def\st{\text{ st. }} \def\xtw{\v x^\top\v w} $$

CS433

GNU General Public License v3.0 licensed. Source available on github.com/zifeo/EPFL.

Fall 2016: Pattern Classification and Machine Learning

[TOC]

Regression

input variable : features, covariates, independent variables, explanatory variables, exogenous variables, predictors, regressors
output variable : target, label, reponse, outcome, dependent variables, endogenous variables, measured variables, regressands
regression : relate input to output for prediction or interpretation
data : $(y_n,\v x_n)$
- dimensionality : $D$
- data-size : $N$
regression function : $y\approx f(\v x_n);\forall n$
correlation $\not=$ causation

Linear regression

linear regression : model that assume linear relationship
parameters : $\v w=(w_0,w_1)$
simple linear regression : $y_n\approx f(\v x_n)=w_0+w_1x_{n1}$
multi linear regression : $y_n\approx f(\v x_n)=w_0+w_1x_{n1}+\cdots+w_Dx_{nD}=\v x_n^\top \v w$ with $\v x_n=(1,x_{n1},\cdots,x_{nD})^\top$
learning : estimation, fitting, find $\v w$
$D>N$ problem : more parameters than data, under-determined for linear regression, use regularization

Cost functions

cost function : energy, loss, training objective, quantifies the model
- $y$ real-valued : better symmetric around $0$
- same penalization : large mistakes and very-large mistakes
statistical vs computational trade-off
outliers : far away data samples, andling them (statistical property)
mean square error : $MSE(\v w)=\frac{1}{N}\sum_{n=1}^N [y_n-f(\v x_n)]^2$, outliers sensitive
mean absolute error : $MAE(\v w)=\frac{1}{N}\sum\abs{y_n-f(\v x_n)}$, outliers robust
convexity : $f(\lambda \v u +(1-\lambda)\v v)\le\lambda f(\v u)+(1-\lambda)f(\v v)$ for any $\v u,\v v$ and $0\le\lambda\le0$
- concave : negative convexity
- unique global minimum value
error vector : $\v e=\v y -\v X\v w$
Huber loss : $Huber=\cases{\frac{1}{2}\v e^2 & \abs{\v e}\le\delta\\ \delta\abs{\v e}-\frac{1}{2}\delta^2 & \abs{\v e}>\delta}$, outliers robust, convex, differentiable
Tukey's bisquare loss : $\frac{\partial \L}{\partial \v e}=\cases{\v e[1-\v e^2/\delta^2]^2 & \abs{\v e}\le\delta\\ 0 & \abs{\v e}>\delta}$, outliers robust, non-convex
Hinge loss : $Hinge=\max(0,\v e)$, relu

Optimization

optimization : $\min_{\v w}\L(\v w)$
grid searh : brute-force, exponential, $10^D$
local minimum : $\L(\v w^)\le\L(\v w);\forall\v w\st\norm{\v w-\v w^}<\epsilon$
global minimum : $\L(\v w^*)\le\L(\v w);\forall\v w$
strict minimum : inequality is strict
lower bound : required for any cost function
smooth optimization : differentiable
gradient : $\nabla\L(\v w)=\left(\frac{\partial\L(\v w)}{\partial w_1},\cdots,\frac{\partial\L(\v w)}{\partial w_D}\right)^\top$
gradient descent (batch) : $\v w^{(t+1)}=\v w^{(t)}-\gamma \v g$ with $\v g=\nabla\L(\v w^{(t)})$
- step-size : learning rate, $\gamma>0$
MSE : $\L(\v w)=\frac{1}{2N}\sum_{n=1}^N(y_n-\v x_n^\top\v w)^2=\frac{1}{2N}\v e^\top\v e$
- gradient : $\nabla\L(\v w)=-\frac{1}{N}\v X^\top\v e$
- complexity : $O(ND+D)$
stochastic gradient descent : $\v g = \nabla\L_n(\v w^{(t)})$ with $E[\nabla\L_n(\v w)]=\nabla\L(\v w)$
- complexity : $O(D)$
mini-batch stochastic gradient descent : $\v g = \frac{1}{\abs{B}}\sum_{n\in B}\nabla\L_n(\v w^{(t)})$ with $B\subseteq N$ randomly chosen
- complexity : $O(D\abs{B})$
alternative convexity characterization : $\L(\v u)\ge\L(\v w)+\Delta\L(\v w)^\top(\v u-\v w);\forall\v u,\v w$, always above its linearization, for differentiable functions only
subgradients : $\v g$ st. $\L(\v u)\ge\L(\v w)+\v g^\top(\v u-\v w);\forall\v u$, if $\L$ differtiable, $\v g=\nabla\L(\v w)$
MAE : $\v g=\cases{-1 & e<0\\ [-1,1]& e=0\\ 1 & e>0}$
constrained optimization : $\min_{\v w}\L(\v w)$ st. $\v w\in C$
convex sets : $\theta\v u+(1-\theta)\v v\in C$ for any $\v u,\v v$ and $0\le\theta\le0$
- intersection : convex
projected gradient descent : unique, efficient, $P_C(\v w')=\arg\min_{\v v\in C}\norm{\v v-\v w'}$
- update rule : $\v w^{(t+1)}=P_C[\v w^{(t)}-\gamma\nabla\L(\v w^{(t)})]$
penality functions : $\min_{\v w}\L(\v w)+I_C(\v w)$
- brick wall : $I_C(\v w)=\cases{0 & \v w\in C\\ \infty &\v w\not\in C}$, not continuous
- penalize error : $\min_{\v w}\L(\v w)+\lambda\norm{A\v w-\v b}^2$ with trade-off parameter $\lambda$
stopping criteria : $\nabla\L(\v w)$ close to $0$, close to optimum
optimality
- necessary : $\nabla\L(\v w^*)=0$
- sufficient : positive second-order derivative
step-size selection : guarantee to converge only when $\gamma <\gamma_{\min}$, application specific
line-search : step-size defined automatically
Robbins-Monroe condition : $\v w^{(t+1)}=\v w^{(t)}+\gamma^{(t)}\nabla\L_n(\v w^{(t)})$ with $\sum_{t=1}^\infty\gamma^{(t)}=\infty$ and $\sum_{t=1}^\infty (\gamma^{(t)})^2<\infty$
- $\gamma^{(t)}=1/(1+t)^r$ with $r\in(0.5,1)$
feature normalization : gradient sensitive to ill-conditioning, directions might converge at different speed, need normalization
Hessian matrix : $\v H(\v w^)=\frac{\partial^2\L(\v w^)}{\partial \v w\partial \v w^\top}$ positive semi-definite

Least squares

normal equations : $\nabla\L(\v w^*)=0$
MSE : $\nabla\L(\v w)=-\frac{1}{N}\v X^\top\v e=0$ give $\v w^*=(\v X^\top\v X)^{-1}\v X^\top\v y$
- complexity : $O(ND^2+D^3)$
Gram matrix : $\v X^\top\v X$ inversible iff $rank(\v X)=D$ (full column rank)
rank deficientcy
- $D>N$ : always $rank(\v X)<D$
- ill-conditioned : $D\le N$ but some columns collinear

Maximum likelihood

gaussian
- random variable : $\N(y\mid\mu,\sigma^2)=\frac{1}{\sqrt{2\pi\sigma^2}}\exp{\left[-\frac{(y-\mu)^2}{2\sigma^2}\right]}$
- random vector : $\N(\v y\mid\v \mu,\v \Sigma)=\frac{1}{\sqrt{(2\pi)^D\det(\Sigma)}}\exp{\left[-\frac{1}{2}(\v y-\v \mu)^\top\Sigma^{-1}(\v y-\v\mu)\right]}$
probabilistic model : $y_n=\v x_n^\top\v w+\epsilon_n$ with $\epsilon_n\sim\N(0\mid\sigma^2)$
log-likelihood : $\L_{lik}(\v w)=\log p(\v y\mid\v X,\v w)=-\frac{1}{2\sigma^2}\sum_{n=1}^N (y_n-\v x_n^\top\v w)^2+ cnst\approx E[\log p(y\mid\v x,\v w)]$
maximum likelihood estimator : $\arg\min_{\v w}\L_{mse}(\v w)=\arg\max_{\v w}\L_{lik}(\v w)$
- expected log-like-lihood : $\L_{lik}(\v w)\approx E_{p(y, x)}[\log p(y\mid \v x,\v w)]$
- $\v w_{mle}\to\v w_{true}$ in probability
- asymptotically normal : $(\v w_{mle}-\v w_{true})\to\frac{1}{\sqrt{N}}\N(\v w_{mle}\mid \v 0,\v F^{-1}(\v w_{true}))$ with Fisher information $\v F(\v w))=-E_{p(\v Y)}[\frac{\partial^2\L}{\partial \v w\partial \v w^\top}]$
- $\v w_{lse}=\arg\max_{\v w}\log\left[\Pi_{n=1}^N\N(y_n\mid\v w_n^\top\v w,\sigma^2)\right]$
laplace : $p(y_n\mid\v x_n,\v w)=\frac{1}{2b}\exp{\frac{1}{b}\abs{y_n-\v X_n^\top\v w}}$, for MAE

Overfitting

underfit : too limited, not fitting the signal well
overfit : too powerful, fitting noise in addition to the signal
get more data : might reduce overfitting
Occam's razor : "Plurality is not to be posited without necessity"
regularization : add term that penalizes complex models, $\min_{\v w}\L(\v w)+\Omega(\v w)$

Ridge regresion and Lasso

basis functions : nonlinear transformations applied to input variable, $\phi_j(\v x)$, $M$ basis
linear basis function models : $y_n=w_o+\sum_{j=1}^M w_j\phi_j(\v x_n)=\v\phi(\v x_n)^\top\v w$ with $\v\phi(\v x_n)^\top=[\v 1,\phi_1(\v x_n),\cdots,\phi_M(\v x_n)]$
- linear : in $\v w$
- nonlinear : in $\v x$
- dimensions of $\v w$ : $M+1\not = D$
- remplace $\v w$ : by $\v \Phi=\begin{bmatrix}1&\phi_1(\v x_1)&\cdots&\phi_M(\v x_1)\\ \vdots&\vdots&\ddots&\vdots\\ 1&\phi_1(\v x_N)&\cdots & \phi_M(\v x_N)\end{bmatrix}$ in $\v y=\v\Phi \v w$
$L_2$-regularization : $\Omega(\v w)=\lambda\norm{\v w}_2^2$ with $\norm{\v w}2^2=\sum{j=0}^M w_j^2$ and $\lambda$ regularization parameter
- enforce : model to be simpler
- ridge regression : $L_2$-
  - regularization with MSE
  - explicit solution : $\v w_{ridge}^*=(\v\Phi^\top\v\Phi+2N\lambda\v I_M)^{-1}\v\Phi^\top\v y$
- fight ill-conditioning : eigenvalues are at least $2N\lambda$, improve condition number of Gram matrix
maximum-a-posteriori : $\v w_{ridge}=\arg\max_{\v w}\log\left[\Pi_{n=1}^N\N(y_n\mid\v x_n^\top\v w,\sigma^2)\cdot\N(\v w\mid 0, \frac{1}{\lambda}\v I)\right]$
$L_1$-regularization : $\Omega(\v w)=\lambda\norm{\v w}_1$ with $\norm{\v w}1=\sum{j=0}^M \abs{w_j}$
- enforce : sparsity, some elements of $\v w$ to be $0$
lasso : $L_1$ with MSE
other regularizations
- shrinkage
- dropout
- weight decay
- early stopping
maximum-a-posteriori : $\v w_{map}=\arg\max_{\v w}p(\v y\mid\v w, \v X, \v\lambda)\cdot p(\v w\mid\v\theta)$, maximize both likelihood and prior
- prior : $p(\v w\mid\v \lambda)$
- ridge regression : $\v w_{ridge}=\arg\max_{\v w}\log\left[\Pi_{n=1}^N \N(y_n\mid\v x_n^\top\v w,\sigma^2)\cdot\N(\v w\mid 0,\frac{1}{\lambda}\v I)\right]$

Model selection and cross-validation

underlying distribution : $\mathcal D=\mathcal Y\times\mathcal X$, unknown
independent samples : $S={(y_n,\v x_n)}^N_{n=1}$
learning algorithm : $\mathcal A(S)$
prediction function : $f_S=\mathcal A(S)$
training set : $S_t$, 80%
validation set : $S_v$, 20%
generalization error : $\L_D(f)=E[l(y,f(\v x))]$ with $l$ loss function, true/expected risk/loss
trainging error : $L_S(f_S)=\frac{1}{N}\sum_{n=1}^N l(y_n,f(\v x_n))$, empirical
- issue as a function of itself : $L_S(f_S)=\frac{1}{N}\sum_{n=1}^N l(y_n,f_S(\v x_n))$
- validation : $L_{S_v}(f_{S_t})=\frac{1}{\abs{S_v}}\sum_{(y_n,\v x_n)\in S_v}l(y_n,f_{S_t}(\v x_n))$ with $L_D(f_{S_t})=E_{S_v\sim D}[L_{S_v}(f_{S_t})]$
- finite size variation : $P\left{\max_k\abs{L_D(f_k)-L_{S_v}(f_k)}\ge\sqrt{\frac{\ln(2K/\delta)}{2\abs{S_v}}}\right}\le\delta$ for $K$ hyper-parameters
Chernoff bound : $\Phi_i$ iid RV with range $[0,1]$, $P\left{\abs{\frac{1}{N}\sum_{n=1}^N\Phi_n-E[\Phi]}>\epsilon\right}\le 2e^{-2N\epsilon^2}$
$K$-fold cross-validation : train on $K-1$, test on remaining, permute and average results
Markov inquality : $P(X\ge a)\le\frac{E(X)}{a}$
Hoeffding lemma : zero mean RV $E[e^{sX}]\le e^{s^8/8(b-a)^2}$

Bias-variance decomposition

simple model : high bias (bad fit), small variance
complex model : small bias, high variance
data generation model : $y=f(\v x)+\epsilon$ with zero mean noise $\epsilon$
decomposition = $E_{S_t\sim D,noise}[(f(\v x_0)+\epsilon-f_{S_t}(\v x_0))^2]=var[\epsilon]+(f(\v x_0)-E_{S_t\sim D}[f_{S_t}(\v x_0)])^2+E_{S_t\sim D}[(E_{S_t'\sim D}[f_{S_t'}(\v x_0)]-f_{S_t}(\v x_0))^2]=noise + bias +variance$, all non-negative

Classification

classification : relates input $\v x$ to output $y$ but $y\in{C_1,C_2}$ restricted to discrete values
class labels : $C_i$
multi-class classification : $y\in{C_0,\ldots,C_{K-1}}$
decision boundaries : decided by a classifier
MSE : count equally bad positive and negative error (whereas only one is wrong), small $\Rightarrow$ small classification error (inverse not true)
nearest neighbors : euclidean distance, useless in high dimension
$k$ nearest neighbors
smoothing kernels : weighted linear combinaison of elements in the neighborhood
linear decision boundaries : hyperplanes
support vector machines : take hyperplane with highest robustness to wiggling
kernel trick : augment feature vector with some non-linear functions
neural networks : apprioriate non-linear transform of input that is linearly separable
optimal classification : for known generating model
- known joint distribution : $p(\v x, y)$
- optimal choice for class label : $\hat y(\v x)$
- probability of misclassification : $P{\hat y(\v x)=y}=\int p(\v x)\sum_{y\in\mathcal Y}p(y\mid\v x)\v 1_{\hat y(\v X)=y}dx$
- strategy : $\hat y(\v x)=\arg\max_{y\in\mathcal Y}p(y\mid x)$

Logistic regression

logistic function : $\sigma(x)=\frac{e^x}{1+e^x}$
binary posterior probability
- $p(C_1\mid\v x)=\sigma(\v x^\top w)$
- $p(C_2\mid\v x)=1-\sigma(\v x^\top w$
training : maximize $p(\v y\mid\v X,\v w)=\Pi_{n:y_n=C_1}p(y_n=C_1\mid\v x_n)\Pi_{n:y_n=C_2}p(y_n=C_2\mid \v x_n)=\Pi_{n=1}^N\sigma(\xtw)^{y_n}[1-\sigma(\xtw)]^{1-y_n}$
cost function : minimize $\L(\v w)=-\sum_{n=1}^N y_n\log\sigma(\v x_n^\top\v w)+(1-y_n)\log[1-\sigma(\v x_n^\top\v w)]=\sum_{n=1}^N\log[1+\exp(\v x_n^\top\v w)]-y_n \v x_n^\top\v w$
gradient : $\nabla\L(\v w)=\sum_{n=1}^N\v x_n(\sigma(\v x_n^\top\v w)-y_n)=\v X^\top[\sigma(\v X\v w)-\v y]$, convex
gradient descent : $\v w^{(t+1)}=\v w^{(t)}-\gamma^{(t)}\nabla\L(\v w^{(t)}$
Hessian : $\v H_{i,j}=\frac{\partial^2\L(\v w)}{\partial\v w_i\partial \v w_j}$, $D\times D$, symmetric, convex if positve semi-definite
Taylor series : $\sum_{n=0}^\infty \frac{f^{(n)(a)}}{n!}(x-a)^n$
- second order : $\L(\v w)\approx\L(\v w^)+\nabla\L(\v w^)^\top(\v w-\v w^)+\frac{1}{2}(\v w-\v w^)^\top\v H(\v w^)(\v w-\v w^)$
Newton's method : $\v w^{(t+1)}=\v w^{(t)}-\gamma^{(t)}(\v H^{(t)})^{-1}\nabla\L(\v w^{(t)})$
- Hessian : $\v H(\v w)=\v X^\top\v S\v X$ with $\v S$ $N\times N$ with diagonal entries $S_{nn}=\sigma(\v x_n^\top\v w)[1-\sigma(\v x_n^\top\v w)]$
linearly separable issue : no finite weight vector (infinite)
penalized logistic regression : $\arg\min_{\v w}-\sum_{n=1}^N\log p(y_n\mid\v x_n^\top,\v w)+\lambda\sum_{d=1}^D\v w_d^2$

Generalized linear models

shorthand : $\v\eta=\xtw$
exponential family : $p(y\mid\v\eta)=h(y)\exp[\v \eta^\top\v\varphi(y)-A(\v\eta)]$
- 2 degrees of freedom : $h(y)$, $\v\varphi(y)$
- normalization term : $A(\v\eta)=\log[\int_y h(y)\exp[\v\eta^\top\v\varphi(y)]dy]<\infty$, cumulant, log partition
- sufficient statistics : $\v\varphi(y)$
- logisitc : $p(y\mid\v\eta)=\frac{e^{\v\eta y}}{1+e^{\v\eta}}=\exp[\v\eta y -\log(1+e^{\v \eta})]$
- Bernouilli : $p(y\mid\mu)=\mu^y(1-\mu)^{1-y}=\exp[(\log\frac{\mu}{1-\mu})y+\log(1-\mu)]$
- Gaussian : $p(y\mid\mu)=\frac{1}{\sqrt{2\pi\sigma^2}}]\exp{-\frac{(y-\mu)^2}{2\sigma^2}}=\exp[(\mu/\sigma^2, -1/(2\sigma^2))(y, y^2)^\top-\frac{\mu^2}{2\sigma^2}-\frac{1}{2}\log(2\pi\sigma^2)]$
- properties
  - convex cumulant : as function of $\v\eta$
  - derivative and moments (multiparameters) : $\nabla A(\v\eta)=E[\v\varphi(y)]$
  - Hessian : $\nabla^2 A(\v\eta)=var[\v\varphi(y)]$
link function : $\v\eta=g(\v\mu)$ with $\mu=E[\v\varphi(y)]$ when scalar
generalized linear models : $p(y\mid\v x,\v w)=h(y)\exp[\v x^\top\v w\varphi(y)-A(\v x^\top\v w)]$ from exponential familiy have easy to solve maximum likelihood
- assume samples obey : $p(y_n\mid\v x_n,\v w)=h(y_n)e^{\eta_n\varphi(y_n)-A(\eta_n)}$
- cost function : $\L(\v w)=-\sum_{n=1}^N\log(h(y_n))+\v x_n^\top\v w\varphi(y_n)-A(\xtw)$
- gradient : $\nabla_{\v w}\L(\v w)=-\sum_{n=1}^N\v x\varphi(y_n)-\v x g^{-1}(\xtw)=\v X^\top[g^{-1}(\v X\v w)-\varphi(\v y)]$ with $\frac{d A(\eta)}{\eta}=g^{-1}(\eta)$

Curse of dimensionality and $k$-NN

Bayes classifier : MAP classifier
$k$-nearest neighbor : $f_{S_t,k}(\v x)=\frac{1}{k}\sum_{n:\v x_n\in nbh_{S_t,k}(\v x)} y_n$ with $nbh_{S_t,k}(\v x)$ set of $k$ input point in $S_t$ the closest to $\v x$
- variant : weight individual by their distances
- classification : $f_{S_t,k}(\v x)= majority; element{y_n\mid \v x_n\in nbh_{S_t,k}(\v x)}$ (for binary, $k$ should be odd)
- $E_{S_t}[\L(f_{S_t})]\le \L(f_\star)+c E_{S_t,\v X\sim D}[\norm{\v x - nbh_{S_t, 1}(\v x)}]\le 2\L(f_\star)+4c\sqrt{d}N^{-1/(d+1)}$ with $\L(f_\star)$ risk (classification error) of optimal Bayes classifier
curse of dimensionality : intuitions fail in high dimensions
- generalizing correctly becomes exponentially harder as the dimensionality grows because fixed-size training sets cover a dwindling fraction of the input space
- in high-dimension, data-points are far from each other : choice of nearest neighbor becomes effectively random
Lipschitz continuity : $\abs{\eta(\v x)-\eta(\v x')}\le c\norm{\v x -\v x'}$ with $c$ constant and $\eta(\v x)=P(y=1\mid\v x)$

Support vector machines

support vector machine : change cost from Logistic to Hinge
- cost : $\min_{\v w}\sum_{n=1}^N [1-y_n\v x_n^\top\v w]_++\frac{\lambda}{2}\norm{\v w}^2$
- hinge loss : $[z]_+=\max{0,z}$
- no obvious probabilistic interpretation
- extension to multi-class non-trivial
labels
- logistic : $y_n\in{0,1}$
- svm : $y_n\in{\pm 1}$
  - hinge : $Hinge(f)=[1-yf]_+$
  - mse : $MSE(f)=(1-yf)^2$
  - logistic : $logisticLoss(f)=\log(1+e^{-yf})$
duality : difficult optimization, define $\L(\v w)=\max_{\v\alpha} G(\v w, \v \alpha)$ so choose between $\min_{\v w}\max_{\v\alpha}G(\v w,\v \alpha)=\max_{\v\alpha}\min_{\v w}G(\v w,\v\alpha)$
- hinge : $[\nu_n]_+=\max{0,\nu_n}=\max(\alpha_n\nu_n);\forall n$ with $\alpha_n\in[0,1]$
  - rewrite svm $\min_{\v w}\max_{\v\alpha\in [0,1]^N}\sum_{n=1}^N \alpha_n(1-y_n\v x_n^\top\v w)+\frac{\lambda}{2}\norm{\v w}^2$
  - convex in $\v w$
  - concave in $\v\alpha$
  - gradient : $\nabla_{\v w}G(\v w,\v\alpha)=-\sum_{n=1}^N\alpha_ny_n\v x_n+\lambda\v w$
  - first-order optimality condition for $\v w$ : $\v w(\v\alpha)=\frac{1}{\lambda}\sum_{n=1}^N\alpha_n y_n\v x_n=\frac{1}{\lambda}\v X\v Y\v \alpha$ with $\v Y=diag(\v y)$ and $\v X$ $N$ data examples as columns
  - dual optimization of svm : $\max_{\v\alpha\in[0,1]^N}\sum_{n=1}^N\alpha_n(1-\frac{1}{\lambda} y_n\v x_n^\top\v X\v Y\v \alpha)+\frac{\lambda}{2}\norm{\frac{1}{\lambda}\v X\v Y\v\alpha}^2=\max_{\v\alpha\in[0,1]^N}\v\alpha^\top\v 1-\frac{1}{2\lambda}\v\alpha^\top\v Y\v X^\top\v X\v Y\v\alpha$
  - differientiable : easy with coordinate descent $\max_{\v\alpha\in[0,1]^N}\v\alpha^\top\v 1-\frac{1}{2\lambda}\v\alpha^\top\v Q\v\alpha$ with kernel $\v Q=diag(\v y)\v X\v X^\top diag(\v y)$
  - solution $\v\alpha$ : generally sparse, non-zero only for examples important for boundaries
  - data vectors
    - non-support vectors : correct side, outside margin, $\alpha_n=0$
    - essential support vectors : lie on margin $\alpha_n\in(0,1)$
    - bound support vectors : strictly inside margin or wrong side $\alpha_n=1$
- minmax theorem : okay to switch when one concave, other convex (convex conjugate)
coordinate descent : update one coordinate at a time, keeping others fixed $\alpha_i\in\v\alpha$

Kernel ridge regression

representer theorem : for a $\v w^\star$ minizing $\min_{\v w}\sum_{n=1}^N L_n(y_n,\v x_n^\top\v w)+\frac{\lambda}{2}\norm{\v w}^2$ there exists $\v\alpha^\star$ st. $\v w^\star =\v X\v\alpha^\star$
trick : $\v w^\star=(\v X\v X^\top+\lambda \v I_D)^{-1}\v X\v y=\v X(\v X^\top \v X+\lambda\v I_N)^{-1}\v y=\v X\v\alpha^\star$
- complexity before : $O(D^3+ND^2)$
- complexity after : $O(N^3+DN^2)$
- $\v w^\star=\v X\v\alpha^\star$ where $\v X=\begin{bmatrix}x_{11}&\cdots& x_{1N}\\ \vdots &\ddots&\vdots\\ x_{D1}&\cdots& x_{DN}\end{bmatrix}$
kernel ridge regression : $\v\alpha^\star=\arg\max_{\v\alpha}-\frac{1}{2}\v\alpha^\top(\v X^\top\v X+\lambda\v I_N)\v\alpha+\lambda\v\alpha^\top\v y$
kernel function : $\kappa(\v x,\v x')=\phi(\v x)^\top\phi(\v x')$, faster than using $\phi$
- linear : $\v K=\v X^\top\v X=x_i^\top x_j$
- basis $\v K=\v\Phi^\top\v\Phi=\phi(x_i)^\top\phi(x_j)$
- radial basis function : $\kappa(\v x,\v x')=\exp[-\frac{1}{2}(\v x-\v x')^\top(\v x-\v x')]$
- properties : must be an inner-product in some feature space, ensured by
  - $\v K$ symmetric : $\kappa(\v x,\v x')=\kappa(\v x',\v x)$
  - $\v K$ positive semi-definite : for any input set ${\v x_n}$
- can lead infinite feature spaces

Unsupervised learning

unsupervised learning
- density estimation
- feature learning / representation learning

K-means clustering

cluster : groups of points whose inter-point distances are small
prototypes : $\v \mu_i$ of cluster assignment $z_n\in{1,\ldots,K}$
k-means : $\min_{\v z,\v\mu}\L(\v z,\v\mu)=\sum_{n=1}^N\sum_{k=1}^Kz_{nk}\norm{\v x_n-\mu_k}2^2=\norm{\v X-\v M\v Z^\top}{Frob}^2$ with $\v\mu_k\in\R^D$, $z_{nk}\in{0,1}$ and $\sum_{k=1}^K z_{nk}=1$
- issues : computation heavy, cluster force spherical, hard cluster
- complexity : $O(nkdi)$ with $i$ iteration
Frobenius norm : $\norm{A}F=\sqrt{\sum_i\sum_j \abs{a{ij}}^2}$
algorithm : coordinate descent
- initialize $\v\mu_k$
- compute $\v z_n$ given $\v\mu$ : $z_{nk}=1$ if $k=\arg\min_j\norm{\v x_n-\v\mu_j}_2^2$
- compute $\v\mu_k$ given $\v z$ : $\v\mu_k=\frac{\sum_{n=1}^N z_{nk}\v x_n}{\sum_{n=1}^N z_{nk}}$
- iterate
- convergance : assured to local optimum
probabilistic model : $p(\v x_n\mid\v\mu, \v z)=\Pi_{n}N(\v x_n\mid\v\mu_{k,\st z_{nk}=1},\v I)=\Pi_n\Pi_k \N(\v x_n\mid\v\mu_k,\v I)^{z_{nk}}$, isotropic variance

Gaussian mixture models

elliptical cluster : $p(\v X\mid\v\mu, \v z)=\Pi_n\Pi_k N(\v x_n\mid\v\mu_k,\v \Sigma_k)^{z_{nk}}$
soft-clustering : $z_{nk}$ multinomial with $p(z_n=k)=\pi_k$ with $\sum_k\pi_k=1$
gaussian mixture model : $p(\v X,\v z\mid\v\mu,\v\Sigma,\v\pi)=\Pi_{n=1}^N p(\v x_n\mid z_n,\v\mu,\v\Sigma)p(z_n\mid\v\pi)=\Pi_{n=1}^N\Pi_{k=1}^K\N(\v x_n\mid\v\mu_k,\v\Sigma_k)^{z_{nk}}\Pi_{k=1}^K\pi_k^{z_{nk}}$
- observed data vectors : $\v x_n$
- latent unobserved variables : $z_n$
- unknown parameters : $\v\theta={\v\mu_1,\ldots,\mu_K,\v\Sigma_1,\ldots,\v\Sigma_K,\v\pi}$
- advantage : $z_n$ can be marginalized out of the cost functions as if not existing
- marginalization : $p(\v x_n\mid\v\theta)=\sum_{k=1}^K\pi_k\N(\v x_n\mid\v\mu_k,\v\Sigma_k)$
  - complexity : $O(D^2K)$, before depend on $O(N)$
  - maximum likelihood : $\max_{\v\theta}\sum_{n=1}^N\log\sum_{k=1}^K\pi_k N(\v x_n\mid\v\mu_k,\v\Sigma_k)$, not convex, degenerate, not identifiable (multiple combination)

Expectation-maximization algorithm

expected-maximization : elegan method to optimize log inside the sum problem
algorithm
- start with $\v\theta^{(1)}$
- expectation step : compute lower bound to cost st. tight at the previous $\v\theta^{(t)}$ $\L(\v\theta)\ge\underline\L(\v\theta,\v\theta^{(t)})$ and $\L(\v\theta^{(t)})=\underline\L(\v\theta^{(t)},\v\theta^{(t)})$
  - per stata sample : $\log\sum_k\pi_k\N(x_n\mid\v\mu_k,\v\Sigma_k)\ge\sum_k q_{kn}^{(t)}\log\frac{\pi_k\N(n_n\mid\v\mu_k,\v\Sigma_k)}{q_{kn}^{(t)}}$
  - equality : $q_{kn}^{(t)}=\frac{\pi_k^{(t)}\N(\v x_n\mid\v\mu_k^{(t)},\v\Sigma_k^{(t)})}{\sum_k \pi_k^{(t)}\N(\v x_n\mid\v\mu_k^{(t)},\v\Sigma_k^{(t)})}$
  - compute : $\L(\v\theta^{(t)})=\sum_n\log\sum_k\pi_k^{(t)}\N(\v x_n\mid\v\mu_k^{(t)},\v\Sigma_k^{(t)})$
- maximization step : update $\v\theta$, $\v\theta^{(t+1)}=\arg\max_{\v\theta}\underline\L(\v\theta,\v\theta^{(t)})$
  - maximize lower bound : $\max_{\v\theta}\sum_n\sum_k q_{kn}^{(t)}[\log\pi_k+\log\N(\v x_n\mid\v\mu_k,\v\Sigma_k)]$ for fixed $q_{kn}^{(t)}$
  - $\mu$ update : $\v\mu_k^{(t+1)}=\frac{\sum_n q_{kn}^{(t)}\v x_n}{\sum_n q_{kn}^{(t)}}$
  - $\Sigma$ update : $\v\Sigma_k^{(t+1)}=\frac{\sum_n q_{kn}^{(t)}(\v x_n-\v\mu_k^{(t+1)})(\v x_n-\v\mu_k^{(t+1)})^\top}{\sum_n q_{kn}^{(t)}}$
  - $\pi$ update : $\pi_k^{(t+1)}=\frac{1}{N}\sum_n q_{kn}^{(t)}$, need to add a Lagrangian term
- iterate
- same as k-means : $\v\Sigma_k=\sigma^2\v I$
posterior distribution : $p(\v x_n,z_n\mid\v\theta)=p(\v x_n\mid z_n,\v\theta)p(z_n\mid\v\theta)=p(z_n\mid\v x_n,\v\theta)p(\v x_n\mid\v\theta)$
EM in general : $\v\theta^{(t+1)}=\arg\max_{\v\theta}\sum_n E_{p(z_n\mid\v x_n,\v\theta^{(t)})}[\log p(\v x_n, z_n\mid\v\theta)]$ with joint distribution $p(\v x_n,z_n\mid\v\theta)$
log concavity : $\log \sum_k q_kt_k\ge\sum_k q_k\log t_k$ when $q_k,t_k$ non-negative and $\sum_k q_k=1$

Matrix factorization

matrix : $\v X=D\times N$ with $d$ movie and $n$ user
factorization : $\v X\approx\v W\v Z^\top$ with $\v W\in\R^{D\times K}$ and $\v Z\in\R^{N\times K}$
learning : $\min_{\v W,\v Z}\L(\v W,\v Z)=\frac{1}{2}\sum_{(d,n)\in\Omega}[x_{dn}-(\v W\v Z^\top)_{dn}]^2$ with $\Omega$ set of observed input, not convex, not identifiable
latent feature : $K$
regularization : $\frac{1}{2}\sum_{(d,n)\in\Omega}[x_{dn}-(\v W\v Z^\top)_{dn}]^2+\frac{\lambda_w}{2}\norm{\v W}_F^2+\frac{\lambda_z}{2}\norm{\v Z}_F^2$
stochastic gradient descent
- objective : $\frac{1}{\abs{\Omega}}\sum_{(d,n)\in\Omega}[x_{dn}-(\v W\v Z^\top)_{dn}]^2$
- $w_{d',k}=-[x_{dn}-(\v W\v Z^\top){dn}]z{n,k}$ if $d'=d$ else 0
- $z_{n',k}=-[x_{dn}-(\v W\v Z^\top){dn}]w{d,k}$ if $n'=n$ else 0
- complexity : $O((D+N)K)$
alternating least-squares : coordinate descent $\frac{1}{2}\sum_d\sum_n[x_{dn}-(\v W\v Z^\top)_{dn}]^2=\frac{1}{2}\norm{\v X-\v W\v Z^\top}_F^2$ if no missing entries
- $\v Z^\top=(\v W^\top\v W+\lambda_z\v I_k)^{-1}\v W^\top\v X$
- $\v W^\top=(\v Z^\top\v Z+\lambda_w\v I_k)^{-1}\v Z^\top\v X^\top$

Text representation learning

embedding : numerical representation for words $w_i\mapsto \v w_i\in\R^K$
co-occurence matrix : $n_{ij}=$ #contexts where word $w_i$ occurs together with word $w_j$, sparse $D\times N$
- context : document, paragraph, sentence, window
- words : $w_d$
- context words : $w_n$
- vocabulary : $\mathcal V={w_1,\ldots,w_D}$
matrix factorization : $\min_{\v W,\v Z}\L(\v W,\v Z)=\frac{1}{2}\sum_{(d,n)\in\Omega}f_{dn}[x_{dn}-(\v W\v Z^\top){dn}]^2$ with weights $f{dn}$
- use log of counts $x_{dn}=\log(n_{dn})$
- GloVe model : $f_{dn}=\min{1, (n_{dn}/n_{max})^\alpha}$ for $\alpha\in[0,1]$
- SGD or ALS
skip-gram model : original word2vec, use binary classification (logistic) to separate good word pairs $(w_d,w_n)$ (appearing together in a context) from bad word pairs (random apparition)
fasttext : supervised, $\v W\in\R^{1\times K}$ and $\v Z\in\R^{\abs{\mathcal V}\times K}$
- bag-of-words representation : $\v x_n\in\R^{\abs{\mathcal V}}$ from sentence $s_n=(w_1,\ldots,w_m)$
- learning : $\min_{\v W,\v Z}\L(\v W,\v Z)=\sum_{s_n}f(y_n\v W\v Z^\top\v x_n)$ with $f$ linear classifier and label $y_n\in{-1,1}$

SVD and PCA

dimensionality reduction
principal component analysis : linear mapping from $D$ to $K$ dimensions or maximum variance (decorrelates), bad if not normalized
unitary matrix : $\v U\v U^\top=\v U^\top\v U=\v I$ with rows or cols orthonormal
- linear transformation : $\norm{\v U\v x}_2^2=\v x^\top\v U^\top\v U\v x=\v x^\top\v x=\norm{\v x}_2^2$
singular value decomposition : $\v X=\v U\v S\v V^\top$
- $\v X=D\times N$ and $D<N$
- $\v U=D\times D$ unitary : left singular vectors
- $\v S=D\times N$ diagonal : singular values, by convention $s_1\ge\cdots\ge s_D\ge 0$
- $\v V^\top=N\times N$ unitary : right singular vectors
- learning : minimize $\norm{\v X-\v R\v C\v X}_F^2$
  - compression : $\v C=K\times D$
  - reconstruction : $\v R=D\times K$
SVD lemma : $\norm{\v X-\v{\hat{X}}}_F^2\ge\norm{\v X-\v U_k\v U_k^\top\v X}F^2=\sum{i\ge K+1}s_i^2$ with $s_i$ singular values of $\v X$ and $\v U_k$ consists of first $K$ col of $\v U$, with $\v{\hat{X}}=D\times N$ (rank $K$)
identity : $\v U_K\v U_K^\top\v U\v S=\v U\v S^{(K)}$
link to matrix factorization : $\v X=\v U\v S\v V^\top=\v U\v S_D^{1/2}S^{1/2}\v V^\top=\v W\v Z^\top$
- $K$ rank approximation $\v X_K=\v U\v S^{(K)}\v V^\top$
- $\v X_K=\v U_K\v{\hat{S}}^{(K)}\v V^\top=\v U_K(\v{\hat{S}}^{(K)})_K^{1/2}(\v{\hat{S}}^{(K)})^{1/2}\v V^\top=\v W\v Z^\top$
decorrelation : $N\v K=\v X\v X^\top=\v U\v S\v V^\top\v V\v S^\top\v U^\top=\v U\v S\v S^\top\v U^\top=\v U\v S_D^2\v U^\top$ give uncorrelated if $\v{\hat{X}}=\v U^\top\v X$
- mean : $\bar{\v{x}}=\frac{1}{N}\sum_n\v x_n$
- variance : $\v K=\frac{1}{N}\sum_n(\v x_n-\bar{\v{x}})(\v x_n-\bar{\v{x}})^\top$
computation : solving eigenvector/value for $\v X\v X^\top$

Neural nets - Basic structures

input layer : size $D$
hidden layer : $L$ layer of size $K$
output layer
edge : $w_{i,j}^{(l)}$ from node $i$ in layer $l-1$ to node $j$ in layer $l$
node : $x_j^{(l)}=\phi(\sum_i w_{i,j}^{(l)}x_i^{(l-1)}+b_j^{(l)})$ at layer $l$ output of node $j$
bias term : $b_j^{(l)}$ parameter
activation function : $\phi(\cdot)$, non-linear
- sigmoid : $\phi(x)=\frac{1}{1+e^{-x}}$

Neural nets - Representation power

sigmoid-like : left limit $0$ right limit $1$
lemma : sufficiently smooth function can be approximated by neural net with one hidden layer and approximation error goes down like one over the number of node in the hidden layer
consider : $f(x)=\phi(w(x-b))$
width of the transistion : $1/w$
approximated rectange in average : $\phi(w(x-a))-\phi(w(x-b))$
higher dimension : require an additionnal hidden layer to filter other dimension arms
rectified linear function : $(x)_+=\max{0,x}$
classical Stone-Weierstrass theorem : there exists a polynomial $\abs{f(x)-p(x)}<\epsilon$ for every $\epsilon >0$ and all $x\in[0,1]$
- can be approximated in $L_\infty$
- continuous piecewise-linear function : $q(x)=\sum_{i=1}^m(a_ix+b_i)1_{{r_{i-1}\le x < r_i>}}$
- continuity requires : $q(x)=\tilde a_1x+\tilde b_1+\sum_{i=2}^m\tilde a_i(x-\tilde b_i)+$ with $a_i=\sum{j=1}^i\tilde a_i$ and $\tilde b_i=r_{i-1}$

Neural nets - Training

standard cost function : $\L=\frac{1}{N}\sum_{n=1}^N(y_n-f(\v x_n))^2$
SGD : not convex but stable (no outcome difference if a single sample replaced) thus less prone to overfit
backpropagation :
- weight matrix : $\v W^{(l)}$ that connects layer $l-1$ to $l$, $\v W_{i,j}^{(l)}=w_{i,j}^{(l)}$
- bias vectors : $\v b^{(l)}$
- function on each layer : $\v x^{(l)}=f^{(l)}(\v x^{(l-1)})=\phi((\v W^{(l)})^\top\v x^{(l-1)}+\v b^{(l)})$
- overall function : $f(\v x^{(0)})=f^{(L+1)}\circ\cdots\circ f^{(1)}(\v x_n^{(0)})$
- cost function : $\L=\frac{1}{N}\sum_{n=1}^N (y_n-f^{(L+1)}\circ\cdots\circ f^{(1)}(\v x_n^{(0)}))^2$ independent of the form of loss (only initialization changes)
- algorithm
  - start : $\L_n=(y_n-f^{(L+1)}\circ\cdots\circ f^{(1)}(\v x_n^{(0)}))^2$
  - forward pass : $\v z^{(l)}=(\v W^{(l)})^\top\v x^{(l-1)}+\v b^{(l)}$ with $\v x^{(0)}=\v x_n$ and $\v x^{(l-1)}=\phi(\v z^{(l-1)})$
  - corresponding vector at level $l$ : $\delta_j^{(l)}=\frac{\partial\L_n}{\partial z_j^{(l)}}$
  - backward pass : $\delta_j^{(l)}=\frac{\partial\L_n}{\partial z_j^{(l)}}=\sum_k \frac{\partial\L_n}{\partial z_k^{(l+1)}}\frac{\partial z_k^{(l+1)}}{\partial z_j^{(l)}}=\sum_k\delta_k^{(l+1)}\v W_{j,k}^{(l+1)}\phi'(z_j^{(l)})$
    - MSE start with : $\delta^{(L+1)}=-2(y_n-\v x^{(L+1)})\phi'(z^{(L+1)})$
  - vector form : $\delta^{(l)}=(\v W^{(l+1)}\delta^{(l+1)})\odot\phi'(\v z^{(l)})$
  - Hadamard product : $\odot$ pointwise multiplication of vectors
  - weights : $=\frac{\partial \L_n}{\partial w_{i,j}^{(l)}}=\sum_k\frac{\partial \L_n}{\partial z_k^{(l)}}\frac{\partial z_k^{(l)}}{\partial w_{i,j}^{(l)}}=\frac{\partial \L_n}{\partial z_j^{(l)}}\frac{\partial z_j^{(l)}}{\partial w_{i,j}^{(l)}}=\delta_j^{(l)}\v x_i^{(l-1)}$
  - bias : $\frac{\partial \L_n}{\partial b_j^{(l)}}=\sum_k\frac{\partial \L_n}{\partial z_k^{(l)}}\frac{\partial z_k^{(l)}}{\partial b_j^{(l)}}=\frac{\partial\L_n}{\partial z_j^{(l)}}\frac{\partial z_j^{(l)}}{\partial b_j^{(l)}}=\delta_j^{(l)}$
  - sum dropped : a when chaning the weight $w_{i,j}^{(l)}$ it changes only $z_j^{(l)}$

Neural nets - Popular activation functions

sigmoid : positive, bounded, vanishing gradient problem
tanh : $tanh(x)=\frac{e^x-e^{-x}}{e^x+e^{-x}}=\phi(2x)-\frac{1}{2}$, balanced, bounded, vanishing
relu : rectified linear unit, positive, unbounded, not vanishing
leaky relu : $f(x)=\max{\alpha x, x}$, avoid null derivative
maxout : $f(x)=\max{\v x^\top\v w_1+b_1,\ldots,\v x^\top\v w_k+b_k}$, generalize relus

Neural nets - Convolutional nets

local processing of data
convolution : $x^{(1)}[n]=\sum_k f[k]x^{(0)}[n-k]$
weight sharing : same weights at every position, training by summing up edges that share same weights
handling of borders
- zero padding
- valid padding : convolve only inside the original data
multiple channel : one more dimension, typical scenario, decrease size and increase depth like a pyramid
fast fourier transform : $O(cL\log_2(L))$
- dft : $\hat x[k]=\sum_{n=0}^L x[n]\exp{-\frac{2\pi j}{L}kn}$
- idft : $x[n]=\frac{1}{L}\sum_{k=0}^L\hat x[k]\exp{\frac{2\pi j}{L}kn}$

Neural nets - Regularization, data augmentation, dropout

regularization : get smaller weights, avoid overfitting
- weights : $\frac{1}{2}\sum_{l=1}^{L+1}\mu^{(l)}\norm{\v W^{(l)}}_F^2$ with $\mu^{(l)}$ a constant that depends on layer
- bias : common to not penalize
- update rule : $\Theta[t+1]=\Theta[t]-\eta(\nabla_\Theta \L+\mu\Theta[t])=\Thetat-\eta\nabla_\Theta\L$ with constant $\mu$ and $\Theta=w_{i,j}^{(l)}$
- leads to weight decay
projected gradient descent : weights should not have a $L_2$ norm bigger than a constant, if so, normalize with the constant
dataset augmentation : create some variants (rotation, resize), learn to be invariant to this transformation, or train a network for some tasks and specialize only last layers to their respective tasks
dropout : $p_i^{(l)}$ probability of keeping node $i$ in layer $l$ other are ignored, draw at each iteration ($K$ subnetworks), predict by averaging $K$ subnetworks or scale output by $p$
bagging : averaging over many models

Graphical models - Bayes nets

graphical models : establish relationchip between random variable, bayes nets, markov hidden field, factor graphs
Bayes nets
- random variables : $X_1,\ldots,X_D$
- joint distribution : $p(X_1,\ldots,X_D)$
- chain rule : $p(X_1,\ldots,X_D)=p(X_1)p(X_2\mid X_1)\cdots p(X_D\mid X_1,\ldots, X_{D-1})$, any of $D!$ orders
- directed edge : from $X_j$ to $X_i$ if $X_j$ (parent) in conditioning of $X_i$ (child)
- topology : always get same graph, always acyclic
tail-to-tail : both arrows point away (conditioning leads to independence)
head-to-tail/tail-to-head : one arrow point away, another in (conditioning leads to independence)
head-to-head : both arrows point in (conditioning creates dependence)
descendant : child, child of child, ...
blocked path : node $X$ to node $Y$ blocked by $Z$ iff it contains a variable st. variable is in $Z$ and it is head-to-tail/tail-to-tail, or st. node is head-to-head and neither this node nor descendants are in $Z$
$D$-separation : $X$ and $Y$ $D$-separated by $Z$ iff every path from any element of $X$ to any element of $Y$ is blocked by $Z$
independence : $X$ conditionnaly independent of random variables $Y$ conditionned on variable $Z$ iff $X$ and $Z$ are $D$-separated (symmetric)
markov blanket : minimal set so that every random variable outside this set is conditionally independent, contains parents, children, co-parents (parent of the children)
generating sample : storage exponential in largest number of parents

Graphical models - Factor graphs

variables : $x_1,\ldots,x_D$ can be random variables
multidimensional function : factorization $f(x_1,\ldots,x_D)=\Pi_{a\in A}f_a(x_{\partial a})$
- variables : indexed by integers
- factor : count $\abs{A}$, indexed by lower case Roman letters
- subset : $x_{\partial a}$ involved by factor ($\partial$ means neighbors of)
factor graph : represents factoriaztion with bipartite graph ($D$ nodes to $\abs{A}$ nodes with edge from variable $i$ to factor $a$ if it participates in)
tree factor graph : if Bayes net is a tree, not the converse
complexity of computing marginals : exponential in $\max_{a\in A}\abs{x_{\partial a}}$ (better a lot of factors)
sum product algorithm : need $f(x_1)=\sum_{\sim x_1}f(x_1,\ldots,x_6)$ out of $f(x_1,\ldots,x_6)=f_1(x_1,x_2,x_3)f_2(x_1,x_4,x_6)f_3(x_4)f_4(x_4,x_5)$
- brute force : $O(\abs{\mathcal X}^6)$
- refinement : $f(x_1)=[\sum_{x_2,x_3}f_1(x_1,x_2,x_3)][\sum_{x_4}f_3(x_4)(\sum_{x_6} f_2(x_1,x_4,x_6))(\sum_{x_5} f_4(x_4,x_5))]$
- complexity : $O(\abs{\mathcal X}^3)$, proportional to highest degree
- marginalization : $\sum_{\sim z}g(z,\ldots)=\sum_{\sim z}\Pi_{k=1}^K[g_k(z,\ldots)]=\Pi_{k=1}^K[\sum_{\sim z}g_k(z,\ldots)]$
- rules
  - initialization at leaf
    - factor : $\mu(x)=f(x)$
    - variable : $\mu(x)=1$
  - node processing
    - factor : $\mu(x)=\sum_{\sim x}f(x, x_1,\ldots x_J)\Pi_{j=1}^J\mu_j(x_j)$
    - variable : $\mu(x)=\Pi_{k=1}^K\mu_k(x)$
  - marginalization : $\Pi_{k=1}^{K+1}\mu_k(x)$
loopy belief propagation : group factor together, approximation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cs433.md

cs433.md

CS433

Regression

Linear regression

Cost functions

Optimization

Least squares

Maximum likelihood

Overfitting

Ridge regresion and Lasso

Model selection and cross-validation

Bias-variance decomposition

Classification

Logistic regression

Generalized linear models

Curse of dimensionality and $k$-NN

Support vector machines

Kernel ridge regression

Unsupervised learning

K-means clustering

Gaussian mixture models

Expectation-maximization algorithm

Matrix factorization

Text representation learning

SVD and PCA

Neural nets - Basic structures

Neural nets - Representation power

Neural nets - Training

Neural nets - Popular activation functions

Neural nets - Convolutional nets

Neural nets - Regularization, data augmentation, dropout

Graphical models - Bayes nets

Graphical models - Factor graphs

Files

cs433.md

Latest commit

History

cs433.md

File metadata and controls

CS433

Regression

Linear regression

Cost functions

Optimization

Least squares

Maximum likelihood

Overfitting

Ridge regresion and Lasso

Model selection and cross-validation

Bias-variance decomposition

Classification

Logistic regression

Generalized linear models

Curse of dimensionality and $k$-NN

Support vector machines

Kernel ridge regression

Unsupervised learning

K-means clustering

Gaussian mixture models

Expectation-maximization algorithm

Matrix factorization

Text representation learning

SVD and PCA

Neural nets - Basic structures

Neural nets - Representation power

Neural nets - Training

Neural nets - Popular activation functions

Neural nets - Convolutional nets

Neural nets - Regularization, data augmentation, dropout

Graphical models - Bayes nets

Graphical models - Factor graphs