perceptron: Finds a linear classifier to separate 2 classes {-1,1} from each other. Classifier found using the perceptron algorithm. The side of the learned separator that a training point is on is the y-value predicted for that training point (1 if on the side of the norm vector, -1 if on the opposite side of the norm vector)
nll loss: also finds a linear classifier to separate 2 classes {0,1} from each other {If data points not discrete, 0.5 can be used as the classification cutoff value}. Classifier found using gradient descent. Has the advantage over perceptron that a classifier with largest margin to all points will be preferred. Also uses the side of the learned separator that a training point is on as the y-value predicted for that training point (1 if on the side of the norm vector, -1 if on the opposite side of the norm vector)
regression: y values in R instead of having 2 classes. Use mean squared error as objective function. Can add a regularization term lam*||th||**2 to the objective function to create ridge regression objective function. Use RMSE as the score value when testing. Can find th min using an analytical solution when the data has low dimensionality. With high dimensionality, computing the analytical solution is too computationally expensive and you will want to use gradient descent. Use the y-value predicted by the learned separator as the y-value predicted for that training point.
neural_networks: Essentially a way of putting many linear layers (X.t@W) (used in the perceptron, nll loss, and regression) back-to-back-to-back. Each linear layer has an activation function. The activation function is essential so that the resulting Ypred is not a linear function of the X values. Popular activation functions for middle layers are Relu and Tanh. The last layer usually has a no activation function (if the output is 1/0) or a Softmax activation function (if the output is 1 of N classes). The loss function is usually NLL for classification and NLLM for regression tasks. Updating the weights is done via gradient descent (back propagation).