-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[github actions] deployed from uu-sml/course-sml
- Loading branch information
uu-sml
committed
Nov 11, 2023
1 parent
0169693
commit 3084aee
Showing
1 changed file
with
1 addition
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
{"cells": [{"cell_type": "markdown", "id": "d09dcd10", "metadata": {"id": "d09dcd10", "tags": []}, "source": "# Notebook: F5 -- Generalization performance\nAuthors: Hugo Toll, Paul H\u00e4usner<br>\nDate: Nov 2023\n\nThis notebook is complementary to lecture F5 about assessing the generalization performance of machine learning models in order to highlight the key concepts. The focus will be on\n\n- How to estimate expected error on new data $E_{new}$ for a given model\n- How to both train a model and estimate its expected error based on data\n\nPlease read the instructions and play around with the notebook where it is described."}, {"cell_type": "code", "execution_count": null, "id": "cc60fa9a", "metadata": {"id": "cc60fa9a", "tags": []}, "outputs": [], "source": "# import the neccessary libraries\nimport numpy as np\nimport matplotlib.pyplot as plt"}, {"cell_type": "markdown", "id": "09be5700", "metadata": {"id": "09be5700", "tags": []}, "source": "---\n\n## 1. Creating data and train a regression model on it\n\nWhen applying machine learning techniques we usually assume that there exists a joint distribution of the inputs and outputs $p(x, y)$. In practice, we have, however, no access to this distribution but can only obtain samples from it.\n\nFor the sake of this notebook, we assume that we know the true joint probability distribution of the inputs and the outputs which is given by a joint normal distribution\n\n$$ [x, y]^T \\sim \\mathcal{N}(\\mu, \\Sigma) $$\n\nwhere $\\mu$ is the mean of the distribution and $\\Sigma$ is the covariance matrix.\n\nTo make things a little more concrete, assume that $x$ stands for the square meters of a house and $y$ stands for the selling price of it (in thousands)\n\nThe following code specifies the just explained distribution and allows us to sample points from it."}, {"cell_type": "code", "execution_count": null, "id": "a9290f64", "metadata": {"id": "a9290f64", "tags": []}, "outputs": [], "source": "mu = np.array([100, 200])\nSigma = np.array([[100, 50], [50, 100]])\n\ndef sample():\n s = np.random.multivariate_normal(mu, Sigma)\n return s[0], s[1]\n\ndef dataset(n, seed=None):\n if seed is not None:\n np.random.seed(seed)\n d = [sample() for i in range(n)]\n x_d = [s[0] for s in d]\n y_d = [s[1] for s in d]\n return np.array(x_d), np.array(y_d)"}, {"cell_type": "markdown", "id": "8492a35a", "metadata": {"id": "8492a35a", "tags": []}, "source": "Now, lets sample $n=5$ data points from our created joint distribution and plot the each point in the graph."}, {"cell_type": "code", "execution_count": null, "id": "882888da", "metadata": {"id": "882888da", "tags": []}, "outputs": [], "source": "x, y = dataset(5, seed=42)"}, {"cell_type": "code", "execution_count": null, "id": "b0cd6c3c", "metadata": {"colab": {"base_uri": "https://localhost:8080/", "height": 430}, "id": "b0cd6c3c", "outputId": "48b4457d-0fa2-4c35-eca8-0ff1efb4a7b0", "scrolled": false, "tags": []}, "outputs": [], "source": "# plotting some samples from the distribution\nplt.plot(x, y, \"x\", ms=7)\nplt.grid(alpha=.3)\nplt.xlabel(\"x\")\nplt.ylabel(\"y\")\nplt.show()"}, {"cell_type": "markdown", "id": "2e031d97", "metadata": {"id": "2e031d97", "tags": []}, "source": "Our aim is to find a linear regression model with two parameters (offset and slope) which we usually denote by $\\theta \\in \\mathbb{R}^2$ and evaluate how good it performs when seeing new samples from the joint probability distribution.\n\nQuestion:\n- Look at the samples from the distribution above. What could be potential values for the linear regression coefficients?"}, {"cell_type": "markdown", "id": "d971e597", "metadata": {"id": "d971e597", "tags": []}, "source": "---\n\n## 2. Evaluating a fixed model\n\nLet's assume that a friend of ours is an expert in the domain and tells us that the parameters for the regression problem should be $\\theta = [90, 1.5 ]^T $. We want to check now if these parameters are a good choice.\n\nTherefore, we want to compute the expected error the model makes when seeing new samples from the joint distribution. This error is called $E_{new}$ and is given by\n\n$$E_{new}(\\theta) = \\mathbb{E}_\\star \\left[ E(x_\\star, y_\\star; \\theta) \\right].$$\n\nHowever, we can not compute the expected value analytically. Instead, we have to solve the problem by estimating $E_{new}$ using samples from the data distribution in the form of a set of independent and identically distributed (iid) samples $\\{ (x_i, y_i) \\}_{i=1}^n$ which approximates the expected value\n\n$$E_{new}(\\theta) \\approx \\frac1n \\sum_{i=1}^n E(x_i, y_i; \\theta).$$"}, {"cell_type": "code", "execution_count": null, "id": "9fd6a47a", "metadata": {"id": "9fd6a47a", "tags": []}, "outputs": [], "source": "# given model parameters\ntheta = np.array([90, 1.5])"}, {"cell_type": "code", "execution_count": null, "id": "90f8efa2", "metadata": {"id": "90f8efa2", "tags": []}, "outputs": [], "source": "# compute the error of the prediciton\n# function E(x, y, \\theta)\ndef compute_error(theta, x, y):\n model_prediction = (theta[0] + theta[1] * x)\n return np.mean(np.abs(model_prediction - y))"}, {"cell_type": "code", "execution_count": null, "id": "dfd5fff7", "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "dfd5fff7", "outputId": "f9924801-5a48-4804-ae1e-66c0246b1a1a", "scrolled": true, "tags": []}, "outputs": [], "source": "n = 3 # change me!!!\nnp.random.seed(553)\n\nerrors = []\nfor i in range(n):\n x_star, y_star = sample()\n error_i = compute_error(theta, x_star, y_star)\n errors.append(error_i)\n\n# compute estimation after n steps:\nE_new = [np.mean(errors[:(i+1)]) for i in range(n)]\nprint(f\"Estimated $E_new$ after {n} steps: {E_new[-1]}\")"}, {"cell_type": "code", "execution_count": null, "id": "30888a5f", "metadata": {"colab": {"base_uri": "https://localhost:8080/", "height": 453}, "id": "30888a5f", "outputId": "fbfbf5c5-f5df-40ad-d92d-fb9ea4e434cb", "scrolled": false, "tags": []}, "outputs": [], "source": "# Plot the estimated new error after n steps\nplt.plot(range(1, n+1), E_new, \".-\", label=\"$E_{new}$\")\nplt.grid(alpha=.3)\nplt.xlabel(\"Number of samples: $n$\")\nplt.legend()\nplt.show()"}, {"cell_type": "markdown", "id": "378e7f27", "metadata": {"id": "378e7f27", "tags": []}, "source": "Questions:\n- What happens if you change $n$ in the code above? What is the ideal value of $n$?\n- What is the drawback of large $n$ values? What is the drawback of small $n$ values?\n- Which error metric is used in the example above?\n- Go to the code above and increase $n$. Do the results match with your answer to the first question?\n\n---"}, {"cell_type": "markdown", "id": "7dafc37d", "metadata": {"id": "7dafc37d", "tags": []}, "source": "## 3. Evaluating a learned model\n\nWe now additionally want to train the linear regression model on data we sample instead of using a fixed set of model weights provided to us. This is achieved by solving the normal equations. For details on this please refer to the material on linear regression from the previous lectures."}, {"cell_type": "code", "execution_count": null, "id": "a4581d30", "metadata": {"id": "a4581d30", "tags": []}, "outputs": [], "source": "# train a linear regression model given data x and y\ndef fit_lr(x, y):\n x_augmented = np.vstack((np.ones_like(x), x)).T\n theta = np.linalg.solve(x_augmented.T@x_augmented, x_augmented.T@y)\n return theta"}, {"cell_type": "code", "execution_count": null, "id": "407d117c", "metadata": {"colab": {"base_uri": "https://localhost:8080/", "height": 430}, "id": "407d117c", "outputId": "0cceb83c-f2c4-4679-c5f9-5213f99f6bfc", "scrolled": false, "tags": []}, "outputs": [], "source": "# create data to train model on\nx, y = dataset(15, seed=42)\n\n# find model parameters\ntheta = fit_lr(x, y)\n\n# plot the data and the linear regression model\nplt.plot(x, y, \"x\", ms=7, label=\"data\")\nplt.plot(np.linspace(np.min(x), np.max(x)), theta[0] + theta[1] * np.linspace(np.min(x), np.max(x)),\n label=\"linear regression\")\nplt.grid(alpha=.3)\nplt.xlabel(\"x\")\nplt.ylabel(\"y\")\nplt.legend()\nplt.show()"}, {"cell_type": "markdown", "id": "2cda5970", "metadata": {"id": "2cda5970", "tags": []}, "source": "However, one problem is that when we train a model on data $\\mathcal{T}$, the parameters of that model -- which are here denoted by $\\theta(\\mathcal{T})$ to indicate that they are obtained from training on the samples $\\mathcal{T}$ -- become dependent on the dataset used to train them i.e.\n\n$$E_{new}(\\theta(\\mathcal{T})) = \\mathbb{E}_\\star \\left[ E(x_\\star, y_\\star; \\theta(\\mathcal{T})) \\right].$$\n\nThis means we can not estimate $E_{new}$ with the sampled dataset $\\mathcal{T}$ we used to train the model, since the samples are directly influencing the model weights. When evaluating the model with parameters $\\theta(\\mathcal{T})$, on the dataset $\\mathcal{T}$, we obtain the training error denoted $E_{train}$ which is an underapproximation of $E_{new}$."}, {"cell_type": "code", "execution_count": null, "id": "59957387", "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "59957387", "outputId": "35d08012-c492-45c0-e01e-eadc0997b71d", "scrolled": true, "tags": []}, "outputs": [], "source": "train_error = compute_error(theta, x, y)\nprint(f\"The training error of the trained LR model is {train_error}\")"}, {"cell_type": "markdown", "id": "14300d93", "metadata": {"id": "14300d93", "tags": []}, "source": "Similar, to the previous task, to estimate $E_{new}$ we need to sample new points from the distribution $p(x, y)$ which are used to numerically approximate the expected value as before. Since we have access to the data distribution we can easily achieve this."}, {"cell_type": "code", "execution_count": null, "id": "e7e662a6", "metadata": {"colab": {"base_uri": "https://localhost:8080/"}, "id": "e7e662a6", "outputId": "f8c2430e-eda9-4dca-96bd-3474a22b04b5", "scrolled": true, "tags": []}, "outputs": [], "source": "# in order to produce a good estimate of E_new, we use 1000 samples here.\ndef test(theta, n=1000):\n x_test, y_test = dataset(n, seed=42**2)\n test_error = compute_error(theta, x_test, y_test)\n return test_error\n\nprint(f\"The expected new error of the trained LR model is approximate {test(theta)}\")"}, {"cell_type": "markdown", "id": "2d1aeeee", "metadata": {"id": "2d1aeeee", "tags": []}, "source": "We can see that the test error $E_{new}$ is usually higher than the training error $E_{train}$. We call the difference between them the **generalization error**.\n\nNow, we want to observe what happens with $E_{new}$ and $E_{train}$ when we increase the number of samples we are training the model parameters on."}, {"cell_type": "code", "execution_count": null, "id": "7da49f12", "metadata": {"id": "7da49f12", "tags": []}, "outputs": [], "source": "n = 3 # change me!!\n\ne_new = []\ne_train = []\n\nfor i in range(2, n+1):\n np.random.seed(42)\n\n x_i, y_i = dataset(i)\n theta_i = fit_lr(x_i, y_i)\n e_train.append(compute_error(theta_i, x_i, y_i))\n e_new.append(test(theta_i))"}, {"cell_type": "code", "execution_count": null, "id": "625fa34a", "metadata": {"colab": {"base_uri": "https://localhost:8080/", "height": 430}, "id": "625fa34a", "outputId": "48f68537-7275-4eac-835a-81ec47ca83fd", "scrolled": false, "tags": []}, "outputs": [], "source": "plt.plot(range(2, n+1), e_train, \".-\", label=\"E_train\")\nplt.plot(range(2, n+1), e_new, \".-\", label=\"E_new\")\nplt.grid(alpha=0.3)\nplt.ylabel(\"\")\nplt.xlabel(\"Number of samples: $n$\")\nplt.legend()\nplt.show()"}, {"cell_type": "markdown", "id": "66399b80", "metadata": {"id": "66399b80", "tags": []}, "source": "Questions:\n- What is $E_{train}$ for $n=2$ and why?\n- What happens when we change the number of samples when estimating the parameters of the model?\n- What happens if you rerun the code? Do you get the same results? Why / why not?\n\nRerun the code above with a different choice for $n$ and see if the results match with the answers to your questions.\n\nIn practice we usually only have one dataset and it is usually not trivial to obtain new samples. Therefore, to get an estimate of $E_{new}$, we need to split our dataset $\\mathcal{T}$ into two parts: the training data and test data. We use the training data $\\mathcal{T}_{train}$ to train the linear regression model and use $\\mathcal{T}_{hold-out}$ to estimate $E_{new}$. This is also called **hold-out validation**.\n\n**Warning**: When we are choosing the hyperparameters (such as regularization strength, hand-engineered features, etc.) of a learned model based on the hold-out data, we also influence the model weights using the data which we are using to estimate the expected new error. \n\nTherefore, usually two seperate test sets are used: the first one is also called validation set and is used to select which model to use. The second one is used to estimate the performance of the chosen model. It is important that no model selection is made based on this final number."}, {"cell_type": "markdown", "id": "9d4e8533", "metadata": {"id": "9d4e8533", "tags": []}, "source": "---\n\n## Take home messages:\n\n- We can not compute the expected new error $E_{new}$ of a model analytically and therefore need to approximate it using data.\n- When we both train a model on data and evaluate its expected error on new samples, we need to make sure that the evaluation is independent of the model.\n- The generalization gap is the difference between the error the model makes on the training data and the error the model makes on new, unseen data.\n\n**Recommended reading**:\n- Machine Learning - A First Course for Engineers and Scientists, Chapter 4"}], "metadata": {"celltoolbar": "Tags", "colab": {"provenance": []}, "kernelspec": {"display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3"}, "language_info": {"codemirror_mode": {"name": "ipython", "version": 3}, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.9.12"}}, "nbformat": 4, "nbformat_minor": 5} |