Merge pull request #59 from uc-python/update_environment

updated environment.yaml
uc-python · Jan 6, 2024 · c16127b · c16127b
2 parents 85815a5 + 5313e78
commit c16127b
Show file tree

Hide file tree

Showing 3 changed files with 109 additions and 38 deletions.
diff --git a/environment.yaml b/environment.yaml
@@ -3,17 +3,17 @@ channels:
     - defaults
     - conda-forge
 dependencies:
-    - category_encoders>=2.2
-    - ipykernel>=6.4
-    - matplotlib>=3.5
+    - python=3.11
+    - category_encoders>=2.6
+    - ipykernel>=6.28
+    - matplotlib>=3.8
     - missingno>=0.4
-    - mlflow=1.22
-    - nbconvert>=6.1
-    - numpy>=1.21
-    - pandas>=1.3
-    - pip>=21.2
-    - plotnine>=0.8
-    - pytest>=6.2
-    - python=3.9
-    - scikit-learn>=1.0
-    - seaborn>=0.11
+    - mlflow=2.9
+    - nbconvert>=7.14
+    - numpy>=1.26
+    - pandas>=2.1
+    - pip>=23.3
+    - plotnine>=0.12
+    - pytest>=7.4
+    - scikit-learn>=1.3
+    - seaborn>=0.13
diff --git a/notebooks/09-ml_lifecycle_mgt.ipynb b/notebooks/09-ml_lifecycle_mgt.ipynb
@@ -231,7 +231,7 @@
    "source": [
     "import mlflow\n",
     "\n",
-    "mlflow.set_experiment(\"Predicting income\")"
+    "experiment = mlflow.set_experiment(\"Predicting income\")"
    ]
   },
   {
@@ -1210,7 +1210,7 @@
     }
    ],
    "source": [
-    "df = mlflow.search_runs(experiment_ids='1')\n",
+    "df = mlflow.search_runs(experiment_ids=experiment.experiment_id)\n",
     "df"
    ]
   },
@@ -1294,7 +1294,7 @@
     }
    ],
    "source": [
-    "model_path = f'mlruns/1/{run_id}/artifacts/best_estimator'\n",
+    "model_path = f'mlruns/{experiment.experiment_id}/{run_id}/artifacts/best_estimator'\n",
     "model = mlflow.sklearn.load_model(model_path)\n",
     "model"
    ]

diff --git a/notebooks/Case Study.ipynb b/notebooks/Case Study.ipynb
@@ -373,57 +373,128 @@
   },
   {
    "cell_type": "markdown",
-   "id": "aa69e649",
+   "id": "ed954c73-b660-4edc-93f4-b6869c0dc9d3",
    "metadata": {},
    "source": [
-    "### Unit Tests\n",
+    "### Modular code & unit tests\n",
     "\n",
-    "1. TBD\n",
-    "1. TBD\n",
-    "1. TBD"
+    "1. Move the `loguniform_int` class we defined above into a new module, `loguniform_int.py`. We haven't put classes into modules before, but it's no different than a function; just paste it along with any imports it needs."
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "98334504",
+   "id": "4bd495e6-eb4b-4ddb-83b5-dde8e586ef36",
    "metadata": {},
    "source": [
-    "### ML lifecycle management"
+    "Your new module should contain something like:\n",
+    "\n",
+    "```python\n",
+    "from scipy.stats import loguniform\n",
+    "\n",
+    "class loguniform_int:\n",
+    "    \"\"\"Integer valued version of the log-uniform distribution\"\"\"\n",
+    "    def __init__(self, a, b):\n",
+    "        self._distribution = loguniform(a, b)\n",
+    "\n",
+    "    def rvs(self, *args, **kwargs):\n",
+    "        \"\"\"Random variable sample\"\"\"\n",
+    "        return self._distribution.rvs(*args, **kwargs).astype(int)\n",
+    "```"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "42c8a06f",
+   "id": "7375923a-f673-4c83-abd3-cf4f099a5b9c",
    "metadata": {},
    "source": [
-    "1. Create and set an MLflow experiment titled \"UC Advanced Python Case Study\"\n",
-    "2. Re-perform the random hyperparameter search executed above while logging the hyperparameter search experiment with MLflow's autologging. Title this run \"rf_hyperparameter_tuning\"."
+    "2. Import your module and make sure you can use it in code by (re)running the below:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 34,
+   "id": "31f07d64-f468-4b4a-a60e-e338f2f00cb2",
+   "metadata": {
+    "tags": [
+     "ci-skip"
+    ]
+   },
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Fitting 5 folds for each of 10 candidates, totalling 50 fits\n"
+     ]
+    }
+   ],
+   "source": [
+    "from loguniform_int import loguniform_int\n",
+    "\n",
+    "param_distributions = {\n",
+    "    'rf__n_estimators': loguniform_int(50, 1000),\n",
+    "    'rf__max_features': loguniform(.1, .8),\n",
+    "    'rf__max_depth': loguniform_int(2, 30),\n",
+    "    'rf__min_samples_leaf': loguniform_int(1, 100),\n",
+    "    'rf__max_samples': loguniform(.5, 1),\n",
+    "}\n",
+    "\n",
+    "random_search = RandomizedSearchCV(\n",
+    "    pipeline, \n",
+    "    param_distributions=param_distributions, \n",
+    "    n_iter=10, # lower this to 10 so it's faster\n",
+    "    cv=5, \n",
+    "    scoring='neg_root_mean_squared_error',\n",
+    "    verbose=1,\n",
+    "    n_jobs=-1,\n",
+    ")\n",
+    "\n",
+    "results2 = random_search.fit(X_train, y_train)"
    ]
   },
   {
    "cell_type": "markdown",
-   "id": "60677940",
+   "id": "ca9dc10a-42dd-4cc4-b957-0451046cc5f9",
    "metadata": {},
    "source": [
-    "### Reproducibility with dependency tracking\n",
+    "3. Create a `tests.py` file in which you add the tests we already create for `get_features_and_target` (you can just copy them), along with a new test that asserts that `loguniform` objects have a `._distribution.args` attribute that holds the original numbers passed into them -- confirming that we did indeed create the kind of distribution we expected. Run the tests when finished.\n",
     "\n",
-    "1. TBD\n",
-    "1. TBD\n",
-    "1. TBD"
+    "```python\n",
+    ">>> lu = loguniform_int(2, 30)\n",
+    ">>> lu._distribution.args\n",
+    "(2, 30)\n",
+    "```"
    ]
   },
   {
-   "cell_type": "code",
-   "execution_count": null,
-   "id": "b8271687",
+   "cell_type": "markdown",
+   "id": "c7d3dd7d-11c9-471f-8391-c5a23219acd6",
    "metadata": {},
-   "outputs": [],
-   "source": []
+   "source": [
+    "4. Parametrize this test. Create one `loguniform_int` with `(2, 30)` as the arguments and another with `(1, 100)` as the arguments. Confirm that in both cases, the resulting `._distribution.args` attribute holds a tuple with the same numbers that were supplied initially. Rerun your tests."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "98334504",
+   "metadata": {},
+   "source": [
+    "### ML lifecycle management"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "42c8a06f",
+   "metadata": {},
+   "source": [
+    "1. Create and set an MLflow experiment titled \"UC Advanced Python Case Study\"\n",
+    "2. Re-perform the random hyperparameter search executed above while logging the hyperparameter search experiment with MLflow's autologging. Title this run \"rf_hyperparameter_tuning\"."
+   ]
   }
  ],
  "metadata": {
   "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
+   "display_name": "Python 3",
    "language": "python",
    "name": "python3"
   },
@@ -437,7 +508,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.9.7"
+   "version": "3.9.2"
   }
  },
  "nbformat": 4,