A machine learning and data science project template that makes it easy to work with multiple Docker based VSCode Dev Containers in the same repository. The template also makes it easy to transition projects to the cloud and production by including automated code quality checks, pytest configuration, CI pipeline templates and a sample for running on Azure Machine Learning.
- Dev Containers for ML feasibility study with VS Code
This repository provides a VSCode Dev Container based project template that can help accelerate your Machine Learning inner-loop development phase. The template covers the phases from early ML experimentation (local training/testing) until production oriented ML model training (cloud based training/testing with bigger CPUs and GPUs).
During the early phase of Machine Learning project, you may face challenges such as each data scientist creating various different python environments that span across CPU and GPU that tend to have different setup procedures. With the power of Dev Containers, you can automate environment setup process across the team and every data scientist will get the exact same environment automatically. This template provides both CPU and GPU Dev Container setup as examples. To support multiple different ML approaches with different python environments to be experimented in one project, this solution allows multiple different Dev Containers to be used in one repository while having a "common" module that will be installed into all Dev Container to enable code reuse across different Dev Containers.
Another challenge you may face is each data scientist creating a low quality codebase. That is fine during the experimentation stage to keep the team agility high and maximize your team’s experimentation throughput. But when you move to the model productionization stage, you experience the burden of bringing code quality up to production level. With the power of python tools and VSCode extensions configured for this template on top of Dev Containers, you can keep the code quality high automatically without losing your team’s agility and experimentation throughput and ease the transition to the productionization phase.
- Multiple Dev Container samples (both CPU and GPU) with many common config steps already configured as following:
- Automated code quality checks (linter and auto formatter) and automated fix when possible with ruff on VSCode on save
- Automated code quality checks (linter and auto formatter) with ruff as precommit hook
- Zero effort transition from local env to Azure Machine Learning (cloud based env) by leveraging the same Dockerfile
- Pre-configured VSCode extensions installed such as python, jupyter, shellcheck, code-spell-checker, git tools etc
- Github Actions and Azure DevOps CI pipelines that run linter (ruff) and pytest with test result reporting and coverage reporting
- Pull Request templates that helps you to write a good PR description for both Github and Azure DevOps
This template automates all tedious setup process as much as possible and saves time and reduces setup errors for the entire data scientist team.
This section provides a comprehensive guide on how to set up a development environment using Dev Containers in Visual Studio Code with step-by-step instructions.
- Install Visual Studio Code
- If your team has a commercial license for Docker Desktop, follow VS Code Remote Containers | Docker Desktop. Otherwise, go to VS Code Remote Containers | Rancher Desktop Docs and finish the first step (Install and launch Rancher Desktop. Select dockerd (moby) as the Container Runtime from the Kubernetes Settings menu.)
- Install VSCode Remote - Containers extension on vscode
- If you forget this step, you will get an error when you try to build the container so make sure you have
.env
at root of this directory before you move on to the next step.. Copy.env.example
and rename it to.env
. This is where you store your credentials etc..env
is automatically loaded into dev container as environment variables. When you add new environment variables.env
, update.env.example
as well to share that with others but don't include any credentials there..env
is gitignored so your credentials in that file won't be accidentally committed. - Run
Dev Containers: Open Folder in Container...
from the Command Palette (F1) and select thenotebooks
directory. - VS Code will then build and start up a container, connect this window to Dev Container:
notebooks
, and install VS Code extensions specified innotebooks/.devcontainer/devcontainer.json
.pre-commit install --overwrite
runs as part ofpostCreateCommand
indevcontainer.json
and this will setup your git precommit hook automatically. - Now set up is done. If you want to develop in another directory for example under
src
, runDev Containers: Open Folder in Container...
and go to that directory that has.devcontainer
and that will setup an dev environment for that directory. - When you or others update either
requirements.txt
orDockerfile
in your working directory, make sure to rebuild your container to apply those changes to container. RunDev Containers: Rebuild and Reopen in Container...
for that.
- Copy
src/sample_cpu_project/
undersrc
and rename it. If you need gpu environment, base off ofsrc/sample_pytorch_gpu_project
instead - Update
COPY sample_cpu_project/.devcontainer/requirements.txt
inDockerfile
with a new path - Update other parts of
Dockerfile
if you need - Update
requirements.txt
if you need - Run
Dev Containers: Open Folder in Container...
from the Command Palette (F1) and select the new directory and make sure you can successfully open the new directory on VS Code running in a container
This section gives you overview of the directory structure of this template. Only essential files are covered in this structure graph for simplicity. The directory structure is as follows:
.
├── .azuredevops # CI pipelines for Azure DevOps. Details at section: How to Configure Azure DevOps CI Pipeline
├── .github # CI pipelines for Github Actions. Details at section: How to Configure Github Actions CI Pipeline
├── .pre-commit-config.yaml # pre-commit config file with formatting and linting. Setup is covered in Section: Getting Started
├── .env.example # Example of .env file. Setup is covered in Section: Getting Started
├── ci-tests.sh # Details at Section: Running all unit tests with ci-tests.sh
├── data # Directory to keep your data for local training etc. This directory is gitignored
├── notebooks # Setup process is covered in Section: How to setup dev environment?
│ ├── .devcontainer # dev container related configuration files goes to here following VSCode convention
│ │ ├── devcontainer.json # dev container configuration and VS Code settings, extensions etc.
│ │ ├── Dockerfile # referred in devcontainer.json
│ │ └── requirements.txt # includes python package list for notebooks. used in Dockerfile
│ └── sample_notebook.py # example of interactive python script
├── pyproject.toml # Setting file for ruff, pytest and pytest-cov
└── src
├── common # this module is accessible from all modules under src. put functions you want to import across the projects here
│ └── requirements.txt # python package list for common module. installed in all Dockerfile under src. python tools for src goes to here too
├── sample_cpu_project # cpu project example. Setup process is covered in Section: How to setup dev environment?
│ ├── .devcontainer # dev container related configuration files goes to here following VSCode convention
│ │ ├── devcontainer.json # dev container configuration and VS Code settings, extensions etc.
│ │ ├── Dockerfile # referred in devcontainer.json. Supports only CPU
│ │ └── requirements.txt # includes python package list for sample_cpu_project. used in Dockerfile
│ ├── sample_main.py
│ └── tests # pytest scripts for sample_cpu_project goes here
│ └── test_dummy.py # pytest script example
└── sample_pytorch_gpu_project # gpu project example with pytorch. Setup process is covered in Section: How to setup dev environment?
├── README.md # README for AML example contained in sample_pytorch_gpu_project
├── .devcontainer # dev container related configuration files goes to here following VSCode convention
│ ├── devcontainer.json # dev container configuration and VS Code settings, extensions etc.
│ ├── Dockerfile # referred in devcontainer.json. Supports GPU
│ └── requirements.txt # includes python package list for sample_pytorch_gpu_project. used in Dockerfile
├── aml_example/ # Sample AML CLI v2 Components-based pipeline, including setup YAML. See sample_pytorch_gpu_project/README for full details of files in this directory.
├── sample_main.py
├── inference.py # Example pytorch inference/eval script that also works with aml_example
├── train.py # Example pytorch model training script that also works with aml_example
└── tests # pytest scripts for sample_pytorch_gpu_project goes here
└── test_dummy.py # pytest script example
There are two places to put python scripts/modules in this template. The notebooks
directory is for experimental or throw-away python scripts and jupyter notebooks that you want to run cell by cell interactively. For example, EDA, one-off visualization codes, new model approaches you are not certain yet if you want to maintain over time typically go to this directory. The src
directory is for python scripts and modules that you want to reuse and maintain over time. The src
directory is also where you would put unit tests (typically under a src/your_module/tests
directory).
Given the nature of each directory's responsibility, there is also a different quality governance required. One big difference is that pre-commit hooks and CI pipelines run ruff check
(linter) over src
but not over notebooks
(ruff format
still run for both). For scripts in notebooks
, we recommend you use interactive python scripts where you can have jupyter-like code cells within .py
files rather than jupyter notebooks .ipynb
. Interactive python files gives you the following benefits:
- Comes with full benefits of python extension in VSCode such as code completion, linting, auto formatting, debugging etc
- pre-commit hooks and CI pipelines will work as they run over
.py
files (but not perfectly over.ipynb
files) - Python file format is easier to review during a pull request review
Interactive python scripts and jupyter notebooks are interchangeable as described in Convert Jupyter notebooks to Python code file so you can switch between them easily too if you want to use both formats during the development.
An Azure Machine Learning (AML) example is provided under src/sample_pytorch_gpu_example
. The example is a AML Components-based ML pipeline, that runs a pytorch based training step followed by a inference/evaluation step. This example shows the seamless transition of moving from a local run (inside the Dev Container) of pytorch based training/inference and running in the cloud in the exact same Docker environment with flexible compute options. See the AML Components-based Pipeline Example README for a detailed explanation and instructions of the example code.
This repository contains templates for running a Continuous Integration (CI) pipeline on either Azure DevOps (under .azuredevops
directory) or on Github Actions (under .github
directory). Each of the CI pipeline configurations provided have the following features at a high level:
- Run code quality checks (
ruff check
) over the repository - Find all subdirectories under
src
and run all pytest tests inside the associated Docker containers - Publish test results and code coverage statistics
We recommend setting up pipeline triggers for PR creation, editing and merging. This will ensure the pipeline runs continuously and will help catch any issues earlier in your development process.
See the sections below for links on how to setup pipelines with Azure DevOps and Github Actions. Note that if you are only using one of these platforms to host a pipeline (or neither), you can safely delete either (or both) the .azuredevops
directory or the .github
directory.
As multiple independent directories can be added under src
, each with its own Dockerfile and requirements, running unit tests for each directory under src
needs to be done within the Docker container of each src
subdirectory. The ci-tests.sh
script automates this task of running all unit tests for the repository with the following steps:
- Finds all subdirectories under
src
that have at least onetest_*.py
under atests
folder - Builds each Docker image for each subdirectory with tests, using the Dockerfile in the associated
.devcontainer
directory - Runs pytest for each subdirectory with tests, inside the matching Docker container built in step 2
- Combine all test results and coverage reports from step 3, with reports in a valid format for publishing in either Azure DevOps or Github Actions hosted pipeline
Note that the ci-test.sh
script can be run locally as well and it is assumed that all tests are written with pytest.
See create your first pipeline for how to setup a pipeline in Azure DevOps. Note that to use the provided template in this repository, you will need to specify the path to .azuredevops/ado-ci-pipeline-ms-hosted.yml
during the pipeline setup process in Azure DevOps.
There are two templates for running a CI pipeline in Azure DevOps, a pipeline configuration that uses a Microsoft hosted agent to run the pipeline (.azuredevops/ado-ci-pipeline-ms-hosted.yml
) and a pipeline configuration that uses a self-hosted agent to run the pipeline (.azuredevops/ado-ci-pipeline-self-hosted.yml
).
The Microsoft hosted version is easiest to start with and recommended. Where you may consider switching to the self-hosted version, is when you have added several directories under src
that have individual containers and the size of all the docker builds in the CI pipeline comes up against the 10GB disk storage limit for Microsoft hosted pipelines (see resource limitations of Microsoft hosted agents). In this case (or when other resource constraints are met) switching to a self-hosted agent pipeline may be an option and the template at .azuredevops/ado-ci-pipeline-self-hosted.yml
includes additional steps to help manage space consumed by CI pipeline runs. The two versions are otherwise identitical in terms of building each docker container under src
, running pytest within each of these containers and publishing test results and coverage information.
Github Actions CI pipeline is defined in .github/workflows/ci.yaml
. As long as this repository is hosted in github, the pipeline will be automatically triggered when a PR is made or updated as well as when a PR is merged into your main branch with the setting below, so no additional setting is required.
on:
push:
branches:
- main
pull_request:
branches:
- main
If you have connected to the origin repository using SSH authentication, you will need to do a bit of setup to reuse your local SSH key inside a Dev Container automatically, which will allow you to interact with the origin repository (git push, git pull etc.) inside the Dev Container.
- Try the recommendations in the official docs for sharing git credentials
- If the previous step doesn't work, try the below method, that includes a bit of additional code to add keys to the SSH agent.
Add the following to your ~/.bash_profile, ~/.profile, ~/.zprofile or similar (by default most WSL users will have only a ~/.profile) so an ssh-agent will be started when needed and default keys will be added to the agent. The ssh-agent will then automatically forward keys to your Dev Container when its launched.
# this part taken from https://code.visualstudio.com/remote/advancedcontainers/sharing-git-credentials
# check that link for the latest version or updates
if [ -z "$SSH_AUTH_SOCK" ]; then
# Check for a currently running instance of the agent
RUNNING_AGENT="`ps -ax | grep 'ssh-agent -s' | grep -v grep | wc -l | tr -d '[:space:]'`"
if [ "$RUNNING_AGENT" = "0" ]; then
# Launch a new instance of the agent
ssh-agent -s &> $HOME/.ssh/ssh-agent
fi
eval `cat $HOME/.ssh/ssh-agent`
fi
# ADD SSH Keys to the SSH agent
# if using non-default SSH key, add it to ssh-add command like:
# ssh-add /path/to/your/ssh-key
ssh-add
- Add Docker build caching to Azure DevOps MS hosted CI pipeline
- Investigate making
src/common
installed withpip -e
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines.
Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.
Any use of third-party trademarks or logos are subject to those third-party's policies.