Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Create GitHub actions for ROCm #30

Open
wants to merge 40 commits into
base: 2-hardware-agnostic-front-and-backend
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
40 commits
Select commit Hold shift + click to select a range
e535712
dockerfile.dev support rocm
samutamm Nov 14, 2024
4536dc8
Revert "dockerfile.dev support rocm"
Nov 15, 2024
21600a7
rocm specific Dockerfile.rocm
samutamm Nov 15, 2024
a014368
install amdsmi python package
samutamm Nov 18, 2024
4d4a946
Fix permission and AMD SMI issue with Dockerfile.rocm
jakki-amd Nov 21, 2024
b16d6a6
Add ROCM installation to each stage
jakki-amd Nov 22, 2024
7e82be8
Remove extra whitespace
jakki-amd Nov 22, 2024
5af9893
Fix typos
jakki-amd Nov 22, 2024
2b42b63
Add modifications to main Dockerfile and build script
jakki-amd Nov 25, 2024
b9f2af2
Add modifications to main Dockerfile and build script
jakki-amd Nov 25, 2024
9afb983
Remove unnecessary part
jakki-amd Nov 25, 2024
6ae43d2
Reorder and fix installation process
jakki-amd Nov 26, 2024
57a7b1d
Fix missing -y flags from apt remove command
jakki-amd Nov 26, 2024
1455909
Update documentation
jakki-amd Nov 26, 2024
3272c32
Update CI image
jakki-amd Nov 27, 2024
34fe74f
Streamline Dockerfile.rocm
jakki-amd Nov 27, 2024
f9afc30
Add missing step to production images
jakki-amd Nov 27, 2024
cb498f9
Fix missing -y flags from apt remove
jakki-amd Nov 27, 2024
54b44fa
Fix issue with Dockerfile.rocm build
jakki-amd Nov 27, 2024
b302e6f
Remove library installation
jakki-amd Nov 27, 2024
44b5eae
Fix apt lock issue
jakki-amd Nov 27, 2024
07a753f
Add missing line
jakki-amd Nov 27, 2024
0225501
Small fixes
jakki-amd Nov 28, 2024
42a7d88
Test installing amdsmi via apt
jakki-amd Nov 28, 2024
899576f
Replace os.replace with shutil.move to support moving files across fi…
jakki-amd Nov 28, 2024
87222c3
Fix amd-smi-lib install
jakki-amd Nov 28, 2024
d6ecd7f
Removed cached from the installs
jakki-amd Nov 28, 2024
6ce3f8e
Revert to old way of installing
jakki-amd Nov 29, 2024
b701ca8
Remove apt cache
jakki-amd Nov 29, 2024
5983653
Fix typo
jakki-amd Nov 29, 2024
4bcd297
Add wget
jakki-amd Nov 29, 2024
92827a0
Remove sudo install
jakki-amd Nov 29, 2024
5ff11c8
Change base image
jakki-amd Nov 29, 2024
76724d3
Prevent apt from accessing same cache file with lock
jakki-amd Nov 29, 2024
8c81965
Add step to install amdsmi python lib
jakki-amd Nov 29, 2024
4974857
Modify move_logs to merge contents
jakki-amd Nov 29, 2024
05b59f1
Add more error handling for moving logs
jakki-amd Nov 29, 2024
fc1958b
Update Dockerfile to fix rocm installation issues
jakki-amd Dec 4, 2024
d12d701
Add multiple GPU options for GH actions
jakki-amd Nov 26, 2024
50f8279
Modify runs-on labels
jakki-amd Nov 26, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 13 additions & 2 deletions .github/workflows/ci_gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,13 @@ concurrency:

jobs:
ci-gpu:
runs-on: [self-hosted, ci-gpu]
runs-on:
- self-hosted
- ci-gpu
- ${{ matrix.gpu-type }}]
strategy:
matrix:
gpu-type: [cuda, rocm]
steps:
- name: Clean up previous run
run: |
Expand All @@ -41,9 +47,14 @@ jobs:
uses: actions/checkout@v3
with:
submodules: recursive
- name: Install dependencies
- name: Install dependencies for CUDA
if: matrix.gpu-type == 'cuda'
run: |
python ts_scripts/install_dependencies.py --environment=dev --cuda=cu121
- name: Install dependencies for ROCm
if: matrix.gpu-type == 'rocm'
run: |
python ts_scripts/install_dependencies.py --environment=dev --rocm=rocm62
- name: Torchserve Sanity
uses: nick-fields/retry@v3
with:
Expand Down
8 changes: 7 additions & 1 deletion .github/workflows/kserve_gpu_tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,13 @@ on:

jobs:
kserve-gpu-tests:
runs-on: [self-hosted, regression-test-gpu]
runs-on:
- self-hosted
- regression-test-gpu
- ${{ matrix.gpu-type }}
strategy:
matrix:
gpu-type: [cuda, rocm]
steps:
- name: Clean up previous run
run: |
Expand Down
15 changes: 13 additions & 2 deletions .github/workflows/regression_tests_gpu.yml
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,13 @@ concurrency:
jobs:
regression-gpu:
# creates workflows on self hosted runner
runs-on: [self-hosted, regression-test-gpu]
runs-on:
- self-hosted
- regression-test-gpu
- ${{ matrix.gpu-type }}
strategy:
matrix:
gpu-type: [cuda, rocm]
steps:
- name: Clean up previous run
run: |
Expand All @@ -44,9 +50,14 @@ jobs:
uses: actions/checkout@v3
with:
submodules: recursive
- name: Install dependencies
- name: Install dependencies for CUDA
if: matrix.gpu-type == 'cuda'
run: |
python ts_scripts/install_dependencies.py --environment=dev --cuda=cu121
- name: Install dependencies for ROCm
if: matrix.gpu-type == 'rocm'
run: |
python ts_scripts/install_dependencies.py --environment=dev --rocm=rocm62
- name: Torchserve Regression Tests
run: |
export TS_RUN_IN_DOCKER=False
Expand Down
73 changes: 64 additions & 9 deletions docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,12 @@ ARG BRANCH_NAME
ARG REPO_URL=https://github.com/pytorch/serve.git
ENV PYTHONUNBUFFERED TRUE

RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt \
RUN --mount=type=cache,sharing=locked,id=apt-dev,target=/var/cache/apt \
apt-get update && \
apt-get upgrade -y && \
apt-get install software-properties-common -y && \
add-apt-repository -y ppa:deadsnakes/ppa && \
apt remove python-pip python3-pip && \
apt remove -y python-pip python3-pip && \
DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \
ca-certificates \
g++ \
Expand All @@ -55,6 +55,13 @@ RUN --mount=type=cache,id=apt-dev,target=/var/cache/apt \
git \
&& rm -rf /var/lib/apt/lists/*

RUN --mount=type=cache,sharing=locked,id=apt-dev,target=/var/cache/apt \
if [ "$USE_ROCM_VERSION" ]; then \
apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y rocm-dev amd-smi-lib \
&& rm -rf /var/lib/apt/lists/* ; \
fi

# Make the virtual environment and "activating" it by adding it first to the path.
# From here on the python$PYTHON_VERSION interpreter is used and the packages
# are installed in /home/venv which is what we need for the "runtime-image"
Expand All @@ -67,6 +74,7 @@ RUN python -m pip install -U pip setuptools
RUN export USE_CUDA=1

ARG USE_CUDA_VERSION=""
ARG USE_ROCM_VERSION=""

COPY ./ serve

Expand All @@ -76,7 +84,6 @@ RUN \
git clone --recursive $REPO_URL -b $BRANCH_NAME serve; \
fi


WORKDIR "serve"

RUN cp docker/dockerd-entrypoint.sh /usr/local/bin/dockerd-entrypoint.sh
Expand All @@ -90,6 +97,14 @@ RUN \
else \
python ./ts_scripts/install_dependencies.py;\
fi; \
elif echo "${BASE_IMAGE}" | grep -q "rocm/"; then \
# Install ROCm version specific binary when ROCm version is specified as a build arg
if [ "$USE_ROCM_VERSION" ]; then \
python ./ts_scripts/install_dependencies.py --rocm $USE_ROCM_VERSION;\
# Install the binary with the latest CPU image on a ROCm base image
else \
python ./ts_scripts/install_dependencies.py; \
fi; \
# Install the CPU binary
else \
python ./ts_scripts/install_dependencies.py; \
Expand All @@ -111,13 +126,14 @@ FROM ${BASE_IMAGE} AS production-image
# Re-state ARG PYTHON_VERSION to make it active in this build-stage (uses default define at the top)
ARG PYTHON_VERSION
ENV PYTHONUNBUFFERED TRUE
ARG USE_ROCM_VERSION

RUN --mount=type=cache,target=/var/cache/apt \
RUN --mount=type=cache,sharing=locked,target=/var/cache/apt \
apt-get update && \
apt-get upgrade -y && \
apt-get install software-properties-common -y && \
add-apt-repository ppa:deadsnakes/ppa -y && \
apt remove python-pip python3-pip && \
apt remove -y python-pip python3-pip && \
DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \
python$PYTHON_VERSION \
python3-distutils \
Expand All @@ -130,13 +146,25 @@ RUN --mount=type=cache,target=/var/cache/apt \
&& rm -rf /var/lib/apt/lists/* \
&& cd /tmp

RUN --mount=type=cache,sharing=locked,id=apt-dev,target=/var/cache/apt \
if [ "$USE_ROCM_VERSION" ]; then \
apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y rocm-dev amd-smi-lib \
&& rm -rf /var/lib/apt/lists/* ; \
fi

RUN useradd -m model-server \
&& mkdir -p /home/model-server/tmp

COPY --chown=model-server --from=compile-image /home/venv /home/venv
COPY --from=compile-image /usr/local/bin/dockerd-entrypoint.sh /usr/local/bin/dockerd-entrypoint.sh
ENV PATH="/home/venv/bin:$PATH"

RUN \
if [ "$USE_ROCM_VERSION" ]; then \
python -m pip install /opt/rocm/share/amd_smi; \
fi

RUN chmod +x /usr/local/bin/dockerd-entrypoint.sh \
&& chown -R model-server /home/model-server

Expand All @@ -157,13 +185,14 @@ FROM ${BASE_IMAGE} AS ci-image
ARG PYTHON_VERSION
ARG BRANCH_NAME
ENV PYTHONUNBUFFERED TRUE
ARG USE_ROCM_VERSION

RUN --mount=type=cache,target=/var/cache/apt \
RUN --mount=type=cache,sharing=locked,target=/var/cache/apt \
apt-get update && \
apt-get upgrade -y && \
apt-get install software-properties-common -y && \
add-apt-repository -y ppa:deadsnakes/ppa && \
apt remove python-pip python3-pip && \
apt remove -y python-pip python3-pip && \
DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y \
python$PYTHON_VERSION \
python3-distutils \
Expand All @@ -183,13 +212,24 @@ RUN --mount=type=cache,target=/var/cache/apt \
&& rm -rf /var/lib/apt/lists/* \
&& cd /tmp

RUN --mount=type=cache,sharing=locked,id=apt-dev,target=/var/cache/apt \
if [ "$USE_ROCM_VERSION" ]; then \
apt-get update && \
DEBIAN_FRONTEND=noninteractive apt-get install --no-install-recommends -y rocm-dev amd-smi-lib \
&& rm -rf /var/lib/apt/lists/* ; \
fi

COPY --from=compile-image /home/venv /home/venv

ENV PATH="/home/venv/bin:$PATH"

RUN python -m pip install --no-cache-dir -r https://raw.githubusercontent.com/pytorch/serve/$BRANCH_NAME/requirements/developer.txt

RUN \
if [ "$USE_ROCM_VERSION" ]; then \
python -m pip install /opt/rocm/share/amd_smi; \
fi

RUN mkdir /home/serve
ENV TS_RUN_IN_DOCKER True

Expand All @@ -203,11 +243,13 @@ ARG PYTHON_VERSION
ARG BRANCH_NAME
ARG BUILD_FROM_SRC
ARG LOCAL_CHANGES
ARG USE_ROCM_VERSION
ARG BUILD_WITH_IPEX
ARG IPEX_VERSION=1.11.0
ARG IPEX_URL=https://software.intel.com/ipex-whl-stable
ENV PYTHONUNBUFFERED TRUE
RUN --mount=type=cache,target=/var/cache/apt \

RUN --mount=type=cache,sharing=locked,target=/var/cache/apt \
apt-get update && \
apt-get upgrade -y && \
apt-get install software-properties-common -y && \
Expand All @@ -227,9 +269,15 @@ RUN --mount=type=cache,target=/var/cache/apt \
# https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=1009905
openjdk-17-jdk \
build-essential \
wget \
curl \
vim \
numactl \
nodejs \
npm \
zip \
unzip \
&& npm install -g [email protected] newman-reporter-htmlextra markdown-link-check \
&& if [ "$BUILD_WITH_IPEX" = "true" ]; then apt-get update && apt-get install -y libjemalloc-dev libgoogle-perftools-dev libomp-dev && ln -s /usr/lib/x86_64-linux-gnu/libjemalloc.so /usr/lib/libjemalloc.so && ln -s /usr/lib/x86_64-linux-gnu/libtcmalloc.so /usr/lib/libtcmalloc.so && ln -s /usr/lib/x86_64-linux-gnu/libiomp5.so /usr/lib/libiomp5.so; fi \
&& rm -rf /var/lib/apt/lists/*

Expand All @@ -243,10 +291,17 @@ RUN \

COPY --from=compile-image /home/venv /home/venv
ENV PATH="/home/venv/bin:$PATH"

RUN \
if [ "$USE_ROCM_VERSION" ]; then \
python -m pip install /opt/rocm/share/amd_smi; \
fi

WORKDIR "serve"

RUN python -m pip install -U pip setuptools \
&& python -m pip install --no-cache-dir -r requirements/developer.txt \
&& python ts_scripts/install_from_src.py \
&& python ts_scripts/install_from_src.py --environment=dev\
&& useradd -m model-server \
&& mkdir -p /home/model-server/tmp \
&& cp docker/dockerd-entrypoint.sh /usr/local/bin/dockerd-entrypoint.sh \
Expand Down
Loading