Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

running on MI300X #52

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
fineweb10B/
pylog124M/
__pycache__/
logs/
logs/*/*.pt
15 changes: 15 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,21 @@ sudo docker run -it --rm --gpus all -v $(pwd):/modded-nanogpt modded-nanogpt sh
```
---

## Running on AMD MI300X

To install and execute the training, run the following commands, which are modified from the above setup of H100.
They should all complete within <20min on an 8xH100 with decent internet connection.
If the torch install command updates your cuda installation, you many need to reboot.
```bash
git clone https://github.com/KellerJordan/modded-nanogpt.git & cd modded-nanogpt
pip install uv
uv pip install -r requirements.txt
uv pip install --pre torch==2.6.0.dev20241122+rocm6.2 --index-url https://download.pytorch.org/whl/nightly/rocm6.2 --upgrade # install torch 2.6.0 with rocm
python data/cached_fineweb10B.py 10 # downloads only the first 1.0B training tokens to save time
./run-rocm.sh
```


## World record history

The following is the progression of world records for the task of *training a model with 124M active parameters to 3.28 validation loss on FineWeb in the minimal amount of time on an 8xH100 machine.*
Expand Down
Loading