This repository provides the paper and code of Token Splitting Effect, a study that explores prompt tuning across multiple Large Language Models (LLMs) for improved interpretability and transferability.
Prompt tuning in natural language processing allows leveraging Large Language Models (LLMs) efficiently. Yet soft prompts struggle with transferability between models and interpretability. This study introduces a method of tuning soft prompts across multiple models to uncover the "token splitting effect". Prompt tokens align with specific models, lying in their embedding spaces and playing a crucial role in their performance. We show how this is hindering transferability, but allowing prompt compression after training. Our research highlights the soft prompt functionality across multiple models and sets a foundation for optimizing soft prompts through strategic compression and a deeper understanding of token importance.
All results and artifacts can be found in logs/explainable-soft-prompts
. In cfgs
all used training configurations are stored.
To reproduce the paper results, use the scripts from the scripts
folder or run_emotion.sh
, run_imdb.sh
and run_mnli.sh
.
For detailed usage, see the information on the NLP research template, this works builds upon.
The recommended usage is:
- Cloning of the repository with
git clone https://github.com/FSadrieh/explainable-soft-prompts.git
cd explainable-soft-prompts
- Run:
bash ./scripts/run-in-docker.sh -g <wanted-gpus> bash
- Run the desired script to reproduce specific results using the
scripts
folder.
If you use the code in this repository, please cite:
@software{ExplainableSoftPrompts2024,
author = {Reimann, Raphael and Sadrieh, Frederic},
title = {{Explainable Soft Prompts}},
url = {https://github.com/FSadrieh/explainable-soft-prompts},
month = {02},
year = {2024}
}
The code in this repository is built based on a NLP Research Template.
Template documentation
NLP research template for training language models using PyTorch + Lightning + Weights & Biases + HuggingFace. It's built to be customized but provides comprehensive, sensible default functionality.
If you are not doing NLP or want to use your own training code or template, the setup and environment tooling with Docker, mamba
, and conda-lock
in this template might still be interesting for you.
It's recommended to use mamba
to manage dependencies. mamba
is a drop-in replacement for conda
re-written in C++ to speed things up significantly (you can stick with conda
though). To provide reproducible environments, we use conda-lock
to generate lockfiles for each platform.
Installing mamba
On Unix-like platforms, run the snippet below. Otherwise, visit the mambaforge repo. Note this does not use the Anaconda installer, which reduces bloat.
curl -L -O "https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh"
bash Mambaforge-$(uname)-$(uname -m).sh
Installing conda-lock
The preferred method is to install conda-lock
using pipx install conda-lock
. For other options, visit the conda-lock repo. For basic usage, have a look at the commands below:
conda-lock install --name gpt5 conda-lock.yml # create environment with name gpt5 based on lockfile
conda-lock # create new lockfile based on environment.yml
conda-lock --update <package-name> # update specific packages in lockfile
Lockfiles are an easy way to exactly reproduce an environment.
After having installed mamba
and conda-lock
, you can create a mamba
environment named gpt5
from a lockfile with all necessary dependencies installed like this:
conda-lock install --name gpt5 conda-lock.yml
You can then activate your environment with
mamba activate gpt5
To generate new lockfiles after updating the environment.yml
file, simply run conda-lock -f environment.yml
.
Setup on ppc64le
If you're not using a PowerPC machine, do not worry about this.
Whenever you create an environment for a different processor architecture, some packages (especially pytorch
) need to be compiled specifically for that architecture. IBM PowerPC machines for example use a processor architecture called ppc64le
.
Setting up the environment ppc64le
is a bit tricky because the official channels do not provide packages compiled for ppc64le
. However, we can use the amazing Open-CE channel instead. A lockfile containing the relevant dependencies is already prepared in ppc64le.conda-lock.yml
and the environment again can be simply installed with:
conda-lock install --name gpt5-ppc64le ppc64le.conda-lock.yml
Dependencies for ppc64le
should go into the separate ppc64le.environment.yml
file. Use the following command to generate a new lockfile after updating the dependencies:
conda-lock --file ppc64le.environment.yml --lockfile ppc64le.conda-lock.yml
For fully reproducible environments and running on HPC clusters, we provide pre-built docker images at konstantinjdobler/nlp-research-template. We also provide a Dockerfile
that allows you to build new docker images with updated dependencies:
# first update `environment.yml` with your dependencies
# then this command will create a new conda-lock.yml file
conda-lock -f environment.yml
# this automatically uses your latest conda-lock.yml to create a reproducible docker image
docker build --tag <username>/<imagename>:<tag> --platform="linux/amd64" .
The specified username should be your personal dockerhub
username. This will make distribution and usage of your images easier with docker push/pull <your image>
.
We also provide shell commands and a convenience script to run all your training commands inside docker (recommended).
After all of this setup you are finally ready for some training. First of all, you need to create your data directory with a train.txt
and dev.txt
. Then you can start a training run in your environment with:
python train.py -n <run-name> -d /path/to/data --model roberta-base --offline
To see an overview over all options and their defaults, run python train.py --help
or have a look inside args.py
. We have disabled Weights & Biases syncing with the --offline
flag. If you want to log your results, enable W&B as described here and omit the --offline
flag.
Using GPUs for hardware acceleration
By default, train.py
tries to use a single CUDA GPU if available. If you want to train on multiple GPUs, increase the --num_devices
flag (this then uses DistributedDataParallel
under the hood). IMPORTANT: you should always select the GPUs that are visible to the script via the CUDA_VISIBLE_DEVICES
environment variable (e.g. CUDA_VISIBLE_DEVICES=0,2 python train.py ...
) or via the docker flags if training inside a container (recommended). To use different hardware accelerators, use the --accelerator
flag. You can use advanced parallel training strategies with --distributed_strategy
.
To conveniently run the training code inside a docker container, you can use the run-in-docker.sh script.
# execute the training inside your container
# -g 2 means only GPU 2 is visible to the script
# -g 0,2 would make the GPUs 0 and 2 visible
bash ./scripts/run-in-docker.sh -g 2 python train.py --num_devices 1 -n <run-name> -d /path/to/data/ --model roberta-base --offline
By default (no -g
flag), no GPUs are available inside the container. You probably want to adjust the run-in-docker.sh
script to add your own mounts for data and other things you want to load / save.
Docker + GPUs: You should always select specific GPUs to be visible inside the container. When using the run-in-docker.sh
script, use the -g
flag. When using docker natively, use e.g. --gpus='"device=0,7"'
(for the GPUs 0
and 7
) and adjust the --num_devices
flag according to your number of selected GPUs. Yes, the weird format of --gpus='"device=0,7"'
is important, otherwise the shell might not pass the flag correctly to nvidia-docker
(official Nvidia recommendation).
Single-line docker command
You can start a script inside a docker container in a single command:
docker run -it --user $(id -u):$(id -g) --ipc host -v "$(pwd)":/workspace -w /workspace --gpus='"device=7"' konstantinjdobler/nlp-research-template:latest python train.py --num_devices=1 ...
Since we have not mounted any cache directories (only the current working directory with $(pwd)
), nothing that is written to disk outside $(pwd)
is persistent in this example. You can add those with -v
or --mount
.
Using Docker with SLURM / pyxis
For security reasons, docker
might be disabled on your HPC cluster. You might be able to use the SLURM plugin pyxis
instead like this:
srun ... --container-image konstantinjdobler/nlp-research-template:latest python train.py ...
This uses enroot
under the hood to import your docker image and run your code inside the container. See the pyxis
documentation for more options, such as --container-mounts
or --container-writable
.
It might take a long time to start the container. You can prepare this by doing enroot import docker://konstantinjdobler/nlp-research-template:latest -o prepared-image.sqsh
and then modify the srun
:
srun ... --container-image /path/to/prepared-image.sqsh python train.py ...
If you want to run an interactive session with bash don't forget the --pty
flag.
Weights & Biases allows you to easily log metrics, training results, checkpoints, and hyperparameters. To enable Weights & Biases, enter your WANDB_ENTITY
and WANDB_PROJECT
in train.py and omit the --offline
flag for training.
Weights & Biases + Docker
When using docker we also have to get our WANDB_API_KEY
inside the container. You can find your personal API key at wandb.ai/authorize. Set WANDB_API_KEY
on your host machine and use the docker
flag --env WANDB_API_KEY
when starting your run. Or just use the run-in-docker.sh
script, which will try to parse the WANDB_API_KEY
from your ~/.netrc
file (or get it from the environment).
To save the exact configurations of experiments and save yourself some time typing out arguments in the command line, you can use .yml
style config files supplied via the --config_path
argument. You can also combine multiple configs. The order of importance is default args < config args (multiple configs are resolved in order) < command line args.
python train.py --config_path ./cfgs/example.yml ./cfgs/llama-from-scratch.yml --devices 8 -n my-training-run ...
If you want to connect to a remote host machine with GPUs for development, we recommend the VS Code Remote-SSH extension.
Ideally, you should also do your development inside the same docker container to reduce a mismatch between training and development. For this, use VS Code Dev Containers
. They allow you to develop in VS Code inside a docker container with full support for IntelliSense, type hints and more. The template already contains a .devcontainer
directory, where all the settings for it are stored - you can start right away!
VS Code Dev Container
example
After having installed the Remote-SSH-, and Dev Containers-Extension, you set up your Dev Container
in the following way:
- Establish the SSH-connection with the host by opening your VS Code command pallette and typing
Remote-SSH: Connect to Host
. Now you can connect to your host machine. - Open the folder that contains this template on the host machine.
- VS Code will automatically detect the
.devcontainer
directory and ask you to reopen the folder in a Dev Container. Alternatively, use the command pallette and typeDev Containers
. - Press
Reopen in Container
and wait for VS Code to set everything up. for the first time or when you changedevcontainer.json
, you will need to doRebuild and reopen in Container
.
There is a bit of setup: for a proper dev environment, you will need to configure mounts (cache directories, your datasets, ...) and environment variables like for a regular docker run command, have a look inside .devcontainer/devcontainer.json
.
conda-lock
is automatically installed for you but you have to add the --micromamba
flag inside the Dev Container (e.g. conda-lock --micromamba -f environment.yml
).
If you want to use GPUs for development, you also need to specify the GPU you want to use in .devcontainer/devcontainer.json
. However, this is a bit cumbersome if you are often switching between GPUs. Alternatively, you edit your code in the Dev Container (without a GPU) but start all actual development runs of your script like you would for training with run-in-docker.sh
and select the GPU ad-hoc. The nice advantage of Dev Containers is that you are still using the exact same docker container for both.
Sometimes it's just quicker or unavoidable to create an environment via conda-lock install --name gpt5 conda-lock.yml
instead of using Docker. In most cases, this is fine since we are using lockfiles but there might be some tricky edge cases depending on the platform and OS. Just be careful to keep any local environments and your docker containers in sync. Docker containers also allow more advanced support for compiled CUDA kernels such as FlashAttention.
We use the ruff
linter and black
formatter. You should install their VS Code extensions and enable "Format on Save" inside VS Code.