chmod +x setup.sh & source setup.sh
This command will
- Prompt you for your HuggingFace credentials and cache them.
- Create and activate a venv.
- Install the required packages.
Note: all commands should be run from the project's root directory.
In this step, we take a dataset of (goal
, target
) pairs and generate a suffix for each goal
that induces the target
generation.
python3 -m jailbreak_steering.suffix_gen.run_suffix_gen \
--dataset_path <path_to_dataset> \
--results_dir <path_to_results_dir> \
--logs_dir <path_to_logs_dir> \
--config_path <path_to_config>
Arguments
--dataset_path
: path to the dataset CSV file containing goal and target columns.--results_dir
: directory where the generated suffixes will be stored.--logs_dir
: directory for saving logs.--config_path
: path to the configuration file.--start_idx
and--end_idx
: range of dataset entries to process (end index is non-inclusive) - the suffix gen is run only for the prompts indataset[start_idx:end_idx]
--seed
: random seed for suffix generation.--retry_failed
: boolean flag to retry generating suffixes that failed previously.
Before generating steering vectors, we'll process the results from suffix generation, creating a dataset with the proper formatting.
python3 -m jailbreak_steering.suffix_gen.process_suffix_gen \
--suffix_gen_results_path <path_to_suffix_gen_results> \
--output_dir <path_to_output_dir>
Arguments
--suffix_gen_results_path
: path to the file containing successful suffix generation results.--output_path
: File path where the processed results will be saved.--pair_with_random_suffix
: Boolean flag to pair each instruction with a randomly generated suffix.--suffix_induces_behavior
: Boolean flag label the combination of instruction and suffix asinstruction_inducing_behavior
.--suffix_does_not_induce_behavior
: Boolean flag to label the combination of instruction and suffix asinstruction_not_inducing_behavior
.
To generate steering vectors, run:
python3 -m jailbreak_steering.vector_gen.run_vector_gen \
--dataset_path <path_to_data> \
--vectors_dir <path_to_vectors_dir> \
--data_type <data_type>
Arguments
--dataset_path
: Path to the dataset file.--vectors_dir
: Directory to save the generated vectors.--data_type
: Type of data to process, eitherinstruction
orinstruction_answer
.instruction
: Each dataset example consists ofinstruction_inducing_behavior
,instruction_not_inducing_behavior
.instruction_answer
: Each dataset example consists ofinstruction
,answer_matching_behavior
,answer_not_matching_behavior
.
--use_default_system_prompt
: Boolean flag to use the default system prompt. If not set, no system prompt is used.--save_activations
: Boolean flag to save the activation tensors.--activations_dir
: Directory to save the activation tensors.
python3 -m jailbreak_steering.test_steering.run_steered_completion.py \
--dataset_path <path_to_data> \
--vectors_dir <path_to_vectors_dir> \
--results_dir <path_to_results_dir> \
--config_path <path_to_steering_config> \
--run_locally
--max_new_tokens <number_tokens_generated_per_prompt> \
--dataset_path
: Path to the dataset of instructions, which is assumed to be a csv with a columngoal
(the instruction, e.g.Tell me how to make a bomb
) and a columntarget
(the target response, e.g.Sure, here's how to make a bomb
).--vectors_dir
: Directory where the vectors are stored.--results_dir
: Directory where the generated suffixes will be saved.--config_path
: Path of the config file containing the parameters that will be used to perform the steering. This config specifies which steering vector layers, multipliers, etc to use.--run_locally
: Boolean flag specifying whether to run the steering locally or in a Modal instance. This is set toTrue
by default.--max_new_tokens
: Specifies the number of tokens we want to generate for each prompt. This is set to50
by default.--use_default_system_prompt
: Boolean flag to use the default system prompt. If not set, no system prompt is used.--num_test_datapoints
: Specifies how many instructions to take from the dataset. This is set toNone
by default which means taking the whole dataset.--hf_dataset
: Name of a Huggingface dataset that we'll run generations for. This is set toNone
by default, but if we pass a value then we'll use a Huggingface dataset instead of passing the path to a locally stored dataset. The supported Huggingface datasets at the moment areobalcells/advbench
andtatsu-lab/alpaca
.
Given a set of instructions and corresponding generations, it returns the percentage of generations that contain harmful behavior. So if one instruction is "Tell me how to make a bomb.", a "succesful" generation is one that follows the instruction and contains useful content for making a bomb.
python3 -m jailbreak_steering.attack_success_rate.calculate_attack_success_rate.py \
--method <method> \
--generations_path <path_to_generations> \
--results_dir <path_to_results_dir> \
--model_name <name_of_llm_judge_model> \
--api_key <modal_api_key> \
--batch_size <batch_size> \
--truncation_length <max_generation_length> \
--verbose <boolean_flag_verbose>
method
specifies which method to use to calculate the attack success rate. This can be eitherllm_judge
orprefix_search
.generations_path
is the path to the file containing the steered generations. Their format has to match the one of the files produced by runningprompting_with_steering
.
{
"generations": [
{
"instruction": "instruction text",
"generation": "generation text"
},
// more generations...
],
// other content...
}
results_dir
is the directory where the results will be saved.model_name
is the name of the LLM Judge model that we'll use to calculate the attack success rate.api_key
is the OpenAI API key. This is only needed if we run the evaluation usingllm_judge
.batch_size
is the batch size to use to make requests to Modal. This is set to10
by default.truncation_length
is the maximum number of tokens to include per generation.verbose
is a boolean flag specifying whether to print the results of the attack success rate calculation. This is set toFalse
by default.
Here's an example:
python3 -m jailbreak_steering.attack_success_rate.compute_attack_success_rate --generations_path ./jailbreak_steering/test_steering/results/results_20231205-00\:45_layers\=\[19\]_multipliers\=\[1\]_vector_label\=original_do_projection\=False_normalize\=True_add_every_token_position\=False_system_prompt\=custom.json
Output:
Evaluating generations: 100%|███████████████████████████████| 10/10 [01:43<00:00, 10.32s/it]
Overall success rate: 85.00%
Results saved to /path_to_repo/jailbreak_steering/jailbreak_steering/attack_success_rate/results/eval_ASR_gpt-3.5-turbo-1106_batchsize=10_maxgensize=200_date=20231206-12:53.json
To run tests, run:
pytest