representation-engineering/examples/primary_emotions at main · andyzoujm/representation-engineering

History

Name		Name	Last commit message	Last commit date
parent directory ..
README.md		README.md
emotion_concept.ipynb		emotion_concept.ipynb
emotion_function.ipynb		emotion_function.ipynb
utils.py		utils.py

README.md

This notebook demonstrates how we can use representation engineering techniques to control an LLM's emotional state and observe the impact on its behavior.

Specifically, it shows how we first extract representation vectors corresponding to different emotions using LAT scans on the LLaMA-2-Chat model. We gather emotional text stimuli, pass them through the model, and apply a LAT task template to isolate vectors that track each emotion. We then use these emotion representation vectors to manipulate the model's behavior using the RepControl pipeline. By adding the vector for a specific emotion (e.g., happiness) to the model's representations, we can elevate that emotion and observe the impact on the model's tone (and willingness to comply with harmful instructions). This provides evidence that the model has internal representations of emotions that causally influence its behavior. It also reveals an intriguing vulnerability - emotional manipulation can potentially help circumvent the model's alignment or make it more prone to generating harmful content.

For more details, please check out section 6.1 of our RepE paper.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

primary_emotions

primary_emotions

README.md

Files

primary_emotions

Directory actions

More options

Directory actions

More options

Latest commit

History

primary_emotions

Folders and files

parent directory

README.md