This repo serves as the accompany code to the paper. The pursuit of interpretability in most alignment research is very similar to that of topic modelling. Our work applies the methodologies used in topic modelling to 'explain' neurons in Transformer Language Models, discovering superpositions in individual neurons.
For each neuron and its path of activation, we:
- Project to a corpus -- From the activated tokens, we contruct a word pool w.r.t the corpus
accounts administration agreement artist asks attributed bass basses bassist beat beatings bird blocks bloomberg book booked booking bookings boxes broad broadband cat circuit client clients commented connecting consist consists crowd cursed curses direct doesn ease entire facebook finding finds fragments gentle guest habit helps historians histories impression includes increases intention intermediate joins leave leaves lift loves loving moment mounted net occupied officials oil organ pick picking platforms plays plus preserve preserved prevent prevented protest protested protesting provides registers relationships reports sales sam says series shows size skeletal sleep sleeping sons sparked stop strikes students tells threatened touch track tracked treat treaties tries users uses valuable victim videos views villa vista voice witness witnessed witnesses
- Shortlist plausible word sets -- Using our star heuristic, we shortlist disjoint word sets from the pools
book boot box certain current display displays editing intended introduced like menu operating possible provides registry reviewed save secret shows similar single special target temporary folder (26)
absolute branch computation continuous direct discussed discussion exact fact geometry introduce note property showed showing speaking straightforward subject true work algebraic (21)
...
- Exact solving -- Find an optimistic set of words that can be used for comparisons and 'infer' the role(s) of the neuron (LLaMA-13B 3-226)
certain intended menu boot possible display displays operating folder target 0.0738
note geometry true continuous straightforward absolute computation fact exact algebraic 0.0876
...
Wikipedia-V40K (18.1 GB), 40K count graphs of vocabulary and some entities
Our results (3.9 GB) from mining GPT-2 via projecting to Wikipedia-V40K
Our results (4.3 GB) from mining LLaMA via projecting to Wikipedia-V40K
Results structured as {size}/{mode}/{layer}_{id}_{polarity}{file}, empty files means nothing extracted
- mode: i/ii/iii/iiii (Hi-C, Hi-I, Lo-C, Lo-I respectively)
- size: e.g. 1558M, 7B
- layer/id: int
- polarity: 'pos' or 'neg'
- file: pool ('.pkl') or '_topics.csv' or '_isets.csv'
Jay Mody's picoGPT for GPT-2 (As we only need the forward pass, it is easier to tinker with small code)
Meta's LLaMA v1 (Note: forward pass modifications to it falls under GNU GPL v3)
Diversity-contrained Greedy (EMNLP'22) from https://github.com/PreferredAI/ReIntNTM
PreferredAI topic-metrics (ACL'23) from https://github.com/PreferredAI/topic-metrics/
For convenience, we include the CVXPY implementation that serves as an interface for different solvers (Gurobi, CPLEX, etc.). For non-commercial solvers, we recommend SCIP's PySCIPOpt, link to report, Apache 2.0 license.
If you had found the resources helpful, we'd appreciate a citation!
@inproceedings{lim-lauw-2023-disentangling,
title = "Disentangling Transformer Language Models as Superposed Topic Models",
author = "Lim, Jia Peng and
Lauw, Hady",
booktitle = "Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing",
month = dec,
year = "2023",
address = "Singapore, Singapore",
publisher = "Association for Computational Linguistics"
}