Skip to content

List of relevant resources for machine learning from explanatory supervision

Notifications You must be signed in to change notification settings

stefanoteso/awesome-explanatory-supervision

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 

Repository files navigation

Awesome Explanatory Supervision Awesome

Overview of literature on learning from supervision on the model's explanations. A .bib file of the papers below can be downloaded here.

Warning: permanent WIP.

Did we miss a relevant paper? Please submit a new entry in the following format:

- **An Artificially-intelligent Means to Escape Discreetly from the Departmental Holiday Party; guide for the socially awkward**
  Eve Armstrong; arXiv 2020 [paper](https://arxiv.org/abs/2003.14169)
  `Notes: it is a joke;  a pretty good joke actually.`

Table of Contents


  • Tutorial on Explanations in Interactive Machine Learning at AAAI-22 website Notes: includes recording.

Approaches that supervise the model's explanations.

  • Rationalizing Neural Predictions Tao Lei, Regina Barzilay, Tommi Jaakkola; EMNLP 2016 paper code Notes: they learn an "explanation module" for text classificaiton from explanatory supervision, namely rationales.

  • Right for the right reasons: training differentiable models by constraining their explanations Andrew Slavin Ross, Michael C. Hughes, and Finale Doshi-Velez; IJCAI 2017 paper code

  • e-SNLI: natural language inference with natural language explanations Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, and Phil Blunsom; NeurIPS 2018 paper code

  • Tell me where to look: Guided attention inference network Kunpeng Li, Ziyan Wu, Kuan-Chuan Peng, Jan Ernst, Yun Fu; CVPR 2018 paper

  • e-SNLI: Natural Language Inference with Natural Language Explanations Oana-Maria Camburu, Tim Rocktäschel, Thomas Lukasiewicz, Phil Blunsom; NeurIPS 2018 paper code

  • Learning credible models Jiaxuan Wang, Jeeheh Oh, Haozhu Wang, and Jenna Wiens; KDD 2018 paper code

  • Not Using the Car to See the Sidewalk--Quantifying and Controlling the Effects of Context in Classification and Segmentation Rakshith Shetty, Bernt Schiele, Mario Fritz; CVPR 2019 paper Notes: not exactly about explanations, learns from ground-truth object annotations.

  • Taking a HINT: Leveraging Explanations to Make Vision and Language Models More Grounded Ramprasaath R. Selvaraju, Stefan Lee, Yilin Shen, Hongxia Jin, Shalini Ghosh, Larry Heck, Dhruv Batra, and Devi Parikh; ICCV 2019 pdf

  • Learning credible deep neural networks with rationale regularization Mengnan Du, Ninghao Liu, Fan Yang, Xia Hu; ICDM 2019 paper

  • Deriving Machine Attention from Human Rationales Yujia Bao, Shiyu Chang, Mo Yu, and Regina Barzilay; ACL 2019 paper code

  • TED: Teaching AI to explain its decisions Michael Hind, Dennis Wei, Murray Campbell, Noel Codella, Amit Dhurandhar, Aleksandra Mojsilović, Karthikeyan Ramamurthy, Kush Varshney; AIES 2019 paper

  • Incorporating Priors with Feature Attribution on Text Classification Frederick Liu, Besim Avci; ACL 2019 paper

  • Saliency Learning: Teaching the Model Where to Pay Attention Reza Ghaeini, Xiaoli Fern, Hamed Shahbazi, Prasad Tadepalli; NAACL 2019 paper

  • Do Human Rationales Improve Machine Explanations? Julia Strout, Ye Zhang, Raymond Mooney; ACL Workshop BlackboxNLP 2019 paper

  • CARE: Class attention to regions of lesion for classification on imbalanced data Jiaxin Zhuang, Jiabin Cai, Ruixuan Wang, Jianguo Zhang, Weishi Zheng; International Conference on Medical Imaging with Deep Learning, 2019. paper

  • GradMask: Reduce Overfitting by Regularizing Saliency Becks Simpson, Francis Dutil, Yoshua Bengio, Joseph Paul Cohen; International Conference on Medical Imaging with Deep Learning, 2019. paper

  • Learning Global Transparent Models Consistent with Local Contrastive Explanations Tejaswini Pedapati, Avinash Balakrishnan, Karthikeyan Shanmugam, Amit Dhurandhar; NeurIPS 2020 paper

  • Model Agnostic Multilevel Explanations Karthikeyan Natesan Ramamurthy, Bhanukiran Vinzamuri, Yunfeng Zhang, Amit Dhurandhar; NeurIPS 2020 paper Notes: implicitly learns to generalize across multiple local explanations.

  • Interpretations are useful: penalizing explanations to align neural networks with prior knowledge Laura Rieger, Chandan Singh, William Murdoch, Bin Yu; ICML 2020 paper code

  • Remembering for the Right Reasons: Explanations Reduce Catastrophic Forgetting Sayna Ebrahimi, Suzanne Petryk, Akash Gokul, William Gan, Joseph Gonzalez, Marcus Rohrbach; ICLR 2020 paper code Notes: uses saliency guided replay for continual learning.

  • Learning to Faithfully Rationalize by Construction Sarthak Jain, Sarah Wiegreffe, Yuval Pinter, Byron Wallace. ACL 2020 paper code

  • Reflective-Net: Learning from Explanations Johannes Schneider, Michalis Vlachos; arXiv 2020 paper

  • Learning Interpretable Concept-based Models with Human Feedback Isaac Lage, Finale Doshi-Velez; arXiv 2020 paper Notes: incrementally acquires side-information about per-concept feature dependencies; side-information is per-concept, not per-instance.

  • Improving performance of deep learning models with axiomatic attribution priors and expected gradients Gabriel Erion, Joseph D. Janizek, Pascal Sturmfels, Scott Lundberg, Su-In Lee; Nature Machine Intelligence 2019 paper preprint code

  • GLocalX-From Local to Global Explanations of Black Box AI Models Mattia Setzu, Riccardo Guidotti, Anna Monreale, Franco Turini, Dino Pedreschi, and Fosca Giannotti; Artificial Intelligence 2021 page code Notes: converts a set of local explanations to a global explanation / white-box model.

  • IAIA-BL: A Case-based Interpretable Deep Learning Model for Classification of Mass Lesions in Digital Mammography Alina Barnett, Fides Schwartz, Chaofan Tao, Chaofan Chen, Yinhao Ren, Joseph Lo, Cynthia Rudin; Nature Machine Intelligence 2021 paper code

  • Debiasing Concept-based Explanations with Causal Analysis Mohammad Taha Bahadori, and David E. Heckerman; ICLR 2021 paper

  • Teaching with Commentaries Aniruddh Raghu, Maithra Raghu, Simon Kornblith, David Duvenaud, and Geoffrey Hinton; ICLR 2021 paper code

  • Saliency is a possible red herring when diagnosing poor generalization Joseph Viviano, Becks Simpson, Francis Dutil, Yoshua Bengio, Joseph Paul Cohen; ICLR 2021 paper code

  • Towards Robust Classification Model by Counterfactual and Invariant Data Generation Chun-Hao Chang, George Alexandru Adam, Anna Goldenberg; CVPR 2021 paper code

  • Global Explanations with Decision Rules: a Co-learning Approach Géraldin Nanfack, Paul Temple, Benoît Frénay1; UAI 2021 paper code

  • Explain and Predict, and then Predict Again Zijian Zhang, Koustav Rudra, Avishek Anand; WSDM 2021 paper code

  • Explanation-Based Human Debugging of NLP Models: A Survey Piyawat Lertvittayakumjorn, Francesca Toni; arXiv 2021 paper

  • When Can Models Learn From Explanations? A Formal Framework for Understanding the Roles of Explanation Data Peter Hase, Mohit Bansal; arXiv 2021 paper code

  • Enjoy the Salience: Towards Better Transformer-based Faithful Explanations with Word Salience George Chrysostomou, Nikolaos Aletras; arXiv 2021 paper code

  • Influence Tuning: Demoting Spurious Correlations via Instance Attribution and Instance-Driven Updates Xiaochuang Han, Yulia Tsvetkov; arXiv 2021 paper code

  • Saliency Guided Experience Packing for Replay in Continual Learning Gobinda Saha, Kaushik Roy; arXiv 2021 paper Notes: leverages saliency for experience replay in continual learning.

  • What to Learn, and How: Toward Effective Learning from Rationales Samuel Carton, Surya Kanoria, Chenhao Tan; arXiv 2021 paper

  • Supervising Model Attention with Human Explanations for Robust Natural Language Inference Joe Stacey, Yonatan Belinkov, Marek Rei; AAAI 2022 paper code

  • Finding and removing Clever Hans: Using explanation methods to debug and improve deep models Christopher Anders, Leander Weber, David Neumann, Wojciech Samek, Klaus-Robert Müller, Klaus-Robert, Sebastian Lapuschkin; Information Fusion 2022 paper code code

  • Toward Learning Human-aligned Cross-domain Robust Models by Countering Misaligned Features Haohan Wang, Zeyi Huang, Hanlin Zhang, Eric P. Xing; UAI 2022 paper code

  • A survey on improving NLP models with human explanations Mareike Hartmann, Daniel Sonntag; arXiv 2022 paper

  • VisFIS: Visual Feature Importance Supervision with Right-for-the-Right-Reason Objectives Zhuofan Ying, Peter Hase, and Mohit Bansal; arXiv 2022 paper code

  • Identifying Spurious Correlations and Correcting them with an Explanation-based Learning Misgina Tsighe Hagos, Kathleen Curran, Brian Mac Namee; arXiv 2022 paper

  • Using Explanations to Guide Models Sukrut Rao, Moritz Böhle, Amin Parchami-Araghi, Bernt Schiele; ICCV 2023 paper code

  • Learning with Explanation Constraints Rattana Pukdee, Dylan Sam, Zico Kolter, Maria-Florina Balcan, Pradeep Ravikumar; arXiv 2023 paper

  • Spuriosity Didn't Kill the Classifier: Using Invariant Predictions to Harness Spurious Features Cian Eastwood, Shashank Singh, Andrei Liviu Nicolicioiu, Marin Vlastelica, Julius von Kügelgen, Benrnhard Schölkopf; arXiv 2023 paper

  • Spurious features everywhere-large-scale detection of harmful spurious features in imagenet Yannic Neuhaus, Maximilian Augustin, Valentyn Boreiko, Matthias Hein; ICCV 2023 paper code

  • Targeted Activation Penalties Help CNNs Ignore Spurious Signals Dekai Zhang, Matt Williams, and Francesca Toni; AAAI 2024 paper code


Approaches that combine supervision on the explanations with interactive machine learning:

  • Principles of Explanatory Debugging to Personalize Interactive Machine Learning Todd Kulesza, Margaret Burnett, Weng-Keen Wong, Simone Stumpf; IUI 2015 paper

  • Explanatory Interactive Machine Learning Stefano Teso, Kristian Kersting; AIES 2019 paper code Notes: introduces explanatory interactive learning, focuses on active learning setup.

  • Toward Faithful Explanatory Active Learning with Self-explainable Neural Nets Stefano Teso; IAL Workshop 2019. paper code Notes: explanatory active learning with self-explainable neural networks.

  • Making deep neural networks right for the right scientific reasons by interacting with their explanations Patrick Schramowski, Wolfgang Stammer, Stefano Teso, Anna Brugger, Franziska Herbert, Xiaoting Shao, Hans-Georg Luigs, Anne-Katrin Mahlein, Kristian Kersting; Nature Machine Intelligence 2020 paper code Notes: introduces end-to-end explanatory interactive learning, fixes clever Hans deep neural nets.

  • Embedding Human Knowledge into Deep Neural Network via Attention Map Masahiro Mitsuhara, Hiroshi Fukui, Yusuke Sakashita, Takanori Ogata, Tsubasa Hirakawa, Takayoshi Yamashita, Hironobu Fujiyoshi; arXiv 2019 paper

  • One explanation does not fit all Kacper Sokol, Peter Flach; 2020 Künstliche Intelligenz paper

  • FIND: Human-in-the-loop Debugging Deep Text Classifiers Piyawat Lertvittayakumjorn, Lucia Specia, Francesca Toni; EMNLP 2020 paper

  • Human-driven FOL explanations of deep learning Gabriele Ciravegna, Francesco Giannini, Marco Gori, Marco Maggini, Stefano Melacci; IJCAI 2020 paper Notes: first-order logic.

  • Cost-effective Interactive Attention Learning with Neural Attention Process Jay Heo, Junhyeon Park, Hyewon Jeong, Kwang joon Kim, Juho Lee, Eunho Yang, Sung Ju Hwang; ICML 2020 paper code Notes: attention, interaction

  • Soliciting human-in-the-loop user feedback for interactive machine learning reduces user trust and impressions of model accuracy Donald Honeycutt, Mahsan Nourani, Eric Ragan; AAAI Conference on Human Computation and Crowdsourcing 2020 paper

  • ALICE: Active Learning with Contrastive Natural Language Explanations Weixin Liang, James Zou, Zhou Yu; EMNLP 2020 paper

  • Machine Guides, Human Supervises: Interactive Learning with Global Explanations Teodora Popordanoska, Mohit Kumar, Stefano Teso; arXiv 2020 paper code Notes: introduces narrative bias and explanatory guided learning, focuses on human-initiated interaction and global explanations.

  • Teaching an Active Learner with Contrastive Examples Chaoqi Wang, Adish Singla, Yuxin Chen. NeurIPS 2021. paper

  • Right for the Right Concept: Revising Neuro-Symbolic Concepts by Interacting with their Explanations Wolfgang Stammer, Patrick Schramowski, and Kristian Kersting; CVPR 2021 paper code Notes: first-order logic, attention.

  • Right for Better Reasons: Training Differentiable Models by Constraining their Influence Function Xiaoting Shao, Arseny Skryagin, Patrick Schramowski, Wolfgang Stammer, Kristian Kersting; AAAI 2021 paper

  • User Driven Model Adjustment via Boolean Rule Explanations Elizabeth Daly, Massimiliano Mattetti, Öznur Alkan, Rahul Nair; AAAI 2021 paper

  • Explainable Active Learning (XAL): Toward AI Explanations as Interfaces for Machine Teachers Bhavya Ghai, Vera Liao, Yunfeng Zhang, Rachel Bellamy, Klaus Mueller. Proc. ACM Hum.-Comput. Interact. 2021 paper

  • Bandits for Learning to Explain from Explanations Freya Behrens, Stefano Teso, Davide Mottin; XAI Workshop 2021 paper code Notes: preliminary.

  • HILDIF: Interactive Debugging of NLI Models Using Influence Functions Hugo Zylberajch, Piyawat Lertvittayakumjorn, Francesca Toni; InterNLP Workshop 2021 paper code

  • Refining Neural Networks with Compositional Explanations Huihan Yao, Ying Chen, Qinyuan Ye, Xisen Jin, Xiang Ren; arXiv 2021 paper code

  • Interactive Label Cleaning with Example-based Explanations Stefano Teso, Andrea Bontempelli, Fausto Giunchiglia, Andrea Passerini; NeurIPS 2021 paper code

  • Symbols as a Lingua Franca for Bridging Human-AI Chasm for Explainable and Advisable AI Systems Subbarao Kambhampati, Sarath Sreedharan, Mudit Verma, Yantian Zha, Lin Guan; AAAI 2022 paper

  • Toward a Unified Framework for Debugging Gray-box Models Andrea Bontempelli, Fausto Giunchiglia, Andrea Passerini, Stefano Teso; AAAI-22 Workshop on Interactive Machine Learning paper

  • Active Learning by Acquiring Contrastive Examples Katerina Margatina, Giorgos Vernikos, Loïc Barrault, Nikolaos Aletras; EMNLP 2021 paper code

  • Finding and Fixing Spurious Patterns with Explanations Gregory Plumb, Marco Tulio Ribeiro, Ameet Talwalkar; arXiv 2021 paper

  • Interactively Generating Explanations for Transformer Language Models Patrick Schramowski, Felix Friedrich, Christopher Tauchmann, Kristian Kersting; arXiv 2021 paper

  • Interaction with Explanations in the XAINES Project Mareike Hartmann, Ivana Kruijff-Korbayová, Daniel Sonntag; arXiv 2021 paper

  • A Rationale-Centric Framework for Human-in-the-loop Machine Learning Jinghui Lu, Linyi Yang, Brian Mac Namee, Yue Zhang; ACL 2022 paper code

  • A Typology to Explore and Guide Explanatory Interactive Machine Learning Felix Friedrich, Wolfgang Stammer, Patrick Schramowski, Kristian Kersting; arXiv 2022 paper

  • CAIPI in Practice: Towards Explainable Interactive Medical Image Classification Emanuel Slany, Yannik Ott, Stephan Scheele, Jan Paulus, Ute Schmid; IFIP International Conference on Artificial Intelligence Applications and Innovations, 2022 paper

  • Semantic Interactive Learning for Text Classification: A Constructive Approach for Contextual Interactions Semastian Kiefer, Mareike Hoffmann, Ute Schmid; Machine Learning and Knowledge Extraction, 2022 paper

  • Impact of Feedback Type on Explanatory Interactive Learning Misgina Tsighe Hagos, Kathleen Curran, Brian Mac Namee; ISMIS 2022 paper

  • Leveraging Explanations in Interactive Machine Learning: An Overview Stefano Teso, Öznur Alkan, Wolfgang Stammer, Elizabeth Daly; Frontiers in AI 2023 paper preprint

  • Concept-level Debugging of Part-prototype Networks Andrea Bontempelli, Stefano Teso, Fausto Giunchiglia, Andrea Passerini; ICLR 2023 paper code

  • Learning to Intervene on Concept Bottlenecks David Steinmann, Wolfgang Stammer, Felix Friedrich, Kristian Kersting; arXiv 2023 paper


  • Explanation Augmented Feedback in Human-in-the-Loop Reinforcement Learning Lin Guan, Mudit Verma, Sihang Guo, Ruohan Zhang, Subbarao Kambhampati; Human And Machine in-the-Loop Evaluation and Learning Strategies paper

  • Learning from explanations and demonstrations: A pilot study Silvia Tulli, Sebastian Wallkötter, Ana Paiva, Francisco Melo, Mohamed Chetouani; Workshop on Interactive Natural Language Technology for Explainable Artificial Intelligence 2020 paper

  • Widening the Pipeline in Human-Guided Reinforcement Learning with Explanation and Context-Aware Data Augmentation Lin Guan, Mudit Verma, Sihang Guo, Ruohan Zhang, Subbarao Kambhampati; NeurIPS 2021 pdf


  • Model reconstruction from model explanations Smitha Milli, Ludwig Schmidt, Anca D. Dragan, Moritz Hardt; FAcct 2019 paper

  • Evaluating Explanations: How much do explanations from the teacher aid students? Danish Pruthi, Bhuwan Dhingra, Livio Baldini Soares, Michael Collins, Zachary C. Lipton, Graham Neubig, and William W. Cohen; arXiv 2020 paper Notes: defines importance of different kinds of explanations by measuring their impact when used as supervision.


Approaches that regularize the model's explanations in an unsupervised manner, often for improved interpretability.

  • Improving the adversarial robustness and interpretability of deep neural networks by regularizing their input gradients Andrew Ross and Finale Doshi-Velez. AAAI 2018 paper

  • Towards robust interpretability with self-explaining neural networks David Alvarez-Melis, Tommi Jaakkola; NeurIPS 2018 paper

  • Beyond sparsity: Tree regularization of deep models for interpretability Mike Wu, Michael Hughes, Sonali Parbhoo, Maurizio Zazzi, Volker Roth, Finale Doshi-Velez; AAAI 2018 paper

  • Regional tree regularization for interpretability in deep neural networks Mike Wu, Sonali Parbhoo, Michael Hughes, Ryan Kindle, Leo Celi, Maurizio Zazzi, Volker Roth, Finale Doshi-Velez; AAAI 2020 paper

  • Regularizing black-box models for improved interpretability Gregory Plumb, Maruan Al-Shedivat, Ángel Alexander Cabrera, Adam Perer, Eric Xing, Ameet Talwalkar; NeurIPS 2020 paper

  • Don't Judge an Object by Its Context: Learning to Overcome Contextual Bias Krishna Kumar Singh, Dhruv Mahajan, Kristen Grauman, Yong Jae Lee, Matt Feiszli, Deepti Ghadiyaram; CVPR 2020 paper code

  • Trustworthy convolutional neural networks: A gradient penalized-based approach Nicholas Halliwell, Freddy Lecue; arXiv 2020 paper

  • Explainable Models with Consistent Interpretations Vipin Pillai, Hamed Pirsiavash; AAAI 2021 paper code

  • Explanation Consistency Training: Facilitating Consistency-based Semi-supervised Learning with Interpretability Tao Han, Wei-Wei Tu, Yu-Feng Li; AAAI 2021 paper

  • Improving Deep Learning Interpretability by Saliency Guided Training Aya Abdelsalam Ismail, Hector Corrada Bravo, Soheil Feizi; NeurIPS 2021 paper code

  • Generating Deep Networks Explanations with Robust Attribution Alignment Guohang Zeng, Yousef Kowsar, Sarah Erfani, James Bailey; ACML 2021 paper

  • Learning by Self-Explaining Wolfgang Stammer, Felix Friedrich, David Steinmann, Hikaru Shindo, Kristian Kersting; arXiv 2023 paper


  • Interpretable Machine Teaching via Feature Feedback Shihan Su, Yuxin Chen, Oisin Mac Aodha, Pietro Perona, Yisong Yue; Workshop on Teaching Machines, Robots, and Humans 2017 paper

  • Teaching Categories to Human Learners with Visual Explanations Oisin Mac Aodha, Shihan Su, Yuxin Chen, Pietro Perona, Yisong Yue; CVPR 2018 paper Notes: this is *inverse* teaching, i.e., machine teaches human.


  • Improving a neural network model by explanation-guided training for glioma classification based on MRI data Frantisek Sefcik, Wanda Benesova; arXiv 2021 paper Notes: based on layer-wise relevance propagation.

Explanation-based learning, focuses on logic-based formalisms and learning strategies:

  • Explanation-based generalization: A unifying view Tom Mitchell, Richard Keller, Smadar Kedar-Cabelli; MLJ 1986 paper

  • Explanation-based learning: An alternative view Gerald DeJong, Raymond Mooney; MLJ 1986 paper

  • Explanation-based learning: A survey of programs and perspectives Thomas Ellman; ACM Computing Surveys 1989 paper

  • Probabilistic explanation based learning Angelika Kimmig, Luc De Raedt, Hannu Toivonen; ECML 2007 paper

Injecting invariances / feature constraints into models:

  • Tangent Prop - A formalism for specifying selected invariances in an adaptive network Patrice Simard, Bernard Victorri, Yann Le Cun, John Denker; NeurIPS 1992 paper Notes: injects invariances into a neural net by regularizing its gradient; precursor to learning from gradient-based explanations.

  • Training invariant support vector machines Dennis DeCoste, Bernhard Schölkopf; MLJ 2002 paper

  • The constrained weight space svm: learning with ranked features Kevin Small, Byron Wallace, Carla Brodley, Thomas Trikalinos; ICML 2011 paper

Dual label-feature feedback:

  • Active learning with feedback on features and instances Hema Raghavan, Omid Madani, Rosie Jones; JMLR 2006 paper

  • An interactive algorithm for asking and incorporating feature feedback into support vector machines Hema Raghavan, James Allan; ACM SIGIR 2007 paper

  • Learning from labeled features using generalized expectation criteria Gregory Druck, Gideon Mann, Andrew McCallum; ACM SIGIR 2008 paper

  • Active learning by labeling features Gregory Druck, Burr Settles, Andrew McCallum; EMNLP 2009 paper

  • A unified approach to active dual supervision for labeling features and examples Josh Attenberg, Prem Melville, Foster Provost; ECML-PKDD 2010 paper

  • Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances Burr Settles; EMNLP 2011 paper

  • Learning from discriminative feature feedback Sanjoy Dasgupta, Akansha Dey, Nicholas Roberts, Sivan Sabato; NeurIPS 2018 paper

  • Robust Learning from Discriminative Feature Feedback Sanjoy Dasgupta, Sivan Sabato; AISTATS 2020 paper

  • Practical Benefits of Feature Feedback Under Distribution Shift Anurag Katakkar, Weiqin Wang, Clay Yoo, Zachary Lipton, Divyansh Kaushik; arXiv 2021 paper

Learning from rationales:

  • Using “annotator rationales” to improve machine learning for text categorization Omar Zaidan, Jason Eisner, Christine Piatko; NAACL 2007 paper

  • Modeling annotators: A generative approach to learning from annotator rationales Omar Zaidan, Jason Eisner; EMNLP 2008 paper

  • Active learning with rationales for text classification Manali Sharma, Di Zhuang, Mustafa Bilgic; NAACL 2015 paper

Counterfactual augmentation:

  • Learning The Difference That Makes A Difference With Counterfactually-Augmented Data Divyansh Kaushik, Eduard Hovy, Zachary Lipton; ICLR 2019 paper code

  • Explaining the Efficacy of Counterfactually Augmented Data Divyansh Kaushik, Amrith Setlur, Eduard H. Hovy, Zachary Lipton; ICLR 2021. paper code

  • An Investigation of the (In)effectiveness of Counterfactually-augmented Data Nitish Joshi, He He; arXiv 2021 paper

Critiquing in recommenders:

  • Critiquing-based recommenders: survey and emerging trends Li Chen, Pearl Pu; User Modeling and User-Adapted Interaction 2012 paper

  • Coactive critiquing: Elicitation of preferences and features Stefano Teso, Paolo Dragone, Andrea Passerini; AAAI 2017 paper

Gray-box models:

  • Concept bottleneck models Pang Wei Koh, Thao Nguyen, Yew Siang Tang, Stephen Mussmann, Emma Pierson, Been Kim, and Percy Liang; ICML 2020 paper

A selection of general resources on Explainable AI focusing on overviews, surveys, societal implications, and critiques:

  • Survey and critique of techniques for extracting rules from trained artificial neural networks Robert Andrews, Joachim Diederich, Alan B. Tickle; Knowledge-based systems 1995 page

  • Toward harnessing user feedback for machine learning Simone Stumpf, Vidya Rajaram, Lida Li, Margaret Burnett, Thomas Dietterich, Erin Sullivan, Russell Drummond, Jonathan Herlocker; IUI 2007 paper

  • The Mythos of Model Interpretability Zachary Lipton; CACM 2016 paper

  • A survey of methods for explaining black box models Riccardo Guidotti, Anna Monreale, Salvatore Ruggieri, Franco Turini, Fosca Giannotti, and Dino Pedreschi; ACM Computing Surveys 2018 paper

  • Sanity checks for saliency maps Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, Been Kim; NeurIPS 2018 paper code

  • Recognition in terra incognita Sara Beery, Grant Van Horn, Pietro Perona; ECCV 2018 paper

  • Explanation in Artificial Intelligence: Insights from the Social Sciences Tim Miller; Artificial Intelligence, 2019 paper

  • Unmasking clever hans predictors and assessing what machines really learn Sebastian Lapuschkin, Stephan Wäldchen, Alexander Binder, Grégoire Montavon, Wojciech Samek, Klaus-Robert Müller; Nature Communications 2019 paper

  • Interpretation of neural networks is fragile Amirata Ghorbani, Abubakar Abid, James Zou; AAAI 2019 paper

  • A Benchmark for Interpretability Methods in Deep Neural Networks Sara Hooke, Dumitru Erhan, Pieter-Jan Kindermans, Been Kim; NeurIPS 2019 paper code

  • Is Attention Interpretable? Sofia Serrano, Noah A. Smith; ACL 2019 paper

  • Attention is not Explanation Sarthak Jain, and Byron C. Wallace; ACL 2019 paper

  • Attention is not not Explanation Sarah Wiegreffe, and Yuval Pinter; EMNLP-IJCNLP 2019 paper

  • The (un)reliability of saliency methods Pieter-Jan Kindermans, Sara Hooker, Julius Adebayo, Maximilian Alber, Kristof T. Schütt, Sven Dähne, Dumitru Erhan, and Been Kim; Explainable AI: Interpreting, Explaining and Visualizing Deep Learning 2019 paper

  • Explanations can be manipulated and geometry is to blame Ann-Kathrin Dombrowski, Maximillian Alber, Christopher Anders, Marcel Ackermann, Klaus-Robert Müller, and Pan Kessel; NeurIPS 2019 paper

  • Fooling Neural Network Interpretations via Adversarial Model Manipulation Juyeon Heo, Sunghwan Joo, and Taesup Moon; NeurIPS 2019 paper

  • Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead Cynthia Rudin; Nature Machine Intelligence 2019 page

  • The Principles and Limits of Algorithm-in-the-loop Decision Making Ben Green, Yiling Chen; PACM HCI 2019 paper

  • Shortcut learning in deep neural networks Robert Geirhos, Jorn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, Felix Wichmann; Nature Machine Intelligence 2020 page

  • When Explanations Lie: Why Many Modified BP Attributions Fail Leon Sixt, Maximilian Granz, Tim Landgraf. ICML 2020 paper

  • The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? Jasmijn Bastings, Katja Filippova; Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP 2020 paper

  • Why Attention is Not Explanation: Surgical Intervention and Causal Reasoning about Neural Models Christopher Grimsley, Elijah Mayfield, Julia Bursten; Language Resources and Evaluation Conference 2020 paper

  • AI for radiographic COVID-19 detection selects shortcuts over signal Alex DeGrave, Joseph Janizek, Su-In Lee; Nature Machine Intelligence 2021 paper code

  • How Well do Feature Visualizations Support Causal Understanding of CNN Activations? Roland Zimmermann,Judy Borowski, Robert Geirhos, Matthias Bethge, Thomas Wallis, Wieland Brendel; arXiv 2021 paper

  • Post hoc explanations may be ineffective for detecting unknown spurious correlation Julius Adebayo, Michael Muelly, Harold Abelson, and Been Kim; ICLR 2022 paper code

  • Where is the Truth? The Risk of Getting Confounded in a Continual World Florian Peter Busch, Roshni Kamath, Rupert Mitchell, Wolfgang Stammer, Kristian Kersting, Martin Mundt


Related Lists


Not Yet Sorted

  • Multimodal explanations: Justifying decisions and pointing to the evidence Dong Huk Park, Lisa Anne Hendricks, Zeynep Akata, Anna Rohrbach, Bernt Schiele, Trevor Darrell, Marcus Rohrbach; CVPR 2018 paper

  • Learning Deep Attribution Priors Based On Prior Knowledge Ethan Weinberger, Joseph Janizek, Su-In Lee; NeurIPS 2020 paper


TODO

  • Crawl & reference work on NLP.

Comments

This list is directly inspired by all the awesome awesome lists out there!