This repo includes papers, tools, and blogs about Synthetic Data of LLMs, by LLMs, for LLMs.
Thanks for all the great contributors on GitHub!🔥⚡🔥
- Best Practices and Lessons Learned on Synthetic Data for Language Models. Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, Andrew M. Dai. Arxiv 2024.
- On LLMs-Driven Synthetic Data Generation, Curation, and Evaluation: A Survey. Lin Long, Rui Wang, Ruixuan Xiao, Junbo Zhao, Xiao Ding, Gang Chen, Haobo Wang. Arxiv 2024.
- Large Language Models for Data Annotation: A Survey Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, Huan Liu. Arxiv 2024.
- Generative AI for Synthetic Data Generation: Methods, Challenges and the Future. Xu Guo, Yiqiang Chen. Arxiv 2024.
- Comprehensive Exploration of Synthetic Data Generation: A Survey. André Bauer, Simon Trapp, Michael Stenger, Robert Leppich, Samuel Kounev, Mark Leznik, Kyle Chard, Ian Foster. Arxiv 2024.
- STaR: Bootstrapping Reasoning With Reasoning Eric Zelikman, Yuhuai Wu, Jesse Mu, Noah D. Goodman. NeurIPS 2022.
- Symbolic Knowledge Distillation: from General Language Models to Commonsense Models Peter West, Chandra Bhagavatula, Jack Hessel, Jena D. Hwang, Liwei Jiang, Ronan Le Bras, Ximing Lu, Sean Welleck, Yejin Choi. NAACL 2022.
- Generating Training Data with Language Models: Towards Zero-Shot Language Understanding Yu Meng, Jiaxin Huang, Yu Zhang, Jiawei Han. NeurIPS 2022.
- ZeroGen: Efficient Zero-shot Learning via Dataset Generation Jiacheng Ye, Jiahui Gao, Qintong Li, Hang Xu, Jiangtao Feng, Zhiyong Wu, Tao Yu, Lingpeng Kong. EMNLP 2022.
- Large Language Models Can Self-Improve Jiaxin Huang, Shixiang Gu, Le Hou, Yuexin Wu, Xuezhi Wang, Hongkun Yu, Jiawei Han. EMNLP 2023.
- Self-Play Fine-Tuning Converts Weak Language Models to Strong Language Models. Zixiang Chen, Yihe Deng, Huizhuo Yuan, Kaixuan Ji, Quanquan Gu. ICML 2024.
- Self-Rewarding Language Models. Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, Jason Weston. Arxiv 2024.
- Synthetic Data (Almost) from Scratch: Generalized Instruction Tuning for Language Models. Haoran Li, Qingxiu Dong, Zhengyang Tang, Chaojun Wang, Xingxing Zhang, Haoyang Huang, Shaohan Huang, Xiaolong Huang, Zeqiang Huang, Dongdong Zhang, Yuxian Gu, Xin Cheng, Xun Wang, Si-Qing Chen, Li Dong, Wei Lu, Zhifang Sui, Benyou Wang, Wai Lam, Furu Wei. Arxiv 2024.
- Self-instruct: Aligning language models with self-generated instructions. Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, Hannaneh Hajishirzi. ACL 2023.
- TarGEN: Targeted Data Generation with Large Language Models Himanshu Gupta, Kevin Scaria, Ujjwala Anantheswaran, Shreyas Verma, Mihir Parmar, Saurabh Arjun Sawant, Chitta Baral, Swaroop Mishra. COLM 2024.
- Automatic Instruction Evolving for Large Language Models. Weihao Zeng, Can Xu, Yingxiu Zhao, Jian-Guang Lou, Weizhu Chen. Arxiv 2024.
- Scaling Synthetic Data Creation with 1,000,000,000 Personas. Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu. Arxiv 2024.
- Self-playing Adversarial Language Game Enhances LLM Reasoning Pengyu Cheng, Tianhao Hu, Han Xu, Zhisong Zhang, Yong Dai, Lei Han, Nan Du Arxiv 2024.
- Source2Synth: Synthetic Data Generation and Curation Grounded in Real Data Sources Alisia Lupidi, Carlos Gemmell, Nicola Cancedda, Jane Dwivedi-Yu, Jason Weston, Jakob Foerster, Roberta Raileanu, Maria Lomeli Arxiv 2024.
- CodecLM: Aligning Language Models with Tailored Synthetic Data. Zifeng Wang, Chun-Liang Li, Vincent Perot, Long T. Le, Jin Miao, Zizhao Zhang, Chen-Yu Lee, Tomas Pfister. Findings of NAACL 2024.
- WizardLM: Empowering Large Language Models to Follow Complex Instructions. Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Daxin Jiang. Arxiv 2023.
- MuggleMath: Assessing the Impact of Query and Response Augmentation on Math Reasoning. Chengpeng Li, Zheng Yuan, Hongyi Yuan, Guanting Dong, Keming Lu, Jiancan Wu, Chuanqi Tan, Xiang Wang, Chang Zhou. ACL 2024.
- MathGenie: Generating Synthetic Data with Question Back-translation for Enhancing Mathematical Reasoning of LLMs. Zimu Lu, Aojun Zhou, Houxing Ren, Ke Wang, Weikang Shi, Junting Pan, Mingjie Zhan, Hongsheng Li. ACL 2024.
- MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models. Longhui Yu, Weisen Jiang, Han Shi, Jincheng Yu, Zhengying Liu, Yu Zhang, James T. Kwok, Zhenguo Li, Adrian Weller, Weiyang Liu. ICLR 2024.
- Augmenting Math Word Problems via Iterative Question Composing. Haoxiong Liu, Yifan Zhang, Yifan Luo, Andrew Chi-Chih Yao. DPFM@ICLR 2024.
- Language Models Can Teach Themselves to Program Better Patrick Haluptzok, Matthew Bowers, Adam Tauman Kalai. ICLR 2023.
- Genetic Instruct: Scaling up Synthetic Generation of Coding Instructions for Large Language Models. Somshubra Majumdar, Vahid Noroozi, Sean Narenthiran, Aleksander Ficek, Jagadeesh Balam, Boris Ginsburg. Arxiv 2024.
- Learning Performance-Improving Code Edits Alexander Shypula, Aman Madaan, Yimeng Zeng, Uri Alon, Jacob Gardner, Milad Hashemi, Graham Neubig, Parthasarathy Ranganathan, Osbert Bastani, Amir Yazdanbakhsh. ICLR 2024.
- Synthesizing Text-to-SQL Data from Weak and Strong LLMs. Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, Chang Zhou. ACL 2024.
- Constitutional AI: Harmlessness from AI Feedback. Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosuite, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemi Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, Jared Kaplan. Arxiv 2022.
- Principle-Driven Self-Alignment of Language Models from Scratch with Minimal Human Supervision Zhiqing Sun, Yikang Shen, Qinhong Zhou, Hongxin Zhang, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan. NeurIPS 2023.
- SALMON: Self-Alignment with Instructable Reward Models Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Cox, Yiming Yang, Chuang Gan. ICLR 2024.
- Refined Direct Preference Optimization with Synthetic Data for Behavioral Alignment of LLMs. V´ıctor Gallego. Arxiv 2024.
- Self-play with Execution Feedback: Improving Instruction-following Capabilities of Large Language Models Guanting Dong, Keming Lu, Chengpeng Li, Tingyu Xia, Bowen Yu, Chang Zhou, Jingren Zhou Arxiv 2024.
- West-of-N: Synthetic Preference Generation for Improved Reward Modeling. Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, Aliaksei Severyn. Arxiv 2024.
- Make Your LLM Fully Utilize the Context. Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, Jian-Guang Lou. Arxiv 2024.
- Impossible Distillation for Paraphrasing and Summarization: How to Make High-quality Lemonade out of Small, Low-quality Models Jaehun Jung, Peter West, Liwei Jiang, Faeze Brahman, Ximing Lu, Jillian Fisher, Taylor Sorensen, Yejin Choi. NAACL 2024.
- Toolformer: Language Models Can Teach Themselves to Use Tools Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, Thomas Scialom. NeurIPS 2023.
- GPT4Tools: Teaching Large Language Model to Use Tools via Self-instruction Rui Yang, Lin Song, Yanwei Li, Sijie Zhao, Yixiao Ge, Xiu Li, Ying Shan. Arxiv 2024.
- Gorilla: Large Language Model Connected with Massive APIs Shishir G. Patil, Tianjun Zhang, Xin Wang, Joseph E. Gonzalez. Arxiv 2023.
- ToolAlpaca: Generalized Tool Learning for Language Models with 3000 Simulated Cases Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, Le Sun. Arxiv 2023.
- Voyager: An Open-Ended Embodied Agent with Large Language Models Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, Anima Anandkumar. Arxiv 2023.
- Visual Instruction Tuning Haotian Liu, Chunyuan Li, Qingyang Wu, Yong Jae Lee. NeurIPS 2023.
- MiniGPT-4: Enhancing Vision-Language Understanding with Advanced Large Language Models Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny. ICLR 2024.
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou. Arxiv 2023.
- G-LLaVA: Solving Geometric Problem with Multi-Modal Large Language Model Jiahui Gao, Renjie Pi, Jipeng Zhang, Jiacheng Ye, Wanjun Zhong, Yufei Wang, Lanqing Hong, Jianhua Han, Hang Xu, Zhenguo Li, Lingpeng Kong. Arxiv 2023.
- Enhancing Large Vision Language Models with Self-Training on Image Comprehension Yihe Deng, Pan Lu, Fan Yin, Ziniu Hu, Sheng Shen, James Zou, Kai-Wei Chang, Wei Wang. Arxiv 2024.
- Fine-tuning Language Models for Factuality Katherine Tian, Eric Mitchell, Huaxiu Yao, Christopher D. Manning, Chelsea Finn. Arxiv 2023.
- MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents Liyan Tang, Philippe Laban, Greg Durrett. Arxiv 2024.
- Synthetic-Text-To-SQL: A synthetic dataset for training language models to generate SQL queries from natural language prompts. Meyer, Yev and Emadi, Marjan and Nathawani, Dhruv and Ramaswamy, Lipika and Boyd, Kendrick and Van Segbroeck, Maarten and Grossman, Matthew and Mlocek, Piotr and Newberry, Drew. Huggingface 2024.
- DataDreamer: A Tool for Synthetic Data Generation and Reproducible LLM Workflows. Ajay Patel, Colin Raffel, Chris Callison-Burch. ACL 2024.
- AgentInstruct: Toward Generative Teaching with Agentic Flows. Arindam Mitra, Luciano Del Corro, Guoqing Zheng, Shweti Mahajan, Dany Rouhana, Andres Codas, Yadong Lu, Wei-ge Chen, Olga Vrousgos, Corby Rosset, Fillipe Silva, Hamed Khanpour, Yash Lara, Ahmed Awadallah. Arxiv 2024.
- Synthetic dataset generation techniques: Self-Instruct. Daniel van Strien. 2024
- LLM-Driven Synthetic Data Generation, Curation & Evaluation. Cobus Greyling. 2024
- The Rise of Agentic Data Generation. Maxime Labonne. 2024