A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

Yunfei Xie*, Juncheng Wu*, Haoqin Tu*, Siwei Yang*, Bingchen Zhao, Yongshuo Zong, Qiao Jin, Cihang Xie, Yuyin Zhou

^* Equal technical contribution
¹ UC Santa Cruz, ² University of Edinburgh, ³ National Institutes of Health

📢 Breaking News

[📄💥 September 24, 2024] Our arXiv paper is released.

Performances Overview

Figure 1: Overall results of o1 and other 4 strong LLMs. We show performance on 12 medical datasets spanning diverse domains. o1 demonstrates a clear performance advantage over closed- and open-source models.

Figure 2: Average accuracy of o1 and other 4 strong LLMs. o1 achieves the highest average accuracy of 73.3% across 19 medical datasets.

🏥 Our Pipeline

Figure 3: Our evaluation pipeline comprises different evaluation (a) aspects containing various tasks. We collect multiple (b) datasets for each task, combining them with various (c) prompt strategies to evaluate the latest (d) language models. We leverage a comprehensive set of (e) evaluations to present a holistic view of model progress in the medical domain.

🚀 Performances of o1

Table 1: Accuracy (Acc.) or F1 results on 4 tasks across 2 aspects. Model performances with * are taken from Wu et al. (2024) as the reference. We also present the average score (Average) of each metric in the table.

Table 2: BLEU-1 (B-1) and ROUGE-1 (R-1) results on 3 tasks across 2 aspects. We use the gray background to highlight o1 results. We also present the average score (Average) of each metric.

Table 3: Accuracy of models on the multilingual task, XMedBench Wang et al. (2024).

Table 4: Accuracy of LLMs on two agentic benchmarks.

Table 5: Accuracy results of model outputs with/without CoT prompting on 5 knowledge QA datasets.

🔍 Case Study

Figure 4: Comparison of the answers from o1 and GPT-4 for a question from NEJM. o1 provides a more concise and accurate reasoning process compared to GPT-4.

Figure 5: Comparison of the answers from o1 and GPT-4 for a case from the Chinese dataset AI Hospital, along with its English translation. o1 offers a more precise diagnosis and practical treatment suggestions compared to GPT-4.

🛠️ Setup

Setting Up Evals

To set up the evaluation framework, clone our repository and run the setup script:

git clone https://github.com/UCSC-VLAA/o1_eval.git
cd o1_eval
bash setup.sh

Configuring OpenAI API

Create a .env file in the root directory and add your OpenAI API credentials:

OPENAI_ORGANIZATION_ID=...
OPENAI_API_KEY=...

📊 Evaluation on Existing Data

Overview of Datasets

We include the prompts and inquiries used in our paper. The detailed datasets are listed below, except for LancetQA and NEJMQA due to copyright.

Running Evaluations on No-Agent Tasks

In the eval_bash directory, there are evaluation scripts corresponding to each dataset. Simply run the scripts to perform the evaluations.

bash eval_bash/eval_dataset_name/eval_script.sh

Running Evaluations on Agent Tasks

AgentClinic

Clone the AgentClinic repository:
```
git clone https://github.com/SamuelSchmidgall/AgentClinic/
```
Follow the installation instructions provided in the repository's README.md.

To run evaluations, execute the following bash command with the specified parameters:

python agentclinic.py --doctor_llm o1-preview \
     --patient_llm o1-preview --inf_type llm \
     --agent_dataset dataset --doctor_image_request False \
     --num_scenarios 220 \
     --total_inferences 20 --openai_client

--agent_dataset: You can choose between MedQA or NEJM_Ext.

AI Hospital

Clone the AI Hospital repository:
```
git clone https://github.com/LibertFan/AI_Hospital
```
Follow the installation instructions provided in the repository's README.md.

To run evaluations, execute the following bash command with the specified parameters:

python run.py --patient_database ./data/patients_sample_200.json \
 --doctor_openai_api_key $OPENAI_API_KEY \
 --doctor Agent.Doctor.GPT --doctor_openai_model_name o1-preview \
 --patient Agent.Patient.GPT --patient_openai_model_name gpt-3 \
 --reporter Agent.Reporter.GPT --reporter_openai_model_name gpt-3 \
 --save_path outputs/dialog_history_iiyi/dialog_history_gpto1_200.jsonl \
 --max_conversation_turn 10 --max_workers 2 --parallel

Note: We evaluated only the first 200 records from AI Hospital due to cost constraints.

📈 Evaluation on New Data

Our evaluation framework is fully based on OpenAI Evals. OpenAI Evals provides a framework for evaluating large language models (LLMs) or systems built using LLMs. It offers an existing registry of evaluations to test different dimensions of OpenAI models and the ability to write your own custom evaluations for use cases you care about. You can also use your data to build private evaluations representing the common LLM patterns in your workflow without exposing any of that data publicly.

For detailed instructions on creating and running custom evaluations, please refer to the OpenAI Evals documentation.

🙏 Acknowledgement

This work is partially supported by the OpenAI Researcher Access Program and Microsoft Accelerate Foundation Models Research Program. Q.J. is supported by the NIH Intramural Research Program, National Library of Medicine. The content is solely the responsibility of the authors and does not necessarily represent the official views of the funding agencies.

📜 Citation

If you find this work useful for your research and applications, please cite using this BibTeX:

@misc{xie2024preliminarystudyo1medicine,
      title={A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?}, 
      author={Yunfei Xie and Juncheng Wu and Haoqin Tu and Siwei Yang and Bingchen Zhao and Yongshuo Zong and Qiao Jin and Cihang Xie and Yuyin Zhou},
      year={2024},
      eprint={2409.15277},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2409.15277}, 
}

🔗 Related Projects

MedS-Bench

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github		.github
docs		docs
eval_bash		eval_bash
evals		evals
examples		examples
prompt_eng		prompt_eng
resources		resources
scripts		scripts
tests/unit/evals		tests/unit/evals
utils		utils
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.md		LICENSE.md
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
SECURITY.md		SECURITY.md
allfiles.txt		allfiles.txt
model_list.md		model_list.md
mypy.ini		mypy.ini
pyproject.toml		pyproject.toml
setup.sh		setup.sh
test_alignscore.py		test_alignscore.py
test_api.py		test_api.py
test_hf.py		test_hf.py
test_mauve.py		test_mauve.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

📢 Breaking News

Performances Overview

🏥 Our Pipeline

🚀 Performances of o1

🔍 Case Study

🛠️ Setup

Setting Up Evals

Configuring OpenAI API

📊 Evaluation on Existing Data

Overview of Datasets

Running Evaluations on No-Agent Tasks

Running Evaluations on Agent Tasks

AgentClinic

AI Hospital

📈 Evaluation on New Data

🙏 Acknowledgement

📜 Citation

🔗 Related Projects

About

Releases

Packages

Contributors 3

Languages

License

UCSC-VLAA/o1_medical

Folders and files

Latest commit

History

Repository files navigation

A Preliminary Study of o1 in Medicine: Are We Closer to an AI Doctor?

📢 Breaking News

Performances Overview

🏥 Our Pipeline

🚀 Performances of o1

🔍 Case Study

🛠️ Setup

Setting Up Evals

Configuring OpenAI API

📊 Evaluation on Existing Data

Overview of Datasets

Running Evaluations on No-Agent Tasks

Running Evaluations on Agent Tasks

AgentClinic

AI Hospital

📈 Evaluation on New Data

🙏 Acknowledgement

📜 Citation

🔗 Related Projects

About

Resources

License

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages