Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The update 2.2 -> 3.1 added stays with anchor_year_group != "2020 - 2022" #1838

Open
1 task done
mlondschien opened this issue Dec 22, 2024 · 0 comments
Open
1 task done

Comments

@mlondschien
Copy link

Prerequisites

Description

Thank you for publishing the MIMIC dataset. We decided to update from 2.2 to 3.1. To us it appears not only stays from the years 2020 - 2022 were added, at least according to the anchor_year_group column in the patients table. See the code below:

import pandas as pd
import pandas as pd
from pathlib import Path
import gzip

path_new = Path("/path/to/miiv/")
path_old = Path("/path/to/miiv-v2.2/")

with gzip.open(path_new / "icu" / "icustays.csv.gz") as f:
    icustays_new = pd.read_csv(f)

with gzip.open(path_old / "icu" / "icustays.csv.gz") as f:
    icustays_old = pd.read_csv(f)

with gzip.open(path_new / "hosp" / "patients.csv.gz") as f:
    patients_new = pd.read_csv(f)

with gzip.open(path_old / "hosp" / "patients.csv.gz") as f:
    patients_old = pd.read_csv(f)

merged_new = pd.merge(
    left=icustays_new,
    right=patients_new,
    on="subject_id",
    how="left",
    validate="m:1"
)

merged_old = pd.merge(
    left=icustays_old,
    right=patients_old,
    on="subject_id",
    how="left",
    validate="m:1"
)

print("old:")
print(merged_old["anchor_year_group"].value_counts())

print("new:")
print(merged_new["anchor_year_group"].value_counts())

print(f"old stay_ids: {merged_old['stay_id'].nunique()}")
print(f"new stay_ids: {merged_new['stay_id'].nunique()}")

merged = pd.merge(
    left=merged_old,
    right=merged_new,
    on="stay_id",
    how="outer",
    validate="1:1",
    indicator=True,
    suffixes=["_old", "_new"]
)
merged["anchor_year_group"] = merged["anchor_year_group_old"].fillna(merged["anchor_year_group_new"])

print(merged.groupby(["anchor_year_group", "_merge"]).size())

This prints

old:
anchor_year_group
2008 - 2010    26710
2011 - 2013    17215
2014 - 2016    15989
2017 - 2019    13267
Name: count, dtype: int64

new:
anchor_year_group
2008 - 2010    30002
2011 - 2013    19475
2014 - 2016    18136
2017 - 2019    16048
2020 - 2022    10797
Name: count, dtype: int64

old stay_ids: 73181
new stay_ids: 94458

anchor_year_group  _merge    
2008 - 2010        left_only        79
                   right_only     3371
                   both          26631
2011 - 2013        left_only        71
                   right_only     2331
                   both          17144
2014 - 2016        left_only       124
                   right_only     2271
                   both          15865
2017 - 2019        left_only       436
                   right_only     3217
                   both          12831
2020 - 2022        left_only         0
                   right_only    10797
                   both              0

This behaviour is not documented in Physionet or the documentation (the changelog of the latter is out of date). Is this expected? Is this possibly an error with the anchor_year_group variable?

Note: It appears that the mapping anchor_year -> anchor_year_group is not unique.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant