The update 2.2 -> 3.1 added stays with `anchor_year_group != "2020 - 2022"` #1838

mlondschien · 2024-12-22T17:21:29Z

Prerequisites

Put an X between the brackets on this line if you have done all of the following:
- Checked the online documentation: https://mimic.mit.edu/
- Checked that your issue isn't already addressed: https://github.com/MIT-LCP/mimic-code/issues?utf8=%E2%9C%93&q=

Description

Thank you for publishing the MIMIC dataset. We decided to update from 2.2 to 3.1. To us it appears not only stays from the years 2020 - 2022 were added, at least according to the anchor_year_group column in the patients table. See the code below:

import pandas as pd
import pandas as pd
from pathlib import Path
import gzip

path_new = Path("/path/to/miiv/")
path_old = Path("/path/to/miiv-v2.2/")

with gzip.open(path_new / "icu" / "icustays.csv.gz") as f:
    icustays_new = pd.read_csv(f)

with gzip.open(path_old / "icu" / "icustays.csv.gz") as f:
    icustays_old = pd.read_csv(f)

with gzip.open(path_new / "hosp" / "patients.csv.gz") as f:
    patients_new = pd.read_csv(f)

with gzip.open(path_old / "hosp" / "patients.csv.gz") as f:
    patients_old = pd.read_csv(f)

merged_new = pd.merge(
    left=icustays_new,
    right=patients_new,
    on="subject_id",
    how="left",
    validate="m:1"
)

merged_old = pd.merge(
    left=icustays_old,
    right=patients_old,
    on="subject_id",
    how="left",
    validate="m:1"
)

print("old:")
print(merged_old["anchor_year_group"].value_counts())

print("new:")
print(merged_new["anchor_year_group"].value_counts())

print(f"old stay_ids: {merged_old['stay_id'].nunique()}")
print(f"new stay_ids: {merged_new['stay_id'].nunique()}")

merged = pd.merge(
    left=merged_old,
    right=merged_new,
    on="stay_id",
    how="outer",
    validate="1:1",
    indicator=True,
    suffixes=["_old", "_new"]
)
merged["anchor_year_group"] = merged["anchor_year_group_old"].fillna(merged["anchor_year_group_new"])

print(merged.groupby(["anchor_year_group", "_merge"]).size())

This prints

old:
anchor_year_group
2008 - 2010    26710
2011 - 2013    17215
2014 - 2016    15989
2017 - 2019    13267
Name: count, dtype: int64

new:
anchor_year_group
2008 - 2010    30002
2011 - 2013    19475
2014 - 2016    18136
2017 - 2019    16048
2020 - 2022    10797
Name: count, dtype: int64

old stay_ids: 73181
new stay_ids: 94458

anchor_year_group  _merge    
2008 - 2010        left_only        79
                   right_only     3371
                   both          26631
2011 - 2013        left_only        71
                   right_only     2331
                   both          17144
2014 - 2016        left_only       124
                   right_only     2271
                   both          15865
2017 - 2019        left_only       436
                   right_only     3217
                   both          12831
2020 - 2022        left_only         0
                   right_only    10797
                   both              0

This behaviour is not documented in Physionet or the documentation (the changelog of the latter is out of date). Is this expected? Is this possibly an error with the anchor_year_group variable?

Note: It appears that the mapping anchor_year -> anchor_year_group is not unique.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The update 2.2 -> 3.1 added stays with `anchor_year_group != "2020 - 2022"` #1838

The update 2.2 -> 3.1 added stays with `anchor_year_group != "2020 - 2022"` #1838

mlondschien commented Dec 22, 2024

The update 2.2 -> 3.1 added stays with anchor_year_group != "2020 - 2022" #1838

The update 2.2 -> 3.1 added stays with anchor_year_group != "2020 - 2022" #1838

Comments

mlondschien commented Dec 22, 2024

Prerequisites

Description

The update 2.2 -> 3.1 added stays with `anchor_year_group != "2020 - 2022"` #1838

The update 2.2 -> 3.1 added stays with `anchor_year_group != "2020 - 2022"` #1838