Support for partitioned parquet #47

aecorn · 2023-02-14T09:49:18Z

On request from @ohvssb
@BjornRoarJoneid also probably has some interest.

Partitioned parquet is possible on google cloud, and it can be filtered on a row basis on-read for faster loading. For the larger datasets it would be a possibility to avoid transitioning to databases potentially.

To write to to a partitioned dataset is pretty simple:

pyarrow.Table.from_pandas(df)
outpath
partition_cols=['FODT_AAR']
filesystem = gcs_file_system

To get back to an ordinary pandas dataframe is a bit harder, you would have to:

decide the filter to apply on the partitioned columns, like filters=[('FODT_AAR', '>', 1995)]
read().combine_chunks() on the pyarrow.parquet.ParquetDataset
set the column used as the partitioned_cols back into the data, with an appropriate datatype (an other option is to duplicate it, before writing it down, which is probably simpler but adds data)
doing a .to_pandas() on the pyarrow dataset

Here is an experiment:
https://github.com/statisticsnorway/utd_nudb/blob/carl_experiments_daplaprod/experiments/partition_parquet.ipynb

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for partitioned parquet #47

Support for partitioned parquet #47

aecorn commented Feb 14, 2023 •

edited

Loading

Support for partitioned parquet #47

Support for partitioned parquet #47

Comments

aecorn commented Feb 14, 2023 • edited Loading

aecorn commented Feb 14, 2023 •

edited

Loading