You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Partitioned parquet is possible on google cloud, and it can be filtered on a row basis on-read for faster loading. For the larger datasets it would be a possibility to avoid transitioning to databases potentially.
To write to to a partitioned dataset is pretty simple:
pyarrow.Table.from_pandas(df)
outpath
partition_cols=['FODT_AAR']
filesystem = gcs_file_system
To get back to an ordinary pandas dataframe is a bit harder, you would have to:
decide the filter to apply on the partitioned columns, like filters=[('FODT_AAR', '>', 1995)]
read().combine_chunks() on the pyarrow.parquet.ParquetDataset
set the column used as the partitioned_cols back into the data, with an appropriate datatype (an other option is to duplicate it, before writing it down, which is probably simpler but adds data)
On request from @ohvssb
@BjornRoarJoneid also probably has some interest.
Partitioned parquet is possible on google cloud, and it can be filtered on a row basis on-read for faster loading. For the larger datasets it would be a possibility to avoid transitioning to databases potentially.
To write to to a partitioned dataset is pretty simple:
To get back to an ordinary pandas dataframe is a bit harder, you would have to:
filters=[('FODT_AAR', '>', 1995)]
Here is an experiment:
https://github.com/statisticsnorway/utd_nudb/blob/carl_experiments_daplaprod/experiments/partition_parquet.ipynb
The text was updated successfully, but these errors were encountered: