-
Hi, First, let me thank you for this amazing project. I was experimenting with reading parquet from GCS and the performance looks very poor compared to downloading the file and loading from it disk. In the example below loading my 10MB parquet file takes around 10s, although downloading the file takes less than a second. Then it is much faster to download then load the file from disk (1s + ~70ms). I added tracing to the object store and I see more that 506 I was wondering why there was so many range requests, and what determine the chunk size ? Thanks for your help. let store = Arc::new({
GoogleCloudStorageBuilder::new()
.with_bucket_name(BUCKET_NAME)
.build()?
});
let url = format!("gs://{BUCKET_NAME}/").parse()?;
let registry = DefaultObjectStoreRegistry::new();
registry.register_store(&url, store);
let runtime_config = RuntimeConfig::default().with_object_store_registry(Arc::new(registry));
let runtime_env = RuntimeEnv::new(runtime_config)?;
let ctx = SessionContext::new_with_config_rt(SessionConfig::default(), Arc::new(runtime_env));
let df = timeit!(
"read parquet",
ctx.read_parquet(
args.input,
ParquetReadOptions {
// file_sort_order: vec![vec![col("time").sort(true, true)]],
parquet_pruning: true.into(),
..Default::default()
},
)
.await?
);
println!("{schema}", schema = df.schema());
let df = timeit!(
"projection",
df.select(vec![
cast(
col("time"),
DataType::Timestamp(TimeUnit::Millisecond, None),
)
.alias("time"),
col("asset_id"),
col("asset_type"),
col("value"),
])?
.cache()
.await?
); |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 1 reply
-
Perhaps you could run the file through parquet-layout. It sounds like the file might have been written with very small row groups |
Beta Was this translation helpful? Give feedback.
Perhaps you could run the file through parquet-layout. It sounds like the file might have been written with very small row groups