High latency loading parquet file on GCS. #8058

gwik · 2023-11-05T23:39:07Z

gwik
Nov 5, 2023

Hi,

First, let me thank you for this amazing project.

I was experimenting with reading parquet from GCS and the performance looks very poor compared to downloading the file and loading from it disk.

In the example below loading my 10MB parquet file takes around 10s, although downloading the file takes less than a second. Then it is much faster to download then load the file from disk (1s + ~70ms).

I added tracing to the object store and I see more that 506 get_opts calls.

I was wondering why there was so many range requests, and what determine the chunk size ?
Is it controlled by some parameters? or is it the consequence of the how the file was written? (Record batches).

Thanks for your help.

    let store = Arc::new({
        GoogleCloudStorageBuilder::new()
            .with_bucket_name(BUCKET_NAME)
            .build()?
    });
    let url = format!("gs://{BUCKET_NAME}/").parse()?;
    let registry = DefaultObjectStoreRegistry::new();
    registry.register_store(&url, store);
    let runtime_config = RuntimeConfig::default().with_object_store_registry(Arc::new(registry));
    let runtime_env = RuntimeEnv::new(runtime_config)?;

    let ctx = SessionContext::new_with_config_rt(SessionConfig::default(), Arc::new(runtime_env));
    let df = timeit!(
        "read parquet",
        ctx.read_parquet(
            args.input,
            ParquetReadOptions {
                // file_sort_order: vec![vec![col("time").sort(true, true)]],
                parquet_pruning: true.into(),
                ..Default::default()
            },
        )
        .await?
    );

    println!("{schema}", schema = df.schema());

    let df = timeit!(
        "projection",
        df.select(vec![
            cast(
                col("time"),
                DataType::Timestamp(TimeUnit::Millisecond, None),
            )
            .alias("time"),
            col("asset_id"),
            col("asset_type"),
            col("value"),
        ])?
        .cache()
        .await?
    );

Answered by tustvold

Nov 6, 2023

Perhaps you could run the file through parquet-layout. It sounds like the file might have been written with very small row groups

View full answer

tustvold · 2023-11-06T14:13:34Z

tustvold
Nov 6, 2023
Collaborator

Perhaps you could run the file through parquet-layout. It sounds like the file might have been written with very small row groups

1 reply

gwik Nov 6, 2023
Author

Thank you so much @tustvold that was exactly what I needed.

parquet-layout showed a lot of row groups: 454, and the row count within each group was very uneven 4k to just 4.

After rewriting the parquet file with datafusion (default write batch size and max row group size).
There is now only one row group with helped compression (zstd) bring down the file size from 11MB to 683K, and load time from GCS from 11s to 500ms.

Awesome!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High latency loading parquet file on GCS. #8058

{{title}}

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

High latency loading parquet file on GCS. #8058

gwik Nov 5, 2023

Replies: 1 comment · 1 reply

tustvold Nov 6, 2023 Collaborator

gwik Nov 6, 2023 Author

gwik
Nov 5, 2023

Replies: 1 comment 1 reply

tustvold
Nov 6, 2023
Collaborator

gwik Nov 6, 2023
Author