Total allocated memory keeps growing even when reading a parquet file in a streaming manner #7675

twitu · 2023-09-27T16:39:42Z

twitu
Sep 27, 2023

The datafusion query engine is fantastic and I've been using it to implement a data streaming layer for the nautilus trading engine. However upon more testing I'm seeing very high total memory allocations by the datafusion code.

stable-x86_64-unknown-linux-gnu (directory override for '/home/twitu/Code/nautilus_trader/nautilus_core')
rustc 1.72.1 (d5c2e9c34 2023-09-13)
datafusion = { version = "31.0.0", default-features = false, features = ["compression", "regex_expressions", "unicode_expressions"] }

The file I'm reading is about 116M and contains about 10M records. The test code is streaming record batches from the file and counting the total number of records retrieved. Something like this were DataBackendSession and queries are wrappers built on top of datafusion.

    let mut catalog = DataBackendSession::new(5000);
    catalog
        .add_file_default_query::<QuoteTick>("quote_tick", file_path)
        .await
        .unwrap();
    let query_result = catalog.get_query_result().await;
    let count: usize = query_result.map(|vec| vec.len()).sum();
    println!("{}", count);

Ideally my expectation is that only the amount of memory needed to process one chunk of data will be allocated. Retrieving it from the disk, processing and collecting it into a Vec. But some analysis using bytehound shows that the memory keeps on growing. Based on the stack trace it seems to be growing in the datafusion library logic.

The total allocation for the program grows to about 1 GB (for a 116 MB file 🙃) and about 700 MB of it grows from these few lines in the datafusion parquet reader. The full backtrace and graphs is attached below.

Is it there anyway to not let the memory keep growing?

#83 [main] <S as futures_core::stream::TryStream>::try_poll_next [stream.rs:196]
#84 [main] <parquet::arrow::async_reader::ParquetRecordBatchStream<T> as futures_core::stream::Stream>::poll_next [mod.rs:544]
#85 [main] <parquet::arrow::arrow_reader::ParquetRecordBatchReader as core::iter::traits::iterator::Iterator>::next [mod.rs:555]
#86 [main] <parquet::arrow::array_reader::struct_array::StructArrayReader as parquet::arrow::array_reader::ArrayReader>::read_records [struct_array.rs:68]
#87 [main] <parquet::arrow::array_reader::primitive_array::PrimitiveArrayReader<T> as parquet::arrow::array_reader::ArrayReader>::read_records [primitive_array.rs:100]
#88 [main] parquet::arrow::array_reader::read_records [mod.rs:138]
#89 [main] parquet::arrow::record_reader::GenericRecordReader<V,CV>::read_records [mod.rs:136]
#90 [main] parquet::arrow::record_reader::GenericRecordReader<V,CV>::read_one_batch [mod.rs:219]
#91 [main] <parquet::arrow::record_reader::buffer::ScalarBuffer<T> as parquet::arrow::record_reader::buffer::BufferQueue>::spare_capacity_mut [buffer.rs:175]
#92 [main] arrow_buffer::buffer::mutable::MutableBuffer::resize [mutable.rs:247]
#93 [main] arrow_buffer::buffer::mutable::MutableBuffer::reserve [mutable.rs:195]
#94 [main] arrow_buffer::buffer::mutable::MutableBuffer::reallocate [mutable.rs:213]

Backtrace:
#00 [libc.so.6] 7f85059269ff
#01 [libc.so.6] 7f8505894b42
#02 [main] std::sys::unix::thread::Thread::new::thread_start [thread.rs:108]
#03 [main] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1993]
#04 [main] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1993]
#05 [main] core::ops::function::FnOnce::call_once{{vtable.shim}} [function.rs:250]
#06 [main] std::thread::Builder::spawn_unchecked_::{{closure}} [mod.rs:528]
#10 [main] <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once [unwind_safe.rs:271]
#11 [main] std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}} [mod.rs:529]
#13 [main] tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}} [pool.rs:471]
#14 [main] tokio::runtime::blocking::pool::Inner::run [pool.rs:513]
#15 [main] tokio::runtime::blocking::pool::Task::run [pool.rs:159]
#16 [main] tokio::runtime::task::UnownedTask<S>::run [mod.rs:437]
#17 [main] tokio::runtime::task::raw::RawTask::poll [raw.rs:200]
#18 [main] tokio::runtime::task::raw::poll [raw.rs:276]
#19 [main] tokio::runtime::task::harness::Harness<T,S>::poll [harness.rs:153]
#20 [main] tokio::runtime::task::harness::Harness<T,S>::poll_inner [harness.rs:208]
#21 [main] tokio::runtime::task::harness::poll_future [harness.rs:473]
#25 [main] <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once [unwind_safe.rs:271]
#26 [main] tokio::runtime::task::harness::poll_future::{{closure}} [harness.rs:485]
#27 [main] tokio::runtime::task::core::Core<T,S>::poll [core.rs:323]
#28 [main] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut [unsafe_cell.rs:16]
#29 [main] tokio::runtime::task::core::Core<T,S>::poll::{{closure}} [core.rs:334]
#30 [main] <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll [task.rs:42]
#31 [main] tokio::runtime::scheduler::multi_thread::worker::Launch::launch::{{closure}} [worker.rs:447]
#32 [main] tokio::runtime::scheduler::multi_thread::worker::run [worker.rs:478]
#33 [main] tokio::runtime::context::runtime::enter_runtime [runtime.rs:65]
#34 [main] tokio::runtime::scheduler::multi_thread::worker::run::{{closure}} [worker.rs:486]
#35 [main] tokio::runtime::context::set_scheduler [context.rs:176]
#36 [main] std::thread::local::LocalKey<T>::with [local.rs:246]
#37 [main] std::thread::local::LocalKey<T>::try_with [local.rs:270]
#38 [main] tokio::runtime::context::set_scheduler::{{closure}} [context.rs:176]
#39 [main] tokio::runtime::context::scoped::Scoped<T>::set [scoped.rs:40]
#40 [main] tokio::runtime::scheduler::multi_thread::worker::run::{{closure}}::{{closure}} [worker.rs:491]
#41 [main] tokio::runtime::scheduler::multi_thread::worker::Context::run [worker.rs]
#42 [main] tokio::runtime::scheduler::multi_thread::worker::Context::run_task [worker.rs:575]
#43 [main] tokio::runtime::coop::budget [coop.rs:73]
#44 [main] tokio::runtime::coop::with_budget [coop.rs:107]
#45 [main] tokio::runtime::scheduler::multi_thread::worker::Context::run_task::{{closure}} [worker.rs:576]
#46 [main] tokio::runtime::task::LocalNotified<S>::run [mod.rs:400]
#47 [main] tokio::runtime::task::raw::RawTask::poll [raw.rs:200]
#48 [main] tokio::runtime::task::raw::poll [raw.rs:276]
#49 [main] tokio::runtime::task::harness::Harness<T,S>::poll [harness.rs:153]
#50 [main] tokio::runtime::task::harness::Harness<T,S>::poll_inner [harness.rs:208]
#51 [main] tokio::runtime::task::harness::poll_future [harness.rs:473]
#55 [main] <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once [unwind_safe.rs:271]
#56 [main] tokio::runtime::task::harness::poll_future::{{closure}} [harness.rs:485]
#57 [main] tokio::runtime::task::core::Core<T,S>::poll [core.rs:323]
#58 [main] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut [unsafe_cell.rs:16]
#59 [main] tokio::runtime::task::core::Core<T,S>::poll::{{closure}} [core.rs:334]
#60 [main] nautilus_persistence::kmerge_batch::KMerge<S,I,C>::push_iter_stream::{{closure}}::{{closure}}::{{closure}} [kmerge_batch.rs:92]
#61 [main] nautilus_persistence::kmerge_batch::PeekElementBatchStream<S,I>::new_from_stream::{{closure}} [kmerge_batch.rs:41]
#62 [main] <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll [next.rs:32]
#63 [main] futures_util::stream::stream::StreamExt::poll_next_unpin [mod.rs:1632]
#64 [main] futures_core::stream::if_alloc::<impl futures_core::stream::Stream for alloc::boxed::Box<S>>::poll_next [stream.rs:209]
#65 [main] <futures_util::stream::stream::map::Map<St,F> as futures_core::stream::Stream>::poll_next [map.rs:58]
#66 [main] <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next [stream.rs:120]
#67 [main] <datafusion::physical_plan::stream::RecordBatchStreamAdapter<S> as futures_core::stream::Stream>::poll_next [stream.rs:282]
#68 [main] <futures_util::stream::try_stream::try_flatten::TryFlatten<St> as futures_core::stream::Stream>::poll_next [try_flatten.rs:66]
#69 [main] <S as futures_core::stream::TryStream>::try_poll_next [stream.rs:196]
#70 [main] <futures_util::stream::once::Once<Fut> as futures_core::stream::Stream>::poll_next [once.rs:46]
#71 [main] <datafusion::physical_plan::sorts::sort::SortExec as datafusion::physical_plan::ExecutionPlan>::execute::{{closure}} [sort.rs:869]
#72 [main] <futures_util::stream::stream::next::Next<St> as core::future::future::Future>::poll [next.rs:32]
#73 [main] futures_util::stream::stream::StreamExt::poll_next_unpin [mod.rs:1632]
#74 [main] <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next [stream.rs:120]
#75 [main] <datafusion::datasource::physical_plan::file_stream::FileStream<F> as futures_core::stream::Stream>::poll_next [file_stream.rs:507]
#76 [main] datafusion::datasource::physical_plan::file_stream::FileStream<F>::poll_inner [file_stream.rs:394]
#77 [main] futures_util::stream::stream::StreamExt::poll_next_unpin [mod.rs:1632]
#78 [main] <core::pin::Pin<P> as futures_core::stream::Stream>::poll_next [stream.rs:120]
#79 [main] <futures_util::stream::stream::map::Map<St,F> as futures_core::stream::Stream>::poll_next [map.rs:58]
#80 [main] <futures_util::stream::try_stream::MapErr<St,F> as futures_core::stream::Stream>::poll_next [lib.rs:102]
#81 [main] <futures_util::stream::stream::map::Map<St,F> as futures_core::stream::Stream>::poll_next [map.rs:58]
#82 [main] <futures_util::stream::try_stream::into_stream::IntoStream<St> as futures_core::stream::Stream>::poll_next [into_stream.rs:38]
#83 [main] <S as futures_core::stream::TryStream>::try_poll_next [stream.rs:196]
#84 [main] <parquet::arrow::async_reader::ParquetRecordBatchStream<T> as futures_core::stream::Stream>::poll_next [mod.rs:544]
#85 [main] <parquet::arrow::arrow_reader::ParquetRecordBatchReader as core::iter::traits::iterator::Iterator>::next [mod.rs:555]
#86 [main] <parquet::arrow::array_reader::struct_array::StructArrayReader as parquet::arrow::array_reader::ArrayReader>::read_records [struct_array.rs:68]
#87 [main] <parquet::arrow::array_reader::primitive_array::PrimitiveArrayReader<T> as parquet::arrow::array_reader::ArrayReader>::read_records [primitive_array.rs:100]
#88 [main] parquet::arrow::array_reader::read_records [mod.rs:138]
#89 [main] parquet::arrow::record_reader::GenericRecordReader<V,CV>::read_records [mod.rs:136]
#90 [main] parquet::arrow::record_reader::GenericRecordReader<V,CV>::read_one_batch [mod.rs:219]
#91 [main] <parquet::arrow::record_reader::buffer::ScalarBuffer<T> as parquet::arrow::record_reader::buffer::BufferQueue>::spare_capacity_mut [buffer.rs:175]
#92 [main] arrow_buffer::buffer::mutable::MutableBuffer::resize [mutable.rs:247]
#93 [main] arrow_buffer::buffer::mutable::MutableBuffer::reserve [mutable.rs:195]
#94 [main] arrow_buffer::buffer::mutable::MutableBuffer::reallocate [mutable.rs:213]
#95 [main] alloc::alloc::alloc [alloc.rs:102]
#96 [main] __rdl_alloc [alloc.rs:381]
#97 [main] std::sys::unix::alloc::<impl core::alloc::global::GlobalAlloc for std::alloc::System>::alloc [alloc.rs:22]
#98 [main] std::sys::unix::alloc::aligned_malloc [alloc.rs:98]
#99 [libbytehound.so] posix_memalign [api.rs:851]

Answered by tustvold

Oct 16, 2023

The unit of IO is the page if the offset index is enabled, otherwise falling back to reading entire column chunk. What did you use to write the file? This behaviour would make sense if only a few very large row groups, and no offset index

View full answer

twitu · 2023-09-28T03:42:58Z

twitu
Sep 28, 2023
Author

It turns out the ORDER BY clause is the reason behind this. The logic expects the data stream to be ordered by a certain field, so the query had a ORDER BY <field>. This necessarily meant that the whole file must be loaded into memory and sorted, which was causing this large allocation.

Removing the ORDER BY clause drastically reduced the allocations to this.

However this means that the user must ensure that the file is pre-sorted. Is there any query or clause that can check this assumption while querying the file and failing if it's not sorted?

0 replies

alamb · 2023-09-29T17:26:26Z

alamb
Sep 29, 2023
Collaborator

Thank you @twitu -- this is a great analysis and thank you for posting your results (and not leaving us hanging!)

You can tell DataFusion how your file is sorted using APIs:

Either via registering the listing table directly:
https://docs.rs/datafusion/latest/datafusion/datasource/listing/struct.ListingOptions.html#structfield.file_sort_order

Or via SQL in https://arrow.apache.org/datafusion/user-guide/sql/ddl.html

CREATE EXTERNAL TABLE ... WITH ORDER ...

If you tell DataFusion the file is sorted by time, for example then a query like SELECT * FROM t ORDER BY time will not resort the data, but a query like SELECT * FROM t ORDER BY time DESC will

1 reply

twitu Oct 5, 2023
Author

Thanks for this suggestion. I've tried out running my queries without ORDER BY <field> clause and it's actually uncovered a weird, inexplicable issue. I've created a separate issue to discuss it - #7742

twitu · 2023-10-10T02:09:24Z

twitu
Oct 10, 2023
Author

Hi @alamb it appears the fix I mentioned above was spurious. For some reason a simple streaming query is still growing the memory continuously. Here's my script that takes a file name loads into the context and makes a simple "SELECT * FROM <table_name>" query.

use std::path::PathBuf;

use datafusion::prelude::{ParquetReadOptions, SessionContext};
use futures::StreamExt;

fn main() {
    let file_path: PathBuf = std::env::var("FILE_PATH").unwrap().into();
    let file_name = file_path.file_stem().unwrap().to_str().unwrap();

    let runtime = tokio::runtime::Builder::new_multi_thread()
        .enable_all()
        .build()
        .unwrap();
    let session_ctx = SessionContext::default();
    let parquet_options = ParquetReadOptions::<'_> {
        skip_metadata: Some(false),
        ..Default::default()
    };
    runtime
        .block_on(session_ctx.register_parquet(
            file_name,
            file_path.to_str().unwrap(),
            parquet_options,
        ))
        .unwrap();

    let default_query = format!("SELECT * FROM {}", &file_name);
    let query = runtime.block_on(session_ctx.sql(&default_query)).unwrap();

    let mut batch_stream = runtime.block_on(query.execute_stream()).unwrap();

    let mut count = 0;
    while let Some(Ok(batch)) = runtime.block_on(batch_stream.next()) {
        count += batch.num_rows();
    }
    println!("{}", count);
}

But when I check memory profile it keeps growing to almost 600 MB for a parquet file of 680 MB. As you can see from the step increases in memory, some Vec is growing and reallocating double memory. This logic is somewhere deep inside ObjectStore according to the logs. Is there any way to stream data from a parquet file while keeping a low memory footprint?

#00 [libc.so.6] 7f5edc926a3f
#01 [libc.so.6] 7f5edc894ac2
#02 [main] std::sys::unix::thread::Thread::new::thread_start [thread.rs:108]
#03 [main] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1993]
#04 [main] <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once [boxed.rs:1993]
#05 [main] core::ops::function::FnOnce::call_once{{vtable.shim}} [function.rs:250]
#06 [main] std::thread::Builder::spawn_unchecked_::{{closure}} [mod.rs:528]
#07 [main] std::panic::catch_unwind [panic.rs:142]
#08 [main] std::panicking::try [panicking.rs:464]
#09 [main] std::panicking::try::do_call [panicking.rs:500]
#10 [main] <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once [unwind_safe.rs:271]
#11 [main] std::thread::Builder::spawn_unchecked_::{{closure}}::{{closure}} [mod.rs:529]
#12 [main] std::sys_common::backtrace::__rust_begin_short_backtrace [backtrace.rs:135]
#13 [main] tokio::runtime::blocking::pool::Spawner::spawn_thread::{{closure}} [pool.rs:471]
#14 [main] tokio::runtime::blocking::pool::Inner::run [pool.rs:513]
#15 [main] tokio::runtime::blocking::pool::Task::run [pool.rs:159]
#16 [main] tokio::runtime::task::UnownedTask<S>::run [mod.rs:437]
#17 [main] tokio::runtime::task::raw::RawTask::poll [raw.rs:200]
#18 [main] tokio::runtime::task::raw::poll [raw.rs:276]
#19 [main] tokio::runtime::task::harness::Harness<T,S>::poll [harness.rs:153]
#20 [main] tokio::runtime::task::harness::Harness<T,S>::poll_inner [harness.rs:208]
#21 [main] tokio::runtime::task::harness::poll_future [harness.rs:473]
#22 [main] std::panic::catch_unwind [panic.rs:142]
#23 [main] std::panicking::try [panicking.rs:464]
#24 [main] std::panicking::try::do_call [panicking.rs:500]
#25 [main] <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once [unwind_safe.rs:271]
#26 [main] tokio::runtime::task::harness::poll_future::{{closure}} [harness.rs:485]
#27 [main] tokio::runtime::task::core::Core<T,S>::poll [core.rs:323]
#28 [main] tokio::loom::std::unsafe_cell::UnsafeCell<T>::with_mut [unsafe_cell.rs:16]
#29 [main] tokio::runtime::task::core::Core<T,S>::poll::{{closure}} [core.rs:334]
#30 [main] <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll [task.rs:42]
#31 [main] <object_store::local::LocalFileSystem as object_store::ObjectStore>::get_ranges::{{closure}}::{{closure}} [local.rs:417]
#32 [main] core::iter::traits::iterator::Iterator::collect [iterator.rs:1895]
#33 [main] <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter [result.rs:1932]
#34 [main] core::iter::adapters::try_process [mod.rs:164]
#35 [main] <core::result::Result<V,E> as core::iter::traits::collect::FromIterator<core::result::Result<A,E>>>::from_iter::{{closure}} [result.rs:1932]
#36 [main] core::iter::traits::iterator::Iterator::collect [iterator.rs:1895]
#37 [main] <alloc::vec::Vec<T> as core::iter::traits::collect::FromIterator<T>>::from_iter [mod.rs:2696]
#38 [main] alloc::vec::in_place_collect::<impl alloc::vec::spec_from_iter::SpecFromIter<T,I> for alloc::vec::Vec<T>>::from_iter [in_place_collect.rs:167]
#39 [main] <alloc::vec::Vec<T> as alloc::vec::spec_from_iter_nested::SpecFromIterNested<T,I>>::from_iter [spec_from_iter_nested.rs:43]
#40 [main] <alloc::vec::Vec<T,A> as alloc::vec::spec_extend::SpecExtend<T,I>>::spec_extend [spec_extend.rs:17]
#41 [main] alloc::vec::Vec<T,A>::extend_desugared [mod.rs:2796]
#42 [main] <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::next [mod.rs:178]
#43 [main] core::iter::traits::iterator::Iterator::try_for_each [iterator.rs:2365]
#44 [main] <core::iter::adapters::GenericShunt<I,R> as core::iter::traits::iterator::Iterator>::try_fold [mod.rs:195]
#45 [main] <core::iter::adapters::map::Map<I,F> as core::iter::traits::iterator::Iterator>::try_fold [map.rs:117]
#46 [main] core::iter::traits::iterator::Iterator::try_fold [iterator.rs:2303]
#47 [main] core::iter::adapters::map::map_try_fold::{{closure}} [map.rs:91]
#48 [main] <object_store::local::LocalFileSystem as object_store::ObjectStore>::get_ranges::{{closure}}::{{closure}}::{{closure}} [local.rs:416]
#49 [main] object_store::local::read_range [local.rs:949]
#50 [main] alloc::vec::Vec<T>::with_capacity [mod.rs:479]
#51 [main] alloc::vec::Vec<T,A>::with_capacity_in [mod.rs:670]
#52 [main] alloc::raw_vec::RawVec<T,A>::with_capacity_in [raw_vec.rs:130]
#53 [main] alloc::raw_vec::RawVec<T,A>::allocate_in [raw_vec.rs:184]
#54 [main] <alloc::alloc::Global as core::alloc::Allocator>::allocate [alloc.rs:245]
#55 [main] alloc::alloc::Global::alloc_impl [alloc.rs:185]
#56 [main] alloc::alloc::alloc [alloc.rs:102]
#57 [main] __rdl_alloc [alloc.rs:381]
#58 [main] std::sys::unix::alloc::<impl core::alloc::global::GlobalAlloc for std::alloc::System>::alloc [alloc.rs:14]
#59 [libbytehound.so] malloc [api.rs:294]

6 replies

twitu Oct 15, 2023
Author

I've tried it with different files and also by adding in the sort_order and trying queries with and without "ORDER BY" but the memory consumption is still proportional to the file size.

            file_sort_order: vec![vec![Expr::Sort(Sort {
                expr: Box::new(col("ts_init")),
                asc: true,
                nulls_first: true,
            })]],

In fact, I noticed that even in the previous diagram in #7675 (comment) the memory grew to about 100 MB when querying a file of about 100 MB.

Is it perhaps a design choice of datafusion to load the file instead of streaming record batches? Is there anyway around this?

tustvold Oct 16, 2023
Collaborator

The unit of IO is the page if the offset index is enabled, otherwise falling back to reading entire column chunk. What did you use to write the file? This behaviour would make sense if only a few very large row groups, and no offset index

Answer selected by twitu

twitu Oct 17, 2023
Author

You are right many test files had just one row group!

Re-writing the files with more row groups like this reduced the memory allocations significantly.

    table = pq.read_table(file_path)
    table = table.sort_by("ts_init")
    pq.write_table(table, file_path, row_group_size=5000)

It might be helpful to call out this interesting property out somewhere in the docs because it's a hidden detail 😶‍🌫️.

alamb Oct 17, 2023
Collaborator

It might be helpful to call out this interesting property out somewhere in the docs because it's a hidden detail 😶‍🌫️.

Indeed -- if you could help us with documentation that would be super helpful 🙏

twitu Oct 31, 2023
Author

I wouldn't know where to begin. But I'm happy to add documentation if you can point out anything specific.

alamb Nov 1, 2023
Collaborator

thanks @twitu -- it would be really cool to update https://arrow.apache.org/datafusion/user-guide/sql/write_options.html#parquet-format-specific-options to note that larger row group sizes will require more memory to write

Source is here: https://github.com/apache/arrow-datafusion/blob/aef95edf9bf1324be8642146c882b8d4c89a3785/docs/source/user-guide/sql/write_options.md?plain=1#L117

(there are several other places (search link) in datafusion that could be updated too

Another good place to add a note would be the underlying parquet writer https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterPropertiesBuilder.html#method.set_max_row_group_size
Code is here https://github.com/apache/arrow-rs/blob/65f7be856099d389b0d0eafa9be47fad25215ee6/parquet/src/file/properties.rs#L462-L467

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Total allocated memory keeps growing even when reading a parquet file in a streaming manner #7675

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 7 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Total allocated memory keeps growing even when reading a parquet file in a streaming manner #7675

twitu Sep 27, 2023

Replies: 3 comments · 7 replies

twitu Sep 28, 2023 Author

alamb Sep 29, 2023 Collaborator

twitu Oct 5, 2023 Author

twitu Oct 10, 2023 Author

twitu Oct 15, 2023 Author

tustvold Oct 16, 2023 Collaborator

twitu Oct 17, 2023 Author

alamb Oct 17, 2023 Collaborator

twitu Oct 31, 2023 Author

alamb Nov 1, 2023 Collaborator

twitu
Sep 27, 2023

Replies: 3 comments 7 replies

twitu
Sep 28, 2023
Author

alamb
Sep 29, 2023
Collaborator

twitu Oct 5, 2023
Author

twitu
Oct 10, 2023
Author

twitu Oct 15, 2023
Author

tustvold Oct 16, 2023
Collaborator

twitu Oct 17, 2023
Author

alamb Oct 17, 2023
Collaborator

twitu Oct 31, 2023
Author

alamb Nov 1, 2023
Collaborator