Slowdown when using `alloc!` #33

sumiya11 · 2024-01-07T23:24:54Z

Hi,

I have the following example where I observe a 2x slowdown with Bumper.alloc!.

Could you please confirm that I use the package correctly?
Do you have ideas on how to fix this?

Thank you !

using BenchmarkTools, Bumper

function work0(polys; use_custom_allocator=false)
    if use_custom_allocator
        custom_allocator = Bumper.SlabBuffer()
        @no_escape custom_allocator begin
            work1(polys, custom_allocator)
        end
    else
        work1(polys)
    end
end

# Very important work
function work1(polys, custom_allocator=nothing)
    res = 0
    for poly in polys
        new_poly = work2(poly, custom_allocator)
        res += sum(new_poly)
    end
    res
end

function work2(poly::Vector{T}, ::Nothing) where {T}
    new_poly = Vector{T}(undef, length(poly))
    work3!(new_poly)
end

function work2(poly::Vector{T}, custom_allocator) where {T}
    new_poly = Bumper.alloc!(custom_allocator, T, length(poly))
    work3!(new_poly)
end

function work3!(poly::AbstractVector{T}) where {T}
    poly[1] = one(T)
    for i in 2:length(poly)
        poly[i] = convert(T, i)^3 - poly[i - 1]
    end
    poly
end

###

m, n = 1_000, 10_000
polys = [rand(UInt32, rand(1:m)) for _ in 1:n];

@btime work0(polys, use_custom_allocator=false)
#   6.461 ms (10001 allocations: 20.26 MiB)
# 0x0000e2e1c67cdb19

@btime work0(polys, use_custom_allocator=true)
#   14.154 ms (6 allocations: 608 bytes)
# 0x0000e2e1c67cdb19

Running on

julia> versioninfo()
Julia Version 1.10.0
Commit 3120989f39 (2023-12-25 18:01 UTC) 
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i5-8250U CPU @ 1.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

The text was updated successfully, but these errors were encountered:

tjjarvinen · 2024-01-11T04:37:22Z

Your problem is that you don't release the allocated arrays, and thus end up allocating a quite bit out of the buffer. If you correct this, you will get a little performance increase over the "normal" version

using BenchmarkTools, Bumper

function work0(polys; use_custom_allocator=false)
    if use_custom_allocator
        custom_allocator = Bumper.SlabBuffer()
        work1(polys, custom_allocator)
    else
        work1(polys)
    end
end

# Very important work
function work1(polys, custom_allocator)
    res = 0
    for poly in polys
        @no_escape custom_allocator begin
            new_poly = work2(poly, custom_allocator)
            res += sum(new_poly)
        end
    end
    res
end

function work1(polys)
    res = 0
    for poly in polys
        new_poly = work2(poly)
        res += sum(new_poly)
    end
    res
end


function work2(poly::Vector{T}) where {T}
    new_poly = Vector{T}(undef, length(poly))
    work3!(new_poly)
end

function work2(poly::Vector{T}, custom_allocator) where {T}
    new_poly = Bumper.alloc!(custom_allocator, T, length(poly))
    work3!(new_poly)
end

function work3!(poly::AbstractVector{T}) where {T}
    poly[1] = one(T)
    for i in 2:length(poly)
        poly[i] = convert(T, i)^3 - poly[i - 1]
    end
    poly
end

## #

m, n = 1_000, 10_000
polys = [rand(UInt32, rand(1:m)) for _ in 1:n];

@btime work0(polys, use_custom_allocator=false)
#   3.430 ms (10001 allocations: 20.20 MiB)
#0x0000e20ca5f991fb

@btime work0(polys, use_custom_allocator=true)
#  3.032 ms (4 allocations: 192 bytes)
#0x0000e20ca5f991fb

The code is not that well optimized, so it will give you only a very small performance improvement.

sumiya11 · 2024-01-11T05:03:39Z

Your problem is that you don't release the allocated arrays, and thus end up allocating a quite bit out of the buffer.

But that is precisely the point !

tjjarvinen · 2024-01-11T05:26:20Z

The problem is that, when you don't release arrays that are out of scope, you end up using much more memory than you actually need. In this example your data should fit into L1 cache (little over 8kB). But, if you don't release the temporary data, you end up with a buffer that does not fit into cache at all (about 20MB).

edit. The default SlabBuffer is about 8MB, so it should fit in L3 cache.

sumiya11 · 2024-01-11T05:52:49Z

Thanks for your answer, sorry for being not clear, let me explain.

My example is purposefully inefficient.
In my original problem, releasing intermediate arrays cannot be done easily.
The allocation pattern in my example here roughly resembles this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Slowdown when using `alloc!` #33

Slowdown when using `alloc!` #33

sumiya11 commented Jan 7, 2024

tjjarvinen commented Jan 11, 2024

sumiya11 commented Jan 11, 2024

tjjarvinen commented Jan 11, 2024 •

edited

Loading

sumiya11 commented Jan 11, 2024

Slowdown when using alloc! #33

Slowdown when using alloc! #33

Comments

sumiya11 commented Jan 7, 2024

tjjarvinen commented Jan 11, 2024

sumiya11 commented Jan 11, 2024

tjjarvinen commented Jan 11, 2024 • edited Loading

sumiya11 commented Jan 11, 2024

Slowdown when using `alloc!` #33

Slowdown when using `alloc!` #33

tjjarvinen commented Jan 11, 2024 •

edited

Loading