Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add disk cache infrastructure back with tests #351

Closed
wants to merge 11 commits into from
Closed

Conversation

vchuravy
Copy link
Member

@vchuravy vchuravy commented Aug 2, 2022

Using Preferences.jl instead of environment variables and split the cache on a user defined key, GPUCompiler version, and Julia version.

@vchuravy vchuravy requested a review from maleadt August 2, 2022 14:54
@codecov
Copy link

codecov bot commented Aug 2, 2022

Codecov Report

Patch coverage has no change and project coverage change: -85.86 ⚠️

Comparison is base (bec672c) 85.85% compared to head (051e795) 0.00%.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #351       +/-   ##
==========================================
- Coverage   85.85%   0.00%   -85.86%     
==========================================
  Files          24      24               
  Lines        2871    2680      -191     
==========================================
- Hits         2465       0     -2465     
- Misses        406    2680     +2274     
Impacted Files Coverage Δ
src/GPUCompiler.jl 0.00% <ø> (-100.00%) ⬇️
src/cache.jl 0.00% <0.00%> (-95.32%) ⬇️

... and 22 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

@vchuravy vchuravy force-pushed the vc/diskcache2 branch 2 times, most recently from cb25a34 to 49f3f67 Compare August 2, 2022 17:06
src/cache.jl Outdated Show resolved Hide resolved
src/cache.jl Outdated Show resolved Hide resolved
test/CacheEnv/LocalPreferences.toml Show resolved Hide resolved
test/CacheEnv/LocalPreferences.toml Outdated Show resolved Hide resolved
@vchuravy
Copy link
Member Author

vchuravy commented Aug 2, 2022

Without caching:

vchuravy@odin ~/s/s/j/GemmDenseCUDA (main)> julia --project gemm-dense-cuda.jl 10000 10000 10000 5
args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.670872 seconds (328.92 k allocations: 17.177 MiB, 80.62% compilation time)
Time to allocate B  0.001136 seconds (5 allocations: 176 bytes)
Time to initialize C  0.003191 seconds (638 allocations: 37.242 KiB, 66.78% compilation time)
Time to fill A  0.114808 seconds (4.73 k allocations: 260.202 KiB, 20.44% gc time, 62.84% compilation time)
Time to fill B  0.000006 seconds
Time to simple gemm  14.005771 seconds (14.90 M allocations: 784.978 MiB, 2.13% gc time, 21.18% compilation time)

First run:

vchuravy@odin ~/s/s/j/GemmDenseCUDA (vc/micro_optim)> julia --project gemm-dense-cuda.jl 10000 10000 10000 5
args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.706839 seconds (328.92 k allocations: 17.177 MiB, 80.50% compilation time)
Time to allocate B  0.001365 seconds (5 allocations: 176 bytes)
Time to initialize C  0.003525 seconds (638 allocations: 37.242 KiB, 67.51% compilation time)
Time to fill A  0.130957 seconds (4.73 k allocations: 260.202 KiB, 22.79% gc time, 59.73% compilation time)
Time to fill B  0.000006 seconds
Time to simple gemm  18.979182 seconds (19.35 M allocations: 1008.772 MiB, 2.35% gc time, 17.06% compilation time)

Second run:

vchuravy@odin ~/s/s/j/GemmDenseCUDA (vc/micro_optim) [SIGINT]> julia --project gemm-dense-cuda.jl 10000 10000 10000 5
args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.654325 seconds (328.92 k allocations: 17.177 MiB, 80.73% compilation time)
Time to allocate B  0.001132 seconds (5 allocations: 176 bytes)
Time to initialize C  0.003681 seconds (638 allocations: 37.242 KiB, 65.31% compilation time)
Time to fill A  0.108716 seconds (4.73 k allocations: 260.202 KiB, 27.39% gc time, 56.61% compilation time)
Time to fill B  0.000004 seconds
Time to simple gemm   3.616108 seconds (722.24 k allocations: 45.187 MiB, 0.60% gc time, 24.34% compilation time)

@vchuravy
Copy link
Member Author

vchuravy commented Aug 2, 2022

In discussion with @williamfgc, maybe we shouldn't make the cache_key static, so that an application can set it at startup? I would most likely put in the git-hash of the application.

test/runtests.jl Outdated Show resolved Hide resolved
src/cache.jl Outdated Show resolved Hide resolved
@maleadt
Copy link
Member

maleadt commented Aug 3, 2022

What causes the 5s regression going from 'without cache' to 'first run'?

@claforte
Copy link

claforte commented Aug 3, 2022

We were discussing with @jpsamaroo...
I'm not sure if this is already covered in this PR, but it would be nice if during development, we had an easy way to specify which kernels we're working on so they always override the cache, e.g. through a preferences.jl always_overwrite_kernels list or an optional argument to @kernel, etc.

@vchuravy
Copy link
Member Author

vchuravy commented Aug 3, 2022

We were discussing with @jpsamaroo... I'm not sure if this is already covered in this PR, but it would be nice if during development, we had an easy way to specify which kernels we're working on so they always override the cache, e.g. through a preferences.jl always_overwrite_kernels list or an optional argument to @kernel, etc.

I think that would be rather hard to do. This is still a stop-gap towards proper precompilation caching support.

Apply suggestions from code review
@vchuravy
Copy link
Member Author

On Julia 1.9 and current CUDA#master with no disk-cache first compilation got a lot faster.

args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.080495 seconds (14.08 k allocations: 1002.329 KiB)
Time to allocate B  0.001020 seconds (7 allocations: 256 bytes)
Time to initialize C  0.001061 seconds (7 allocations: 256 bytes)
Time to fill A  0.079274 seconds (3.64 k allocations: 192.344 KiB, 16.84% gc time)
Time to fill B  0.000005 seconds
Time to simple gemm   7.802547 seconds (8.92 M allocations: 546.678 MiB, 1.71% gc time, 0.39% compilation time)
Time to simple gemm 2.620980927
Time to simple gemm 2.634474094
Time to simple gemm 2.648787405
Time to simple gemm 2.669124524
GFLOPS: 756.618023173782 steps: 5 average_time: 2.6433417375
Time to total time  18.620834 seconds (8.97 M allocations: 549.802 MiB, 0.79% gc time, 0.16% compilation time)

@vchuravy
Copy link
Member Author

Now first run with caching:

args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.083496 seconds (14.08 k allocations: 1002.329 KiB)
Time to allocate B  0.001083 seconds (7 allocations: 256 bytes)
Time to initialize C  0.001120 seconds (7 allocations: 256 bytes)
Time to fill A  0.084755 seconds (3.64 k allocations: 192.344 KiB, 20.16% gc time)
Time to fill B  0.000006 seconds
Time to simple gemm   8.316279 seconds (9.18 M allocations: 564.005 MiB, 1.53% gc time, 0.36% compilation time)
Time to simple gemm 2.621605666
Time to simple gemm 2.644468266
Time to simple gemm 2.656315144
Time to simple gemm 2.670673464
GFLOPS: 755.2112497959444 steps: 5 average_time: 2.648265635
Time to total time  19.164910 seconds (9.22 M allocations: 567.129 MiB, 0.75% gc time, 0.16% compilation time)

Second run hitting the cache:

args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.083945 seconds (14.08 k allocations: 1002.329 KiB)
Time to allocate B  0.001022 seconds (7 allocations: 256 bytes)
Time to initialize C  0.001109 seconds (7 allocations: 256 bytes)
Time to fill A  0.081859 seconds (3.64 k allocations: 192.344 KiB, 20.37% gc time)
Time to fill B  0.000006 seconds
Time to simple gemm   3.225041 seconds (176.45 k allocations: 12.828 MiB, 0.90% compilation time)
Time to simple gemm 2.683764144
Time to simple gemm 2.694396815
Time to simple gemm 2.714404264
Time to simple gemm 2.725327305
GFLOPS: 739.5155737860738 steps: 5 average_time: 2.7044731320000004
Time to total time  14.291853 seconds (221.96 k allocations: 15.949 MiB, 0.12% gc time, 0.20% compilation time)

So 7.802547 seconds to 8.316279 seconds to 3.225041 seconds. Subtracting out the baseline cost of ~2.6s

5.2s normal, 5.7s with a cold cache and 0.6s with a hot cache.

@vchuravy
Copy link
Member Author

On an Oceananigans test case from time spent in setup went from 150s spent in cufunction to 1.5s spent in cufunction.


!!! warning
The disk cache is not automatically invalidated. It is sharded upon
`cache_key` (see [`set_cache_key``](@ref)), the GPUCompiler version
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`cache_key` (see [`set_cache_key``](@ref)), the GPUCompiler version
`cache_key` (see [`set_cache_key`](@ref)), the GPUCompiler version

end

key(ver::VersionNumber) = "$(ver.major)_$(ver.minor)_$(ver.patch)"
cache_path() = @get_scratch!(cache_key() * "-kernels-" * key(VERSION))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe include "cache" in the directory name? Or make this a subdirectory of the existing compile_cache scratch directory? That way the cache would also get wiped on reset_runtime, which is done when recompiling CUDA.jl. Or is that unwanted?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might also be confusing to have compiled in the scratch dir, containing the runtime bitcode, and cache for compiled kernels :-) maybe cache/{runtime,jobs}?

I know we're bikeshedding here :-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That way the cache would also get wiped on reset_runtime, which is done when recompiling CUDA.jl. Or is that unwanted?

I was trying to add a dependency on the version of GPUCompiler/CUDA. Cache invalidation is a big potential footgun here.

@@ -173,7 +206,18 @@ end
job = CompilerJob(src, cfg)

asm = nothing
# TODO: consider loading the assembly from an on-disk cache here
# can we load from the disk cache?
if disk_cache()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use @static here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was frustrated by the need to recompile GPUCompiler to turn caching on and off.

I originally had it be a compile time preference which is what we would need for it to be @static

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize we had non compile-time preferences...

@@ -182,6 +226,10 @@ end
end

asm = compiler(job)

if disk_cache() && !isfile(path)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

@vchuravy
Copy link
Member Author

ERROR: LoadError: CUDA error: named symbol not found (code 500, ERROR_NOT_FOUND)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/N71Iw/lib/cudadrv/libcuda.jl:27
  [2] macro expansion
    @ ~/.julia/packages/CUDA/N71Iw/lib/cudadrv/libcuda.jl:35 [inlined]
  [3] cuModuleGetFunction(hfunc::Base.RefValue{Ptr{CUDA.CUfunc_st}}, hmod::CUDA.CuModule, name::String)
    @ CUDA ~/.julia/packages/CUDA/N71Iw/lib/utils/call.jl:26
  [4] CuFunction
    @ ~/.julia/packages/CUDA/N71Iw/lib/cudadrv/module/function.jl:19 [inlined]
  [5] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
    @ CUDA ~/.julia/packages/CUDA/N71Iw/src/compiler/compilation.jl:235
  [6] (::GPUCompiler.var"#123#124"{Dict{UInt64, Any}, UInt64, typeof(CUDA.link), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})()
    @ GPUCompiler ~/.julia/packages/GPUCompiler/81n3h/src/cache.jl:250
  [7] lock(f::GPUCompiler.var"#123#124"{Dict{UInt64, Any}, UInt64, typeof(CUDA.link), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}, l::ReentrantLock)
    @ Base ./lock.jl:229
  [8] actual_compilation(cache::Dict{UInt64, Any}, key::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, ft::Type, tt::Type, world::UInt64, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/81n3h/src/cache.jl:247
  [9] cached_compilation(cache::Dict{UInt64, Any}, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, ft::Type, tt::Type, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/81n3h/src/cache.jl:200
 [10] macro expansion
    @ ~/.julia/packages/CUDA/N71Iw/src/compiler/execution.jl:310 [inlined]
 [11] macro expansion
    @ ./lock.jl:267 [inlined]
 [12] cufunction(f::typeof(Oceananigans.TurbulenceClosures.gpu_compute_ri_number!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(2162, 902, 102)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{3, KernelAbstractions.NDIteration.StaticSize{(136, 57, 102)}, KernelAbstractions.NDIteration.StaticSize{(16, 16, 1)}, Nothing, Nothing}}, NamedTuple{(:κ, :ν, :Ri), Tuple{OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}}}, Tuple{Int64, Int64, Int64}, ImmersedBoundaryGrid{Float64, FullyConnected, FullyConnected, Bounded, LatitudeLongitudeGrid{Float64, FullyConnected, FullyConnected, Bounded, OffsetArrays.OffsetVector{Float64, CUDA.CuDeviceVector{Float64, 1}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, CUDA.CuDeviceVector{Float64, 1}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, CUDA.CuDeviceVector{Float64, 1}}, Nothing}, GridFittedBottom{typeof(OceanScalingTests.double_drake_bathymetry), Oceananigans.ImmersedBoundaries.CenterImmersedCondition}, Nothing, Nothing}, RiBasedVerticalDiffusivity{VerticallyImplicitTimeDiscretization, Float64, Oceananigans.TurbulenceClosures.HyperbolicTangentRiDependentTapering}, NamedTuple{(:u, :v, :w), Tuple{OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}}}, NamedTuple{(:T, :S), Tuple{OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}}}, Buoyancy{SeawaterBuoyancy{Float64, SeawaterPolynomials.BoussinesqEquationOfState{SeawaterPolynomials.TEOS10.TEOS10SeawaterPolynomial{Float64}, Float64}, Nothing, Nothing}, Oceananigans.Grids.ZDirection}, NamedTuple{(:T, :S), Tuple{BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Oceananigans.BoundaryConditions.DiscreteBoundaryFunction{Float64, typeof(OceanScalingTests.T_relaxation)}}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Oceananigans.BoundaryConditions.DiscreteBoundaryFunction{NTuple{4, NTuple{4, Float64}}, typeof(OceanScalingTests.surface_salinity_flux)}}}}, NamedTuple{(:time, :iteration, :stage), Tuple{Float64, Int64, Int64}}}}; kwargs::Base.Pairs{Symbol, Integer, Tuple{Symbol, Symbol}, NamedTuple{(:always_inline, :maxthreads), Tuple{Bool, Int64}}})
    @ CUDA ~/.julia/packages/CUDA/N71Iw/src/compiler/execution.jl:306
 [13] macro expansion
    @ ~/.julia/packages/CUDA/N71Iw/src/compiler/execution.jl:104 [inlined]
 [14] (::KernelAbstractions.Kernel{CUDA.CUDAKernels.CUDABackend, KernelAbstractions.NDIteration.StaticSize{(16, 16)}, KernelAbstractions.NDIteration.StaticSize{(2162, 902, 102)}, typeof(Oceananigans.TurbulenceClosures.gpu_compute_ri_number!)})(::NamedTuple{(:κ, :ν, :Ri), Tuple{Field{Center, Center, Face, Nothing, ImmersedBoundaryGrid{Float64, FullyConnected, FullyConnected, Bounded, LatitudeLongitudeGrid{Float64, FullyConnected, FullyConnected, Bounded, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, GridFittedBottom{typeof(OceanScalingTests.double_drake_bathymetry), Oceananigans.ImmersedBoundaries.CenterImmersedCondition}, Nothing, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, Tuple{Colon, Colon, Colon}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}}}, Field{Center, Center, Face, Nothing, ImmersedBoundaryGrid{Float64, FullyConnected, FullyConnected, Bounded, LatitudeLongitudeGrid{Float64, FullyConnected, FullyConnected, Bounded, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, GridFittedBottom{typeof(OceanScalingTests.double_drake_bathymetry), Oceananigans.ImmersedBoundaries.CenterImmersedCondition}, Nothing, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, Tuple{Colon, Colon, Colon}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}  

Found by @simone-silvestri when running with a large number of nodes and a shared filesystem.

@maleadt
Copy link
Member

maleadt commented Apr 17, 2023

That CuFunction look-up constructor should probably do its own error handling (i.e., call unsafe_cuModuleGetFunction and print the requested function; sadly I don't think we can list the available ones).

@vchuravy
Copy link
Member Author

vchuravy commented Apr 3, 2024

Replaced by #557

@vchuravy vchuravy closed this Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants