Add disk cache infrastructure back with tests #351

vchuravy · 2022-08-02T14:54:13Z

Using Preferences.jl instead of environment variables and split the cache on a user defined key, GPUCompiler version, and Julia version.

codecov · 2022-08-02T16:21:01Z

Codecov Report

Patch coverage has no change and project coverage change: -85.86 ⚠️

Comparison is base (bec672c) 85.85% compared to head (051e795) 0.00%.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #351       +/-   ##
==========================================
- Coverage   85.85%   0.00%   -85.86%     
==========================================
  Files          24      24               
  Lines        2871    2680      -191     
==========================================
- Hits         2465       0     -2465     
- Misses        406    2680     +2274

Impacted Files	Coverage Δ
src/GPUCompiler.jl	`0.00% <ø> (-100.00%)`	⬇️
src/cache.jl	`0.00% <0.00%> (-95.32%)`	⬇️

... and 22 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

src/cache.jl

test/CacheEnv/LocalPreferences.toml

vchuravy · 2022-08-02T18:48:39Z

Without caching:

vchuravy@odin ~/s/s/j/GemmDenseCUDA (main)> julia --project gemm-dense-cuda.jl 10000 10000 10000 5
args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.670872 seconds (328.92 k allocations: 17.177 MiB, 80.62% compilation time)
Time to allocate B  0.001136 seconds (5 allocations: 176 bytes)
Time to initialize C  0.003191 seconds (638 allocations: 37.242 KiB, 66.78% compilation time)
Time to fill A  0.114808 seconds (4.73 k allocations: 260.202 KiB, 20.44% gc time, 62.84% compilation time)
Time to fill B  0.000006 seconds
Time to simple gemm  14.005771 seconds (14.90 M allocations: 784.978 MiB, 2.13% gc time, 21.18% compilation time)

First run:

vchuravy@odin ~/s/s/j/GemmDenseCUDA (vc/micro_optim)> julia --project gemm-dense-cuda.jl 10000 10000 10000 5
args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.706839 seconds (328.92 k allocations: 17.177 MiB, 80.50% compilation time)
Time to allocate B  0.001365 seconds (5 allocations: 176 bytes)
Time to initialize C  0.003525 seconds (638 allocations: 37.242 KiB, 67.51% compilation time)
Time to fill A  0.130957 seconds (4.73 k allocations: 260.202 KiB, 22.79% gc time, 59.73% compilation time)
Time to fill B  0.000006 seconds
Time to simple gemm  18.979182 seconds (19.35 M allocations: 1008.772 MiB, 2.35% gc time, 17.06% compilation time)

Second run:

vchuravy@odin ~/s/s/j/GemmDenseCUDA (vc/micro_optim) [SIGINT]> julia --project gemm-dense-cuda.jl 10000 10000 10000 5
args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.654325 seconds (328.92 k allocations: 17.177 MiB, 80.73% compilation time)
Time to allocate B  0.001132 seconds (5 allocations: 176 bytes)
Time to initialize C  0.003681 seconds (638 allocations: 37.242 KiB, 65.31% compilation time)
Time to fill A  0.108716 seconds (4.73 k allocations: 260.202 KiB, 27.39% gc time, 56.61% compilation time)
Time to fill B  0.000004 seconds
Time to simple gemm   3.616108 seconds (722.24 k allocations: 45.187 MiB, 0.60% gc time, 24.34% compilation time)

vchuravy · 2022-08-02T18:52:07Z

In discussion with @williamfgc, maybe we shouldn't make the cache_key static, so that an application can set it at startup? I would most likely put in the git-hash of the application.

test/runtests.jl

src/cache.jl

maleadt · 2022-08-03T12:44:44Z

What causes the 5s regression going from 'without cache' to 'first run'?

claforte · 2022-08-03T16:13:14Z

We were discussing with @jpsamaroo...
I'm not sure if this is already covered in this PR, but it would be nice if during development, we had an easy way to specify which kernels we're working on so they always override the cache, e.g. through a preferences.jl always_overwrite_kernels list or an optional argument to @kernel, etc.

vchuravy · 2022-08-03T16:52:59Z

We were discussing with @jpsamaroo... I'm not sure if this is already covered in this PR, but it would be nice if during development, we had an easy way to specify which kernels we're working on so they always override the cache, e.g. through a preferences.jl always_overwrite_kernels list or an optional argument to @kernel, etc.

I think that would be rather hard to do. This is still a stop-gap towards proper precompilation caching support.

Apply suggestions from code review

vchuravy · 2023-03-22T00:34:47Z

On Julia 1.9 and current CUDA#master with no disk-cache first compilation got a lot faster.

args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.080495 seconds (14.08 k allocations: 1002.329 KiB)
Time to allocate B  0.001020 seconds (7 allocations: 256 bytes)
Time to initialize C  0.001061 seconds (7 allocations: 256 bytes)
Time to fill A  0.079274 seconds (3.64 k allocations: 192.344 KiB, 16.84% gc time)
Time to fill B  0.000005 seconds
Time to simple gemm   7.802547 seconds (8.92 M allocations: 546.678 MiB, 1.71% gc time, 0.39% compilation time)
Time to simple gemm 2.620980927
Time to simple gemm 2.634474094
Time to simple gemm 2.648787405
Time to simple gemm 2.669124524
GFLOPS: 756.618023173782 steps: 5 average_time: 2.6433417375
Time to total time  18.620834 seconds (8.97 M allocations: 549.802 MiB, 0.79% gc time, 0.16% compilation time)

vchuravy · 2023-03-22T00:42:16Z

Now first run with caching:

args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.083496 seconds (14.08 k allocations: 1002.329 KiB)
Time to allocate B  0.001083 seconds (7 allocations: 256 bytes)
Time to initialize C  0.001120 seconds (7 allocations: 256 bytes)
Time to fill A  0.084755 seconds (3.64 k allocations: 192.344 KiB, 20.16% gc time)
Time to fill B  0.000006 seconds
Time to simple gemm   8.316279 seconds (9.18 M allocations: 564.005 MiB, 1.53% gc time, 0.36% compilation time)
Time to simple gemm 2.621605666
Time to simple gemm 2.644468266
Time to simple gemm 2.656315144
Time to simple gemm 2.670673464
GFLOPS: 755.2112497959444 steps: 5 average_time: 2.648265635
Time to total time  19.164910 seconds (9.22 M allocations: 567.129 MiB, 0.75% gc time, 0.16% compilation time)

Second run hitting the cache:

args = ["10000", "10000", "10000", "5"]
Time to allocate A  0.083945 seconds (14.08 k allocations: 1002.329 KiB)
Time to allocate B  0.001022 seconds (7 allocations: 256 bytes)
Time to initialize C  0.001109 seconds (7 allocations: 256 bytes)
Time to fill A  0.081859 seconds (3.64 k allocations: 192.344 KiB, 20.37% gc time)
Time to fill B  0.000006 seconds
Time to simple gemm   3.225041 seconds (176.45 k allocations: 12.828 MiB, 0.90% compilation time)
Time to simple gemm 2.683764144
Time to simple gemm 2.694396815
Time to simple gemm 2.714404264
Time to simple gemm 2.725327305
GFLOPS: 739.5155737860738 steps: 5 average_time: 2.7044731320000004
Time to total time  14.291853 seconds (221.96 k allocations: 15.949 MiB, 0.12% gc time, 0.20% compilation time)

So 7.802547 seconds to 8.316279 seconds to 3.225041 seconds. Subtracting out the baseline cost of ~2.6s

5.2s normal, 5.7s with a cold cache and 0.6s with a hot cache.

vchuravy · 2023-03-27T21:02:17Z

On an Oceananigans test case from time spent in setup went from 150s spent in cufunction to 1.5s spent in cufunction.

maleadt · 2023-04-04T17:13:13Z

src/cache.jl

+
+!!! warning
+    The disk cache is not automatically invalidated. It is sharded upon
+    `cache_key` (see [`set_cache_key``](@ref)), the GPUCompiler version


Suggested change

`cache_key` (see [`set_cache_key``](@ref)), the GPUCompiler version

`cache_key` (see [`set_cache_key`](@ref)), the GPUCompiler version

maleadt · 2023-04-04T17:15:40Z

src/cache.jl

+end
+
+key(ver::VersionNumber) = "$(ver.major)_$(ver.minor)_$(ver.patch)"
+cache_path() = @get_scratch!(cache_key() * "-kernels-" * key(VERSION))


Maybe include "cache" in the directory name? Or make this a subdirectory of the existing compile_cache scratch directory? That way the cache would also get wiped on reset_runtime, which is done when recompiling CUDA.jl. Or is that unwanted?

Might also be confusing to have compiled in the scratch dir, containing the runtime bitcode, and cache for compiled kernels :-) maybe cache/{runtime,jobs}?

I know we're bikeshedding here :-)

That way the cache would also get wiped on reset_runtime, which is done when recompiling CUDA.jl. Or is that unwanted?

I was trying to add a dependency on the version of GPUCompiler/CUDA. Cache invalidation is a big potential footgun here.

maleadt · 2023-04-04T17:15:59Z

src/cache.jl

@@ -173,7 +206,18 @@ end
    job = CompilerJob(src, cfg)

    asm = nothing
-    # TODO: consider loading the assembly from an on-disk cache here
+    # can we load from the disk cache?
+    if disk_cache()


Can we use @static here?

I was frustrated by the need to recompile GPUCompiler to turn caching on and off.

I originally had it be a compile time preference which is what we would need for it to be @static

I didn't realize we had non compile-time preferences...

maleadt · 2023-04-04T17:16:12Z

src/cache.jl

@@ -182,6 +226,10 @@ end
        end

        asm = compiler(job)
+
+        if disk_cache() && !isfile(path)


vchuravy · 2023-04-16T22:20:20Z

ERROR: LoadError: CUDA error: named symbol not found (code 500, ERROR_NOT_FOUND)
Stacktrace:
  [1] throw_api_error(res::CUDA.cudaError_enum)
    @ CUDA ~/.julia/packages/CUDA/N71Iw/lib/cudadrv/libcuda.jl:27
  [2] macro expansion
    @ ~/.julia/packages/CUDA/N71Iw/lib/cudadrv/libcuda.jl:35 [inlined]
  [3] cuModuleGetFunction(hfunc::Base.RefValue{Ptr{CUDA.CUfunc_st}}, hmod::CUDA.CuModule, name::String)
    @ CUDA ~/.julia/packages/CUDA/N71Iw/lib/utils/call.jl:26
  [4] CuFunction
    @ ~/.julia/packages/CUDA/N71Iw/lib/cudadrv/module/function.jl:19 [inlined]
  [5] link(job::GPUCompiler.CompilerJob, compiled::NamedTuple{(:image, :entry, :external_gvars), Tuple{Vector{UInt8}, String, Vector{String}}})
    @ CUDA ~/.julia/packages/CUDA/N71Iw/src/compiler/compilation.jl:235
  [6] (::GPUCompiler.var"#123#124"{Dict{UInt64, Any}, UInt64, typeof(CUDA.link), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}})()
    @ GPUCompiler ~/.julia/packages/GPUCompiler/81n3h/src/cache.jl:250
  [7] lock(f::GPUCompiler.var"#123#124"{Dict{UInt64, Any}, UInt64, typeof(CUDA.link), GPUCompiler.CompilerJob{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}}, l::ReentrantLock)
    @ Base ./lock.jl:229
  [8] actual_compilation(cache::Dict{UInt64, Any}, key::UInt64, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, ft::Type, tt::Type, world::UInt64, compiler::typeof(CUDA.compile), linker::typeof(CUDA.link))
    @ GPUCompiler ~/.julia/packages/GPUCompiler/81n3h/src/cache.jl:247
  [9] cached_compilation(cache::Dict{UInt64, Any}, cfg::GPUCompiler.CompilerConfig{GPUCompiler.PTXCompilerTarget, CUDA.CUDACompilerParams}, ft::Type, tt::Type, compiler::Function, linker::Function)
    @ GPUCompiler ~/.julia/packages/GPUCompiler/81n3h/src/cache.jl:200
 [10] macro expansion
    @ ~/.julia/packages/CUDA/N71Iw/src/compiler/execution.jl:310 [inlined]
 [11] macro expansion
    @ ./lock.jl:267 [inlined]
 [12] cufunction(f::typeof(Oceananigans.TurbulenceClosures.gpu_compute_ri_number!), tt::Type{Tuple{KernelAbstractions.CompilerMetadata{KernelAbstractions.NDIteration.StaticSize{(2162, 902, 102)}, KernelAbstractions.NDIteration.DynamicCheck, Nothing, Nothing, KernelAbstractions.NDIteration.NDRange{3, KernelAbstractions.NDIteration.StaticSize{(136, 57, 102)}, KernelAbstractions.NDIteration.StaticSize{(16, 16, 1)}, Nothing, Nothing}}, NamedTuple{(:κ, :ν, :Ri), Tuple{OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}}}, Tuple{Int64, Int64, Int64}, ImmersedBoundaryGrid{Float64, FullyConnected, FullyConnected, Bounded, LatitudeLongitudeGrid{Float64, FullyConnected, FullyConnected, Bounded, OffsetArrays.OffsetVector{Float64, CUDA.CuDeviceVector{Float64, 1}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, CUDA.CuDeviceVector{Float64, 1}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, CUDA.CuDeviceVector{Float64, 1}}, Nothing}, GridFittedBottom{typeof(OceanScalingTests.double_drake_bathymetry), Oceananigans.ImmersedBoundaries.CenterImmersedCondition}, Nothing, Nothing}, RiBasedVerticalDiffusivity{VerticallyImplicitTimeDiscretization, Float64, Oceananigans.TurbulenceClosures.HyperbolicTangentRiDependentTapering}, NamedTuple{(:u, :v, :w), Tuple{OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}}}, NamedTuple{(:T, :S), Tuple{OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuDeviceArray{Float64, 3, 1}}}}, Buoyancy{SeawaterBuoyancy{Float64, SeawaterPolynomials.BoussinesqEquationOfState{SeawaterPolynomials.TEOS10.TEOS10SeawaterPolynomial{Float64}, Float64}, Nothing, Nothing}, Oceananigans.Grids.ZDirection}, NamedTuple{(:T, :S), Tuple{BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Oceananigans.BoundaryConditions.DiscreteBoundaryFunction{Float64, typeof(OceanScalingTests.T_relaxation)}}, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Oceananigans.BoundaryConditions.DiscreteBoundaryFunction{NTuple{4, NTuple{4, Float64}}, typeof(OceanScalingTests.surface_salinity_flux)}}}}, NamedTuple{(:time, :iteration, :stage), Tuple{Float64, Int64, Int64}}}}; kwargs::Base.Pairs{Symbol, Integer, Tuple{Symbol, Symbol}, NamedTuple{(:always_inline, :maxthreads), Tuple{Bool, Int64}}})
    @ CUDA ~/.julia/packages/CUDA/N71Iw/src/compiler/execution.jl:306
 [13] macro expansion
    @ ~/.julia/packages/CUDA/N71Iw/src/compiler/execution.jl:104 [inlined]
 [14] (::KernelAbstractions.Kernel{CUDA.CUDAKernels.CUDABackend, KernelAbstractions.NDIteration.StaticSize{(16, 16)}, KernelAbstractions.NDIteration.StaticSize{(2162, 902, 102)}, typeof(Oceananigans.TurbulenceClosures.gpu_compute_ri_number!)})(::NamedTuple{(:κ, :ν, :Ri), Tuple{Field{Center, Center, Face, Nothing, ImmersedBoundaryGrid{Float64, FullyConnected, FullyConnected, Bounded, LatitudeLongitudeGrid{Float64, FullyConnected, FullyConnected, Bounded, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, GridFittedBottom{typeof(OceanScalingTests.double_drake_bathymetry), Oceananigans.ImmersedBoundaries.CenterImmersedCondition}, Nothing, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, Tuple{Colon, Colon, Colon}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}, NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}}}}, Field{Center, Center, Face, Nothing, ImmersedBoundaryGrid{Float64, FullyConnected, FullyConnected, Bounded, LatitudeLongitudeGrid{Float64, FullyConnected, FullyConnected, Bounded, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Float64, Float64, Float64, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, StepRangeLen{Float64, Base.TwicePrecision{Float64}, Base.TwicePrecision{Float64}, Int64}}, OffsetArrays.OffsetVector{Float64, CUDA.CuArray{Float64, 1, CUDA.Mem.DeviceBuffer}}, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, GridFittedBottom{typeof(OceanScalingTests.double_drake_bathymetry), Oceananigans.ImmersedBoundaries.CenterImmersedCondition}, Nothing, Oceananigans.Distributed.DistributedArch{GPU, Int64, Tuple{Int64, Int64, Int64}, Tuple{Int64, Int64, Int64}, Oceananigans.Distributed.RankConnectivity{Int64, Int64, Int64, Int64, Nothing, Nothing, Int64, Int64, Int64, Int64}, MPI.Comm, true, Vector{MPI.Request}, Vector{Int64}}}, Tuple{Colon, Colon, Colon}, OffsetArrays.OffsetArray{Float64, 3, CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}}, Float64, FieldBoundaryConditions{BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, BoundaryCondition{Oceananigans.BoundaryConditions.DistributedCommunication, Oceananigans.Distributed.HaloCommunicationRanks{Int64, Int64}}, Nothing, Nothing, BoundaryCondition{Oceananigans.BoundaryConditions.Flux, Nothing}}, Nothing, Oceananigans.Fields.FieldBoundaryBuffers{NamedTuple{(:send, :recv), Tuple{CUDA.CuArray{Float64, 3, CUDA.Mem.DeviceBuffer}

Found by @simone-silvestri when running with a large number of nodes and a shared filesystem.

maleadt · 2023-04-17T17:58:28Z

That CuFunction look-up constructor should probably do its own error handling (i.e., call unsafe_cuModuleGetFunction and print the requested function; sadly I don't think we can list the available ones).

vchuravy · 2024-04-03T16:14:57Z

Replaced by #557

vchuravy requested a review from maleadt August 2, 2022 14:54

vchuravy force-pushed the vc/diskcache2 branch 2 times, most recently from cb25a34 to 49f3f67 Compare August 2, 2022 17:06

claforte reviewed Aug 2, 2022

View reviewed changes

src/cache.jl Outdated Show resolved Hide resolved

vchuravy force-pushed the vc/diskcache2 branch from 49f3f67 to 8ec2e28 Compare August 2, 2022 17:59

vchuravy commented Aug 2, 2022

View reviewed changes

src/cache.jl Outdated Show resolved Hide resolved

test/CacheEnv/LocalPreferences.toml Show resolved Hide resolved

test/CacheEnv/LocalPreferences.toml Outdated Show resolved Hide resolved

vchuravy commented Aug 3, 2022

View reviewed changes

test/runtests.jl Outdated Show resolved Hide resolved

vchuravy commented Aug 3, 2022

View reviewed changes

src/cache.jl Outdated Show resolved Hide resolved

vchuravy force-pushed the vc/diskcache2 branch from 7229af4 to 796c1bb Compare March 22, 2023 00:19

Add disk cache infrastructure back with tests

33171a1

Apply suggestions from code review

vchuravy force-pushed the vc/diskcache2 branch from 796c1bb to 33171a1 Compare March 22, 2023 00:25

vchuravy added 4 commits March 22, 2023 23:04

fix pkg_version

da0edc6

Make caching runtime not compiletime

939bf4e

fixup! Make caching runtime not compiletime

639ac30

fixup! fixup! Make caching runtime not compiletime

c585f72

maleadt reviewed Apr 4, 2023

View reviewed changes

src/cache.jl

@@ -182,6 +226,10 @@ end

end

asm = compiler(job)

if disk_cache() && !isfile(path)

Copy link

Member

maleadt Apr 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here.

vchuravy added 2 commits April 11, 2023 19:49

add environemnt

5d2d446

fixup! add environemnt

a7a5a71

vchuravy added 4 commits April 11, 2023 19:53

fixup! add environemnt

ca8e8bb

fix debug statement

c8b4f9a

fix race condition observed at scale

c95966f

fixup! fix race condition observed at scale

051e795

vchuravy mentioned this pull request May 26, 2023

GPURuntime compilation can "embedd" julia runtime values #460

Open

maleadt force-pushed the master branch from 628b8dd to f3f2c5e Compare September 5, 2023 19:11

vchuravy mentioned this pull request Apr 3, 2024

Add disk cache infrastructure for Julia 1.11 #557

Merged

3 tasks

vchuravy closed this Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add disk cache infrastructure back with tests #351

Add disk cache infrastructure back with tests #351

vchuravy commented Aug 2, 2022 •

edited

Loading

codecov bot commented Aug 2, 2022 •

edited

Loading

vchuravy commented Aug 2, 2022 •

edited

Loading

vchuravy commented Aug 2, 2022

maleadt commented Aug 3, 2022

claforte commented Aug 3, 2022

vchuravy commented Aug 3, 2022

vchuravy commented Mar 22, 2023

vchuravy commented Mar 22, 2023

vchuravy commented Mar 27, 2023

maleadt Apr 4, 2023

maleadt Apr 4, 2023

maleadt Apr 4, 2023

vchuravy Apr 4, 2023

maleadt Apr 4, 2023

vchuravy Apr 4, 2023

maleadt Apr 4, 2023

maleadt Apr 4, 2023

vchuravy commented Apr 16, 2023

maleadt commented Apr 17, 2023

vchuravy commented Apr 3, 2024

	`cache_key` (see [`set_cache_key``](@ref)), the GPUCompiler version
	`cache_key` (see [`set_cache_key`](@ref)), the GPUCompiler version

Add disk cache infrastructure back with tests #351

Add disk cache infrastructure back with tests #351

Conversation

vchuravy commented Aug 2, 2022 • edited Loading

codecov bot commented Aug 2, 2022 • edited Loading

Codecov Report

vchuravy commented Aug 2, 2022 • edited Loading

vchuravy commented Aug 2, 2022

maleadt commented Aug 3, 2022

claforte commented Aug 3, 2022

vchuravy commented Aug 3, 2022

vchuravy commented Mar 22, 2023

vchuravy commented Mar 22, 2023

vchuravy commented Mar 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vchuravy commented Apr 16, 2023

maleadt commented Apr 17, 2023

vchuravy commented Apr 3, 2024

vchuravy commented Aug 2, 2022 •

edited

Loading

codecov bot commented Aug 2, 2022 •

edited

Loading

vchuravy commented Aug 2, 2022 •

edited

Loading