-
Notifications
You must be signed in to change notification settings - Fork 37
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wrong results from sum(skipmissing(raster))
#812
Comments
just to add to this a little bit more, I also noticed that doing (maybe a totally different issue, but I mention it because is also a
|
Think of the sheer number of floating point operations happening in summing your 10_000 * 10_000 array. Theres a lot of room for error to creep in There is a correctness optimisation to reduce the error in that process: Maybe we don't hit that path with the Rasters inbuilt There could be specialised methods of |
But seems we don't actually use kahan summation, but pairwise. And not on a regular iteration: https://discourse.julialang.org/t/kahan-summation-in-sum/102723 |
Yes it only affects really big rasters, an error of more than 5% seemed crazy, but I also haven't been able to reproduce it. |
Could you try reproducing this on the original data using AccurateArithmetic.jl? Or convert the values to Float64 before summing? |
For me it was a fraction of a percent, but still big. But its a real problem, thus all the algorithms to fix it. Try looking in Base for what sum does on |
Base has all of this machinery implemented for mapreduce on With It's quite a lot of machinery to add and maintain in Rasters.jl, though. To clarify: with the example above, it's 0.1% off or so, with the population data it was way more, maybe because a few pixels have very high values. Anyway it seems important that these things are accurate. |
|
Hmm it's hard to do this with lazy iteration without essentially copying the Base implementation. One can always define It also seems that sum on Base skipmissing has about the same performance as Rasters' internal skipmissingval, and generates the correct result as well. But you have to call julia> @be sum($(Base.SkipMissing(replace_missing(ras))))
Benchmark: 3 samples with 1 evaluation
47.097 ms (5 allocs: 128 bytes)
47.157 ms (5 allocs: 128 bytes)
47.242 ms (5 allocs: 128 bytes)
julia> @be sum($(skipmissing(replace_missing(ras))))
Benchmark: 3 samples with 1 evaluation
46.943 ms (5 allocs: 128 bytes)
47.107 ms (5 allocs: 128 bytes)
47.122 ms (5 allocs: 128 bytes) As far as I remember, in the What is the value of Rasters skipmissing (in a post- |
I imagine a lot of people still won't use We cant collect, we just need to see what SkipMissing does in sum and add the method. (We already fixed this in DiskArrays too, doing nested block |
This method falls back to the base implementation. If a raster has
|
I'm happy to duplicate. But we would also need to do chunked DiskArrays handling inside that (will also be much faster as well as more correct on disk data). |
I was getting some mysterious results from summing over a raster of gridded population counts from WorldPop (which can be downloaded here)
This is a big raster, with Float32 data and a Float32 missingval. The output of
sum(skipmissing(x))
is not identical to the sum after aggregating, or the sum of the sum of the slices, or the sum ofcollect(skipmissing(x))
In the worldpop data, the result is off by like 5 percent, but when I try to reproduce I can only get it off by like 0.1%.
MWE:
What is going on here?
The text was updated successfully, but these errors were encountered: