Enable custom aggregate functions (take 2) #529

llimllib · 2022-09-05T01:31:20Z

This PR builds on #407, updating it to current HEAD, adding more tests, and documenting it.

This PR closes #204, and is composed mostly of the work that @AnyhowStep did so long ago in that thread.

#407 notes that aggregate window functions do not work, but I think that's a different issue that should be tackled in another PR.

The full list of changes follows:

Update README.md to document create_aggregate
extract parseFunctionArguments into a function that can be used by both create_aggregate and create_function
- Please verify that nothing substantial has changed here - I'm worried I missed some small divergence between Add: aggregate functions. #407 and current HEAD and this function should only be an extraction, not changing any behavior
extract setFunctionResults into a function
- same caveats apply
export the sqlite3_aggregate_context function
Add the create_aggregate function
- The signature: create_aggregate(function_name, initial_value, { step: (state, ...value) =>, finalize: (state) =>})
- create_aggregate calls sqlite3_aggregate_context to allocate 1 byte of memory per instance of the aggregate function, which it uses as a key for its state array. sqlite3_aggregate_context documentation
Add tests

llimllib · 2022-09-05T01:43:45Z

cc @lovasoa

lovasoa

This looks good, thank you !

A few things:

Can we avoid calling init in create_aggregate itself ? I think this is a very surprising behavior for the programmer. init should be called every time the aggregate function is called.
Can we make the step function return the new state, like the standard reduce ? This would avoid the awkward function init() { return {sum: 0} }
Taking three functions as arguments makes the code using create_aggregate a little bit hard to read. Can we take an object instead ?

db.create_aggregate("json_array", {
    init: () => [],
    step: (state, value) => [...state, value],
    finalize: state => JSON.stringify(state)
})

Can we make init and finalize optional ? init should default to () => null and finalize to state => state. In the end we should be able to call

db.create_aggregate("js_sum", { step: (a,b) => a+b })

Can we add memory leak tests like the ones we have in test/test_functions_recreate.js

llimllib · 2022-09-05T22:36:28Z

Can we avoid calling init in create_aggregate itself ? I think this is a very surprising behavior for the programmer. init should be called every time the aggregate function is called.

Yes, thank you! Bad mental model striking there.

I like all your proposed changes, will do.

llimllib · 2022-09-06T01:32:44Z

A question: when an instance of create_aggregate is returned, do I have to worry about it being thread-safe?

That is, if I say: x = create_aggregate_func("x", ...), can I rely on that function only being called by one caller from init to finalize?

I'm assuming I can? If not, storing the state is tricky, because I need to account for this scenario:

x = create_aggregate_func("x", ...)
On thread Foo, SELECT x(...) from y runs, and sets a state var defined in the create_aggregate scope to the value of init (like the code I'm starting from does)
Before this finishes, thread Bar runs SELECT x(...) from z
it sets state to its own init value - but unless we're getting a copy of the create_aggregate function (are we?), this is going to overwrite the state of the previous function
Foo finishes, and returns an invalid value because its state got overwritten

I would try to answer this for myself but the web platform is a really tricky runtime so I'm asking in case you know off hand?

llimllib · 2022-09-06T01:33:59Z

The way that proper sqlite3 extensions do it is by hanging their state off the execution context like so.

Currently, sqlite3_aggregate_context isn't exposed and using it seems like it would be tricky, but doable

llimllib · 2022-09-06T02:16:02Z

Can we make the step function return the new state, like the standard reduce?

I don't think so? We're not the ones calling the step function, that's sqlite doing it

Can we make init and finalize optional

We can make init optional (non-existant actually) if we can figure out how to store state on the sqlite context between steps - otherwise callers will need to have a place to accumulate state between steps

We can't make finalize optional without getting pretty aggressive in overriding sqlite, because otherwise the aggregate function has no return value - finalize is where we call setFunctionResult with the reduced value of the accumulated state.

(Maybe I'm not being clever enough and there is a way for both!)

Basically it seems that the sqlite extension pattern of 'allocate a struct and stick it in the context pointer' is not going to work for us here. I wonder if using the id of the pointer returned by sqlite3_aggregate_context would be enough? Since no two functions could use the same pointer, per https://www.sqlite.org/c3ref/aggregate_context.html ?

llimllib · 2022-09-06T04:04:35Z

I managed to make init optional at least, and I think the solution I gave here to use the value of the pointer returned by sqlite3_aggregate_context is clever at least?

This branch is WIP status, and I apologize for the noise on it.

I will look into getting it down to a single function argument tomorrow.

lovasoa · 2022-09-06T08:05:33Z

About defaults for init and finalize: Can't we just have create_aggregate_func start like that :

function create_aggregate_func<T>(
   name: string,
   methods: {
       init?: () => T,
       step: (state: T, value: any) => T,
       finalize?: (state: T) => any 
   }
) {
    let init = methods.init || (() => null);
    let step = (state: T, value: any) => { 
        state = methods.step(state, value);
        __update_sqlite_internal_state(state);
    };
    let finalize = methods.finalize || (state => state);
    ...
}

About thread-safety: there is no such thing as multi-threading with concurrent access in javascript. The javascript runtime guarantees that there is a single execution thread that can access mutable javascript objects. That said, even within a single thread, there can be multiple aggregate functions that are at various stages of there execution simultaneously. The following has to work:

SELECT js_agg_fn(x),  js_agg_fn(y) FROM t;

A test case like this one should probably be added to the tests too.

llimllib · 2022-09-06T13:29:41Z

Having to return values from step makes it more awkward in the very common "accumulate values" case. Here's json_agg with returning values (and a default state of null):

db.create_aggregate(
    "json_agg", {
        step: function(state, val) { state = (state || []); state.push(val); return state; },
        finalize: function(state) { return JSON.stringify(state); }
    }
);

And with modifying state (and with state defaulting to [] instead of null):

db.create_aggregate(
    "json_agg", {
        step: function(state, val) { state.push(val); },
        finalize: function(state) { return JSON.stringify(state); }
    }
);

I think we might even want to make step default to pushing the values into an array? That way many functions could be written as just a finalizer - you get the list of values and pull the median, or serialize them, or calculate the standard deviation.

sum could be implemented this way, but it would be very memory inefficient, so it's important to give functions the ability at least to avoid creating an array of all values.

(I also wonder if there's efficency trickery we could do to accumulate values in a typed array?)

edit: standard deviation is another example of an aggregation that can be completed without accumulating all values

lovasoa · 2022-09-06T14:01:04Z

Having to return values from step makes it more awkward in the very common "accumulate values" case. Here's json_agg with returning values (and a default state of null):
db.create_aggregate(
    "json_agg", {
        step: function(state, val) { state = (state || []); state.push(val); return state; },
        finalize: function(state) { return JSON.stringify(state); }
    }
);

No, that's ugly, you wouldn't write it like that ! 🙂 This is a very common pattern in javascript, in many state-management libraries. You would write it as (state, val) => [...state, val]

And with modifying state (and with state defaulting to [] instead of null):
db.create_aggregate(
    "json_agg", {
        step: function(state, val) { state.push(val); },
        finalize: function(state) { return JSON.stringify(state); }
    }
);
I think we might even want to make step default to pushing the values into an array? That way many functions could be written as just a finalizer - you get the list of values and pull the median, or serialize them, or calculate the standard deviation.

sum could be implemented this way, but it would be very memory inefficient, so it's important to give functions the ability at least to avoid creating an array of all values.

(I also wonder if there's efficency trickery we could do to accumulate values in a typed array?)

edit: standard deviation is another example of an aggregation that can be completed without accumulating all values

Let's just stick to what developers are used to. Reduce (sometimes called fold) is a very common pattern, and users will appreciate it if you don't try to reinvent the wheel here. undefined as the default initial value and no default reducer.

lovasoa · 2022-09-06T14:12:52Z

Can we also add a test with

an init function that throws
a step function that throws
a finalize function that throws
many redefinitions of the same function in a loop

llimllib · 2022-09-07T12:27:09Z

init is a function in C because you have to allocate memory, but in almost every other reduce interface, it's provided as a value, so I changed the signature of the function to be create_aggregate(name, initial_value, aggregateFunctions) and eliminated the init function from the interface entirely.

llimllib · 2022-09-07T12:28:44Z

You can see from the changes in ac548d4 how much that simplifies the step functions

llimllib · 2022-09-07T13:21:32Z

OK, I think this is ready for re-review now. Thanks for working through this with me!

src/api.js

test/test_aggregate_functions.js

lovasoa

It looks good to me ! Can you just fix the documentation, I'll read it one last time, and we'll merge :)

README.md

Co-authored-by: Ophir LOJKINE <[email protected]>

README.md

dogquery and others added 6 commits September 3, 2022 19:25

initial commit.

24a6a12

documentation

fad9ba6

remove no-longer-valid type

d191caa

close over state initialization for performance

0d937a7

link documentation in comment

8fd3f8a

more testing

ba733ba

llimllib marked this pull request as ready for review September 5, 2022 01:43

lovasoa requested changes Sep 5, 2022

View reviewed changes

run tests if they're main

9e6b462

llimllib added 3 commits September 5, 2022 22:17

accept a single arg

573afa7

a middle road sqlite3_agg_context solution

9daf01f

llimllib added 4 commits September 6, 2022 09:42

try out auto-updating state

ec5c72b

improve quantile test, add multiple agg test

a927950

add a null to the test

e643bd9

acorn fails to parse ||=, whatever

2cbdb0e

make eslint happy

b9ccd48

make initial_value an argument

ac548d4

llimllib added 4 commits September 7, 2022 08:55

add memory leak test

55858e9

update docs to current interface

9a0c185

delete state in exception handlers

2445107

remove null state

5b62cf6

lovasoa requested changes Sep 7, 2022

View reviewed changes

src/api.js Outdated Show resolved Hide resolved

src/api.js Outdated Show resolved Hide resolved

lovasoa requested changes Sep 7, 2022

View reviewed changes

test/test_aggregate_functions.js Outdated Show resolved Hide resolved

llimllib added 5 commits September 7, 2022 10:51

return init function and document object

062f147

more tests and update back to init function

7aff1ae

update redefinition test for new interface

67f85e5

update README to match fixed signature

b8692d4

more consistent test formatting

b41e5cf

lovasoa requested changes Sep 7, 2022

View reviewed changes