Expose `initializer` primitive #131

neNasko1 · 2024-01-20T19:48:45Z

This change will expose the initialiser primitive to go along with argument. This change provides a more complete coverage of the onnx standard and provides a way to optimise op.const-s in the graph to initialiser-s.

Comparing op.const vs initializer:

a = argument(Tensor(np.int64, ('N',)))

c = a
for x in range(10000):
    c = op.mul(c, op.const(1))

model: onnx.ModelProto = build(inputs={'a': a}, outputs={'c': c})
onnx.save(model, "big1.onnx")

a = argument(Tensor(np.int64, ('N',)))

c = a
for x in range(10000):
    c = op.mul(c, initializer(1))

model: onnx.ModelProto = build(inputs={'a': a}, outputs={'c': c})
onnx.save(model, "big2.onnx")

File-sizes:

-rw-r--r--   1 staff  1363443 Jan 20 21:47 big1.onnx
-rw-r--r--   1 staff   934553 Jan 20 21:46 big2.onnx

Checklist

Added a CHANGELOG.rst entry

jbachurski · 2024-01-20T23:40:45Z

Thanks for the PR!
Maybe to make it more complete, could you consider/add:

Why is there a difference of Constant vs initializer in the first place? Is it really just because of the amount of "Constant" string literals in the protobuf output, as you're just adding a really large number of small constants (and is this a practical scenario)? In the case there are a lot of small constants, perhaps running a common-subexpression-elimination optimiser would be the way to go (since I imagine many constants are -1/0/1)?
Perhaps a couple more practical tests, as this feature has not been really used anywhere, despite having been in _future for a while.

Of course if this has been battle-tested downstream and shown its worth in practice I can trust you guys it's fine.

I don't see anything problematic in the approach of just moving what's in _future to public.

src/spox/_public.py

adityagoel4512 · 2024-01-20T23:27:53Z

tests/test_initializer.py

@@ -4,7 +4,8 @@
 import numpy
 import pytest

-from spox._future import initializer
+from spox import build, initializer
+from spox.opset.ai.onnx import v17 as op



The ONNX specification says:

A list of named tensor values. When an initializer has the same name as a graph input, it specifies a default value for that input. When an initializer has a name different from all graph inputs, it specifies a constant value. The order of the list is unspecified.

Edit: It looks like the initializer names generated by Spox are unlikely to conflict with the input name (e.g. "Initializer_0_arg"). What should happen if a user goes ahead and has an input with a name "Initializer_0_arg" as well (for whatever reason)? I think a warning should be sufficient that this initializer is being used as a default graph input and may be overriden.

Scoping seems weird with initializers (unlike constants). We should test these edge cases explicitly.

In models with IR version >= 4, in nested subgraphs used as attribute values, users MUST NOT use the same name as both a subgraph initializer and subgraph input unless the corresponding op's specification explicitly allows it.

and

In models with IR version <= 3, users MAY use the same name as both a subgraph initializer and subgraph input, but this is restricted to support constants via initializers that are not intended to correspond to any actual inputs passed from the node into the subgraph.

The naming generated by Spox should be fine for almost all general cases, but we should aim to raise warnings and errors in case users hit these edge cases.

It seems most suitable at this stage to prevent initializer and argument from sharing identical names.

I've tackled the other issues by incorporating more tests. Please don't hesitate to extend the discussion if there are any other edge cases that we've overlooked.

tests/test_initializer.py

cbourjau · 2024-01-21T18:14:03Z

Thanks @neNasko1 for putting the PR together and thanks for your comments, @jbachurski and @adityagoel4512 !

The context is that we want to make the onnxruntime session creation faster. It is a bit of a depressing endeavor: There is no inherent reason why it would need to be slow other than protobuf not being a good fit for this use case. We did see some improvements when loading models that onnxruntime optimized. One of those optimizations is that Constant nodes are converted to initializers. But I just ran some more benchmarks and I think we misattributed the speedup to that transformation. It is probably rather subexpression elimination as suggested.

Benchmark code

from time import perf_counter

import onnxruntime as ort
import numpy as np
import onnx

from spox import argument, Tensor, build
import spox.opset.ai.onnx.v18 as op
from spox._future import initializer


def make_model(n, const_fun):
    a = argument(Tensor(np.float64, ("N",)))
    cond = argument(Tensor(np.bool_, ()))

    def do_sum():
        return op.sum([op.add(a, const_fun()) for _ in range(n)])

    (b, ) = op.if_(
        cond,
        then_branch=lambda: [do_sum()],
        else_branch=lambda: [do_sum()],
    )

    out = op.add(b, do_sum())

    return build({"a": a, "cond": cond}, {"out": out})


def bench(msg, const_fun, N=1000):
    m = make_model(N, const_fun)
    model_file = "bench_model.onnx"
    opt_model_file = "bench_model_optimized.onnx"
    onnx.save(m, model_file)

    # Create optimized version
    opts = ort.SessionOptions()
    opts.optimized_model_filepath = opt_model_file
    opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_BASIC
    ort.InferenceSession(model_file, opts)

    t0 = perf_counter()
    sess = ort.InferenceSession(model_file)
    t1 = perf_counter()
    print(f"{msg}; {N=}: {t1 - t0}s")

    t0 = perf_counter()
    sess = ort.InferenceSession(opt_model_file)
    t1 = perf_counter()
    print(f"{msg}; ORT optimized; {N=}: {t1 - t0}s")

print(f"{ort.__version__=}")

# No reuse
bench("Constant nodes; no global", lambda: op.const(1.0))
bench("Initializers; no global", lambda: initializer(1.0))

# Reusing globals
g_const = op.const(1.0)
bench("Constant node; global", lambda: g_const)
g_init = initializer(1.0)
bench("Initializer; global", lambda: g_init)


print()
print(f"Unrelated bug: {onnx.load('bench_model_optimized.onnx').opset_import=}")

Main output:

Ort.__version__='1.16.3'
Constant nodes; no global; N=1000: 0.3524372920510359s
Constant nodes; no global; ORT optimized; N=1000: 0.3217784999869764s
Initializers; no global; N=1000: 0.33844133402453735s
Initializers; no global; ORT optimized; N=1000: 0.3136827909620479s
Constant node; global; N=1000: 0.051705374964512885s
Constant node; global; ORT optimized; N=1000: 0.018141582957468927s
Initializer; global; N=1000: 0.05279454204719514s
Initializer; global; ORT optimized; N=1000: 0.01927170902490616s

As you can see, using Initializers may be slightly faster, but reusing constants is a much more significant gain. Interestingly, the optimizer does not seem smart enough to eliminate identical constant nodes. However, when using "global constants" it does remove redundant subexpressions as expected. Interestingly, the optimizer also adds a bunch of unrelated opset imports to the model:

Unrelated bug in the onnxruntime optimizer:

Unrelated bug: onnx.load('bench_model_optimized.onnx').opset_import=[domain: ""
version: 16
, domain: "com.microsoft.experimental"
version: 1
, domain: "ai.onnx.ml"
version: 3
, domain: "ai.onnx.training"
version: 1
, domain: "com.microsoft"
version: 1
, domain: "com.ms.internal.nhwc"
version: 19
, domain: "ai.onnx.preview.training"
version: 1
, domain: "com.microsoft.nchwc"
version: 1
, domain: "org.pytorch.aten"
version: 1
]

adityagoel4512 · 2024-01-21T22:08:47Z

That makes sense.

We did see some improvements when loading models that onnxruntime optimized.

I'm curious what sort of model this was then (given your benchmark indicates that the optimized model wasn't significantly faster than the unoptimized one).

Interestingly, the optimizer does not seem smart enough to eliminate identical constant nodes.

Is the if node in the benchmark intentional? The optimizer can indeed eliminate these outside of subgraphs. Changing make_model to:

def make_model(n, const_fun):
    a = argument(Tensor(np.float64, ("N",)))
    cond = argument(Tensor(np.bool_, ()))

    def do_sum():
        return op.sum([op.add(a, const_fun()) for _ in range(n)])
    b = do_sum()
    out = op.add(b, do_sum())
    return build({"a": a, "cond": cond}, {"out": out})

the output is:

ort.__version__='1.16.3'
Constant nodes; no global; N=1000: 0.07471766602247953s
Constant nodes; no global; ORT optimized; N=1000: 0.012379958992823958s
Initializers; no global; N=1000: 0.0753192079719156s
Initializers; no global; ORT optimized; N=1000: 0.011552749900147319s
Constant node; global; N=1000: 0.04821212496608496s
Constant node; global; ORT optimized; N=1000: 0.01119362493045628s
Initializer; global; N=1000: 0.05039608292281628s
Initializer; global; ORT optimized; N=1000: 0.011134125059470534s

N=5:
optimized, no global, constant nodes:

unoptimized, no global, constant nodes:

…r primitive

neNasko1 · 2024-01-22T01:58:58Z

Why is there a difference of Constant vs initializer in the first place? Is it really just because of the amount of "Constant" string literals in the protobuf output, as you're just adding a really large number of small constants (and is this a practical scenario)? In the case there are a lot of small constants, perhaps running a common-subexpression-elimination optimiser would be the way to go (since I imagine many constants are -1/0/1)?

The data above indicates that each conversion from const to initialiser results in a reduction of approximately 40 bytes in the model size. This significant reduction doesn't seem to be solely due to string lengths. When we examine the graph deserialisations, it becomes evident that an initializer is less resource-intensive to encode than a node of type const.

initializer {
  data_type: 7
  int64_data: 1
  name: "Initializer_0_arg"
}

and

node {
  output: "Constant_0_output"
  name: "Constant_0"
  op_type: "Constant"
  attribute {
    name: "value"
    t {
      data_type: 7
      int64_data: 1
      name: ""
    }
    type: TENSOR
  }
  domain: ""
}

Also running an external optimiser is not ideal as it introduces another dependency in a deployment pipeline.

cbourjau · 2024-01-22T10:29:43Z

The data above indicates that each conversion from const to initialiser results in a reduction of approximately 40 bytes in the model size.

I doubt that the byte size is the issue. We are not bound by IO but by whatever computation protobuf is then doing with those byes. More bytes does not necessarily indicate more work. That said, in this case, it is true that a NodeProto is more complicated than a TensorProto by itself. However, the benchmarks above show that the deserialization of the latter is not necessarily significantly faster.

Is the If node in the benchmark intentional? The optimizer can indeed eliminate these outside of subgraphs.

Yes, the If node was intentional since that would be closer to our production models which make heavy use of subgraphs. But it is very interesting that the onnxruntime optimizer gets tripped up by the subgraphs!

jbachurski · 2024-01-22T12:17:15Z

Also running an external optimiser is not ideal as it introduces another dependency in a deployment pipeline.

Hard disagree with this. It seems to be an obvious step to use optimiser tools for deployment - Spox was never developed to perform such a role.

neNasko1 · 2024-01-22T12:59:04Z

Hard disagree with this. It seems to be an obvious step to use optimiser tools for deployment - Spox was never developed to perform such a role.

I agree that Spox' job is not to optimise the produced graph I just think it's crucial that we avoid over-reliance on the optimizer. Instead, we should let users decide by exposing as much of the standard as possible.

I doubt that the byte size is the issue. We are not bound by IO but by whatever computation protobuf is then doing with those byes. More bytes does not necessarily indicate more work. That said, in this case, it is true that a NodeProto is more complicated than a TensorProto by itself. However, the benchmarks above show that the deserialization of the latter is not necessarily significantly faster.

I was just commenting on the reasons that the model got smaller.

cbourjau · 2024-01-22T15:39:32Z

Instead, we should let users decide by exposing as much of the standard as possible.

The benchmark does show that no matter which choice the user takes, it makes no notable difference. Without the parsing benefits, I don't see any use case that Initializer would enable that is not more obviously served with other parts of the standard. It touches on a point to which we don't have an explicit answer: Should Spox expose the entire standard including its historically accumulated mistakes and foot guns, or should Spox strive to expose an obvious and orthogonal set of features that covers every ONNX use case? The latter would be in the Pythonic spirit of there being one and only one obvious way to do something.

neNasko1 · 2024-01-23T12:32:14Z

The benchmark does show that no matter which choice the user takes, it makes no notable difference. Without the parsing benefits, I don't see any use case that Initializer would enable that is not more obviously served with other parts of the standard. It touches on a point to which we don't have an explicit answer: Should Spox expose the entire standard including its historically accumulated mistakes and foot guns, or should Spox strive to expose an obvious and orthogonal set of features that covers every ONNX use case? The latter would be in the Pythonic spirit of there being one and only one obvious way to do something.

Our current focus seems heavily tilted towards the existing infrastructure. While this change might seem unnecessary if we consider onnxruntime as the definitive standard, it's crucial to remember that this isn't always the case. Users should be allowed to save the model using initialisers instead of constants. This approach could be particularly beneficial in scenarios where conserving memory is a priority, resulting in smaller models in memory.

Additionally, we should entertain the possibility of users wanting to utilise a spox-based optimiser. This would mean that exposing initialiser-s is something spox should definitely have.

neNasko1 · 2024-01-25T04:59:03Z

Closing as this feature seems unneeded at this point in time.

cbourjau · 2024-01-25T08:59:14Z

For added context: We had some more discussions around this in a meeting. At this point, there does not appear to be a use case for initializers which would not be (better) covered using existing features. The size argument is true in theory, but in practice, a model's size is vastly dominated by large constant tensors of some trained state.

Atanas Dimitrov added 3 commits January 19, 2024 18:52

Initial exposition of initializer

de8530e

Update CHANGELOG.rst

6556cdd

Add test

b646fe3

neNasko1 requested review from adityagoel4512, cbourjau and jbachurski as code owners January 20, 2024 19:48

neNasko1 changed the title ~~Initial exposition of initializer~~ Expose initializer primitive Jan 20, 2024

adityagoel4512 requested changes Jan 20, 2024

View reviewed changes

cbourjau reviewed Jan 21, 2024

View reviewed changes

tests/test_initializer.py Show resolved Hide resolved

Increase test coverage and extend the documentation of the initialise…

8758206

…r primitive

neNasko1 closed this Jan 25, 2024

cbourjau deleted the expose-initializer branch January 25, 2024 08:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose `initializer` primitive #131

Expose `initializer` primitive #131

neNasko1 commented Jan 20, 2024

jbachurski commented Jan 20, 2024

adityagoel4512 Jan 20, 2024 •

edited

Loading

neNasko1 Jan 22, 2024

cbourjau commented Jan 21, 2024

adityagoel4512 commented Jan 21, 2024 •

edited

Loading

neNasko1 commented Jan 22, 2024

cbourjau commented Jan 22, 2024

jbachurski commented Jan 22, 2024

neNasko1 commented Jan 22, 2024 •

edited

Loading

cbourjau commented Jan 22, 2024

neNasko1 commented Jan 23, 2024

neNasko1 commented Jan 25, 2024

cbourjau commented Jan 25, 2024

Expose initializer primitive #131

Expose initializer primitive #131

Conversation

neNasko1 commented Jan 20, 2024

Checklist

jbachurski commented Jan 20, 2024

adityagoel4512 Jan 20, 2024 • edited Loading

Choose a reason for hiding this comment

neNasko1 Jan 22, 2024

Choose a reason for hiding this comment

cbourjau commented Jan 21, 2024

adityagoel4512 commented Jan 21, 2024 • edited Loading

neNasko1 commented Jan 22, 2024

cbourjau commented Jan 22, 2024

jbachurski commented Jan 22, 2024

neNasko1 commented Jan 22, 2024 • edited Loading

cbourjau commented Jan 22, 2024

neNasko1 commented Jan 23, 2024

neNasko1 commented Jan 25, 2024

cbourjau commented Jan 25, 2024

Expose `initializer` primitive #131

Expose `initializer` primitive #131

adityagoel4512 Jan 20, 2024 •

edited

Loading

adityagoel4512 commented Jan 21, 2024 •

edited

Loading

neNasko1 commented Jan 22, 2024 •

edited

Loading