Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose initializer primitive #131

Closed
wants to merge 4 commits into from
Closed

Expose initializer primitive #131

wants to merge 4 commits into from

Conversation

neNasko1
Copy link
Contributor

This change will expose the initialiser primitive to go along with argument. This change provides a more complete coverage of the onnx standard and provides a way to optimise op.const-s in the graph to initialiser-s.

Comparing op.const vs initializer:

a = argument(Tensor(np.int64, ('N',)))

c = a
for x in range(10000):
    c = op.mul(c, op.const(1))

model: onnx.ModelProto = build(inputs={'a': a}, outputs={'c': c})
onnx.save(model, "big1.onnx")
a = argument(Tensor(np.int64, ('N',)))

c = a
for x in range(10000):
    c = op.mul(c, initializer(1))

model: onnx.ModelProto = build(inputs={'a': a}, outputs={'c': c})
onnx.save(model, "big2.onnx")

File-sizes:

-rw-r--r--   1 staff  1363443 Jan 20 21:47 big1.onnx
-rw-r--r--   1 staff   934553 Jan 20 21:46 big2.onnx

Checklist

  • Added a CHANGELOG.rst entry

@neNasko1 neNasko1 changed the title Initial exposition of initializer Expose initializer primitive Jan 20, 2024
@jbachurski
Copy link
Collaborator

Thanks for the PR!
Maybe to make it more complete, could you consider/add:

  • Why is there a difference of Constant vs initializer in the first place? Is it really just because of the amount of "Constant" string literals in the protobuf output, as you're just adding a really large number of small constants (and is this a practical scenario)? In the case there are a lot of small constants, perhaps running a common-subexpression-elimination optimiser would be the way to go (since I imagine many constants are -1/0/1)?
  • Perhaps a couple more practical tests, as this feature has not been really used anywhere, despite having been in _future for a while.

Of course if this has been battle-tested downstream and shown its worth in practice I can trust you guys it's fine.

I don't see anything problematic in the approach of just moving what's in _future to public.

src/spox/_public.py Show resolved Hide resolved
@@ -4,7 +4,8 @@
import numpy
import pytest

from spox._future import initializer
from spox import build, initializer
from spox.opset.ai.onnx import v17 as op

Copy link
Member

@adityagoel4512 adityagoel4512 Jan 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ONNX specification says:

A list of named tensor values. When an initializer has the same name as a graph input, it specifies a default value for that input. When an initializer has a name different from all graph inputs, it specifies a constant value. The order of the list is unspecified.

Edit: It looks like the initializer names generated by Spox are unlikely to conflict with the input name (e.g. "Initializer_0_arg"). What should happen if a user goes ahead and has an input with a name "Initializer_0_arg" as well (for whatever reason)? I think a warning should be sufficient that this initializer is being used as a default graph input and may be overriden.

Scoping seems weird with initializers (unlike constants). We should test these edge cases explicitly.

In models with IR version >= 4, in nested subgraphs used as attribute values, users MUST NOT use the same name as both a subgraph initializer and subgraph input unless the corresponding op's specification explicitly allows it.

and

In models with IR version <= 3, users MAY use the same name as both a subgraph initializer and subgraph input, but this is restricted to support constants via initializers that are not intended to correspond to any actual inputs passed from the node into the subgraph.

The naming generated by Spox should be fine for almost all general cases, but we should aim to raise warnings and errors in case users hit these edge cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems most suitable at this stage to prevent initializer and argument from sharing identical names.

I've tackled the other issues by incorporating more tests. Please don't hesitate to extend the discussion if there are any other edge cases that we've overlooked.

@cbourjau
Copy link
Collaborator

Thanks @neNasko1 for putting the PR together and thanks for your comments, @jbachurski and @adityagoel4512 !

The context is that we want to make the onnxruntime session creation faster. It is a bit of a depressing endeavor: There is no inherent reason why it would need to be slow other than protobuf not being a good fit for this use case. We did see some improvements when loading models that onnxruntime optimized. One of those optimizations is that Constant nodes are converted to initializers. But I just ran some more benchmarks and I think we misattributed the speedup to that transformation. It is probably rather subexpression elimination as suggested.

Benchmark code
from time import perf_counter

import onnxruntime as ort
import numpy as np
import onnx

from spox import argument, Tensor, build
import spox.opset.ai.onnx.v18 as op
from spox._future import initializer


def make_model(n, const_fun):
    a = argument(Tensor(np.float64, ("N",)))
    cond = argument(Tensor(np.bool_, ()))

    def do_sum():
        return op.sum([op.add(a, const_fun()) for _ in range(n)])

    (b, ) = op.if_(
        cond,
        then_branch=lambda: [do_sum()],
        else_branch=lambda: [do_sum()],
    )

    out = op.add(b, do_sum())

    return build({"a": a, "cond": cond}, {"out": out})


def bench(msg, const_fun, N=1000):
    m = make_model(N, const_fun)
    model_file = "bench_model.onnx"
    opt_model_file = "bench_model_optimized.onnx"
    onnx.save(m, model_file)

    # Create optimized version
    opts = ort.SessionOptions()
    opts.optimized_model_filepath = opt_model_file
    opts.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_BASIC
    ort.InferenceSession(model_file, opts)

    t0 = perf_counter()
    sess = ort.InferenceSession(model_file)
    t1 = perf_counter()
    print(f"{msg}; {N=}: {t1 - t0}s")

    t0 = perf_counter()
    sess = ort.InferenceSession(opt_model_file)
    t1 = perf_counter()
    print(f"{msg}; ORT optimized; {N=}: {t1 - t0}s")

print(f"{ort.__version__=}")

# No reuse
bench("Constant nodes; no global", lambda: op.const(1.0))
bench("Initializers; no global", lambda: initializer(1.0))

# Reusing globals
g_const = op.const(1.0)
bench("Constant node; global", lambda: g_const)
g_init = initializer(1.0)
bench("Initializer; global", lambda: g_init)


print()
print(f"Unrelated bug: {onnx.load('bench_model_optimized.onnx').opset_import=}")

Main output:

Ort.__version__='1.16.3'
Constant nodes; no global; N=1000: 0.3524372920510359s
Constant nodes; no global; ORT optimized; N=1000: 0.3217784999869764s
Initializers; no global; N=1000: 0.33844133402453735s
Initializers; no global; ORT optimized; N=1000: 0.3136827909620479s
Constant node; global; N=1000: 0.051705374964512885s
Constant node; global; ORT optimized; N=1000: 0.018141582957468927s
Initializer; global; N=1000: 0.05279454204719514s
Initializer; global; ORT optimized; N=1000: 0.01927170902490616s

As you can see, using Initializers may be slightly faster, but reusing constants is a much more significant gain. Interestingly, the optimizer does not seem smart enough to eliminate identical constant nodes. However, when using "global constants" it does remove redundant subexpressions as expected. Interestingly, the optimizer also adds a bunch of unrelated opset imports to the model:

Unrelated bug in the onnxruntime optimizer:

Unrelated bug: onnx.load('bench_model_optimized.onnx').opset_import=[domain: ""
version: 16
, domain: "com.microsoft.experimental"
version: 1
, domain: "ai.onnx.ml"
version: 3
, domain: "ai.onnx.training"
version: 1
, domain: "com.microsoft"
version: 1
, domain: "com.ms.internal.nhwc"
version: 19
, domain: "ai.onnx.preview.training"
version: 1
, domain: "com.microsoft.nchwc"
version: 1
, domain: "org.pytorch.aten"
version: 1
]

@adityagoel4512
Copy link
Member

adityagoel4512 commented Jan 21, 2024

That makes sense.

We did see some improvements when loading models that onnxruntime optimized.

I'm curious what sort of model this was then (given your benchmark indicates that the optimized model wasn't significantly faster than the unoptimized one).

Interestingly, the optimizer does not seem smart enough to eliminate identical constant nodes.

Is the if node in the benchmark intentional? The optimizer can indeed eliminate these outside of subgraphs. Changing make_model to:

def make_model(n, const_fun):
    a = argument(Tensor(np.float64, ("N",)))
    cond = argument(Tensor(np.bool_, ()))

    def do_sum():
        return op.sum([op.add(a, const_fun()) for _ in range(n)])
    b = do_sum()
    out = op.add(b, do_sum())
    return build({"a": a, "cond": cond}, {"out": out})

the output is:

ort.__version__='1.16.3'
Constant nodes; no global; N=1000: 0.07471766602247953s
Constant nodes; no global; ORT optimized; N=1000: 0.012379958992823958s
Initializers; no global; N=1000: 0.0753192079719156s
Initializers; no global; ORT optimized; N=1000: 0.011552749900147319s
Constant node; global; N=1000: 0.04821212496608496s
Constant node; global; ORT optimized; N=1000: 0.01119362493045628s
Initializer; global; N=1000: 0.05039608292281628s
Initializer; global; ORT optimized; N=1000: 0.011134125059470534s

N=5:
optimized, no global, constant nodes:
bench_model_optimized_Constant nodes; no global

unoptimized, no global, constant nodes:
bench_model_Constant nodes; no global

@neNasko1
Copy link
Contributor Author

  • Why is there a difference of Constant vs initializer in the first place? Is it really just because of the amount of "Constant" string literals in the protobuf output, as you're just adding a really large number of small constants (and is this a practical scenario)? In the case there are a lot of small constants, perhaps running a common-subexpression-elimination optimiser would be the way to go (since I imagine many constants are -1/0/1)?

The data above indicates that each conversion from const to initialiser results in a reduction of approximately 40 bytes in the model size. This significant reduction doesn't seem to be solely due to string lengths. When we examine the graph deserialisations, it becomes evident that an initializer is less resource-intensive to encode than a node of type const.

initializer {
  data_type: 7
  int64_data: 1
  name: "Initializer_0_arg"
}

and

node {
  output: "Constant_0_output"
  name: "Constant_0"
  op_type: "Constant"
  attribute {
    name: "value"
    t {
      data_type: 7
      int64_data: 1
      name: ""
    }
    type: TENSOR
  }
  domain: ""
}

Also running an external optimiser is not ideal as it introduces another dependency in a deployment pipeline.

@cbourjau
Copy link
Collaborator

The data above indicates that each conversion from const to initialiser results in a reduction of approximately 40 bytes in the model size.

I doubt that the byte size is the issue. We are not bound by IO but by whatever computation protobuf is then doing with those byes. More bytes does not necessarily indicate more work. That said, in this case, it is true that a NodeProto is more complicated than a TensorProto by itself. However, the benchmarks above show that the deserialization of the latter is not necessarily significantly faster.

Is the If node in the benchmark intentional? The optimizer can indeed eliminate these outside of subgraphs.

Yes, the If node was intentional since that would be closer to our production models which make heavy use of subgraphs. But it is very interesting that the onnxruntime optimizer gets tripped up by the subgraphs!

@jbachurski
Copy link
Collaborator

Also running an external optimiser is not ideal as it introduces another dependency in a deployment pipeline.

Hard disagree with this. It seems to be an obvious step to use optimiser tools for deployment - Spox was never developed to perform such a role.

@neNasko1
Copy link
Contributor Author

neNasko1 commented Jan 22, 2024

Hard disagree with this. It seems to be an obvious step to use optimiser tools for deployment - Spox was never developed to perform such a role.

I agree that Spox' job is not to optimise the produced graph I just think it's crucial that we avoid over-reliance on the optimizer. Instead, we should let users decide by exposing as much of the standard as possible.

I doubt that the byte size is the issue. We are not bound by IO but by whatever computation protobuf is then doing with those byes. More bytes does not necessarily indicate more work. That said, in this case, it is true that a NodeProto is more complicated than a TensorProto by itself. However, the benchmarks above show that the deserialization of the latter is not necessarily significantly faster.

I was just commenting on the reasons that the model got smaller.

@cbourjau
Copy link
Collaborator

Instead, we should let users decide by exposing as much of the standard as possible.

The benchmark does show that no matter which choice the user takes, it makes no notable difference. Without the parsing benefits, I don't see any use case that Initializer would enable that is not more obviously served with other parts of the standard. It touches on a point to which we don't have an explicit answer: Should Spox expose the entire standard including its historically accumulated mistakes and foot guns, or should Spox strive to expose an obvious and orthogonal set of features that covers every ONNX use case? The latter would be in the Pythonic spirit of there being one and only one obvious way to do something.

@neNasko1
Copy link
Contributor Author

The benchmark does show that no matter which choice the user takes, it makes no notable difference. Without the parsing benefits, I don't see any use case that Initializer would enable that is not more obviously served with other parts of the standard. It touches on a point to which we don't have an explicit answer: Should Spox expose the entire standard including its historically accumulated mistakes and foot guns, or should Spox strive to expose an obvious and orthogonal set of features that covers every ONNX use case? The latter would be in the Pythonic spirit of there being one and only one obvious way to do something.

Our current focus seems heavily tilted towards the existing infrastructure. While this change might seem unnecessary if we consider onnxruntime as the definitive standard, it's crucial to remember that this isn't always the case. Users should be allowed to save the model using initialisers instead of constants. This approach could be particularly beneficial in scenarios where conserving memory is a priority, resulting in smaller models in memory.

Additionally, we should entertain the possibility of users wanting to utilise a spox-based optimiser. This would mean that exposing initialiser-s is something spox should definitely have.

@neNasko1
Copy link
Contributor Author

Closing as this feature seems unneeded at this point in time.

@neNasko1 neNasko1 closed this Jan 25, 2024
@cbourjau cbourjau deleted the expose-initializer branch January 25, 2024 08:53
@cbourjau
Copy link
Collaborator

For added context: We had some more discussions around this in a meeting. At this point, there does not appear to be a use case for initializers which would not be (better) covered using existing features. The size argument is true in theory, but in practice, a model's size is vastly dominated by large constant tensors of some trained state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants