[DISCUSS] Should we allow options to be declared at the top level of a function definition #673

jacques-n · 2024-07-28T23:21:25Z

The rounding blocks in this file:
https://github.com/substrait-io/substrait/blob/main/extensions/functions_rounding.yaml

Make my eyes hurt and destroy my DRY soul. How about we add support for options to be declared at the same level as function name so these patterns aren't this awful...

Thoughts @EpsilonPrime @vbarua @westonpace ?

The text was updated successfully, but these errors were encountered:

jacques-n · 2024-07-28T23:22:30Z

The argument descriptions feel similarly painful...

westonpace · 2024-07-29T13:00:53Z

There are some arithmetic functions where the options do not apply to all implementations (fp addition does not specify overflow behavior for example).

I don't know if there are any implementations where the arguments differ based on the implementation.

What if argument / option descriptions were at the function level but each implementation still listed which options / arguments apply?

    name: "round"
    description: >
      Rounding the value `x` to `s` decimal places.
    args:
          - name: "x"
            description: >
              Numerical expression to be rounded.
          - name: "s"
            description: >
              Number of decimal places to be rounded to.

              When `s` is a positive number, nothing will happen
              since `x` is an integer value.

              When `s` is a negative number, the rounding is
              performed to the nearest multiple of `10^(-s)`.    
    options:
      rounding:
        description: >
          When a boundary is computed to lie somewhere between two values,
          and this value cannot be exactly represented, this specifies how
          to round it.

            - TIE_TO_EVEN: round to nearest value; if exactly halfway, tie
              to the even option.
            - TIE_AWAY_FROM_ZERO: round to nearest value; if exactly
              halfway, tie away from zero.
            - TRUNCATE: always round toward zero.
            - CEILING: always round toward positive infinity.
            - FLOOR: always round toward negative infinity.
            - AWAY_FROM_ZERO: round negative values with FLOOR rule, round positive values with CEILING rule
            - TIE_DOWN: round ties with FLOOR rule
            - TIE_UP: round ties with CEILING rule
            - TIE_TOWARDS_ZERO: round ties with TRUNCATE rule
            - TIE_TO_ODD: round to nearest value; if exactly halfway, tie
              to the odd option.
    impls:
      - args:
          - value: i8
            name: "x"
          - value: i32
            name: "s"
        options:
          rounding:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: i8?
      - args:
          - value: i16
            name: "x"
          - value: i32
            name: "s"
        options:
          rounding:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: i16?
      - args:
          - value: i32
            name: "x"
          - value: i32
            name: "s"
        options:
          rounding:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: i32?
      - args:
          - value: i64
            name: "x"
          - value: i32
            name: "s"
        options:
          rounding:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: i64?
      - args:
          - value: fp32
            name: "x"
          - value: i32
            name: "s"
        options:
          rounding:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: fp32?
      - args:
          - value: fp64
            name: "x"
          - value: i32
            name: "s"
        options:
          rounding:
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: fp64?

We could also leave the option's "values" at the function level (so each function just has an array of option names)

The only programmatic tool I know of that would be impacted is the BFT and I think it already reports options / arguments at the function level so there would be some benefit to align the two models.

richtia · 2024-07-29T19:53:29Z

I'm a fan of argument/option descriptions at the function level. We have a fun of other functions where all this information is put into the descriptions that's already at the function level, and it seems like it could be organized better.

Some of the string functions for example: https://github.com/substrait-io/substrait/blob/main/extensions/functions_string.yaml#L61

Blizzara · 2024-07-30T07:57:58Z

For the sake of discussion, an alternative would be to allow multiple types in the args' value field, or generic types. Though that might require allowing to specify output type in terms of an input type. I.e. something like:

name: "round"
    description: >
      Rounding the value `x` to `s` decimal places.
    impls:
      - args:
          - value: integer # alternative 1: specify a "logical" type
            name: "x"
            description: >
              Numerical expression to be rounded.
          - value: i32
            name: "s"
            description: >
              Number of decimal places to be rounded to.

              When `s` is a positive number, nothing will happen
              since `x` is an integer value.

              When `s` is a negative number, the rounding is
              performed to the nearest multiple of `10^(-s)`.
        options:
          rounding:
            description: >
              When a boundary is computed to lie somewhere between two values,
              and this value cannot be exactly represented, this specifies how
              to round it.

                - TIE_TO_EVEN: round to nearest value; if exactly halfway, tie
                  to the even option.
                - TIE_AWAY_FROM_ZERO: round to nearest value; if exactly
                  halfway, tie away from zero.
                - TRUNCATE: always round toward zero.
                - CEILING: always round toward positive infinity.
                - FLOOR: always round toward negative infinity.
                - AWAY_FROM_ZERO: round negative values with FLOOR rule, round positive values with CEILING rule
                - TIE_DOWN: round ties with FLOOR rule
                - TIE_UP: round ties with CEILING rule
                - TIE_TOWARDS_ZERO: round ties with TRUNCATE rule
                - TIE_TO_ODD: round to nearest value; if exactly halfway, tie
                  to the odd option.
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: arg_typeof("x")?
      - args:
          - value: # alternative 2: specify a list of types
             - fp32
             - fp64 
            name: "x"
            description: >
              Numerical expression to be rounded.
          - value: i32
            name: "s"
            description: >
              Number of decimal places to be rounded to.

              When `s` is a positive number, the rounding
              is performed to a `s` number of decimal places.

              When `s` is a negative number, the rounding is
              performed to the left side of the decimal point
              as specified by `s`.
        options:
          rounding:
            description: >
              When a boundary is computed to lie somewhere between two values,
              and this value cannot be exactly represented, this specifies how
              to round it.

                - TIE_TO_EVEN: round to nearest value; if exactly halfway, tie
                  to the even option.
                - TIE_AWAY_FROM_ZERO: round to nearest value; if exactly
                  halfway, tie away from zero.
                - TRUNCATE: always round toward zero.
                - CEILING: always round toward positive infinity.
                - FLOOR: always round toward negative infinity.
                - AWAY_FROM_ZERO: round negative values with FLOOR rule, round positive values with CEILING rule
                - TIE_DOWN: round ties with FLOOR rule
                - TIE_UP: round ties with CEILING rule
                - TIE_TOWARDS_ZERO: round ties with TRUNCATE rule
                - TIE_TO_ODD: round to nearest value; if exactly halfway, tie
                  to the odd option.
            values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
              AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
        nullability: DECLARED_OUTPUT
        return: arg_typeof("x")?

If multiple args specify lists of types, then all combinations of those would be considered valid. This could be useful especially if the function takes two or more arguments and they don't necessarily need to be of the same type (though maybe then the output type gets complicated). The "logical" types could be e.g. integers (i8 .. i64), numerics (i8 .. i64, fp32, fp64, decimal?), strings (string, varchar, fixedchar).

tokoko · 2024-10-05T09:34:34Z

I'd rather treat this as a yaml problem and solve with anchors/aliases. for example this works fine for me in python w/o any changes to the code. The new declarations field will probably have to be added to json schema for the files not to fail validation, though.

declarations:
  options: &rounding_options
    rounding:
      description: >
        When a boundary is computed to lie somewhere between two values,
        and this value cannot be exactly represented, this specifies how
        to round it.

          - TIE_TO_EVEN: round to nearest value; if exactly halfway, tie
            to the even option.
          - TIE_AWAY_FROM_ZERO: round to nearest value; if exactly
            halfway, tie away from zero.
          - TRUNCATE: always round toward zero.
          - CEILING: always round toward positive infinity.
          - FLOOR: always round toward negative infinity.
          - AWAY_FROM_ZERO: round negative values with FLOOR rule, round positive values with CEILING rule
          - TIE_DOWN: round ties with FLOOR rule
          - TIE_UP: round ties with CEILING rule
          - TIE_TOWARDS_ZERO: round ties with TRUNCATE rule
          - TIE_TO_ODD: round to nearest value; if exactly halfway, tie
            to the odd option.
      values: [ TIE_TO_EVEN, TIE_AWAY_FROM_ZERO, TRUNCATE, CEILING, FLOOR,
        AWAY_FROM_ZERO, TIE_DOWN, TIE_UP, TIE_TOWARDS_ZERO, TIE_TO_ODD ]
  round_s_description: &round_s_description >
    Number of decimal places to be rounded to.

    When `s` is a positive number, nothing will happen
    since `x` is an integer value.

    When `s` is a negative number, the rounding is
    performed to the nearest multiple of `10^(-s)`.

scalar_functions:
  -
    name: "round"
    description: >
      Rounding the value `x` to `s` decimal places.
    impls:
      - args:
          - value: i8
            name: "x"
            description: >
              Numerical expression to be rounded.
          - value: i32
            name: "s"
            description: *round_s_description
        options:
          <<: *rounding_options
        nullability: DECLARED_OUTPUT
        return: i8?
      - args:
          - value: i16
            name: "x"
            description: >
              Numerical expression to be rounded.
          - value: i32
            name: "s"
            description: *round_s_description
        options:
          <<: *rounding_options
        nullability: DECLARED_OUTPUT
        return: i16?
      - args:
          - value: i32
            name: "x"
            description: >
              Numerical expression to be rounded.
          - value: i32
            name: "s"
            description: *round_s_description
        options:
          <<: *rounding_options
        nullability: DECLARED_OUTPUT
        return: i32?
      - args:
          - value: i64
            name: "x"
            description: >
              Numerical expression to be rounded.
          - value: i32
            name: "s"
            description: *round_s_description
        options:
          <<: *rounding_options
        nullability: DECLARED_OUTPUT
        return: i64?
      - args:
          - value: fp32
            name: "x"
            description: >
              Numerical expression to be rounded.
          - value: i32
            name: "s"
            description: *round_s_description
        options:
          <<: *rounding_options
        nullability: DECLARED_OUTPUT
        return: fp32?
      - args:
          - value: fp64
            name: "x"
            description: >
              Numerical expression to be rounded.
          - value: i32
            name: "s"
            description: *round_s_description
        options:
          <<: *rounding_options
        nullability: DECLARED_OUTPUT
        return: fp64?

jacques-n · 2024-10-08T15:53:17Z

I'd rather treat this as a yaml problem and solve with anchors/aliases

We evaluated this when we originally wrote the format. We found that anchors/aliases are not reliably implemented in many yaml parsers. I wonder if the status is any better today...

jacques-n changed the title ~~[DISCUSS] Should be allow options to be declared at the top level of a function definition~~ [DISCUSS] Should we allow options to be declared at the top level of a function definition Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DISCUSS] Should we allow options to be declared at the top level of a function definition #673

[DISCUSS] Should we allow options to be declared at the top level of a function definition #673

jacques-n commented Jul 28, 2024

jacques-n commented Jul 28, 2024

westonpace commented Jul 29, 2024

richtia commented Jul 29, 2024

Blizzara commented Jul 30, 2024 •

edited

Loading

tokoko commented Oct 5, 2024

jacques-n commented Oct 8, 2024

[DISCUSS] Should we allow options to be declared at the top level of a function definition #673

[DISCUSS] Should we allow options to be declared at the top level of a function definition #673

Comments

jacques-n commented Jul 28, 2024

jacques-n commented Jul 28, 2024

westonpace commented Jul 29, 2024

richtia commented Jul 29, 2024

Blizzara commented Jul 30, 2024 • edited Loading

tokoko commented Oct 5, 2024

jacques-n commented Oct 8, 2024

Blizzara commented Jul 30, 2024 •

edited

Loading