CLI arguments problem in distinguishing strings and data nodes #73

agoscinski · 2024-12-19T13:50:05Z

Hello @leclairm, @GeigerJ2 and me started to look at the cli_arguments again and we have a design question. The problem in supporting a non data node in the cli arguments as in this example

cli_arguments:
  positional:
    - not_a_data_node
    - data_node
...
data:
  available:
    - data_node:
      ...

# --> command not_a_data_node {data_node}

is that we need to check if not_a_data_node is in data or not to know how to interpret it internally. This could cause the problem that a user uses the name of names a data node the same as a string in the cli_arguments. This will result in an error which will be hard to debug for the user, we therefore would like to avoid this.

The suggestion we had during the meeting to put it to the flags leaves the question open how to handle the same problem positional arguments where the order matters

cli_arguments:
  positional:
    - data_node
  flags:
    - not_a_data_node
    
...
data:
  available:
    - data_node:
      ...

# --> command data_node not_a_data_node
# or command not_a_data_node data_node
# Ambiguous!

One solution would be do also specify strings as data nodes

cli_arguments:
  positional:
    - not_a_data_node
    - data_node
...
data:
  available:
    - not_a_data_node:
        value: not_a_data_node
        type: str
    - data_node:
      ...

# --> command {not_a_data_node} {data_node}

This however is a bit cumbersome to write. We therefore iterated to marking the data node somehow. Most naturally the string interpolation notation came from Python (and similar in other languages) came up

cli_arguments:
  positional:
    - not_a_data_node
    - {data_node}
...
data:
  available:
    - data_node:
      ...

# --> command not_a_data_node {data_node}

But if we already need to use such a notation to mark data nodes then we could also simplify this using one string for the cli_arguments

cli_arguments: "-ps1 --option {another_data_node} --ps 2  not_a_data_node {data_node}"
# --> command -ps1 --option {another_data_node} --ps 2 not_a_data_node {data_node}

It seems also the most intuitive for the user as one writes it like in the terminal. It is basically string interpolation, a concept from programming languages the user might have seen already so this notation might be already intuitive. We were only worried that during parsing of this string we could run into behavior that is hard to debug but it seems so far that the parsing is very simple, thus there is not much room for errors to appear. See

# splits whitespaces and removes redundant ones
# e.g. "a  {b}" --> ["a", "{b}"]
strings = [string for string in string.split(" ") if string != '']
for string in strings:
    if string.startswith("{") and string.endswith("}"):
        # check if string is in data nodes, if not raise error

The text was updated successfully, but these errors were encountered:

GeigerJ2 · 2024-12-19T14:14:08Z

Already made a draft PR for the alternative solution: #72
(still needs the resolving of the nodes, though, as mentioned in the last code snippet in this issue)

And to add here something @agoscinski and I just discussed:
One way to make the former approach, using _CliArgsBaseModel work, is to only allow passing keyword arguments, which do not refer to AiiDA entities via flags. That is, things like, --verbosity=high must be passed in the flags section, not under keywords, while entries under keywords must always refer to AiiDA entities (or data/files in that sense).

And, for positional arguments, we don't see any use case, where one would pass a positional argument to a script that is not a file. So passing str/int/float as positional arguments might not even need to be a case that we need to cover.

Lastly, flags and source_files to set environment variables are clear, so with the approach outlined above, we might even be able to even resolve the ambiguity outlined in this issue. Then, it is more about design, if we want the user to write the explicit syntax currently in place, or a single string as proposed in #72.

leclairm · 2024-12-19T19:51:24Z

Thanks @agoscinski @GeigerJ2
I quite like this proposition actually. We could even go one step further and put everything in the command item like

tasks:
  - my_postproc:
      src: postproc/my_script.sh
      command: "./my_script.sh --verbosity 3 --input {data_1}"

agoscinski · 2024-12-20T09:20:31Z

@leclairm Shouldn't it then be?

tasks:
  - my_postproc:
      src: postproc
      command: "./my_script.sh --verbosity 3 --input {data_1}"
# folder structure is postproc/my_script.sh, but potentially could contain more

So we copy the whole folder and specify the script relative to the src folder. Linking does not make much sense to me in this case because the only reason you have a folder is to have files that the script references to. But maybe that is a discussion for another issue.

The `cli_arguments` are specified now as a single string using string interpolation to reference data items. The string is parsed and separated internally. Implements suggestion in issue #73.

leclairm · 2025-01-07T07:59:26Z

@leclairm Shouldn't it then be?
tasks:
  - my_postproc:
      src: postproc
      command: "./my_script.sh --verbosity 3 --input {data_1}"
# folder structure is postproc/my_script.sh, but potentially could contain more
So we copy the whole folder and specify the script relative to the src folder. Linking does not make much sense to me in this case because the only reason you have a folder is to have files that the script references to. But maybe that is a discussion for another issue.

I fully agree

agoscinski mentioned this issue Jan 1, 2025

Replace cli_arguments in the yaml to a single string #75

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLI arguments problem in distinguishing strings and data nodes #73

CLI arguments problem in distinguishing strings and data nodes #73

agoscinski commented Dec 19, 2024 •

edited by GeigerJ2

Loading

GeigerJ2 commented Dec 19, 2024 •

edited

Loading

leclairm commented Dec 19, 2024

agoscinski commented Dec 20, 2024 •

edited

Loading

leclairm commented Jan 7, 2025

CLI arguments problem in distinguishing strings and data nodes #73

CLI arguments problem in distinguishing strings and data nodes #73

Comments

agoscinski commented Dec 19, 2024 • edited by GeigerJ2 Loading

GeigerJ2 commented Dec 19, 2024 • edited Loading

leclairm commented Dec 19, 2024

agoscinski commented Dec 20, 2024 • edited Loading

leclairm commented Jan 7, 2025

agoscinski commented Dec 19, 2024 •

edited by GeigerJ2

Loading

GeigerJ2 commented Dec 19, 2024 •

edited

Loading

agoscinski commented Dec 20, 2024 •

edited

Loading