Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLI arguments problem in distinguishing strings and data nodes #73

Open
agoscinski opened this issue Dec 19, 2024 · 4 comments
Open

CLI arguments problem in distinguishing strings and data nodes #73

agoscinski opened this issue Dec 19, 2024 · 4 comments

Comments

@agoscinski
Copy link
Collaborator

agoscinski commented Dec 19, 2024

Hello @leclairm, @GeigerJ2 and me started to look at the cli_arguments again and we have a design question. The problem in supporting a non data node in the cli arguments as in this example

cli_arguments:
  positional:
    - not_a_data_node
    - data_node
...
data:
  available:
    - data_node:
      ...

# --> command not_a_data_node {data_node}

is that we need to check if not_a_data_node is in data or not to know how to interpret it internally. This could cause the problem that a user uses the name of names a data node the same as a string in the cli_arguments. This will result in an error which will be hard to debug for the user, we therefore would like to avoid this.

The suggestion we had during the meeting to put it to the flags leaves the question open how to handle the same problem positional arguments where the order matters

cli_arguments:
  positional:
    - data_node
  flags:
    - not_a_data_node
    
...
data:
  available:
    - data_node:
      ...

# --> command data_node not_a_data_node
# or command not_a_data_node data_node
# Ambiguous!

One solution would be do also specify strings as data nodes

cli_arguments:
  positional:
    - not_a_data_node
    - data_node
...
data:
  available:
    - not_a_data_node:
        value: not_a_data_node
        type: str
    - data_node:
      ...

# --> command {not_a_data_node} {data_node}

This however is a bit cumbersome to write. We therefore iterated to marking the data node somehow. Most naturally the string interpolation notation came from Python (and similar in other languages) came up

cli_arguments:
  positional:
    - not_a_data_node
    - {data_node}
...
data:
  available:
    - data_node:
      ...

# --> command not_a_data_node {data_node}

But if we already need to use such a notation to mark data nodes then we could also simplify this using one string for the cli_arguments

cli_arguments: "-ps1 --option {another_data_node} --ps 2  not_a_data_node {data_node}"
# --> command -ps1 --option {another_data_node} --ps 2 not_a_data_node {data_node}

It seems also the most intuitive for the user as one writes it like in the terminal. It is basically string interpolation, a concept from programming languages the user might have seen already so this notation might be already intuitive. We were only worried that during parsing of this string we could run into behavior that is hard to debug but it seems so far that the parsing is very simple, thus there is not much room for errors to appear. See

# splits whitespaces and removes redundant ones
# e.g. "a  {b}" --> ["a", "{b}"]
strings = [string for string in string.split(" ") if string != '']
for string in strings:
    if string.startswith("{") and string.endswith("}"):
        # check if string is in data nodes, if not raise error
@GeigerJ2
Copy link
Collaborator

GeigerJ2 commented Dec 19, 2024

Already made a draft PR for the alternative solution: #72
(still needs the resolving of the nodes, though, as mentioned in the last code snippet in this issue)

And to add here something @agoscinski and I just discussed:
One way to make the former approach, using _CliArgsBaseModel work, is to only allow passing keyword arguments, which do not refer to AiiDA entities via flags. That is, things like, --verbosity=high must be passed in the flags section, not under keywords, while entries under keywords must always refer to AiiDA entities (or data/files in that sense).

And, for positional arguments, we don't see any use case, where one would pass a positional argument to a script that is not a file. So passing str/int/float as positional arguments might not even need to be a case that we need to cover.

Lastly, flags and source_files to set environment variables are clear, so with the approach outlined above, we might even be able to even resolve the ambiguity outlined in this issue. Then, it is more about design, if we want the user to write the explicit syntax currently in place, or a single string as proposed in #72.

@leclairm
Copy link
Contributor

Thanks @agoscinski @GeigerJ2
I quite like this proposition actually. We could even go one step further and put everything in the command item like

tasks:
  - my_postproc:
      src: postproc/my_script.sh
      command: "./my_script.sh --verbosity 3 --input {data_1}"

@agoscinski
Copy link
Collaborator Author

agoscinski commented Dec 20, 2024

@leclairm Shouldn't it then be?

tasks:
  - my_postproc:
      src: postproc
      command: "./my_script.sh --verbosity 3 --input {data_1}"
# folder structure is postproc/my_script.sh, but potentially could contain more

So we copy the whole folder and specify the script relative to the src folder. Linking does not make much sense to me in this case because the only reason you have a folder is to have files that the script references to. But maybe that is a discussion for another issue.

agoscinski added a commit that referenced this issue Jan 1, 2025
The `cli_arguments` are specified now as a single string using string
interpolation to reference data items. The string is parsed and separated
internally. Implements suggestion in issue #73.
agoscinski added a commit that referenced this issue Jan 1, 2025
The `cli_arguments` are specified now as a single string using string
interpolation to reference data items. The string is parsed and separated
internally. Implements suggestion in issue #73.
@leclairm
Copy link
Contributor

leclairm commented Jan 7, 2025

@leclairm Shouldn't it then be?

tasks:
  - my_postproc:
      src: postproc
      command: "./my_script.sh --verbosity 3 --input {data_1}"
# folder structure is postproc/my_script.sh, but potentially could contain more

So we copy the whole folder and specify the script relative to the src folder. Linking does not make much sense to me in this case because the only reason you have a folder is to have files that the script references to. But maybe that is a discussion for another issue.

I fully agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants