Bulk update extras of Nodes #6587

GeigerJ2 · 2024-10-22T09:25:59Z

As mentioned by @giovannipizzi, it would be nice if bulk-updating Node extras could be achieved using, e.g., a dictionary of the form:

{
    <UUID>: {
        'extra_key1': <VALUE>,
        'extra_key2': <VALUE>,
    },
    <UUID>: ...
}

Currently, the various storage backends already implement bulk_insert and bulk_update methods, which are being used when importing an archive. Using the existing bulk_update method, the following works to ~~update~~ replace the extras:

storage_backend.bulk_update(
    EntityTypes.NODE,
    [
        {
            # "uuid": "da44c605-70ee-4a24-9844-045c47fbebd9",
            'id': 1,
            "extras": {"a": 1, "b": 2, "c": 3},
        },
        {
            # "uuid": "56e619b0-f894-4417-bb85-aa430f4bcacb",
            'id': 2,
            "extras": {"a": 4, "b": 5, "c": 6},
        }
    ],
)

However, in the current implementation, node selection only works using the id (probably for efficiency reasons), and previous extras, e.g., _aiida_hash would be overwritten, so it doesn't function in the way of an extend feature, which is usually what one wants. One could add an additional, specific method for updating the node extras, or extend the current one. Though, care has to be taken to keep it efficient, and not iterate again over individual nodes in the implementation, possibly slowing down things.

As all other Node properties should be immutable once stored, I currently cannot think of other modifications than changing the extras which would be interesting in bulk for nodes.

The text was updated successfully, but these errors were encountered:

rabbull · 2024-12-05T13:42:31Z

As discussed with @GeigerJ2, there are two problems regarding bulk updates, and the following are our plans to address them:

Bulk updates currently only support selecting nodes with the primary key.
This limitation might not be a significant issue, as users typically already have access to the id beforehand. Given that the transmission cost to the database is generally higher than looping through an array in memory to gather ids, the current implementation may suffice. While it would be beneficial to support this feature, it is acceptable to deprioritize or abandon it if it proves too time-consuming to implement. I will explore this further after addressing the second problem.
Expanding vs. Overwriting JSON fields.
Users are more likely to expect expand-ing the existing JSON object (particularly the extra field as other JSON fields are mostly immutable in storage) rather than overwrite-ing it entirely. To maintain compatibility with the current version and allow for key deletion in JSON fields, the plan is to introduce a flag in the bulk_update method. This flag will indicate whether JSON fields should be extended or overwritten.
- For PostgreSQL, this can be implemented using the || operator or jsonb_concat.
- For SQLite, json_patch will be used.
However, note that there are slight disparities between json_patch and jsonb_concat. For instance, in json_patch, assigning a null value to a key will remove it, while jsonb_concat retains the key with a null value. These differences need to be carefully considered and accounted for in the implementation.

GeigerJ2 added type/feature request status undecided topic/sqlalchemy topic/storage labels Oct 22, 2024

GeigerJ2 assigned rabbull Nov 12, 2024

rabbull mentioned this issue Dec 10, 2024

Storage: Add extend_json Tag for Enhanced JSON Field Handling in bulk_update Operations #6659

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bulk update extras of Nodes #6587

Bulk update extras of Nodes #6587

GeigerJ2 commented Oct 22, 2024

rabbull commented Dec 5, 2024

Bulk update extras of Nodes #6587

Bulk update extras of Nodes #6587

Comments

GeigerJ2 commented Oct 22, 2024

rabbull commented Dec 5, 2024