Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ClickHouse: pause KeeperMap insertion momentarily during backup #252

Merged
merged 3 commits into from
Nov 4, 2024

Conversation

aris-aiven
Copy link
Contributor

The PR alters replicated users' privileges (through GRANT & REVOKE) during the backup. This ensures the KeeperMap tables' snapshot is consistent with the frozen table data.

[DDB-1237]

@aris-aiven aris-aiven force-pushed the aris-pause-keepermap-tables branch 2 times, most recently from c91dac7 to f48b634 Compare October 29, 2024 16:09
@aris-aiven aris-aiven marked this pull request as ready for review October 29, 2024 16:12
@aris-aiven aris-aiven requested a review from a team October 29, 2024 16:12
@aris-aiven aris-aiven force-pushed the aris-pause-keepermap-tables branch from f48b634 to 2ca258a Compare October 29, 2024 16:14
@@ -137,6 +139,7 @@ def get_backup_steps(self, *, context: OperationContext) -> Sequence[Step[Any]]:
FreezeTablesStep(
clients=clickhouse_clients, freeze_name=self.freeze_name, freeze_unfreeze_timeout=self.freeze_timeout
),
KeeperMapTablesReadOnlyStep(clients=clickhouse_clients, allow_writes=True),
Copy link
Contributor

@joelynch joelynch Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel like we need to have stronger guarantees that this will run in the case that there is a failure in RetrieveKeeperMapTableDataStep or FreezeTablesStep. Maybe each Step could optionally add a function onto a stack and in the case of a failure, all the functions are popped from the stack and run.

Or you could have a step that wraps other steps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only exception the KeeperMapTablesReadOnlyStep should be emitting is ClickHouseClientQueryError, so the handle_step_failure should be noexcept.

@joelynch joelynch self-requested a review October 30, 2024 12:06
await user_client.execute(
b"INSERT INTO `keeperdata`.`keepertable` SETTINGS wait_for_async_insert=1 SELECT *, materialize(2) FROM numbers(3)"
)
with pytest.raises(ClickHouseClientQueryError, match=".*ACCESS_DENIED.*"):
Copy link
Contributor

@Khatskevich Khatskevich Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, we are basically do it for the "ClickHouse connector for Kafka". I have not read it's code, but I suspect some new problems in the normal operation.
What does exactly once mean: every part is processed, every part is not processed 2 times.
Before this commit if there are not backup restore and errors during the operation it was exactly once.
Now, during backup some part will not be able to say that the work is done.
How exactly will it behave?
Because it theoretically could

  1. fail instantly and someone else repeats the whole processing
  2. be restarted and the processing is done again

If so, it is no better than it was, because previously it was not happening during normal operation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sink connector pauses writes to ClickHouse for these exceptions: https://github.com/ClickHouse/clickhouse-kafka-connect/blob/main/src/main/java/com/clickhouse/kafka/connect/util/Utils.java#L68

If the exception is retriable, the connector won't commit the offsets to the KeeperMap table, and won't try to write that part to ClickHouse until later. The design docs describes the state machine transitions: https://github.com/ClickHouse/clickhouse-kafka-connect/blob/main/docs/DESIGN.md

Copy link
Contributor Author

@aris-aiven aris-aiven Oct 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Critically, the raised ACCESS_DENIED corresponds to exception code 497 which isn't included in the retriable exceptions list. So we need to change the approach a bit.

statements = [
privilege_altering_fun(table, escape_sql_identifier(user))
for table in keeper_map_table_names
for user in replicated_users_names
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. this looks a bit dangerous, because num_users*num_tables may blow up for some cases which we do not control.
  2. and it might be not only just long, but will grow replication queue very fast
  3. This replication queue not only should be stored, but also replayed by new replicas, repeating all those steps (as I understand)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if other problems are solved, may be it makes sense to add a special setting for the read only keepermap?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that's a problem in practice, we usually have few users and little to no KeeperMap tables.

for table in keeper_map_table_names
for user in replicated_users_names
]
await asyncio.gather(*(self.clients[0].execute(statement) for statement in statements))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect is not immediate. Ideally, we should wait until this statement propagates to all the replicas

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, that makes sense, will update the logic to wait for the effect. Thanks!

Copy link
Contributor Author

@aris-aiven aris-aiven Oct 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've played around with waiting for the correct grants to be in the system.grants table, or by using the SHOW GRANTS FOR ALL. Unfortunately that doesn't work for inherited grants, since they're computed on the fly and the table is "compressed".
The other solution would be to have a sentinel value that we try to insert in the table, but that's not possible since we don't know the table structure: we would have to go read the types to create a valid INSERT statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyhow there's no good way to wait for a GRANT statement to propagate. Let me know if the updated solution doesn't look to bad (I don't think we can aim for much better).

@aris-aiven aris-aiven force-pushed the aris-pause-keepermap-tables branch 4 times, most recently from 76d7ad4 to ee5469f Compare October 31, 2024 12:28
The PR alters replicated users' privileges (through GRANT & REVOKE) during the
backup. This ensures the KeeperMap tables' snapshot is consistent with the
frozen table data.

[DDB-1237]
@aris-aiven aris-aiven force-pushed the aris-pause-keepermap-tables branch from ee5469f to 1f613c2 Compare October 31, 2024 15:44
@aris-aiven aris-aiven requested review from a team and Khatskevich October 31, 2024 15:47
@aris-aiven aris-aiven force-pushed the aris-pause-keepermap-tables branch from 1f613c2 to e92f68c Compare October 31, 2024 23:11
Copy link
Contributor

@joelynch joelynch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some suggestions that can be covered in a follow up

logger.info("Step %s failed: %s", step, str(e))
except (StepFailedError, WaitResultError) as exc:
logger.info("Step %s failed: %s", step, str(exc))
await step.handle_step_failure(cluster, context)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I was thinking maybe more using a stack like

failure_functions = []
for i, step in enumerate(self.steps, 1):
    failure_functions.append(step.handle_step_failure)
    ...
    try:
        ...
    except ...:
        for func in reversed(failure_functions):
            func(cluster, context)

But this is probably good enough

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe the except needs to be more broad?

Copy link
Contributor Author

@aris-aiven aris-aiven Nov 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the code, if I understand correctly there are two scenarios:

  1. If the failure is Transient then the step will be retried, so there's nothing to put on the stack yet.
  2. If the failure is permanent (astacus.common.exceptions.PermanentException subclass) then there won't be a next step. The exception clause in question sets the step to failed, and higher up the caller stack the whole Op is aborted (in this case the BackupOp).

Which case do you think could result in multiple steps needing to be unwound?

maybe the except needs to be more broad?

Looking at the usages for other PermanentException subclasses they're either raised during the restore steps or by the Rohmu storage subclasses (which would be caught when the server is being started/configured I think). Might be worth broadening StepFailedError to PermanentException..

@joelynch joelynch merged commit 44520ef into main Nov 4, 2024
2 checks passed
@joelynch joelynch deleted the aris-pause-keepermap-tables branch November 4, 2024 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants