-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ClickHouse: pause KeeperMap insertion momentarily during backup #252
Conversation
c91dac7
to
f48b634
Compare
f48b634
to
2ca258a
Compare
@@ -137,6 +139,7 @@ def get_backup_steps(self, *, context: OperationContext) -> Sequence[Step[Any]]: | |||
FreezeTablesStep( | |||
clients=clickhouse_clients, freeze_name=self.freeze_name, freeze_unfreeze_timeout=self.freeze_timeout | |||
), | |||
KeeperMapTablesReadOnlyStep(clients=clickhouse_clients, allow_writes=True), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like we need to have stronger guarantees that this will run in the case that there is a failure in RetrieveKeeperMapTableDataStep or FreezeTablesStep. Maybe each Step could optionally add a function onto a stack and in the case of a failure, all the functions are popped from the stack and run.
Or you could have a step that wraps other steps?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only exception the KeeperMapTablesReadOnlyStep
should be emitting is ClickHouseClientQueryError
, so the handle_step_failure
should be noexcept.
await user_client.execute( | ||
b"INSERT INTO `keeperdata`.`keepertable` SETTINGS wait_for_async_insert=1 SELECT *, materialize(2) FROM numbers(3)" | ||
) | ||
with pytest.raises(ClickHouseClientQueryError, match=".*ACCESS_DENIED.*"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, we are basically do it for the "ClickHouse connector for Kafka". I have not read it's code, but I suspect some new problems in the normal operation.
What does exactly once mean: every part is processed, every part is not processed 2 times.
Before this commit if there are not backup restore and errors during the operation it was exactly once.
Now, during backup some part will not be able to say that the work is done.
How exactly will it behave?
Because it theoretically could
- fail instantly and someone else repeats the whole processing
- be restarted and the processing is done again
If so, it is no better than it was, because previously it was not happening during normal operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sink connector pauses writes to ClickHouse for these exceptions: https://github.com/ClickHouse/clickhouse-kafka-connect/blob/main/src/main/java/com/clickhouse/kafka/connect/util/Utils.java#L68
If the exception is retriable, the connector won't commit the offsets to the KeeperMap table, and won't try to write that part to ClickHouse until later. The design docs describes the state machine transitions: https://github.com/ClickHouse/clickhouse-kafka-connect/blob/main/docs/DESIGN.md
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Critically, the raised ACCESS_DENIED
corresponds to exception code 497 which isn't included in the retriable exceptions list. So we need to change the approach a bit.
statements = [ | ||
privilege_altering_fun(table, escape_sql_identifier(user)) | ||
for table in keeper_map_table_names | ||
for user in replicated_users_names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- this looks a bit dangerous, because num_users*num_tables may blow up for some cases which we do not control.
- and it might be not only just long, but will grow replication queue very fast
- This replication queue not only should be stored, but also replayed by new replicas, repeating all those steps (as I understand)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if other problems are solved, may be it makes sense to add a special setting for the read only keepermap?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's a problem in practice, we usually have few users and little to no KeeperMap tables.
for table in keeper_map_table_names | ||
for user in replicated_users_names | ||
] | ||
await asyncio.gather(*(self.clients[0].execute(statement) for statement in statements)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The effect is not immediate. Ideally, we should wait until this statement propagates to all the replicas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, that makes sense, will update the logic to wait for the effect. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've played around with waiting for the correct grants to be in the system.grants
table, or by using the SHOW GRANTS FOR ALL
. Unfortunately that doesn't work for inherited grants, since they're computed on the fly and the table is "compressed".
The other solution would be to have a sentinel value that we try to insert in the table, but that's not possible since we don't know the table structure: we would have to go read the types to create a valid INSERT
statement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Anyhow there's no good way to wait for a GRANT statement to propagate. Let me know if the updated solution doesn't look to bad (I don't think we can aim for much better).
76d7ad4
to
ee5469f
Compare
The PR alters replicated users' privileges (through GRANT & REVOKE) during the backup. This ensures the KeeperMap tables' snapshot is consistent with the frozen table data. [DDB-1237]
ee5469f
to
1f613c2
Compare
1f613c2
to
e92f68c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some suggestions that can be covered in a follow up
logger.info("Step %s failed: %s", step, str(e)) | ||
except (StepFailedError, WaitResultError) as exc: | ||
logger.info("Step %s failed: %s", step, str(exc)) | ||
await step.handle_step_failure(cluster, context) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I was thinking maybe more using a stack like
failure_functions = []
for i, step in enumerate(self.steps, 1):
failure_functions.append(step.handle_step_failure)
...
try:
...
except ...:
for func in reversed(failure_functions):
func(cluster, context)
But this is probably good enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe the except needs to be more broad?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the code, if I understand correctly there are two scenarios:
- If the failure is
Transient
then the step will be retried, so there's nothing to put on the stack yet. - If the failure is permanent (
astacus.common.exceptions.PermanentException
subclass) then there won't be a next step. The exception clause in question sets the step to failed, and higher up the caller stack the wholeOp
is aborted (in this case theBackupOp
).
Which case do you think could result in multiple steps needing to be unwound?
maybe the except needs to be more broad?
Looking at the usages for other PermanentException
subclasses they're either raised during the restore steps or by the Rohmu storage subclasses (which would be caught when the server is being started/configured I think). Might be worth broadening StepFailedError
to PermanentException
..
The PR alters replicated users' privileges (through GRANT & REVOKE) during the backup. This ensures the KeeperMap tables' snapshot is consistent with the frozen table data.
[DDB-1237]