Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Async replication setup fails when there's write load to primary #399

Open
paulomach opened this issue Mar 27, 2024 · 5 comments
Open

Async replication setup fails when there's write load to primary #399

paulomach opened this issue Mar 27, 2024 · 5 comments
Labels
bug Something isn't working

Comments

@paulomach
Copy link
Contributor

Steps to reproduce

  1. Deploy two clusters with more then one unit each
  2. add write load to the one selected to become the primary
  3. and relate to form a cluster-set

Expected behavior

Cluster set is formed

Actual behavior

Joining instances on secondary cluster fail to join the cluster

@paulomach paulomach added the bug Something isn't working label Mar 27, 2024
Copy link
Contributor

paulomach added a commit that referenced this issue Apr 8, 2024
@paulomach
Copy link
Contributor Author

paulomach commented Apr 11, 2024

Upstream bug report: http://bugs.mysql.com/114624

Oracles internal bug report reference: 36403327

paulomach added a commit that referenced this issue Apr 15, 2024
* poc k8s-k8s async repl

* isort had mismatched line_length config

* (WIP) improved support

* (WIP) support more cases

* scale up/down for repl clusters
* idle state for 2nds
* instance label when creating repl cluster
* more typing

* (wip) replica secondaries support

* allow replica secondaries to rejoin

* (WIP) sync cluster-set-name and better deal with repl 2ndaries

* (WIP) fix secondaries join and messaging

* fix: removed unneeded flag

* support for relation broken

* dissolved replica cluster stays blocked
* added the promote standby action
* support for cluster set name config
* minor refactors and typing

* async common methods, single module/file

* removed dup action and lint fixes

* check for user data on replica side

* moved async_replication to a library (owned by vm charm)

* sync with vm code

* lint fixes

* scale-in locks write to global primary

* address pr comments

* second batch of PR comments adressing

* fence/unfence actions

* fix for single unit replica cluster

* workaround: secrets in relation data

* partial unit test fixes

* normalized cluster name for dict reference

* remove test due to be refactored

* allows rejoin after unrelate

* fix lock instance reference

* allow unrelated cluster rejoin

* reset cluster-set name

* dealing with secret not found

* ensure secrets are shared with full uri

* fix recreation and unrelation after promotion

* ensure all online unis

* unset read only after unfence

* refactor remove instance to acommodate changing

* fix race condition

* automatic deal with clusters with the same name

* bump

* test for mysql version and cluster-set-name on rejoin

* avoid handling on unit removal

* sync local root

* use method from lib

* lint fixes

* addressing pr feedback

* using node mode instead of role

* covering edge cases for recovery/failover

* add rejoin invalidated cluster action

* the integration test

* workaround for issue #399

* bump libpatch

* add group marks

* missing default

* set flag to avoid lock release

* chore: fixes old issue on test

* fix markers

* fix marker import

* fix retry process on removal

* fix user creation after refactor

`root@%` created only when required.

* lint/grammar fixes

* pr comment
@taurus-forever
Copy link
Contributor

@paulomach do we have an update from MySQL team here? I see no activity in http://bugs.mysql.com/114624

@paulomach
Copy link
Contributor Author

paulomach commented Jun 19, 2024

@taurus-forever they've pinged directly on slack.

tl;dr; they cannot reproduce on sandboxed environment. There's something special about the way we are setting things. I've allocated some time to compare their instructions to the way we do it.

For the record, the instructions:

dba.deploy_sandbox_instance(3310, {'password': "", 'sandboxDir': "/tmp/sandbox"});
dba.deploy_sandbox_instance(3320, {'password': "", 'sandboxDir': "/tmp/sandbox"});
dba.deploy_sandbox_instance(3330, {'password': "", 'sandboxDir': "/tmp/sandbox"});

# 1. Deploy Group Replication cluster (A)

shell.connect("[email protected]:3310", "")

cluster = dba.create_cluster("my_cluster")

cluster.add_instance("127.0.0.1:3320", {'recoveryMethod':'incremental'});
cluster.add_instance("127.0.0.1:3330", {'recoveryMethod':'incremental'});

# 2. Create ClusterSet on A

clusterset = cluster.create_cluster_set("foobar")

# 3. Deploy standalone instance B1

dba.deploy_sandbox_instance(3340, {'password': "", 'sandboxDir': "/tmp/sandbox"});

# 4. Create ReplicaCluster on B1 from A (cluster_set.create_replica_cluster)

clusterset.create_replica_cluster("127.0.0.1:3340", "replica", {'recoveryMethod':'incremental'});

# 5. Add some write load to cluster A

# \sql
my_session = mysql.get_classic_session('root@localhost:3310', '');
my_session.run_sql("CREATE DATABASE test");
my_session.run_sql("USE test;");
my_session.run_sql("CREATE TABLE t1 (c1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY, c2 INT) ENGINE=InnoDB;");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");

On other terminal generate load with mysqlslap:

mysqlslap -S /tmp/sandbox/3310/sandboxdata/mysqld.sock --create-schema=test --delimiter=";" --iterations=300000 --query="INSERT INTO t1 (c2) SELECT c2 FROM t1 LIMIT 100" --concurrency=100 &

And on mysqlshell execute final steps:

shell.connect("[email protected]:3310", "")

dba.deploy_sandbox_instance(3350, {'password': "", 'sandboxDir': "/tmp/sandbox"});

cluster = dba.get_cluster("my_cluster")

cluster.add_instance("127.0.0.1:3350", {'recoveryMethod':'incremental'})

@paulomach
Copy link
Contributor Author

Reproducible using:

dba.deploy_sandbox_instance(3310, {'password': "", 'sandboxDir': "/tmp/sandbox"});
dba.deploy_sandbox_instance(3320, {'password': "", 'sandboxDir': "/tmp/sandbox"});
dba.deploy_sandbox_instance(3330, {'password': "", 'sandboxDir': "/tmp/sandbox"});

# 1. Deploy Group Replication cluster (A)
shell.connect("[email protected]:3310", "")
cluster = dba.create_cluster("my_cluster")
cluster.add_instance("127.0.0.1:3320", {'recoveryMethod':'incremental'});
cluster.add_instance("127.0.0.1:3330", {'recoveryMethod':'incremental'});

# 2. Create ClusterSet on A
clusterset = cluster.create_cluster_set("foobar")

# 3. Deploy standalone instance B1
dba.deploy_sandbox_instance(3340, {'password': "", 'sandboxDir': "/tmp/sandbox"});

# 4. Create ReplicaCluster on B1 from A (cluster_set.create_replica_cluster)
clusterset.create_replica_cluster("127.0.0.1:3340", "replica", {'recoveryMethod':'incremental'});

# 5. Add some write load to cluster A

# \sql
my_session = mysql.get_classic_session('root@localhost:3310', '');
my_session.run_sql("CREATE DATABASE test");
my_session.run_sql("USE test;");
my_session.run_sql("CREATE TABLE t1 (c1 INT NOT NULL AUTO_INCREMENT PRIMARY KEY, c2 INT) ENGINE=InnoDB;");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) values (2);");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");
my_session.run_sql("INSERT INTO t1 (c2) SELECT c2 FROM t1;");

On other terminal generate load with mysqlslap:

mysqlslap -S /tmp/sandbox/3310/sandboxdata/mysqld.sock --create-schema=test --delimiter=";" --iterations=300000 --query="INSERT INTO t1 (c2) SELECT c2 FROM t1 LIMIT 100" --concurrency=100 &

And on mysqlshell execute final steps:

shell.connect("[email protected]:3340", "")
dba.deploy_sandbox_instance(3350, {'password': "", 'sandboxDir': "/tmp/sandbox"});
# First difference - I'm adding instances to the replica cluster
cluster = dba.get_cluster("replica")
cluster.add_instance("127.0.0.1:3350", {'recoveryMethod':'incremental'})

# Second difference - sometimes It was needed to add a third instance to the replica cluster
dba.deploy_sandbox_instance(3360, {'password': "", 'sandboxDir': "/tmp/sandbox"});
cluster.add_instance("127.0.0.1:3360", {'recoveryMethod':'incremental'})

session.run_sql("SELECT max(c1) FROM test.t1")
# PK stop incrementing while mysqlslap is running

# connect to `my_cluster` primary
shell.connect("[email protected]:3310", "")
# PK still incrementing while mysqlslap is running
session.run_sql("SELECT max(c1) FROM test.t1")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants