ege.dev

Friday, 27 March 2026

Point-in-time recovery for MySQL on Kubernetes

Since the v1.0.0 release of the new MySQL Operator (K8SPS), point-in-time recovery (PiTR) has been the most anticipated feature. Naturally, we decided to implement it in the upcoming v1.1.0 release.

PiTR relies on two processes:

  1. Reliably collecting binary logs from MySQL servers and storing them somewhere safe
  2. When the recovery is triggered, applying those binary logs up to a specific point

Collecting binary logs

Our Galera replication-based MySQL Operator (K8SPXC) has a binary log collector that was developed by the Cloud Team. It has worked reliably for years but has a certain limitation that is inherent in its design: it depends on flushing the binary logs to collect them. This leads to huge numbers of binary logs on the MySQL server, which becomes a headache after running the collector for a few weeks. We mitigated this for the users by maintaining a cache for binary logs, but it didn’t remove the pain—it just became ours.

At Percona, #FindABetterWay is one of our core values. In the spirit of finding a better way, when we first started to think about PiTR in K8SPS, we decided to improve the process of collecting binary logs. These discussions eventually led to the birth of a new product: Percona Binlog Server (PBS).

PBS works by connecting to the MySQL server as a replica and streaming events. It either uploads these events to S3 or stores them on the filesystem. It supports replication source switchovers and is able to continue from where it left off. On top of these, it provides helper commands to search for a particular GTID or timestamp in the collected binary logs.

Applying binary logs

The official MySQL docs suggest converting each binary log to text using mysqlbinlog and piping them into the mysql client for PiTR. This is already what we do in K8SPXC.

I decided to check if there’s a better way. First, I checked old posts on the Percona Blog to see if our experts had written anything about a different PiTR approach. It’s not surprising that they did. Marcelo wrote about an approach leveraging replication appliers for recovery. It seemed much better than piping mysqlbinlog output into the client, since with replication we can have multithreading and parallel appliers for recovery. My only problem with this approach was that it required two mysqld instances. Of course it’s possible, but I would love to not have to care about the state of two MySQL servers.

Luckily, lefred commented that there’s an “even better” approach that requires only one MySQL server!

At a high level, the process looks like this:

  1. Restore the full backup into the MySQL datadir
  2. Start a temporary mysqld instance
  3. Download binary logs and put them as relay logs in the datadir
  4. Start replication via CHANGE REPLICATION SOURCE TO RELAY_LOG_FILE=..., SOURCE_HOST='dummy'
  5. Start applying binary logs via START REPLICA SQL_THREAD UNTIL <GTID>
  6. Wait for the SQL thread to stop
  7. STOP REPLICA; RESET REPLICA ALL;

I have already created an unpolished but working PoC. There is also an RFC for explaining various decisions and tracking open questions that need answers. There’s still work to do, but I’m confident that this approach is good and we’ll release this as a tech preview in the upcoming v1.1.0 release. If you have thoughts on the RFC or want to try the PoC, we’d love to hear your feedback.

Thursday, 12 March 2026

gremlins in HAProxy

We are in the process of certifying our operators for <redacted>. We started with PostgreSQL Operator and it worked just fine without any adjustments. Then we moved on to our MySQL Operators and it surfaced a problem in HAProxy.

HAProxy is used by default in our MySQL clusters. They sit in front of MySQL instances as the proxy to have read/write splitting. We have our own external scripts to perform checks for each backend for determining if a MySQL server is good for that particular backend.

After deploying the operator on <redacted>, we realized that our HAProxy pods are failing to get ready because all external checks are failing due to timeouts. But why?

It’s gotta be the DNS. It’s always DNS, isn’t it? Turns out, no. I tested DNS queries from the HAProxy container and it seems they were fast.

Could this be AppArmor? Maybe <redacted> has stricter AppArmor profiles? I configured HAProxy pods to be Unconfined. It didn’t help either.

Then I decided to increase the timeout from 10 seconds to 30 seconds, just to see how much time the script needs to finish. To my surprise, the script was taking 810 milliseconds to finish! How could a check time out in 10 seconds but finish in less than a second with 30 seconds timeout?

In every deep debugging session, there is a moment when engineers start to believe in spiritual beings or gremlins who mess with the software in the system. One needs to resist this temptation of irrationality. In computers there’s always a rational explanation for problems, it’s just very deep and caused by unlucky combinations of design choices and/or bugs.

At this point I decided to attach a debug container into the pod and check what HAProxy is doing with strace. This resulted in the first breakthrough of the problem: the child process of HAProxy that runs the external check command was doing shitloads of poll syscalls until it was killed due to timeout. It wasn’t even running the script, it was simply stuck.

This realization made me shift my focus to HAProxy itself. I started to read the source code to see what’s going on.

src/extcheck.c

int fd;
sa_family_t family;

/* close all FDs. Keep stdin/stdout/stderr in verbose mode */
fd = (global.mode & (MODE_QUIET|MODE_VERBOSE)) == MODE_QUIET ? 0 : 3;

my_closefrom(fd);

/* restore the initial FD limits */
limit.rlim_cur = rlim_fd_cur_at_boot;
limit.rlim_max = rlim_fd_max_at_boot;
if (raise_rlim_nofile(NULL, &limit) != 0) {
	getrlimit(RLIMIT_NOFILE, &limit);
	ha_warning("External check: failed to restore initial FD limits (cur=%u max=%u), using cur=%u max=%u\n",
		   rlim_fd_cur_at_boot, rlim_fd_max_at_boot,
		   (unsigned int)limit.rlim_cur, (unsigned int)limit.rlim_max);
}

HAProxy attempts to close all FDs. What does all mean? And then it restores limits to some value? What’s that value?

All FDs means closing all from FD 0 to FD soft limit. my_closefrom function first polls the FD and then closes it if poll doesn’t fail with POLLNVAL. OK, this explains the excessive polling I see with strace. But it’s the default HAProxy behavior for a long time, why didn’t we see the same problem on some other platform, i.e GKE?

The answer is simple. Turns out, <redacted> has a much higher soft limit than GKE. On GKE ulimit -n returns 1048576, on <redacted> 1073741816! But still there was something that troubled me at this point: I vaguely remembered a configuration option in HAProxy that limits the number of FDs that the process will use. The option is fd-hard-limit and docs say its default value is 1048576. If it had any effect, I wouldn’t have seen any problems. Something was wrong in HAProxy.

Remember that the external check process was doing something to restore limits to some value assigned at boot? Those values are assigned in the main function here:

src/haproxy.c

/* take a copy of initial limits before we possibly change them */
getrlimit(RLIMIT_NOFILE, &limit);

if (limit.rlim_max == RLIM_INFINITY)
	limit.rlim_max = limit.rlim_cur;
rlim_fd_cur_at_boot = limit.rlim_cur;
rlim_fd_max_at_boot = limit.rlim_max;

The code that limits FDs with fd-hard-limit is a few lines below:

if (global.fd_hard_limit && limit.rlim_cur > global.fd_hard_limit)
	limit.rlim_cur = global.fd_hard_limit;

This means external checks restore limits to the value unbounded by fd-hard-limit. Oh, this looks like a bug in HAProxy and explains why we had this issue!

Wednesday, 21 July 2021

Cluster Statuses in Percona Kubernetes Operators

In Kubernetes, all resources have a status field separated from their spec. The status field is an interface both for humans or applications to read the perceived state of the resource.

When you deploy our Percona Kubernetes Operators – Percona Operator for MongoDB or Percona Operator for MySQL – in your Kubernetes cluster, you’re creating a custom resource (CR for short) and it has its own status, too. Since Kubernetes operators mimic the human operator and aim to have the required expertise to run software in a Kubernetes cluster; the status of the custom resources should be smart.

You can get cluster status with the commands below, or via (Kubernetes API) for Percona Operator for MySQL:

% kubectl get pxc
NAME            ENDPOINT                                   STATUS   PXC   PROXYSQL   HAPROXY   AGE
lisette-18537   lisette-18537-haproxy.subjectivism-22940   ready    3                3         87m

% kubectl get pxc <cluster-name> -o jsonpath='{.status}'
{
  "backup": {
    "version": "8.0.23"
  },
  "conditions": [
    {
      "lastTransitionTime": "2021-07-12T13:13:46Z",
      "status": "True",
      "type": "initializing"
    }
  ],
  "haproxy": {
    "labelSelectorPath": "...",
    "ready": 3,
    "size": 3,
    "status": "ready"
  },
  "host": "lisette-18537-haproxy.subjectivism-22940",
  "logcollector": {
    "version": "1.8.0"
  },
  "observedGeneration": 2,
  "pmm": {
    "version": "2.12.0"
  },
  "proxysql": {},
  "pxc": {
    "image": "percona/percona-xtradb-cluster:8.0.22-13.1",
    "labelSelectorPath": "...",
    "ready": 2,
    "size": 3,
    "status": "initializing",
    "version": "8.0.22-13.1"
  },
  "ready": 5,
  "size": 6,
  "state": "initializing"
}

And for Percona Operator for MongoDB:

% kubectl get psmdb
NAME             ENDPOINT                                                     STATUS   AGE
cynodont-26997   cynodont-26997-mongos.subjectivism-22940.svc.cluster.local   ready    85m


% kubectl get psmdb <cluster-name> -o jsonpath='{.status}'
{
  "conditions": [
    {
      "lastTransitionTime": "2021-07-12T13:13:39Z",
      "status": "True",
      "type": "initializing"
    }
  ],
  "host": "cynodont-26997-mongos.subjectivism-22940.svc.cluster.local",
  "mongoImage": "percona/percona-server-mongodb:4.4.6-8",
  "mongoVersion": "4.4.6-8",
  "mongos": {
    "ready": 1,
    "size": 3,
    "status": "initializing"
  },
  "observedGeneration": 2,
  "ready": 3,
  "replsets": {
    "cfg": {
      "ready": 1,
      "size": 3,
      "status": "initializing"
    },
    "rs0": {
      "initialized": true,
      "ready": 2,
      "size": 3,
      "status": "initializing"
    }
  },
  "size": 6,
  "state": "initializing"
}

As you can see there are several fields in the output: conditions, cluster size, number of ready cluster members, statuses and versions of different components, and the “state”. In the following sections, we’ll take a look at every possible value of the state field.

Initializing

While the cluster is progressing to readiness, CR status is “initializing”. It includes creating the cluster, scaling it up or down, and updating the CR that triggers a rolling restart of pods (for instance updating Percona Operator for MySQL memory limits).

Percona Operator for MongoDB also reconfigures the replica set config if necessary (for instance it adds the new pods as members to replset or removes terminated ones). Replica set in MongoDB is a set of servers that implements replication and automatic failover. Although they have the same name, it’s different from the Kubernetes replica set. While this configuration is happening or if there is an unknown/unpredicted error during it, the status is also “initializing”.

Since version 1.7.0, the Percona Operator for MySQL can handle full crash recovery if necessary. If a pod waits for the recovery, the cluster status is “initializing”.

Ready

The operator keeps track of the status of each component in the cluster. Percona Operator for MongoDB has the following components:

  1. mongod StatefulSet
  2. configsvr StatefulSet if sharding is enabled
  3. mongos Deployment if sharding is enabled

Percona Operator for MySQL components:

  1. PXC StatefulSet
  2. HAProxy StatefulSet if enabled
  3. ProxySQL StatefulSet if enabled

All components need to be in “ready” status for CR to be “ready”. If the number of ready pods controlled by the stateful set reaches the desired number, the operator marks the component as ready. The readiness of the pods is tracked by Kubernetes using readiness probes for each container in the pod. For example, for a Percona XtraDB Cluster container to be ready “wsrep_cluster_status” needs to be “Primary” and “wsrep_local_state” should be “Synced” or “Donor”. For a Percona Server for MongoDB container to be ready, accepting TCP connections on 27017 is enough.

But ready as the CR status means more than that. CR “ready” means the cluster (Percona Server for MongoDB or Percona XtraDB Cluster) is up and running and ready to receive traffic. So, even if all components are ready, the cluster status can be “initializing”. In the Percona Operator for MongoDB, the replica set needs to be initialized and its config up-to-date. Also, with the 1.9.0 release of both operators, the load balancer needs to be ready if the cluster is exposed with exposeType: LoadBalancer.

Stopping

Version 1.9.0 introduced two new statuses:

  1. Stopping
  2. Paused

Stopping means the cluster is paused or deleted and its pods are terminating right now.

If you run kubectl delete psmdb <cluster-name> or `kubectl delete pxc `` the resource can be deleted quickly without a chance to see “stopping” status. If you had finalizers (for example “delete-pxc-pods-in-order” in Percona Operator for MySQL) deletion will be blocked until the finalizer list is exhausted and you can observe “stopping” status.

Paused

Once the cluster is paused and all pods are terminated, the CR status becomes “paused”.

To pause the cluster: kubectl patch <psmdb|pxc> <cluster-name> --type=merge -p '{"spec": {"pause": true}}'

Keep in mind, when the cluster is paused and exposeType is LoadBalancer – Load balancers are still there and you continue to pay for them.

Error

Before 1.9.0, “error” status could mean two different things:

  1. An error occurred in the operator during the reconciliation of the CR
  2. One or more pods in a component are not schedulable

With 1.9.0, the “error” status means only the operator errors. If there is an unschedulable pod, the cluster’s status will be initializing. If the cluster is stuck in initializing for too long, it’s better to check the operator logs to investigate.

% kubectl logs <operator-pod-name>
...
{"level":"info","ts":1626095618.9982307,"logger":"controller_psmdb","msg":"Created a new mongo key","Request.Namespace":"subjectivism-22940","Request.Name":"cynodont-26997","KeyName":"cynodont-26997-mongodb-keyfile"}
{"level":"info","ts":1626095619.0032709,"logger":"controller_psmdb","msg":"Created a new mongo key","Request.Namespace":"subjectivism-22940","Request.Name":"cynodont-26997","KeyName":"cynodont-26997-mongodb-encryption-key"}
{"level":"info","ts":1626095687.3783236,"logger":"controller_psmdb","msg":"initiating replset","replset":"rs0","pod":"cynodont-26997-rs0-1"}
{"level":"info","ts":1626095694.020591,"logger":"controller_psmdb","msg":"replset was initialized","replset":"rs0","pod":"cynodont-26997-rs0-1"}
{"level":"error","ts":1626095694.622869,"logger":"controller_psmdb","msg":"failed to reconcile cluster","Request.Namespace":"subjectivism-22940","Request.Name":"cynodont-26997","replset":"rs0","error":"undefined state of the replset member cynodont-26997-rs0-0.cynodont-26997-rs0.subjectivism-22940.svc.cluster.local:27017: 6","errorVerbose":"undefined state of the replset member cynodont-26997-rs0-0.cynodont-26997-rs0.subjectivism-22940.svc.cluster.local:27017: 6\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).reconcileCluster\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go:210\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:449\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:451\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}
% kubectl logs <operator-pod-name>
 
...
{"level":"info","ts":1626095618.9982307,"logger":"controller_psmdb","msg":"Created a new mongo key","Request.Namespace":"subjectivism-22940","Request.Name":"cynodont-26997","KeyName":"cynodont-26997-mongodb-keyfile"}
{"level":"info","ts":1626095619.0032709,"logger":"controller_psmdb","msg":"Created a new mongo key","Request.Namespace":"subjectivism-22940","Request.Name":"cynodont-26997","KeyName":"cynodont-26997-mongodb-encryption-key"}
{"level":"info","ts":1626095687.3783236,"logger":"controller_psmdb","msg":"initiating replset","replset":"rs0","pod":"cynodont-26997-rs0-1"}
{"level":"info","ts":1626095694.020591,"logger":"controller_psmdb","msg":"replset was initialized","replset":"rs0","pod":"cynodont-26997-rs0-1"}
{"level":"error","ts":1626095694.622869,"logger":"controller_psmdb","msg":"failed to reconcile cluster","Request.Namespace":"subjectivism-22940","Request.Name":"cynodont-26997","replset":"rs0","error":"undefined state of the replset member cynodont-26997-rs0-0.cynodont-26997-rs0.subjectivism-22940.svc.cluster.local:27017: 6","errorVerbose":"undefined state of the replset member cynodont-26997-rs0-0.cynodont-26997-rs0.subjectivism-22940.svc.cluster.local:27017: 6\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).reconcileCluster\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go:210\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:449\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88\nruntime.goexit\n\t/usr/local/go/src/runtime/asm_amd64.s:1371","stacktrace":"github.com/go-logr/zapr.(*zapLogger).Error\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/github.com/go-logr/zapr/zapr.go:128\ngithub.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb.(*ReconcilePerconaServerMongoDB).Reconcile\n\t/go/src/github.com/percona/percona-server-mongodb-operator/pkg/controller/perconaservermongodb/psmdb_controller.go:451\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:256\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:232\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).worker\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:211\nk8s.io/apimachinery/pkg/util/wait.JitterUntil.func1\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:152\nk8s.io/apimachinery/pkg/util/wait.JitterUntil\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:153\nk8s.io/apimachinery/pkg/util/wait.Until\n\t/go/src/github.com/percona/percona-server-mongodb-operator/vendor/k8s.io/apimachinery/pkg/util/wait/wait.go:88"}

You can try new statuses in version 1.9.0 of both Percona Operator for MongoDB and Percona Operator for MySQL. Percona Operator for MongoDB was released in June and Percona Operator for MySQL is on the way.