Percona Toolkit for Managing MySQL Clusters

Running a MySQL cluster in production eventually teaches you that MySQL’s built-in tooling is optimised for the happy path. Replication works, until it doesn’t. Schema changes are fast, until the table has 80 million rows. SHOW SLAVE STATUS reports lag, but the number isn’t always telling you the truth.

Percona Toolkit is a collection of command-line tools that covers exactly these gaps — the operational situations where the native toolset leaves you guessing. I’ve been running MySQL replication clusters for long enough to have a working set of tools I reach for reflexively. These are the ones that have earned a permanent place in that set.

pt-table-checksum

The first thing you need to know about replication drift is that MySQL will not tell you it’s happening. A replica can fall silently out of sync — different row counts, different column values — while SHOW SLAVE STATUS reports Seconds_Behind_Master: 0 and no errors. The binlog says everything is fine because everything that was replicated was applied correctly. The drift came from something else: a direct write to the replica, a failed transaction that rolled back on the primary but committed on the replica, a binary log format mismatch.

pt-table-checksum detects this. It runs CHECKSUM TABLE queries on the primary in chunks, writes the results to a percona.checksums table, and replication carries that table to replicas. On each replica, the tool compares the checksums it finds against the primary’s expected values.

pt-table-checksum \
  --user=root \
  --password=secret \
  --host=primary.db.internal \
  --databases=myapp \
  --chunk-size=1000 \
  --max-lag=1s \
  --replicate=percona.checksums

--max-lag is important. The tool throttles itself if replica lag exceeds the threshold, which prevents the checksum queries themselves from causing the lag they’re trying to measure around. --chunk-size controls how many rows each checksum covers — smaller chunks mean finer granularity on which rows have drifted, at the cost of more queries.

The output tells you which tables have diffs and on which replicas. The next step is pt-table-sync.

pt-table-sync

Once you know which tables have drifted, pt-table-sync fixes them — without stopping replication, without locking tables, without touching the primary unless you explicitly ask it to.

The key thing to understand about pt-table-sync is that it generates the minimal set of INSERT, UPDATE, and DELETE statements required to bring the replica into agreement with the primary. Those statements run on the replica by default, bypassing the binlog, so they don’t propagate further down the replication chain.

pt-table-sync \
  --execute \
  --sync-to-master \
  --user=root \
  --password=secret \
  h=replica.db.internal,D=myapp,t=orders

--sync-to-master tells the tool to make the replica match the primary. --execute is required — without it, the tool runs in dry-run mode and prints what it would do. Always run the dry-run first:

pt-table-sync \
  --print \
  --sync-to-master \
  --user=root \
  --password=secret \
  h=replica.db.internal,D=myapp,t=orders

Review the output carefully. pt-table-sync is doing the right thing in almost every case, but “almost every case” is not “every case”. On a table with drift caused by a bad migration, the fix it generates is exactly what you want. On a table where the drift is a symptom of an ongoing application bug, fixing the symptom without understanding the cause will cost you the next time the tool runs.

pt-query-digest

Slow query analysis is the other half of cluster management. MySQL’s slow query log captures what you tell it to capture, but the log file itself is not particularly useful to read — it’s verbose, unordered, and gives you per-query execution time rather than aggregate cost. pt-query-digest turns the raw log into an actionable summary.

pt-query-digest /var/log/mysql/slow.log

The output groups queries by fingerprint — a normalised form with literals replaced by ? placeholders — and reports aggregate statistics: total time, average time, worst-case time, rows examined, rows sent, and a count of how often the query pattern appeared. The report is sorted by total time consumed, so the first entries are the ones that matter most.

# Profile
# Rank Query ID           Response time  Calls  R/Call  V/M   Item
# ==== ================== ============= ======  ======= ===== ====
#    1 0x89C9E...         142.3249 40.1%  5234  0.0272  0.06  SELECT orders
#    2 0xF3B7A...          89.1031 25.1%   801  0.1113  0.15  SELECT products JOIN categories
#    3 0x2C14E...          47.8822 13.5%  1200  0.0399  0.04  UPDATE sessions

From there, you can look at specific query examples, execution plans, and the distribution of response times. The variance-to-mean ratio (V/M) is worth paying attention to — a high ratio means the query is sometimes fast and sometimes catastrophically slow, which is usually an index selectivity problem or a locking issue rather than a fundamentally bad query.

pt-query-digest can also read from SHOW PROCESSLIST output, binary logs, and TCP dumps — useful when you can’t enable the slow query log in a running production environment.

pt-online-schema-change

ALTER TABLE on a large MySQL table holds a metadata lock for the duration of the operation. On a table with millions of rows, that’s minutes — during which reads can continue but writes queue up. In practice, this means you cannot run non-trivial schema migrations during business hours without taking downtime, which becomes increasingly difficult to justify as the application grows.

pt-online-schema-change (pt-osc) solves this. The approach: create a shadow table with the new schema, copy rows in chunks from the original, install triggers on the original to replicate concurrent writes to the shadow, then rename the shadow table into place. The original never gets locked; writes to the original are captured and replayed on the shadow during the copy phase.

pt-online-schema-change \
  --alter="ADD COLUMN last_seen_at DATETIME" \
  --user=root \
  --password=secret \
  --execute \
  D=myapp,t=users

The --alter string is exactly what you’d put after ALTER TABLE tablename in a standard MySQL statement. You can add columns, drop columns, add indexes, change column types — anything ALTER TABLE supports.

pt-online-schema-change \
  --alter="ADD INDEX idx_users_email (email)" \
  --chunk-size=2000 \
  --max-lag=2s \
  --user=root \
  --password=secret \
  --execute \
  D=myapp,t=users

--chunk-size controls the batch size for the row copy; --max-lag pauses the copy if replica lag exceeds the threshold. Both are worth tuning for your cluster’s capacity.

The main caveat: tables with foreign keys require extra care. pt-osc handles foreign keys, but the strategy (--alter-foreign-keys-method) matters — rebuild_constraints is safer but slower; drop_swap is faster but briefly removes the constraint. Know which tables in your schema have foreign keys before running this.

pt-heartbeat

Seconds_Behind_Master in SHOW SLAVE STATUS is derived from comparing the timestamp on the most recently executed binlog event against the current time. The number is meaningless when the replica is idle — if no binlog events have arrived recently, the counter shows zero regardless of actual lag. It’s also susceptible to clock skew between the primary and replica.

pt-heartbeat measures lag differently. It writes a timestamp to a heartbeat table on the primary at a configurable interval. Replication carries that write to replicas. The monitor process on each replica reads the heartbeat table and reports how far behind the replica’s copy is from the current wall-clock time.

Start the heartbeat writer on the primary:

pt-heartbeat \
  --user=root \
  --password=secret \
  --host=primary.db.internal \
  --database=percona \
  --update \
  --daemonize

Monitor lag on a replica:

pt-heartbeat \
  --user=root \
  --password=secret \
  --host=replica.db.internal \
  --database=percona \
  --monitor

Output is a continuous stream of lag measurements: 0.00s, 0.01s, or — when something goes wrong — 3.47s climbing. This is the number you want in your monitoring stack, not Seconds_Behind_Master.

pt-kill

Long-running queries on a replica can block DDL operations or — if you’re using the replica for analytics — monopolise connections that the application needs. pt-kill monitors SHOW PROCESSLIST and kills queries that meet specified criteria.

pt-kill \
  --host=replica.db.internal \
  --user=root \
  --password=secret \
  --busy-time=30 \
  --kill \
  --victims=oldest \
  --interval=5 \
  --daemonize

--busy-time=30 targets queries that have been running for more than 30 seconds. --victims=oldest kills the longest-running match first. --interval=5 checks every five seconds. --daemonize runs it as a background process.

There’s a dry-run mode:

pt-kill \
  --host=replica.db.internal \
  --user=root \
  --password=secret \
  --busy-time=30 \
  --print

This prints what would be killed without actually killing anything. Run the print mode first, confirm the targets are what you expect, then switch to --kill.

The obvious risk: killing a legitimate long-running analytics query that a colleague is waiting on. The --match-command, --match-user, and --ignore-user flags give you finer control over which connections are candidates. Setting --ignore-user=analytics_ro exempts read-only analytics connections from the kill policy while still catching runaway application queries.

pt-duplicate-key-checker

Indexes accumulate over time. A column gets an index for a query that no longer exists. A composite index is added that makes a single-column index on the same leading column redundant. Nobody deletes the old one because nobody is certain it’s safe to drop. pt-duplicate-key-checker finds these.

pt-duplicate-key-checker \
  --user=root \
  --password=secret \
  --host=primary.db.internal \
  --databases=myapp

The output identifies duplicate and redundant keys with the ALTER TABLE ... DROP INDEX statement ready to execute:

# Duplicate of idx_users_email
# Key definitions:
#   KEY `idx_users_email_created` (`email`,`created_at`)
#   KEY `idx_users_email` (`email`)
# Column types:
#   `email` varchar(255)
#   `created_at` datetime
# To remove this duplicate key, execute:
ALTER TABLE `myapp`.`users` DROP INDEX `idx_users_email`;

Here the single-column index on email is redundant because idx_users_email_created has email as its leading column and covers any query the single-column index would serve. Dropping idx_users_email is safe.

Verify with EXPLAIN before dropping anything in production. The tool is accurate, but the decision of which duplicate to remove matters — and the tool defaults to suggesting removal of the shorter index, which is not always the right choice when query patterns are uneven.

pt-archiver

Deleting rows from a large table with a plain DELETE FROM ... WHERE ... statement can be catastrophic. It holds row-level locks for the duration of the statement, writes a large transaction to the binlog, and causes replica lag proportional to the size of the operation. On a table with millions of stale rows, you do not want to do this in a single statement.

pt-archiver deletes (or archives) rows in configurable chunks, with configurable sleeps between chunks, respecting replica lag throughout.

pt-archiver \
  --source h=primary.db.internal,D=myapp,t=events \
  --purge \
  --where "created_at < DATE_SUB(NOW(), INTERVAL 90 DAY)" \
  --limit=500 \
  --sleep=0.1 \
  --user=root \
  --password=secret

--purge deletes without archiving to a destination. --limit=500 processes 500 rows per chunk. --sleep=0.1 pauses 100ms between chunks. For archiving to another table or another server rather than deleting, replace --purge with --dest.

For ongoing data retention policies, pt-archiver running as a nightly cron job is considerably safer than a scheduled DELETE query.

What the Toolkit Doesn’t Do

These tools are operational — they help you manage a cluster that exists. They do not help you design a replication topology that avoids drift in the first place, choose a failover strategy, or decide between semi-synchronous and asynchronous replication. That thinking has to happen before the cluster is live.

pt-table-sync in particular has a sharp edge worth repeating: it repairs drift, but if drift is recurring, you have an upstream problem that sync won’t solve. Finding and fixing the cause — a replica receiving direct writes, an application ignoring transaction boundaries, a schema difference between primary and replica — is more important than how quickly you can run the repair.

The toolkit is well-maintained and the documentation is thorough. If you’re running MySQL clusters and not using it, you’re solving problems with worse tools than you need to be.