前言
在 MySQL 主从架构中,从库(Slave)突然崩溃且重启后陷入“启动-崩溃-再启动”的死循环,往往令人惊心动魄。最近在维护一个基于 MySQL 5.7.24 (Percona Server) 的生产环境时,就遭遇了一次由于 FEDERATED 存储引擎引起的从库核心崩溃。
本文将还原这次故障的现场,深入分析其背后的底层原因,并提供紧急恢复与免重启动态根治的完整方案。
一、 故障现场:崩溃日志还原
某天,从库(3306 端口)突然离线,查看错误日志(Error Log),发现了以下密密麻麻的堆栈信息:
2026-06-10T18:23:30.271709+08:00 1 [Note] Slave I/O thread for channel '': connected to master 'slave@192.168.23.91:3306',replication started in log 'mysql-bin.017232' at position 690661736
2026-06-10T18:23:30.293590+08:00 0 [Note] Event Scheduler: Loaded 0 events
2026-06-10T18:23:30.294407+08:00 0 [Note] /usr/local/webserver/mysql/bin/mysqld: ready for connections.
Version: '5.7.24-27-log' socket: '/tmp/mysql.sock' port: 3306 Percona Server (GPL), Release 27, Revision bd42700
2026-06-10T18:23:30.306095+08:00 2 [Note] Slave SQL thread for channel '' initialized, starting replication in log 'mysql-bin.017232' at position 689961274, relay log './db92-relay-bin.045838' position: 689961535
10:23:30 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.
Please help us make Percona Server better by reporting any
bugs at http://bugs.percona.com/
key_buffer_size=33554432
read_buffer_size=8388608
max_used_connections=0
max_threads=1001
thread_count=18
connection_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 12334946 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.
Thread pointer: 0x7f5cb500e000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f6780ce6cf0 thread_stack 0x80000
/usr/local/webserver/mysql/bin/mysqld(my_print_stacktrace+0x2c)[0xedf27c]
/usr/local/webserver/mysql/bin/mysqld(handle_fatal_signal+0x461)[0xd206f1]
/lib64/libpthread.so.0(+0xf5d0)[0x7f678f98a5d0]
/usr/local/webserver/mysql/bin/mysqld(_ZN12ha_federated7rnd_posEPhS0_+0x2f)[0xfb159f]
/usr/local/webserver/mysql/bin/mysqld(_ZN7handler10ha_rnd_posEPhS0_+0x172)[0x7aa3a2]
/usr/local/webserver/mysql/bin/mysqld(_ZN7handler17rnd_pos_by_recordEPh+0x36)[0x7b2d66]
/usr/local/webserver/mysql/bin/mysqld(_ZN14Rows_log_event24do_index_scan_and_updateEPK14Relay_log_info+0x1c6)[0xe58ec6]
/usr/local/webserver/mysql/bin/mysqld(_ZN14Rows_log_event14do_apply_eventEPK14Relay_log_info+0x65e)[0xe5818e]
/usr/local/webserver/mysql/bin/mysqld(_Z27slave_worker_exec_job_groupP12Slave_workerP14Relay_log_info+0x17f)[0xeb519f]
/usr/local/webserver/mysql/bin/mysqld(handle_slave_worker+0x35b)[0xe9829b]
/usr/local/webserver/mysql/bin/mysqld(pfs_spawn_thread+0x1b4)[0xef6d34]
/lib64/libpthread.so.0(+0x7dd5)[0x7f678f982dd5]
/lib64/libc.so.6(clone+0x6d)[0x7f678db51ead]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7f5cb73a81a0): is an invalid pointer
Connection ID (thread ID): 3
Status: NOT_KILLED
You may download the Percona Server operations manual by visiting
http://www.percona.com/software/percona-server/. You may find information
in the manual which will help you identify the cause of the crash.
关键线索提取
mysqld got signal 11:这意味着服务触发了操作系统的段错误(Segment Fault),即内存越界访问,导致进程被系统强制杀死。handle_slave_worker/do_apply_event:表明崩溃发生在从库的多线程复制(MTS)Worker 线程重放 Binlog 事件时。do_index_scan_and_update:从库正在对某张表进行索引扫描以执行UPDATE或DELETE操作。ha_federated::rnd_pos:罪魁祸首现身! 从库当前正在同步的表使用了FEDERATED(联邦/远程映射)存储引擎。
二、 核心原因深挖:为什么会崩溃?
排查业务建表语句,涉事表结构如下:
CREATE TABLE `fd_statistics_order` (
`order_id` int(11) NOT NULL COMMENT '订单ID',
...
PRIMARY KEY (`order_id`),
KEY `ben_statistics_order_pay_time_index` (`pay_time`) USING BTREE
) ENGINE=FEDERATED DEFAULT CHARSET=utf8mb4 CONNECTION='mysql://user:passwd@192.168.23.92:3307/shop_stats/shop_statistics_order';
该表通过网络将数据实时映射到了 192.168.23.92:3307 上的远程表。
崩溃原理
这是 MySQL 5.7 早期版本中一个著名的官方已知 Bug。当主库对一张 Federated 表执行更新或删除,并且 Binlog 格式为 ROW 时,从库的 SQL 线程在重放该事件时会调用 ha_federated::rnd_pos() 函数去定位远程行的位置。
然而,由于 Federated 引擎对 Row 格式 Binlog 传入的行指针(Row Pointer)校验和内存管理存在漏洞,导致了内存越界访问。操作系统为了保护内存安全,直接向 mysqld 发送了 Signal 11 信号,导致服务瞬间崩溃。
由于崩溃时该条 Binlog 尚未被标记为“已执行”,从库重启后会再次读取该条日志,再次崩溃,从而陷入无限死循环。
三、 紧急恢复从库服务
当从库因坏账日志陷入死循环时,我们需要“阻断”它的自动同步,先把服务拉起来。
步骤 1:阻止从库自动启动复制
修改从库的 my.cnf 配置文件,在 [mysqld] 模块下临时加入:
[mysqld]
skip-slave-start = 1
重启 MySQL 服务。此时从库能正常启动,但不会自动运行 Replication,从而避开坏账日志。
mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
Slave_IO_State:
Master_Host: 192.168.23.91
Master_User: slave
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.017232
Read_Master_Log_Pos: 690661736
Relay_Log_File: db92-relay-bin.045838
Relay_Log_Pos: 689961535
Relay_Master_Log_File: mysql-bin.017232
Slave_IO_Running: No
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table: information_schema.%,performance_schema.%,sys.%,percona.%
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 689961274
Relay_Log_Space: 690662481
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 0
Master_UUID: 4d83ee2d-11ad-11e9-953c-1866dae7c89c
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 4d83ee2d-11ad-11e9-953c-1866dae7c89c:4522980-13586051:13586057-42130492:42130497-42134293:42134317-1187029942
Executed_Gtid_Set: 047d948c-23ec-11e9-8e05-0242fb3aada3:1-7,
4d83ee2d-11ad-11e9-953c-1866dae7c89c:1-1187029942,
6c679363-11a5-11e9-8b86-1866dae7c89c:1-2
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
步骤 2:主从复制不同步联邦表
由于 Federated 表本质上只是一个远程映射,主库修改后远程数据已经改变,从库完全不需要通过 Binlog 重放来同步它。因此,最佳实践是在从库上直接过滤掉所有 Federated 表。
这里的联邦表命名是fd_开头,过滤所有库及所有表的fd开头 '%.fd%'
上面命令 SHOW SLAVE STATUS\G 可以看到 Slave_IO_Running 和 Slave_SQL_Running 已经为 No,这里直接配置主从复制不同步联邦表
mysql> CHANGE REPLICATION FILTER REPLICATE_WILD_IGNORE_TABLE = (
'information_schema.%',
'performance_schema.%',
'sys.%',
'percona.%',
'%.fd_%'
);
mysql> START SLAVE;
mysql> show slave status\G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 192.168.23.91
Master_User: slave
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.017232
Read_Master_Log_Pos: 725755336
Relay_Log_File: db92-relay-bin.045842
Relay_Log_Pos: 35786096
Relay_Master_Log_File: mysql-bin.017232
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table: information_schema.%,performance_schema.%,sys.%,percona.%,%.fd_%
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 725746868
Relay_Log_Space: 726457362
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 9106
Master_UUID: 4d83ee2d-11ad-11e9-953c-1866dae7c89c
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 4d83ee2d-11ad-11e9-953c-1866dae7c89c:4522980-13586051:13586057-42130492:42130497-42134293:42134317-1187034035
Executed_Gtid_Set: 047d948c-23ec-11e9-8e05-0242fb3aada3:1-7,
4d83ee2d-11ad-11e9-953c-1866dae7c89c:1-1187034035,
6c679363-11a5-11e9-8b86-1866dae7c89c:1-2
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
确认 SHOW SLAVE STATUS\G 中的 Slave_IO_Running 和 Slave_SQL_Running 恢复为 Yes 后,记得去 my.cnf 中删除 skip-slave-start = 1。
由于内存变量重启后会丢失,必须将其持久化到 my.cnf 中(多条规则需分多行编写):
[mysqld]
replicate-wild-ignore-table = information_schema.%
replicate-wild-ignore-table = performance_schema.%
replicate-wild-ignore-table = sys.%
replicate-wild-ignore-table = percona.%
replicate-wild-ignore-table = %.fd_%
# skip-slave-start = 1
步骤3:在线动态调整其他健康的从库
如果我们有其他健康的从库也需要配置过滤,但不希望重启服务影响业务,可以采用以下“免重启动态平滑配置”方案:
在 MySQL 5.7+ 中,我们可以利用 CHANGE REPLICATION FILTER 语法在线修改过滤规则。
⚠️ 重要避坑提示:
CHANGE REPLICATION FILTER是覆盖性的命令。执行时必须把原有的过滤规则和新加的规则写在一起,否则旧规则会丢失!
-- 1. 暂停 SQL 线程(不影响 IO 线程接收日志,业务无感知)
STOP SLAVE SQL_THREAD;
-- 2. 动态注入通配符过滤规则(假设所有联邦表都以 fd_ 开头,这里保留了原有的系统过滤,并追加了 '%.fd_%')
CHANGE REPLICATION FILTER REPLICATE_WILD_IGNORE_TABLE = (
'information_schema.%',
'performance_schema.%',
'sys.%',
'percona.%',
'%.fd_%'
);
-- 3. 重新启动 SQL 线程
START SLAVE SQL_THREAD;
执行后输入 SHOW SLAVE STATUS\G,可以看到 Replicate_Wild_Ignore_Table 已经动态更新!
my.cnf 持久化
# /etc/my.cnf
[mysqld]
replicate-wild-ignore-table = %.fd_%
总结与反思
- 架构设计避坑:
FEDERATED存储引擎在现代 MySQL 架构中应避免在主从复制环境下的表中使用它。 - 版本红利:该内存越界 Bug 在 MySQL
5.7.35及后续版本中已被官方修复。如果条件允许,建议将老旧的5.7.24升级至 5.7 的终极稳定版5.7.44。 - 运维工具箱:熟练掌握
CHANGE REPLICATION FILTER可以在不重启生产数据库的前提下,优雅地调整复制流,是 DBA 必备的微创手术技能。
SundayHK