起因:cd 或 ls 文件路径 响应很慢很慢,结束相关进程问题依旧
gitlab挂掉(硬盘响应慢)
root@demo18:~# tail /var/log/gitlab/unicorn/unicorn_stderr.log -f
I, [2024-07-22T13:36:25.674206 #41065] INFO -- : master complete
I, [2024-07-22T13:36:27.064958 #47425] INFO -- : Refreshing Gem list
I, [2024-07-22T13:36:45.467352 #47425] INFO -- : listening on addr=127.0.0.1:8181 fd=17
I, [2024-07-22T13:36:45.467488 #47425] INFO -- : unlinking existing socket=/var/opt/gitlab/gitlab-rails/sockets/gitlab.socket
I, [2024-07-22T13:36:45.467569 #47425] INFO -- : listening on addr=/var/opt/gitlab/gitlab-rails/sockets/gitlab.socket fd=18
I, [2024-07-22T13:36:45.523851 #47594] INFO -- : worker=0 ready
I, [2024-07-22T13:36:45.527721 #47597] INFO -- : worker=1 ready
I, [2024-07-22T13:36:45.533209 #47600] INFO -- : worker=2 ready
I, [2024-07-22T13:36:45.534730 #47425] INFO -- : master process ready
I, [2024-07-22T13:36:45.538788 #47603] INFO -- : worker=3 ready
E, [2024-07-22T13:37:54.549813 #47425] ERROR -- : worker=3 PID:47603 timeout (61s > 60s), killing
E, [2024-07-22T13:37:54.580907 #47425] ERROR -- : reaped #<Process::Status: pid 47603 SIGKILL (signal 9)> worker=3
I, [2024-07-22T13:37:54.594292 #48375] INFO -- : worker=3 ready
E, [2024-07-22T13:38:03.600771 #47425] ERROR -- : worker=0 PID:47594 timeout (61s > 60s), killing
E, [2024-07-22T13:38:03.629095 #47425] ERROR -- : reaped #<Process::Status: pid 47594 SIGKILL (signal 9)> worker=0
I, [2024-07-22T13:38:03.641272 #48485] INFO -- : worker=0 ready
E, [2024-07-22T13:38:05.640754 #47425] ERROR -- : worker=2 PID:47600 timeout (61s > 60s), killing
E, [2024-07-22T13:38:05.668951 #47425] ERROR -- : reaped #<Process::Status: pid 47600 SIGKILL (signal 9)> worker=2
I, [2024-07-22T13:38:05.682336 #48492] INFO -- : worker=2 ready
E, [2024-07-22T13:38:38.686827 #47425] ERROR -- : worker=1 PID:47597 timeout (61s > 60s), killing
E, [2024-07-22T13:38:38.712657 #47425] ERROR -- : reaped #<Process::Status: pid 47597 SIGKILL (signal 9)> worker=1
HDD/SSD
MegaCli -PDlist -aAll | grep -i "Media Error Count"
MegaCli -PDlist -aAll | grep -i "Predictive Failure Count"
MegaCli -PDlist -aAll | grep -i "Other Error Count"
perccli /c0/eall/sall show all | grep -i "Media Error Count"
perccli /c0/eall/sall show all | grep -i "Predictive Failure Count"
perccli /c0/eall/sall show all | grep -i "Other Error Count"
perccli /c0 show termlog > /tmp/raid/termlog.txt
perccli /c0 show events > /tmp/raid/events.txt
perccli /c0 show all > /tmp/raid/all.txt
perccli /c0 show alilog> /tmp/raid/alilog.txt
perccli /c0/vall show all> /tmp/raid/vall.txt
查看是否有不为0的,如何有执行下面语句
MegaCli -PDlist -aAll | grep -i "Media Error Count"
MegaCli -PDlist -aAll | grep -i "Predictive Failure Count"
MegaCli -PDlist -aAll | grep -i "Other Error Count"
经检查4位硬盘有坏道
gitlab使用pull速度,push依旧卡顿,此服务器raid5, 现热拔掉4位硬盘,push推送速度正常
root@demo18:/data/shell# MegaCli -PDlist -aAll | grep -i "Media Error Count"
Media Error Count: 0
Media Error Count: 0
Media Error Count: 0
Media Error Count: 0
Media Error Count: 8831
Media Error Count: 0
Media Error Count: 0
Media Error Count: 0
root@demo18:~# MegaCli -PDInfo -PhysDrv [32:4] -aAll
Enclosure Device ID: 32
Slot Number: 4
Drive's position: DiskGroup: 0, Span: 0, Arm: 4
Enclosure position: 1
Device Id: 4
WWN: 500003972bd811bd
Sequence Number: 2
Media Error Count: 8945
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
PD Type: SATA
Raw Size: 5.458 TB [0x2baa0f4b0 Sectors]
Non Coerced Size: 5.457 TB [0x2ba90f4b0 Sectors]
Coerced Size: 5.457 TB [0x2ba900000 Sectors]
Sector Size: 512
Logical Sector Size: 512
Physical Sector Size: 4096
Firmware state: Online, Spun Up
Device Firmware Level: FS6D
Shield Counter: 0
Successful diagnostics completion on : N/A
SAS Address(0): 0x4433221105000000
Connected Port Number: 7(path0)
Inquiry Data: 768EK28EFE1CTOSHIBA MG04ACA600E FS6D
FDE Capable: Not Capable
FDE Enable: Disable
Secured: Unsecured
Locked: Unlocked
Needs EKM Attention: No
Foreign State: None
Device Speed: 6.0Gb/s
Link Speed: 6.0Gb/s
Media Type: Hard Disk Device
Drive Temperature :36C (96.80 F)
PI Eligibility: No
Drive is formatted for PI information: No
PI: No PI
Port-0 :
Port status: Active
Port's Linkspeed: 6.0Gb/s
Drive has flagged a S.M.A.R.T alert : No
Media Error Count:磁盘存在错误,可能是磁盘有坏道。值越大,越危险。根据磁盘状况,一般大于0报修更换, 因为只要一坏后面数值就涨上去了。
Predictive Failure Count:磁盘的预警数。一般大于0,就报修更换。
Last Predictive Failure Event Seq Number:最后一条预警的时间序列号,这个值不为0,肯定Predictive FailureCount也不为0。
NVME
nvme list
nvme smart-log /dev/nvme0n1