環境
- Ubuntu 22.04.2 LTS
- Linode(Nanode 1 GB)
某天發現 VM 上的 rabbitmq 掛了
root@localhost:~# sudo service rabbitmq-server status
● rabbitmq-server.service - RabbitMQ Messaging Server
Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
Active: activating (auto-restart) (Result: exit-code) since Sat 2023-08-12 04:44:46 UTC; 5s ago
Process: 3097879 ExecStart=/usr/lib/rabbitmq/bin/rabbitmq-server (code=exited, status=1/FAILURE)
Main PID: 3097879 (code=exited, status=1/FAILURE)
Status: "Startup in progress"
CPU: 1.502s
Aug 12 04:44:46 localhost systemd[1]: rabbitmq-server.service: Main process exited, code=exited, status=1/FAILURE
Aug 12 04:44:46 localhost systemd[1]: rabbitmq-server.service: Failed with result 'exit-code'.
Aug 12 04:44:46 localhost systemd[1]: Failed to start RabbitMQ Messaging Server.
Aug 12 04:44:46 localhost systemd[1]: rabbitmq-server.service: Consumed 1.502s CPU time.
然後 restart 之後又起不來
sudo service rabbitmq-server restart
接著依照提示的指令查看,也得不到有用的資訊
root@localhost:~# journalctl -xeu rabbitmq-server.service
Aug 12 04:45:14 localhost systemd[1]: rabbitmq-server.service: Consumed 1.597s CPU time.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit rabbitmq-server.service completed and consumed the indicated resources.
Aug 12 04:45:14 localhost systemd[1]: Starting RabbitMQ Messaging Server...
░░ Subject: A start job for unit rabbitmq-server.service has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit rabbitmq-server.service has begun execution.
░░
░░ The job identifier is 578874.
Aug 12 04:45:21 localhost systemd[1]: rabbitmq-server.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ An ExecStart= process belonging to unit rabbitmq-server.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Aug 12 04:45:21 localhost systemd[1]: rabbitmq-server.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit rabbitmq-server.service has entered the 'failed' state with result 'exit-code'.
Aug 12 04:45:21 localhost systemd[1]: Failed to start RabbitMQ Messaging Server.
░░ Subject: A start job for unit rabbitmq-server.service has failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit rabbitmq-server.service has finished with a failure.
░░
░░ The job identifier is 578874 and the job result is failed.
接著去翻 log
cat /var/log/rabbitmq/rabbitmq-server.error.log
發現下面這段內容不段重複,應該就是因為 auto-restart 的程序不斷的被重複執行,老實說看不明白,就丟 ChatGPT 問了一下
error:{badmatch,{error,{{shutdown,{failed_to_start_child,ra_log_sup,{shutdown,{failed_to_start_child,ra_log_wal_sup,{shutdown,{failed_to_start_child,ra_log_wal,{{badmatch,{error,enospc}},[{ra_log_wal,make_tmp,1,[{file,"src/ra_log_wal.erl"},{line,608}]},{ra_log_wal,prepare_file,2,[{file,"src/ra_log_wal.erl"},{line,593}]},{ra_log_wal,open_wal,3,[{file,"src/ra_log_wal.erl"},{line,586}]},{ra_log_wal,roll_over,2,[{file,"src/ra_log_wal.erl"},{line,564}]},{ra_log_wal,recover_wal,2,[{file,"src/ra_log_wal.erl"},{line,333}]},{ra_log_wal,init,1,[{file,"src/ra_log_wal.erl"},{line,257}]},{gen_batch_server,init_it,6,[{file,"src/gen_batch_server.erl"},{line,151}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}}}}}}},{child,undefined,coordination,{ra_system_sup,start_link,[#{data_dir => "/var/lib/rabbitmq/mnesia/rabbit@localhost/coordination/rabbit@localhost",name => coordination,names => #{closed_mem_tbls => ra_coordination_log_closed_mem_tables,directory => ra_coordination_directory,directory_rev => ra_coordination_directory_reverse,log_ets => ra_coordination_log_ets,log_meta => ra_coordination_log_meta,log_sup => ra_coordination_log_sup,open_mem_tbls => ra_coordination_log_open_mem_tables,segment_writer => ra_coordination_segment_writer,server_sup => ra_coordination_server_sup_sup,wal => ra_coordination_log_wal,wal_sup => ra_coordination_log_wal_sup},segment_max_entries => 4096,wal_compute_checksums => true,wal_data_dir => "/var/lib/rabbitmq/mnesia/rabbit@localhost/coordination/rabbit@localhost",wal_max_batch_size => 4096,wal_max_entries => undefined,wal_max_size_bytes => 64000000,wal_sync_method => datasync,wal_write_strategy => default}]},permanent,false,infinity,supervisor,[ra_system_sup]}}}}
rabbit:run_prelaunch_second_phase/0, line 387
rabbit:start/2, line 847
application_master:start_it_old/4, line 293
Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{badmatch,{error,{{shutdown,{failed_to_start_child,ra_log_sup,{shutdown,{failed_to_start_child,ra_log_wal_sup,{shutdown,{failed_to_start_child,ra_log_wal,{{badmatch,{error,enospc}},[{ra_log_wal,make_tmp,1,[{file,"src/ra_log_wal.erl"},{line,608}]},{ra_log_wal,prepare_file,2,[{file,"src/ra_log_wal.erl"},{line,593}]},{ra_log_wal,open_wal,3,[{file,"src/ra_log_wal.erl"},{line,586}]},{ra_log_wal,roll_over,2,[{file,"src/ra_log_wal.erl"},{line,564}]},{ra_log_wal,recover_wal,2,[{file,"src/ra_log_wal.erl"},{line,333}]},{ra_log_wal,init,1,[{file,"src/ra_log_wal.erl"},{line,257}]},{gen_batch_server,init_it,6,[{file,"src/gen_batch_server.erl"},{line,151}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}}}}}}},{child,undefined,coordination,{ra_system_sup,start_link,[#{data_dir => "/var/lib/rabbitmq/mnesia/rabbit@localhost/coordination/rabbit@localhost",name => coordination,names => #{closed_mem_tbls => ra_coordination_log_
Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done
BOOT FAILED
=done
得到的結果是因為硬碟空間滿了,所以服務起不來。
The error message you've posted indicates that RabbitMQ is failing to start due to an issue related to the log subsystem. Specifically, it seems to be encountering a "no space left on device" (enospc) error while trying to work with its log files. This error is often caused by running out of available disk space on the device where RabbitMQ's log files are located.
Here's the relevant part of the error message:
{{badmatch,{error,enospc}},[{ra_log_wal,make_tmp,1,[{file,"src/ra_log_wal.erl"},{line,608}]},...
The error occurs in the `ra_log_wal` module while trying to create temporary files. This indicates that the disk where RabbitMQ stores its logs might be full or close to full, leading to the "no space left on device" error.
最後先用 df -h
來排查,看看什麼東西佔用了最大的資源
root@localhost:~# df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 97M 1.4M 96M 2% /run
/dev/sda 25G 24G 144M 99% /
tmpfs 485M 224K 485M 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
overlay 25G 24G 144M 99% /var/lib/docker/overlay2/ec5835fcd2075ec4fdf04531ffb1d0d3f5146eae41fc266a8733a717eba74a76/merged
overlay 25G 24G 144M 99%/var/lib/docker/overlay2/066a05a0f407607ca25a340fc454f50a7339888a2b4a418691956ed01fd9325d/merged
overlay 25G 24G 144M 99% /var/lib/docker/overlay2/26fb503a8ef6c3ac3bc05285296f6a831337df4c9d9390c178ba9771e60d8ed2/merged
tmpfs 97M 4.0K 97M 1% /run/user/0
最後用指令找出使用最大空間的資料夾
root@localhost:~# sudo du -h --max-depth=1 | sort -hr
25G /
20G /var
4.0G /usr
1.3G /snap
252M /boot
6.3M /etc
5.4M /home
1.4M /run
128K /root
76K /tmp
28K /dev
16K /opt
16K /lost+found
4.0K /srv
4.0K /mnt
4.0K /media
0 /sys
0 /proc
再 cd
進去使用最多空間的資料夾用同樣的方式繼續找,最後發現是存了很多沒在使用的 docker image 跟 container,因此下指令清除它們。
docker container prune
docker image prune
把空間清出來後, rabbitmq 也復活了。
root@localhost:~# df -h
Filesystem Size Used Avail Use% Mounted on
tmpfs 97M 1.3M 96M 2% /run
/dev/sda 25G 11G 13G 48% /
tmpfs 485M 28K 485M 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
overlay 25G 11G 13G 48% /var/lib/docker/overlay2/ec5835fcd2075ec4fdf04531ffb1d0d3f5146eae41fc266a8733a717eba74a76/merged
tmpfs 97M 4.0K 97M 1% /run/user/0
overlay 25G 11G 13G 48% /var/lib/docker/overlay2/246c2e3303187b230b4d3e881d290734e53d183256e90cc8f693a8ab614f1037/merged
overlay 25G 11G 13G 48% /var/lib/docker/overlay2/a84bba981cffa456fcf2ab7c9e960223a6690cc2a15b3d101a584dd469363bc0/merged