RabbitMQ 在 VM 無法啟動，原來是硬碟沒空間

環境

Ubuntu 22.04.2 LTS
Linode(Nanode 1 GB)

某天發現 VM 上的 rabbitmq 掛了

root@localhost:~# sudo service rabbitmq-server status
● rabbitmq-server.service - RabbitMQ Messaging Server
     Loaded: loaded (/lib/systemd/system/rabbitmq-server.service; enabled; vendor preset: enabled)
     Active: activating (auto-restart) (Result: exit-code) since Sat 2023-08-12 04:44:46 UTC; 5s ago
    Process: 3097879 ExecStart=/usr/lib/rabbitmq/bin/rabbitmq-server (code=exited, status=1/FAILURE)
   Main PID: 3097879 (code=exited, status=1/FAILURE)
     Status: "Startup in progress"
        CPU: 1.502s

Aug 12 04:44:46 localhost systemd[1]: rabbitmq-server.service: Main process exited, code=exited, status=1/FAILURE
Aug 12 04:44:46 localhost systemd[1]: rabbitmq-server.service: Failed with result 'exit-code'.
Aug 12 04:44:46 localhost systemd[1]: Failed to start RabbitMQ Messaging Server.
Aug 12 04:44:46 localhost systemd[1]: rabbitmq-server.service: Consumed 1.502s CPU time.

然後 restart 之後又起不來

sudo service rabbitmq-server restart

接著依照提示的指令查看，也得不到有用的資訊

root@localhost:~# journalctl -xeu rabbitmq-server.service
Aug 12 04:45:14 localhost systemd[1]: rabbitmq-server.service: Consumed 1.597s CPU time.
░░ Subject: Resources consumed by unit runtime
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit rabbitmq-server.service completed and consumed the indicated resources.
Aug 12 04:45:14 localhost systemd[1]: Starting RabbitMQ Messaging Server...
░░ Subject: A start job for unit rabbitmq-server.service has begun execution
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit rabbitmq-server.service has begun execution.
░░
░░ The job identifier is 578874.
Aug 12 04:45:21 localhost systemd[1]: rabbitmq-server.service: Main process exited, code=exited, status=1/FAILURE
░░ Subject: Unit process exited
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ An ExecStart= process belonging to unit rabbitmq-server.service has exited.
░░
░░ The process' exit code is 'exited' and its exit status is 1.
Aug 12 04:45:21 localhost systemd[1]: rabbitmq-server.service: Failed with result 'exit-code'.
░░ Subject: Unit failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ The unit rabbitmq-server.service has entered the 'failed' state with result 'exit-code'.
Aug 12 04:45:21 localhost systemd[1]: Failed to start RabbitMQ Messaging Server.
░░ Subject: A start job for unit rabbitmq-server.service has failed
░░ Defined-By: systemd
░░ Support: http://www.ubuntu.com/support
░░
░░ A start job for unit rabbitmq-server.service has finished with a failure.
░░
░░ The job identifier is 578874 and the job result is failed.

接著去翻 log

cat /var/log/rabbitmq/rabbitmq-server.error.log

發現下面這段內容不段重複，應該就是因為 auto-restart 的程序不斷的被重複執行，老實說看不明白，就丟 ChatGPT 問了一下

error:{badmatch,{error,{{shutdown,{failed_to_start_child,ra_log_sup,{shutdown,{failed_to_start_child,ra_log_wal_sup,{shutdown,{failed_to_start_child,ra_log_wal,{{badmatch,{error,enospc}},[{ra_log_wal,make_tmp,1,[{file,"src/ra_log_wal.erl"},{line,608}]},{ra_log_wal,prepare_file,2,[{file,"src/ra_log_wal.erl"},{line,593}]},{ra_log_wal,open_wal,3,[{file,"src/ra_log_wal.erl"},{line,586}]},{ra_log_wal,roll_over,2,[{file,"src/ra_log_wal.erl"},{line,564}]},{ra_log_wal,recover_wal,2,[{file,"src/ra_log_wal.erl"},{line,333}]},{ra_log_wal,init,1,[{file,"src/ra_log_wal.erl"},{line,257}]},{gen_batch_server,init_it,6,[{file,"src/gen_batch_server.erl"},{line,151}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}}}}}}},{child,undefined,coordination,{ra_system_sup,start_link,[#{data_dir => "/var/lib/rabbitmq/mnesia/rabbit@localhost/coordination/rabbit@localhost",name => coordination,names => #{closed_mem_tbls => ra_coordination_log_closed_mem_tables,directory => ra_coordination_directory,directory_rev => ra_coordination_directory_reverse,log_ets => ra_coordination_log_ets,log_meta => ra_coordination_log_meta,log_sup => ra_coordination_log_sup,open_mem_tbls => ra_coordination_log_open_mem_tables,segment_writer => ra_coordination_segment_writer,server_sup => ra_coordination_server_sup_sup,wal => ra_coordination_log_wal,wal_sup => ra_coordination_log_wal_sup},segment_max_entries => 4096,wal_compute_checksums => true,wal_data_dir => "/var/lib/rabbitmq/mnesia/rabbit@localhost/coordination/rabbit@localhost",wal_max_batch_size => 4096,wal_max_entries => undefined,wal_max_size_bytes => 64000000,wal_sync_method => datasync,wal_write_strategy => default}]},permanent,false,infinity,supervisor,[ra_system_sup]}}}}

    rabbit:run_prelaunch_second_phase/0, line 387
    rabbit:start/2, line 847
    application_master:start_it_old/4, line 293

Kernel pid terminated (application_controller) ({application_start_failure,rabbit,{{badmatch,{error,{{shutdown,{failed_to_start_child,ra_log_sup,{shutdown,{failed_to_start_child,ra_log_wal_sup,{shutdown,{failed_to_start_child,ra_log_wal,{{badmatch,{error,enospc}},[{ra_log_wal,make_tmp,1,[{file,"src/ra_log_wal.erl"},{line,608}]},{ra_log_wal,prepare_file,2,[{file,"src/ra_log_wal.erl"},{line,593}]},{ra_log_wal,open_wal,3,[{file,"src/ra_log_wal.erl"},{line,586}]},{ra_log_wal,roll_over,2,[{file,"src/ra_log_wal.erl"},{line,564}]},{ra_log_wal,recover_wal,2,[{file,"src/ra_log_wal.erl"},{line,333}]},{ra_log_wal,init,1,[{file,"src/ra_log_wal.erl"},{line,257}]},{gen_batch_server,init_it,6,[{file,"src/gen_batch_server.erl"},{line,151}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]}}}}}}},{child,undefined,coordination,{ra_system_sup,start_link,[#{data_dir => "/var/lib/rabbitmq/mnesia/rabbit@localhost/coordination/rabbit@localhost",name => coordination,names => #{closed_mem_tbls => ra_coordination_log_

Crash dump is being written to: /var/log/rabbitmq/erl_crash.dump...done

BOOT FAILED
=done

得到的結果是因為硬碟空間滿了，所以服務起不來。

The error message you've posted indicates that RabbitMQ is failing to start due to an issue related to the log subsystem. Specifically, it seems to be encountering a "no space left on device" (enospc) error while trying to work with its log files. This error is often caused by running out of available disk space on the device where RabbitMQ's log files are located.

Here's the relevant part of the error message:

{{badmatch,{error,enospc}},[{ra_log_wal,make_tmp,1,[{file,"src/ra_log_wal.erl"},{line,608}]},...

The error occurs in the `ra_log_wal` module while trying to create temporary files. This indicates that the disk where RabbitMQ stores its logs might be full or close to full, leading to the "no space left on device" error.

最後先用 df -h 來排查，看看什麼東西佔用了最大的資源

root@localhost:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            97M  1.4M   96M   2% /run
/dev/sda         25G   24G  144M  99% /
tmpfs           485M  224K  485M   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
overlay          25G   24G  144M  99% /var/lib/docker/overlay2/ec5835fcd2075ec4fdf04531ffb1d0d3f5146eae41fc266a8733a717eba74a76/merged
overlay          25G   24G  144M  99%/var/lib/docker/overlay2/066a05a0f407607ca25a340fc454f50a7339888a2b4a418691956ed01fd9325d/merged
overlay          25G   24G  144M  99% /var/lib/docker/overlay2/26fb503a8ef6c3ac3bc05285296f6a831337df4c9d9390c178ba9771e60d8ed2/merged
tmpfs            97M  4.0K   97M   1% /run/user/0

最後用指令找出使用最大空間的資料夾

root@localhost:~# sudo du -h --max-depth=1 | sort -hr
25G	/
20G	/var
4.0G	/usr
1.3G	/snap
252M	/boot
6.3M	/etc
5.4M	/home
1.4M	/run
128K	/root
76K	/tmp
28K	/dev
16K	/opt
16K	/lost+found
4.0K	/srv
4.0K	/mnt
4.0K	/media
0	/sys
0	/proc

再 cd 進去使用最多空間的資料夾用同樣的方式繼續找，最後發現是存了很多沒在使用的 docker image 跟 container，因此下指令清除它們。

docker container prune
docker image prune

把空間清出來後， rabbitmq 也復活了。

root@localhost:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
tmpfs            97M  1.3M   96M   2% /run
/dev/sda         25G   11G   13G  48% /
tmpfs           485M   28K  485M   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
overlay          25G   11G   13G  48% /var/lib/docker/overlay2/ec5835fcd2075ec4fdf04531ffb1d0d3f5146eae41fc266a8733a717eba74a76/merged
tmpfs            97M  4.0K   97M   1% /run/user/0
overlay          25G   11G   13G  48% /var/lib/docker/overlay2/246c2e3303187b230b4d3e881d290734e53d183256e90cc8f693a8ab614f1037/merged
overlay          25G   11G   13G  48% /var/lib/docker/overlay2/a84bba981cffa456fcf2ab7c9e960223a6690cc2a15b3d101a584dd469363bc0/merged

Updated: 2023/08/12