SSH on Azure Linux VM suddenly failed
When deploying a Linux virtual machine on Microsoft Azure, you may have applied some best practices:
- You disable SELinux
- You change default SSH port
- And you also do tuning some TCP settings and deploy many softwares on your VM
Your Linux VM’s just worked fine - until one day, you could not SSH to the VM despite many tries…
You try restarting the VM through Azure Portal. Doesn’t worked!
You try redeploying the VM. Also doesn’t worked!
You’re still out of SSH and have no clue what to do next… then you would think about cloning a new VM or creating a new one from scratch and re-installing all your softwares on the VM.
Hey friend, if you think so, pls wait… pls follow my story because I got the same SSH trouble as you, then you would get SSH works again… like a charm!
The 1st story
Log is best friend of developer, it always tell truth.
Log is best friend, so first thing first I want to see some logs of the Linux VM to see what happened with the SSHD service. To do that, I come to Azure Portal and enable the VM’s Serial console:
- Enable Boot diagnostics
- Ensure that Account Storage Firewall is disabled
Then, deep dive on Serial console log, there are services that failed to start:
[ OK ] Started Azure Linux Agent.
[FAILED] Failed to start Login Service.
See 'systemctl status systemd-logind.service' for details.
[FAILED] Failed to start Cleanup of Temporary Directories.
See 'systemctl status systemd-tmpfiles-clean.service' for details.
[FAILED] Failed to start OpenSSH server daemon.
See 'systemctl status sshd.service' for details.
[ OK ] Started D-Bus System Message Bus.
[ OK ] Stopped Login Service.
Starting Login Service...
[ OK ] Started Permit User Sessions.
[FAILED] Failed to start OMI CIM Server.
See 'systemctl status omid.service' for details.
[ OK ] Started D-Bus System Message Bus.
[ OK ] Started Job spooling tools.
[ OK ] Started Command Scheduler.
Starting Wait for Plymouth Boot Screen to Quit...
Starting Terminate Plymouth Boot Screen...
[FAILED] Failed to start Login Service.
See 'systemctl status systemd-logind.service' for details.
[ OK ] Stopped Login Service.
Starting Login Service...
[ OK ] Started D-Bus System Message Bus.
With the obvious logs, I know that OpenSSH server is failed to start. That’s why we could not gain SSH access to the VM.
The 2nd story
Always find the problem’s root cause
Now, two next questions are: why the OpenSSH server did not start successfully? what’s the root cause of the problem?
To find the answers, I have no choice other than access the VM in Single User Mode (Centos 7 was install on the VM). To do that:
- Click on
turn-off
icon, then choosehard-reset
- At GRUB boot loading, quickly press
e
button - Find the kernel line (starts with
linux16
), appendrw init=/bin/bash
to the end of this line - Press
ctrl + x
to continue
After booting to single user mode, I try to restart SSHD service, then check /var/log/message
and see a message say that /etc/passwd
doesn’t seem to exist… Sound strange! After googling for a while, it seem that Microsoft would install some services automatically on my VM and may be there was an error. Luckily, Microsoft has backed up the old file to /etc/passwd-
. Now, as you may easily guess, I end up with command
# cp /etc/passwd- /etc/passwd
and click the hard-reset button again.
Wait for a minute when restarting the VM… and no new result, still get the message:
[FAILED] Failed to start OpenSSH server daemon.
See 'systemctl status sshd.service' for details.
At that moment, I think about re-creating and re-installing the VM from scratch. But after taking a deep breath, I decide not to give up.
The 3rd story
Never, never and never give up… too early
Boot again to Single User Mode, I discover that the /etc/passwd
file is still there, but it has no content. Quite strange.
I decide to change file permission:
# chmod 444 /etc/passwd
and reset the VM once again…