SSH on Azure Linux VM suddenly failed

Photo by Daniel Páscoa on Unsplash

When deploying a Linux virtual machine on Microsoft Azure, you may have applied some best practices:

  • You disable SELinux
  • You change default SSH port
  • And you also do tuning some TCP settings and deploy many softwares on your VM

Your Linux VM’s just worked fine - until one day, you could not SSH to the VM despite many tries…

You try restarting the VM through Azure Portal. Doesn’t worked!

You try redeploying the VM. Also doesn’t worked!

You’re still out of SSH and have no clue what to do next… then you would think about cloning a new VM or creating a new one from scratch and re-installing all your softwares on the VM.

Hey friend, if you think so, pls wait… pls follow my story because I got the same SSH trouble as you, then you would get SSH works again… like a charm!

The 1st story

Log is best friend of developer, it always tell truth.

Log is best friend, so first thing first I want to see some logs of the Linux VM to see what happened with the SSHD service. To do that, I come to Azure Portal and enable the VM’s Serial console:

  • Enable Boot diagnostics
  • Ensure that Account Storage Firewall is disabled

Then, deep dive on Serial console log, there are services that failed to start:

[  OK  ] Started Azure Linux Agent.
[FAILED] Failed to start Login Service.
See 'systemctl status systemd-logind.service' for details.
[FAILED] Failed to start Cleanup of Temporary Directories.
See 'systemctl status systemd-tmpfiles-clean.service' for details.
[FAILED] Failed to start OpenSSH server daemon.
See 'systemctl status sshd.service' for details.
[ OK ] Started D-Bus System Message Bus.
[ OK ] Stopped Login Service.
Starting Login Service...
[ OK ] Started Permit User Sessions.
[FAILED] Failed to start OMI CIM Server.
See 'systemctl status omid.service' for details.
[ OK ] Started D-Bus System Message Bus.
[ OK ] Started Job spooling tools.
[ OK ] Started Command Scheduler.
Starting Wait for Plymouth Boot Screen to Quit...
Starting Terminate Plymouth Boot Screen...
[FAILED] Failed to start Login Service.
See 'systemctl status systemd-logind.service' for details.
[ OK ] Stopped Login Service.
Starting Login Service...
[ OK ] Started D-Bus System Message Bus.

With the obvious logs, I know that OpenSSH server is failed to start. That’s why we could not gain SSH access to the VM.

The 2nd story

Always find the problem’s root cause

Now, two next questions are: why the OpenSSH server did not start successfully? what’s the root cause of the problem?

To find the answers, I have no choice other than access the VM in Single User Mode (Centos 7 was install on the VM). To do that:

  • Click on turn-off icon, then choose hard-reset
  • At GRUB boot loading, quickly press e button
  • Find the kernel line (starts with linux16), append rw init=/bin/bash to the end of this line
  • Press ctrl + x to continue
Boot the VM to Single User Mode

After booting to single user mode, I try to restart SSHD service, then check /var/log/message and see a message say that /etc/passwddoesn’t seem to exist… Sound strange! After googling for a while, it seem that Microsoft would install some services automatically on my VM and may be there was an error. Luckily, Microsoft has backed up the old file to /etc/passwd-. Now, as you may easily guess, I end up with command

# cp /etc/passwd- /etc/passwd

and click the hard-reset button again.

Wait for a minute when restarting the VM… and no new result, still get the message:

[FAILED] Failed to start OpenSSH server daemon.
See 'systemctl status sshd.service' for details.

At that moment, I think about re-creating and re-installing the VM from scratch. But after taking a deep breath, I decide not to give up.

The 3rd story

Never, never and never give up… too early

Boot again to Single User Mode, I discover that the /etc/passwd file is still there, but it has no content. Quite strange.

I decide to change file permission:

# chmod 444 /etc/passwd

and reset the VM once again…

Ah ha, finally… It’s working… like a charm!

Backend Leader @ Pingcom, Runner