Updated: Aug 23, 2021
VMware HA uses datastore heartbeats to differentiate between host isolation and host failure events. Let's take a closer look at how this works.
First let's understand how datastore heartbeat works. When we configure HA on a cluster; one of the settings that we configure is "select datastore for heartbeats". Based on our configuration, the heartbeat datastore is selected manually or automatically. If manually selected, the datastore(s) selected must be such that all hosts in the cluster have access to the datastore.
When we configure HA on a cluster; one of the settings that we configure is "select datastore for heartbeats". Based on our configuration, the heartbeat datastore is selected manually or automatically. If manually selected, datastore selected must be such that all hosts in the cluster have access to the datastore.
Once, HA is successfully configured and enabled, each host in the HA enabled cluster create something known as a "lock" file in the heartbeat datastore. These files are owned by the host that created it and 'only' the host that has created it has access rights to the file. You can view these files in your HA attached datastore. "DO NOT DELETE OR MODIFY THESE FILES"
Browse to your datastores--> select the HA hosts attached datastore--> click on the files tab. You will notice a folder with the name ".VSphere-HA" . Under this folder, you will see a subfolder "FDM-GUID". Click on the folder, you can view the lock files. Each lock file name is based on the host 'MoRef' (Managed Object reference) ID. See the example below- "host-6008-hb":
Let's understand the process by which VMware HA determines and differentiates between host isolation and failure events:
Post HA election, one host in the HA cluster is elected as the "Master" and all other hosts are elected as "Slaves". The master host sends heartbeats to all slave hosts over the "Management" network, allowing the slave hosts to determine the master host is alive and able to communicate. At the same time, all slave hosts sends a heartbeat to the master host, allowing the the master to be aware all slaves connected to it and alive.
When the master host stops receiving heartbeats from a slave host, the master host tries to 'ping' the slave host over the management network. If the 'ping' fails, the master host tries to access the slave's lock file in the heartbeat datastore. This is when the master decides if the host is isolated or failed.
If the master is unable to access the slave's lock file, it means the host is still alive, hence only the slave host can access the lock file, no one else. This means the slave host is simply isolated on the management network. Based on how we have configured the host isolation settings under our HA configuration, the Master host takes necessary actions.
If the master slave is able to access the slave's lock file, this means the slave host is no longer online to hold the ownership of the file, hence the master is able to access the lock file. This is when the master determines that the particular slave host has failed. Based on how we have configured the host failure settings under our HA configuration, the Master host takes necessary actions.
Hope I was able to make the understanding of HA datastore heartbeats slightly easier for you!
Reference: Configure Heartbeat Datastores (vmware.com)