VM Stun – The cause I keep seeing, the why and how to fix it!
Posted on September 25, 2020 by moodraman
VM Stun typically happens when a VM is being snapshotted. If you’re using a backup software that leverages VM snapshots (as most do for VMware today) then you run the risk of VM Stun.
*Please note, trying my best to explain in layman’s terms.
In order to take a snapshot of a VM and have it continue to run, VMware creates a checkpoint where it saves all the new writes to that VM as they come in. Once the snap is then consolidated, VMware must roll these changes back into the original VM. At some point in the consolidation process, VMware must flip the writes from the delta file back to the original disk, thus pausing the VM very briefly. The larger the VM, and the longer the VM Snapshot is opened, the more impactful the VM consolidation will be.
In highly transactional environments, this additional load, and even the slightest pause of a VM can cause a VM Stun, where the VM is briefly unresponsive. This can cause a myriad of problems.
NOTE: < sales rant > Anytime you take a VM Snapshot in vSphere, you run the risk of VM Stun. I work for Veeam, and we tell people we can help mitigate this risk by leveraging Backup from Storage Snapshots (This process explained more later) but I have seen marketing for OTHER “Modern” backup solutions claim they can “Completely eliminate VM Stun with Backup from Storage Snapshot”. Unless their process runs agents, or is snapshotless in VMware (Certain architectures of Veeam can do this.) Then, this is a flat out lie. Remember, if ANY snapshot is taken in vSphere, you still run the risk of VM Stun. < /sales rant>
Back to that Customer First theme 😉
Modern infrastructures and the hypervisors supporting them have improved dramatically from the early days of virtualization. Shortly after the inclusion of Storage vMotion into VMware, the consolidation was changed to leverage some of these new efficiencies and mitigate this issue. On the infrastructure side, modern infrastructures have drastically improved the storage performance, the network or fibre performance as well as more powerful compute.
In a nutshell, if you are running modern infrastructure with the latest version of a hypervisor you should not be experiencing VM Stun. If you are, either you’re working with a VM that is so sensitive, it can never withstand a VM snapshot. In which case another backup methodology for this data is the best bet. Or something is wrong with your infrastructure or your backup product.
How to decrease the likelihood of VM Stuns
The best way to decrease the likely hood that you’d experience a stun is to keep the amount of time the VMware snapshot is actually opened as short as possible. Modern data protection products have started integrating with storage vendors, so that they can leverage the native storage snapshot capabilities to decrease the time a VM Snap is needed.
Enter… Backup From Storage Snapshot with Veeam.
How does this work?
The short version is above. Veeam still calls a snap in VMware, but this snap no longer needs to stay open for the entire backup job. Once information is snagged from VMware (CBT info & Datastore/Storage volume info) Then Veeam will call storage snap and close the VM snapshot.
I’ve seen this happen in as little as 6 seconds.
So, now instead of your VMware infrastructure having an open snap that it must commit a large amount of data back into the VMDK file, the snap is only opened a few seconds and very little delta data is created during that time, thus drastically decreasing the likely hood of Snap Stun. But just remember, when you see someone’s marketing saying “Do away with Snap Stun completely with our product…” they’re lying. The ability to do away with a lot of the impact in VMware is avalible, but whenever there is a snap there is a chance of a stun.
Now on to the most common time I personally see snap stun, and how to fix it.
Most common cause: NFSv3 & HotAdd Mode
The most common mistake that I have seen in infrastructures leveraging Veeam who are experiencing stun is running NFSv3 storage protocol, and leveraging HotAdd backup.
The issue occurs when the Proxy doing “HotAdd” mode (read more here on HotAdd/Virtual Appliance Mode) is NOT on the same virtual host as the machine it’s backing up. VMware states this is an issue in the NFSv3 locking mechanism.
What do I do?
If you’re experiencing VM stun and you do have NFSv3, the best thing to do is configure what is called “Direct NFS” mode. This allows for the Proxy to connect directly to the storage volumes via NFS, and thus does not leverage HotAdd.
Direct NFS mode allows the Proxy to pull the data blocks DIRECTLY from the storage, whereas HotAdd mode attaches the disks to the proxy for backup. The act of attaching those disk cross hosts during HotAdd is what will cause the stun. But with direct NFS since the proxy can connect DIRECTLY to the volumes vs. VM disks to the proxy, there is no stun from the usage of NFSv3.
If Direct NFS mode is not your thing, maybe because of some environmental restrictions. We have a patch that will force Network Mode on any VM were a proxy is NOT on the host where the VM lives. Alternatively, you could install proxies on each host and set affinity rules.
ANYTIME a VM is snapped in VMware there is a possibility of VM Stun.
Be careful with vendor marketing, and actually understand how they accomplish the features and functions they’re claiming do. I’ve even seen vendors who cannot do Backup From Storage Snapshots claim that their products do the backups so fast that it will eliminate it. (They’re still not talking about 6 seconds fast, and are still leveraging a VM snap). Quick google search will get you who I’m talking about.
Leverage your infrastructure to the fullest. If you have the ability to setup direct storage access with your Veeam environment. DO IT!