# TLDR;
Disable auto system update
Disable auto package update
Disable auto Kernal update
# Background
Our company has a massive degradation this year. We spend few days recovering our service and ask for help from the cloud service provider.
The reason is unclear why the massive instance goes down at the same time as we do not have any deployment or scheduled patch at that period.
# Expected
We roll out the terraform script to resume our instances from zero and load the new traffic to it. In this case, everything should be back to normal and we can have a cup of coffee.
# Actual
After we bootstrap the new instances and enable the network traffic, some of the instances still go down without any reason.
# Diagnosis
We run `ps -aux` to ensure all the essential services working as normal.
It shows that one of core service states itself at failure.
In order to understand the reason behind we review the log file of that service and found that the mount point is gone.
Therefore, it cannot write the data to that path as before.
# Lesson we learned
Device paths in Linux aren’t guaranteed to be consistent across restarts.
In this case, any mounted path can be floating after a reboot for an instance.
# Resolution
We should mount the path via UUID or other persistence labels to ensure the os can always pick up the path.
# Bonus
Some of the Linux distribution does not guarantee the kernel upgrade will always be backward compatible.
Like M$ states at the following
Azure Linux VM that’s running the 3.10-based kernel crashes after a host node upgrade in Azure.
# The right way
According to the Immutable infrastructure definition from DO ( digital ocean ),
An immutable infrastructure is another infrastructure paradigm in which servers are never modified after they’re deployed. If something needs to be updated, fixed, or modified in any way, new servers built from a common image with the appropriate changes are provisioned to replace the old ones. After they’re validated, they’re put into use and the old ones are decommissioned.
Just put it in NodeJS way, consider each instance is an application. We can use `package-lock.json` to lock the version of each dependency. It can avoid untested packages loaded into our application and break something at the end.
If we apply this logic to maintain infrastructure, we can also avoid unexpected Punch from the updates.
# But what if I want to update the kernel and packages?
Just test it and update your terraform file to ensure every things is tested
# How AWS approach it
By default, AWS OpsWorks Stacks automatically installs the latest updates during setup, after an instance finishes booting. AWS OpsWorks Stacks does not automatically install updates after an instance is online, to avoid interruptions such as restarting application servers. Instead, you manage updates to your online instances yourself, so you can minimize any disruptions.
see also: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/update-management.html
# Reference
- https://docs.microsoft.com/en-us/troubleshoot/azure/virtual-machines/troubleshoot-device-names-problems#cause
- https://serverfault.com/questions/809812/windows-azure-data-disk-mount-point-was-changed-after-reboot
- https://docs.microsoft.com/en-us/troubleshoot/azure/virtual-machines/linux-kernel-panics-upgrade
- https://www.digitalocean.com/community/tutorials/what-is-immutable-infrastructure
- https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/update-management.html
- https://docs.aws.amazon.com/opsworks/latest/userguide/workingsecurity-updates.html