You should disable auto upgrade on your cloud infrastructure

ukitdog

3 min readSep 13, 2021

X_X source: https://www.reddit.com/r/wyzecam/comments/cjhq8m/sensor_bridge_firmware_update_fail_tried_many/

# TLDR;

Disable auto system update

Disable auto package update

Disable auto Kernal update

# Background

Our company has a massive degradation this year. We spend few days recovering our service and ask for help from the cloud service provider.

The reason is unclear why the massive instance goes down at the same time as we do not have any deployment or scheduled patch at that period.

# Expected

We roll out the terraform script to resume our instances from zero and load the new traffic to it. In this case, everything should be back to normal and we can have a cup of coffee.

# Actual

After we bootstrap the new instances and enable the network traffic, some of the instances still go down without any reason.

# Diagnosis

We run `ps -aux` to ensure all the essential services working as normal.

It shows that one of core service states itself at failure.

In order to understand the reason behind we review the log file of that service and found that the mount point is gone.

Therefore, it cannot write the data to that path as before.

# Lesson we learned

Device paths in Linux aren’t guaranteed to be consistent across restarts.

In this case, any mounted path can be floating after a reboot for an instance.

# Resolution

We should mount the path via UUID or other persistence labels to ensure the os can always pick up the path.

# Bonus

Some of the Linux distribution does not guarantee the kernel upgrade will always be backward compatible.

Like M$ states at the following

Azure Linux VM that’s running the 3.10-based kernel crashes after a host node upgrade in Azure.

# The right way

According to the Immutable infrastructure definition from DO ( digital ocean ),

An immutable infrastructure is another infrastructure paradigm in which servers are never modified after they’re deployed. If something needs to be updated, fixed, or modified in any way, new servers built from a common image with the appropriate changes are provisioned to replace the old ones. After they’re validated, they’re put into use and the old ones are decommissioned.

Just put it in NodeJS way, consider each instance is an application. We can use `package-lock.json` to lock the version of each dependency. It can avoid untested packages loaded into our application and break something at the end.

If we apply this logic to maintain infrastructure, we can also avoid unexpected Punch from the updates.

# But what if I want to update the kernel and packages?

Just test it and update your terraform file to ensure every things is tested

# How AWS approach it

By default, AWS OpsWorks Stacks automatically installs the latest updates during setup, after an instance finishes booting. AWS OpsWorks Stacks does not automatically install updates after an instance is online, to avoid interruptions such as restarting application servers. Instead, you manage updates to your online instances yourself, so you can minimize any disruptions.

# Reference

You should disable auto upgrade on your cloud infrastructure

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Written by ukitdog

No responses yet