Blog

Why using virtual machines in the cloud is bad practice

Joost van der Waal

4 Oct 2024

6 min read

In my work I still see people easily spinning up virtual machines in the cloud. For me this is a horrible sight.

I’ve been working for 7 years now in the cloud and never had any use of a virtual machine.

In this blog I will explain why using virtual machines is expensive, inflexible and maintenance heavy. Of course I will offer some alternatives but those details will be explained in follow-up’s. All examples are either general or on AWS cloud, but the same concept applies to others platforms like Azure cloud as well.

Reasons why VM’s are not the right cloud tool

Financial
Inflexible
Security
Complexity
Maintenance

Financial reasons

When running a virtual machine you often see usage graphs like this:

The graph in this case shows a peak during breakfast and diner. The details per use-case will differ but this is a common pattern. Mostly the specifications of the virtual machine are selected in a way so it can perform decently during peak hours. But this means there is a constant over-capacity during low hours.

If you would visualize the unused reserved capacity you end up like this:

You are still paying for all the unused capacity!

Example: A customer had some (smaller) servers running for around €150,- each month. When going full serverless the monthly costs were dropped to less then €5,- EUR.

Inflexibility reasons

When starting a product, most of the time it is not clear how much capacity is need. Probably during development the smallest instance is used and with some guesstimate the production instance size is decided.

At first this goes well but at one point this might be the usage chart:

Your application becomes slow during peak moments due to limited capacity.

Just get a bigger instance!

This is probably the start of a more expensive journey. The easy solution is to “jus get a bigger instance”. But as seen in financial reasons this will also increase the amount of over capacity during low hours.

Better would be to scale horizontally and introduce a load balancer which can spin-up multiple smaller instances. But this is not always as simple as it sounds:

We now need a database server (it was embedded)
Software is not always designed to share a database connection
Setting up proper scaling is hard
A VPC is needed to secure inter-instance traffic
Session management is terrible: Lets also introduce a mem-cache instance

Security reasons

Running software in a virtual machine actually means 1 credential per outgoing connection. There is no real distinction between all (sub) processes that are running inside the machine. Therefor really fine grained security access is not possible.

Comparison: With a microsystems architecture you can create many fine grained credentials. Here a specific microservice could only read data from the database. Even if the microservice is somehow breached it still can’t modify.

Complexity reasons

Which one looks more complex?

Cloud	Virtual Machine

The left image has more visual components. On the right side my answer would be I don’t know. Knowing whats is in your architecture is very important. In my experience the individual elements on the left side are:

scalable
minimal configuration possibilities
always available

On the right side I have to guess. This could be:

Database server
Nginx/Apache server
Bash scripts
Cron jobs
Configuration in /etc /opt /var
…

Basically its a black box with lots of configuration.

Maintenance reasons

Lets assume during the development phase everything goes smooth, all software has the latest LCM version and a cool automation bash script is build to configure everything.

A lot of virtual machines run 24/7 for many days/weeks/months or even years.

Software needs patching.

Some of the patching is easy enough but others introduce breaking changes. The longer a server is active, the harder and scarier it gets to update the machine.

If a fix need to be applied, people are suddenly confronted with that cool automation bash script which does not make any sense anymore and does not have unit tests.

During the lifetime of a virtual machine you need to have people employed with enough knowledge of server management. They need to have costly night-shifts for that once in a year moment where they need to fix something.

Recap

Above I basically explained why I do not think a virtual machine is ever a good idea. Not for small companies, but also not for large companies. Even in enterprises you can easily save on money with a proper skepticism if people ask for a virtual machine. Between the lines I’ve hinted a bit towards some alternatives. These alternatives probably need a shift in thinking about software but they are worth it. I will spend some blogs on them as well.

What to do with…

Sometimes there are difficulties to take into consideration

Virtual desktop environments

This is probably an ok-ish use-case. However, think about the possibility that it can be done with auto starting and stopping containers.

”Vendor software requires a VM”

Ask the vendor to consider static containers. This already enforces some Infra as Code with i.e. the Dockerfile. Also containers can work with smaller units of work as long as their application is scalable.

When choosing the route of VM

Sometimes you will still use the route of a VM. In that case I do advise the following

1. Leverage machine instances

Create a machine instance per configuration set. This improves stability and speed on startup. Patch the machine images as often as possible.

2. Always use a load balancer

Even with a single node use a load balancer. Make sure you mark the running node “dirty” every day. This will trigger a start of a new node from the predefined machine image with the patched software. Using this approach you are forcing yourself to make the startup perfect. If not you are fixing the misconfiguration every day.

3. Use a cloud orchestration tool

Use cloud tools like terraform, cloudformation/cdk, ARM/bicep to deploy your infra and building your machine images. When properly automated you can easily setup multiple types of tests. Like:

acceptance test for patched machine images
Daily Disaster & Recovery (DR) test where you just duplicate the whole cluster

End note

Thanks for reading, I hope it was useful. Please drop me a note on linked in when you have additional questions or remarks.

~ Joost van der Waal (Cloud guru)

Taking off to the clouds

A not so well designed serverless cloud application