Remember when the Hollow Knight sequel crashed all of the game storefronts?

No alternative text description for this image

The new Hollow Knight game crashed the digital storefront of every major game portal it’s offered on! Hundreds of thousands if not millions of people trying to eagerly download it right on launch day. Exciting! Embarrassing? Sometimes disappointing? It can be all of these things! And also an opportunity to talk about scalable architecture and best practices! Sorry to be such an insufferable geek, but you know who you followed. Let’s go for the teachable moment.*

Look, first of all, if you’re hosting everything in an on-premises datacentre, you’re a little bit to blame here, right? You have a very clear ceiling to how much demand you can handle it. But if we’re talking cloud, we still have a few things to keep in mind.

Historical traffic, capacity planning, and infrastructure spend patterns are not necessarily indicative of future needs. You may have some sort of unprecedented load based on a cool product that gets linked in a gift guide that makes your eCommerce site go viral, or you may have a smash long-awaited indie game hit that seemingly out-performs major triple AAA releases. Rad! Good job, devs!!!

… Less good job, infra architects (or whatever financial person forced them to make such stringently capped decisions.) Plan your tech stack for major success, even if you don’t usually experience it. (And also plan for an excellent security posture so nobody runs up your cloud bills with a DDoS attack because of those success visions, heh. Those initial limits and quotas don’t just protect us from our own success, they also protect against evil attackers, so shore up those defences when you increase your capacity.)

Don’t put things in a fixed-number compute target (like serving your game from 3 VMs continuously). Use scalable choices like scaling sets of virtual machines (if you must) – Managed Instance Groups in Google, VM scale sets in Azure (I think? It’s been a minute), surely Elastic Something in AWS…*** better yet, containerise into smthg that can handle major load (k8s anyone? Sorry not sorry)

Even when you do that, watch out for those quotas, per region, per project, per whatever the unit is in your cloud(s) of choice. Don’t get stuck behind some unforeseen cap.

Lastly, invest in great SRE talent, tooling, and processes that can spot this and take care of it before it becomes an outage that a non-journalist like me ends up writing a LinkedIn post about while running errands, because it’s that egregious, right? (Which, ofc, involves being willing to take SRE seriously enough to spend some real money in advance of any such incident.)

Cool, back to sitting in front of your broken load screens waiting for Silksong to drop, have fun folks! (And seriously, great job, team who made this game. I don’t even know the name of the dev studio, but this is damn impressed and someone should give you a prize.)****

*Nb: I did literally no research for this post, as I was out running errands when Grant Roberts told me about the major outages, but I thought it was too funny and timely not to share. (He knows me so well.) Come at me in the comments if you must!

***It is, in fact, EC2 (or Elastic Compute Cloud) Autoscaling Groups.

****They are Team Cherry and do indeed deserve prizes!

This was originally posted on LinkedIn; I’m finally migrating it here. I cleaned up a bit of un-researched detail from the original, but tried to keep the same spirit of random unresearched post-on-the-go. :)
Very annoyed to discover a current WordPress bug that won’t let me include the cute emojis originally in my LinkedIn post. There were many cute emojis. Boo! Cross-platform Unicode nonsense art for all!

Leave a comment

Your email address will not be published. Required fields are marked *