iximiuz Labs update: dev and ops horror stories and an illustrated guide on how to use playgrounds

Hello 👋

Ivan's here with a mid-month iximiuz Labs update.

In this email, you'll find a few debugging horror stories, a couple of brief incident postmortems, and a bunch of illustrated tips and tricks on how and why to use iximiuz Labs playgrounds. The popularity of the latter keeps growing, and so is the number of questions I get, so it's time for me to start documenting the answers 🤓

Debugging horror

Traditionally, after a big release (persistent playgrounds this time), a period of stabilization was anticipated. And one followed, but the issues I had to iron out weren't anywhere close to what I initially expected.

Syntax highlighting and 100% CPU usage

Unlike me, iximiuz Labs did participate in KubeCon North America 2025. And imagine my surprise when a couple of days before the event, I started seeing these spikes on the CPU usage graphs of the API server:

Historically, CPU usage would never go above 20%

Luckily, after "just a few hours" of profiling, I accidentally noticed the same pattern on another graph, and it finally allowed me to pinpoint the offender:

The /api/_mdc/highlight is not a real API endpoint - Nuxt MDC (custom markdown components library) invokes it as a regular function in SSR mode to highlight code blocks of iximiuz Labs tutorials, challenges, and the like pages. However, the actual syntax highlighting is done not by MDC itself but by Shiki, a pretty popular JS library.

After spending several more hours trying to come up with a labs-agnostic reproduction, I managed to produce a tiny gist that takes about a minute to highlight a relatively short code block (20 lines). From there on, the path forward was clear - I gave the gist to Claude Code and asked it to identify the problematic piece of code in Shiki. And ~10 minutes later, I had this commit with a fix for exponential regex performance degradation on lines with long whitespace sequences. If only the Shiki project were accepting PRs from external contributors... But a "hot-patch" it is for now.

Fly.io load balancing and idle SSE connections

Around the same time, I spotted another anomaly in the API server charts - the "App Concurrency" metric started growing like crazy:

Initially, I thought it might be related to the elevated CPU usage, but the concurrency level remained unreasonably high even after the CPU issue was resolved.

SSH-ing into one of the Fly machines showed that there are indeed many established TCP connections, mostly idle. And the most puzzling part was that there were no code changes (other than simple HTML/CSS adjustments when the problem first appeared on the graph).

Cutting a long story short (and I also don't have hard proof yet), something changed on Fly's side, and their managed load-balancing layer stopped closing idle SSE connections even when the client that established the connection was already long gone 🤦‍♂️

The fix was to start sending an empty SSE heartbeat message, making it rather impossible for the load balancer not to notice that there is no one to receive this event on the client side.

Nuxt 3 to 4 upgrade - the devil was in the details

I kept postponing the upgrade to Nuxt 4 since the summer, but Nuxt 3 is reaching the end of life in February 2026, so the usually quiet December offered a good and much-needed opportunity to upgrade my key frontend & API server library.

But there was an obstacle...

From its inception back in 2023, iximiuz Labs relied on the nuxt-theme/typography plugin to render good-looking HTML from the markdown source (of tutorials, challenges, courses, etc.). However, this project got abandoned, along with its underlying styling library - pinceau, leaving me in an unfortunate situation when a simple npm update would immediately break the entire frontend app and the API server (yes, shame on me for not splitting them long ago).

Getting rid of the typography plugin completely and restyling all possible markdown elements by hand was/is probably the right way forward, but I'm not that proficient at UI design. And it's also not nuxt-typography that was problematic (since it's just a collection of mostly static Vue components for styled links, paragraphs, headers, etc. HTML blocks), but the underlying TS-to-CSS compiler, pinceau, which also looked way too versatile a dependency for my nuxt-specific use case. So, I decided to keep the typography plugin as-is and replace only its abandoned build-time dependency.

A quick look for an alternative library that would render the below CCS-ish TypeScript into a real piece of CSS revealed only that pinceau itself was a rewrite of another abandoned project - stitches. Yes, that's why we all love the Node.js ecosystem.

But the actual reason I'm ranting about this problem for so long is how I got it solved, which you may find amusing. I'm no expert in frontend bundlers, compilers, and the like. So my only hope was that Claude Code could do the rewrite for me. However, I needed to verify the results somehow, and here is what I came up with:

Build the frontend app using the abandoned libraries (this is how it's been done historically).
Capture the generated CSS styles of the markdown components as expected output.
Generate a comprehensive test suite that rebuilds the project and compares the generated CSS styles (actual output) with the captured styles.
Remove the abandoned dependency (profit).
Ask Claude Code to implement a TS-to-CSS compiler that would support all cases in the nuxt-typography plugin (pointing it at github.com/Tahul/pinceau for a reference impl)
Keep Claude Code tweaking the implementation until all tests pass.

It took me a couple of days to research the issue and come up with the above plan, and even with the completely verifiable success criteria, Claude Code managed to derail itself half a dozen times, but, after a few more hours, we managed to produce ~1000 lines of code that made all the tests pass 🎉

To make it clear, AI was a huge time saver here! Without a coding agent, I'd probably spend a whole week, if not more, figuring out how to write this code by hand.

Ah, and the Nuxt upgrade itself was rather seamless - I had to tweak several places in the code, following the migration guide, but all the adjustments were trivial.

Docker 29 and (mostly unpleasant so far) surprises

I always try to keep the official iximiuz Labs playground collection up to date. And most of the time, a rootfs re-bake doesn't bring many surprises. But the recent upgrade to Docker 29 (and containerd 2.2) was packed with breaking changes.

Approximately ~30% of challenges stopped passing after the upgrade, and I spent an entire day adjusting them, my helper scripts, and even the main codebase to make it work with the new Docker behavior. Chances are, your automation is affected as well. Here is what I found:

The default output format of docker images has changed, and it broke a whole bunch of docker images | grep ... checks in my scripts. A quick fix is to switch to docker images --format table (which mimics the historical format).
The containerd-snapshotter has become the default Docker storage driver, and it has moved the on-disk location of containers from /var/lib/docker to /var/lib/containerd. If you were mounting a dedicated partition on /var/lib/docker, be ready to remount it on the new location.
The .GraphDriver section of the docker container inspect is completely gone now (and it also broke a bunch of challenges).
The .NetworkSettings.IPAddress attribute is gone from the docker container inspect output (the explicit .NetworkSettings.Networks. should be used instead, which makes perfect sense, but it's such a breaking change...).
ctr image pull/push started treating multi-platform images differently. You cannot ctr image pull then ctr image tag and ctr image push anymore - an explicit ctr image convert (from a manifest list into a single-platform variant) is now required.
ctr image mount started failing with a bizarre error if the is provided (but ctr image mount works, mounting the image at some random location under /run/containerd/...).

There are probably more breaking changes, so it's a good idea to read the release notes in full before the upgrade.

Four worker node outages in two weeks

In 3 years of iximiuz Labs' existence, there has been only 1 worker node outage (when the bare metal server went completely missing). That is, until this month. Now there is 5 🤦‍♂️

But I'm taking the platform's reliability very seriously!

Below, you'll find the root cause analysis of the incidents and the actions that were taken to prevent similar issues in the future.

Farewell to Hetzner Auctioned servers

Three out of four December's outages were caused by Hetzner's auctioned servers suddenly halting. All three servers were of the same profile (Intel i9 12900K, 128 GB RAM, running somewhere in Finland), and in all cases, the malfunctioning component was the NIC (that would just stop sending/receiving packets). A simple hardware reset revived the servers, but I decided to replace them anyway.

Luckily, this time, I could afford Hetzner's prime bare metal offering - thanks to all of you who decided to support iximiuz Labs with premium memberships! 🤗

The new server profile is a bit more powerful and also noticeably more expensive. But my main hope is that it'll work much more reliably than the (dated) auctioned servers.

Interesting that the first incident also caused partial unavailability of the control plane, hence the API server and the frontend app, too. To make it handle similar incidents without disruption, I applied several adjustments, and the two dead-server incidents that followed shortly afterward proved my changes had the intended effect.

Asia Pacific regional outage

On Friday, December 19th, iximiuz Labs had its very first region-wide availability incident. ~~Yay!~~

Unlike the other 3 worker node outages in December, this time the issue was caused by a flaw in the platform's load-balancing logic (authored by yours truly).

Live workshops create a unique load pattern that differs significantly from a typical, much more evenly distributed load from individual playground users. The control plane has always had a backpressure mechanism built in, but it's based on current CPU, memory, and disk usage on the worker nodes and doesn't account for potential future spikes in already scheduled playgrounds.

If a worker node is approaching its healthy limits, it'd push back on the next playground placement attempt, and the control plane would try to schedule the playground run to another node. However, this mechanism doesn't protect against situations where N > 10 initially idling playgrounds are started on the same worker node simultaneously, but after some initial delay, begin to consume a bunch of resources.

An ad hoc debugging tool Claude Code and I built to investigate the incident.

And this is exactly what happens during most workshops - each participant clicks the Start button, then listens to the instructor for a few minutes, and only after that starts hitting the keyboard, loading the playground. Multiply by ten, and you'll get a worker node outage.

To ~~prevent~~ reduce the probability of similar incidents in the future, I now limit the maximum number of VMs per server, regardless of whether they are idling or not. Plus, I added an extra server to the Asia Pacific region, making it the grand total of 3 🎉 This is a 300% increase compared to the pre-December server headcount, by the way.

The above beauty costs me "just" $258/mo, bringing the regional bill to a whopping ~$750/mo.

By the way, this availability incident perfectly explains the reason why Premium Seats that instructors can purchase for their students cost significantly more than individual premium access for the same period. The former assumes synchronous access, which requires a bunch of overprovisioning, while the latter tends to produce much more evenly spread load because of its self-paced nature.

iximiuz Labs Playgrounds keep getting better

To a large extent, the above issues were caused by a steep increase in playground usage that happened over just a couple of weeks, late November-early December. What was historically a record level of simultaneously running playgrounds has now become the new baseline... and this is great! 🚀

More people using the platform also means a more diverse set of use cases, which often means I have to work around the clock to ship the missing feature. But cannot be more grateful for this opportunity - it is the most natural way to develop the product, and my end goal with playgrounds remains to provide as realistic server-side experience as possible.

By the way, the playgrounds now boot twice as fast on average.

The current collection of playgrounds ranges from a simple Linux VM to a multi-node Kubernetes cluster, and always comes with the latest and greatest:

Ubuntu 22.04/24.04, Debian Bookworm/Trixie/Forky, Fedora 43, Rocky 10
Docker 29 and containerd 2.2
Kubernetes 1.35
...and two dozen more!

You can use iximiuz Labs playgrounds to:

Practice Linux and networking (either in a free-play mode or by solving guided hands-on challenges)
Build a "remote" homelab (actually, unlike with most physical homelabs, at iximiuz Labs, you can build as many as you like)
Create your public DevOps portfolio (CodePen/StackBlitz, but for Linux projects)
Experiment with new tools, confine coding agents, do security research, and whatnot.

And if you are a newcomer or want to improve your iximiuz Labs game, I've tried to explain what playgrounds are in a visual way below.

iximiuz Labs playgrounds in a nutshell

At its simplest, a playground is just a VM running on a large bare-metal server. Since it's a full-fledged VM and not a container, you can run most typical workloads in it, including Docker and Kubernetes, without the annoying limitations of Docker-in-Docker or a shared kernel.

Each playground VM gets a root drive and its own kernel, a network adapter with a local IP address assigned, 2-4 vCPU, and 4-8 GB of RAM. Accessing a playground VM is no different from SSH-ing to a regular Linux server, and you can also:

Expose applications running in the playground
Forward local and remote ports
Access the playground from your IDE (VS Code, Cursor, etc.)

When a single drive is not enough

For many tasks, the above basic VM will already be enough, but if you want to practice more advanced sysadmin topics like disk partitioning or try using different filesystems, you can easily add extra drives to the playground VM using either the constructor UI or a Kubernetes-style manifest:

A single VM is not a limitation

Most of the real-world setups you'll be dealing with are distributed systems with more than one node. The good news is that on iximiuz Labs, you can easily start a playground with up to 5 VMs in it using the so-called Flexbox base (video):

Multi-VM playgrounds is where iximiuz Labs start to outperform its alternatives noticeably:

Renting multiple VPS (e.g., Digital Ocean droplets or EC2 instances) to reproduce a similar setup is significantly more expensive.
Using local virtualization software (e.g., VirtualBox) requires a relatively powerful laptop or PC.
Slicing a remote server into multiple VMs with KVM is tricky, and the resulting setup will likely be less flexible.

Complex network topologies

The two-VM setup from the above diagram was only scratching the surface. On iximiuz Labs, you can connect a VM to an arbitrary number of bridge networks, simulating rather complex network topologies.

An example of a setup you can configure using a 5-node Flexbox playground.

This is a much-needed capability for practicing routing problems or exploring real-world hierarchical topologies.

Preserving your playground progress

For a long time, iximiuz Labs playgrounds were fully ephemeral. You'd start a new environment, perform some tasks in it for up to 8 hours, but eventually, the playground would have to be terminated and its data completely removed.

While ephemeral playgrounds remain a completely valid (and still dominant) way to use the labs, since November, it's also possible to save your playground progress and restart it after lunch the next day, or even after a lengthy vacation.

The playground termination dialog now offers two actions:

[new] Stop - the playground VMs will be terminated, but their drives will be snapshotted and offloaded to a remote storage.
Destroy - the historically only option that completely disposes of the playground's data after terminating the VMs.

Instantly fork the playground run

Imagine you're working on a task - it can be a server configuration issue, a particularly involved Kubernetes cluster, or a coding problem. You've been on it for hours - cloning GitHub repos, running ad hoc shell commands, restarting services, etc. Finally, you manage to produce a certain state you want to preserve. You hit the Stop button, enjoying the playground persistence, and go to sleep.

But the next day, you wake up not with 1 but with 3 ideas how to proceed. If only it were possible to "clone" the state of your playground and try all three hypotheses against the same system.

Infrastructure as Code is the way, but there is a big problem with this approach - you need to know upfront that the setup should be scripted. Clearly, we're already past that point.

Luckily, on iximiuz Labs, you can clone stopped playgrounds with just one click. A replica takes no extra space (copy-on-write) and can be created instantly. So now you can have as many copies of your accidental but valuable snowflake setup as you like!

Create custom playgrounds

Last but not least, some accidental (or intentional) "clickops" setups are worth preserving as reusable templates for a long time. This is where another capability of persistent playgrounds comes in handy - saving a stopped run as a custom playground.

Cloning a run ad hoc creates two independently mutable playground runs. But saving a stopped run as a custom playground creates a read-only template that can then be used to branch out as many mutable runs off it as needed. And as with any playgrounds, you can also share it with others or embed it in tutorials, challenges, or course lessons. Magic! 🧙

Wrapping up

Phew, that was a long one!

My original plans for December also included redesigning the inner and outer content navigation, shipping independent author machinery (public profile, internal dashboard with tools and stats, the ability to monetize content), and, of course, replacing the good old Premium plan with two more fine-grained options (Tinkerer and Learner). In actuality, the stabilization and upgrade of the platform took all my time, so I'm only starting to work on these features now, but hey, there is ten more days to go! 💪

By the way, this wasn't a year wrap-up yet! I'll do my best to send one more email with "iximiuz Labs in numbers" and also the platform's plans for 2026. Stay tuned!

Happy holidays!

Ivan

P.S. The all-inclusive iximiuz Labs Premium plan is going away soon - don't miss your chance to benefit from the early-day supporter offer.

Ivan on the Server Side