We Almost Chose Kubernetes. A Single Bug Changed Our Mind — And Created a New Problem.

When we started building Fintropy, the default answer for container orchestration was Kubernetes. GKE. Autoscaling. The whole ecosystem. We had it half-configured before we ran our first serious cost estimate on the dev environment.

Idle nodes: $800/month with nothing running on them.

For a pre-revenue startup building a platform that tells enterprises they’re wasting cloud money, that felt like a bad joke. We switched to Cloud Run.

And immediately hit a problem nobody warned us about.

Why Cloud Run Won on Paper

The case was straightforward:

Scale to zero. No requests, no containers, no bill.
No cluster management. No node pools to tune, no control plane to maintain.
Per-100ms billing. You pay for what you use, not what you provision.
Managed TLS, IAM, and traffic splitting. Things GKE makes you configure yourself.

For a lean team building fast, this was the right trade. We weren’t at the scale where Kubernetes’ operational overhead would pay for itself.

The Problem We Didn’t See Coming

Cloud Run scales to zero. When a request comes in, it starts a new container instance. That’s the cold start.

Our Celery workers were importing cloud SDKs at module level:

import boto3
from azure.mgmt.compute import ComputeManagementClient
from google.cloud import monitoring_v3

These are heavy. azure.mgmt.compute alone pulls in hundreds of megabytes of dependencies. On a cold start, Python was loading all of them before the worker could accept its first task.

Cold start time: 70+ seconds.

Cloud Run’s SIGTERM timeout: 60 seconds.

The worker was being killed before it finished booting. Silently. Jobs were disappearing into a void with no error logs because the process never reached the point where it could log anything.

It took us three days to diagnose this. We were looking at task queue depths, Redis connectivity, Celery configuration — all wrong. The fix was embarrassingly simple once we found it.

The Fix: Lazy Imports Everywhere

Every cloud SDK import had to move inside the function body that actually uses it:

# Before (module level — kills cold start)
import boto3

def scan_ec2_instances(region: str):
    client = boto3.client("ec2", region_name=region)
    ...

# After (lazy — imported only when the function is called)
def scan_ec2_instances(region: str):
    import boto3  # noqa: PLC0415
    client = boto3.client("ec2", region_name=region)
    ...

The # noqa: PLC0415 comment tells our linter (ruff) that yes, we know this import isn’t at the top of the file, and yes, it’s intentional.

Cold start after the fix: under 8 seconds.

The Second-Order Problem: Test Patching

Lazy imports broke our tests in a non-obvious way.

When you patch a module-level import, you patch the name as it exists on the module:

# When boto3 is at module level, this works:
with patch("app.services.aws_scanner.boto3") as mock_boto:
    ...

When the import is lazy (inside the function), that name doesn’t exist on the module until the function runs. You have to patch at the source:

# With lazy imports, patch at the source package:
with patch("boto3.client") as mock_client:
    ...

We had to audit and fix ~40 test files. Not painful, but time-consuming, and easy to get wrong.

What We Put in the Rules

This was important enough to go into our CLAUDE.md — the instruction file that guides every engineer and AI agent working on the codebase:

Lazy imports: All cloud SDK imports (boto3, azure.mgmt.*, google.cloud.*) must be inside function bodies, not module level — Cloud Run cold-start SIGTERM at ~60s kills the worker otherwise. Test patches must target source package (patch('boto3.client') not patch('app.services.aws_scanner.boto3')).

It’s a rule now, not a reminder.

The Actual Tradeoff

What we gained from Cloud Run:

Dev environment costs dropped from ~$800/month to ~$60/month
Zero cluster maintenance overhead
Managed traffic splitting for canary deploys
Built-in HTTPS, IAM, and Secret Manager integration

What we gave up:

Long-running stateful workloads (Celery still needs a persistent worker process — we run that separately)
The Kubernetes ecosystem (Helm charts, operators, dashboards)
Control over the underlying infrastructure when things go wrong at the OS level

The hidden cost: Lazy imports everywhere. It’s a codebase-wide convention that every new developer has to learn. Kubernetes would have hidden this problem; Cloud Run forced us to solve it permanently.

The Broader Lesson

Your infrastructure choice rewrites your code architecture.

Kubernetes tolerates heavy module-level imports because containers stay warm. Cloud Run punishes them because it starts from zero. Neither is objectively correct — they’re optimised for different failure modes.

The lesson wasn’t “Cloud Run is better.” It was: understand what your platform penalises, and design your code to avoid it. When you don’t, the platform will make you understand eventually. Preferably in dev, not in production at 2am.

Fintropy is a multi-cloud FinOps platform in private beta. We help companies detect cloud waste, track SLA breaches, and automatically file credit claims across AWS, Azure, and GCP. Learn more at nuvikatech.com

Why Cloud Run Won on Paper#

The Problem We Didn’t See Coming#

The Fix: Lazy Imports Everywhere#

The Second-Order Problem: Test Patching#

What We Put in the Rules#

The Actual Tradeoff#

The Broader Lesson#