What Happens When Your Auto-Filing System Fails at 2am

There’s a class of engineering decisions that only reveal their importance when something goes wrong at the worst possible time.

Fintropy’s auto-claim filing is one of them. When a cloud service recovers from an SLA breach, we resolve the breach record in our database, compute the credit value, and then automatically file the claim with the provider. All of this happens inside the breach resolution transaction.

The question is: what happens when the filing step fails?

The naive answer is to let the exception propagate and roll back the transaction. The breach record stays unresolved, the customer gets an error, and someone investigates.

This is wrong. Here’s why, and what we did instead.

The Constraint

Breach resolution and claim filing are not the same operation. They have different reliability characteristics:

Breach resolution is a database write. It’s deterministic. It either works or it doesn’t, and if it doesn’t, we know immediately. The inputs are fully controlled — we computed them ourselves.

Claim filing is an external API call to AWS, Azure, or GCP. It can fail for reasons entirely outside our control:

Customer is on AWS Basic support plan (API access requires Business or above)
GCP SDK not installed in this deployment
Azure service principal credentials misconfigured
Provider API is temporarily unavailable
Network timeout

If we tie these together in one transaction, an AWS support plan issue will prevent the breach from ever being marked “Resolved.” The customer’s breach record would be stuck in limbo indefinitely.

The Solution: Total Error Containment

The auto-filing code is wrapped in a try/except that catches everything and never re-raises:

def _maybe_auto_file(self, breach: SLABreach) -> None:
    """Fire auto-claim filing. Errors are caught and logged — never propagated."""
    try:
        # ... check tenant settings, build connector, file claim ...

        if result.get("status") == "submitted":
            breach.status = "Filed"
            breach.claim_reference = case_id
            # ... update evidence ...
            create_alert(db=self.db, alert_type=AlertType.SLABreach,
                        title="SLA Claim Submitted", ...)

        elif result.get("status") == "assisted":
            breach.status = "Assisted Filing Required"
            create_alert(db=self.db, alert_type=AlertType.ActionRequired,
                        title="SLA Claim Requires Manual Filing", ...)

    except Exception as e:
        logger.warning("Auto-file failed for breach %s: %s", breach.id, e)
        breach.status = "Assisted Filing Required"
        try:
            create_alert(db=self.db, alert_type=AlertType.ActionRequired,
                        title="SLA Claim Requires Manual Filing",
                        message=f"Auto-filing failed: {breach.service}. Please file manually.")
        except Exception:
            pass  # Alert failure must also not propagate

Several things are deliberate here:

The outer except Exception catches everything — including ConnectorBuildError, network errors, and RuntimeError. Nothing escapes.

The inner try/except around create_alert is necessary because the alert service itself could fail (database connectivity, etc.). If it does, we log the outer filing failure and move on. An alert failure must never mask or hide the breach status.

breach.status = "Assisted Filing Required" is set on any exception path. The breach is never left in a state where the auto-file was attempted but the outcome is unknown.

The State Machine

Every path out of “Resolved” leads to a defined, actionable state. There are no stuck states, no ambiguous outcomes.

The Assisted Filing Path

“Assisted Filing Required” isn’t a failure state — it’s a defined outcome. When automated filing isn’t possible (wrong support plan, missing credentials, provider doesn’t support API filing), Fintropy generates a pre-filled support case with:

The claim description ready to paste
Evidence attached (uptime metrics, incident timeline)
Direct link to the provider’s support portal
Filing deadline prominently displayed

For GCP, that deadline is 30 days. Fintropy sends a reminder alert as it approaches.

The customer can file manually in under 5 minutes using the generated materials. It’s not as seamless as auto-filing, but it’s dramatically better than starting from scratch.

What This Design Costs

Total error containment has a real tradeoff: it makes silent failures possible.

If the filing system has a bug that always throws an exception silently, every breach would be set to “Assisted Filing Required” without anyone noticing the auto-file never ran. The customer sees alerts, files manually, and everything works — but the automation is broken.

We mitigate this with:

Logging: Every caught exception is logged with the breach ID and the full error message. Alert monitoring on Auto-file failed log lines in Cloud Monitoring.
Metrics: We track the ratio of Filed vs Assisted Filing Required status transitions. A sudden shift is a signal.
Integration tests: We test the full auto-file path in our own GCP environment with real credentials before every release.

But fundamentally, we made a choice: a failed auto-file that leaves the breach in a defined state is better than a successful auto-file that corrupts the breach record. The customer loses some automation value. The data stays correct.

The Principle

For operations that are side effects of primary operations:

The side effect must never be able to break the primary operation. If it fails, it should fail gracefully and leave the system in a defined, actionable state.

Filing a claim is a side effect of resolving a breach. The breach resolution is the primary operation. We designed accordingly.

Fintropy is a multi-cloud FinOps platform in private beta. Learn more at nuvikatech.com

The Constraint#

The Solution: Total Error Containment#

The State Machine#

The Assisted Filing Path#

What This Design Costs#

The Principle#