The 9 Commits That Completed Our Multi-Cloud SLA Claim Filing Module

Software shipping stories usually skip the interesting parts. You see the before and after, not the false starts, the design decisions that required three conversations to resolve, or the code review that caught a bug that would have silently written wrong data to production.

Here’s the real story of shipping Fintropy’s SLA claim filing module: 9 commits, 3 providers, one Cloud Scheduler job, and the bugs that nearly shipped.

What We Built

The claim filing module auto-files SLA credit claims with AWS, Azure, and GCP when a breach is detected. The full flow:

Breach resolves → auto-file check runs
Tenant settings checked (claim_filing_enabled, auto_file_claims)
Provider connector built with customer credentials
Claim filed via provider Support API
Cloud Scheduler polls daily for approval/denial

Nothing in that list sounds complicated. The complexity is in the details.

The 9 Commits

Commit 1: `ConnectorFactory`

feat(claim-filing): add ConnectorFactory for all three providers

Before this, connector construction was scattered across two files with different credential-handling logic for each provider. The factory unified it:

AWS: reads credentials from subscription.auth_metadata
Azure: reads tenant_id from auth_metadata + service principal from env vars
GCP: reads service_account_key JSON from auth_metadata

Bug caught in code review: The initial implementation called strategy.authenticate(credentials) but ignored the return value. If authentication failed (wrong credentials, SDK unavailable), we’d return a broken strategy object with None clients instead of raising ConnectorBuildError. Fixed before merge.

Commit 2: `TenantClaimSettings`

feat(claim-filing): add TenantClaimSettings helper

A thin dataclass that reads claim_filing_enabled, auto_file_claims, and claim_filing_providers from Tenant.settings JSONB with safe defaults. 20 lines of code, 6 tests. Nothing surprising here — but critical to have as a well-tested primitive before building on top of it.

Commit 3: AWS service code expansion

feat(claim-filing): expand AWS service code map to 27 services

The AWS Support API requires specific service codes when creating cases. Our existing map had 8 entries. We expanded to 27: DynamoDB, ElastiCache, Redshift, SageMaker, API Gateway, SNS, SQS, Kinesis, Glue, Athena, EMR, OpenSearch, Batch, Step Functions, and aliases for common naming variants.

Bug noted but deferred: The map exists in two files (providers/aws.py and claim_filing.py). DRY violation. We noted it, decided the fix was lower priority than shipping, left a comment.

Commit 4: GCP `get_claim_status()`

feat(claim-filing): implement GCP get_claim_status via Cloud Support API v2

Replaced the stub that returned {"status": "unknown"} with a real implementation using google.cloud.support_v2.CaseServiceClient.get_case().

GCP state mapping:

SOLUTION_PROVIDED → approved
CLOSED → denied
anything else → pending

Bug caught in code review: The credentials fallback path (when _sa_credentials is None, fall back to parsing service_account_key from self.credentials) had no test. The code was correct; the test was missing. Added before merge.

Commit 5: Auto-file hook

feat(claim-filing): add auto-file hook to BreachLifecycleService.resolve_breach

The most complex commit. Added _maybe_auto_file() to BreachLifecycleService — called at the end of every breach resolution, checks tenant settings, builds connector, files claim.

Design decision recorded here: The hook uses lazy imports for all sla_monitoring.* modules. This prevents circular imports between app.services and sla_monitoring. All test patches target the source module (sla_monitoring.connector_factory.build) not the local name.

Bug caught in code review: ClaimFilingService.file_claim() legitimately returns {"status": "assisted"} for AWS Basic plan accounts and GCP without Support SDK. The initial implementation treated everything except "submitted" as an error, which hit the exception handler and logged a misleading “Auto-file failed” warning for completely normal assisted-filing outcomes. Added an explicit elif result.get("status") == "assisted" branch.

Commit 6: Poll endpoint

feat(claim-filing): add POST /api/sla/poll-claims endpoint for Cloud Scheduler

Protected by X-CloudScheduler-Token header checked against CLOUD_SCHEDULER_SECRET env var. Queries all “Filed” breaches across tenants with claim_filing_enabled: true, calls get_claim_status() per provider, updates on status change.

Bug caught in code review (critical): AWS and Azure get_claim_status() were returning raw provider status strings ("resolved", "closed"), not our normalised vocabulary ("approved", "denied"). The poll endpoint checked if credit_status == "approved", so "resolved" from AWS hit the else branch and marked the claim as "Credit Denied". Silent. Incorrect. Fixed in a follow-up commit.

Commit 7: Refactor `_build_provider_connector`

refactor(claim-filing): replace _build_provider_connector with ConnectorFactory for all providers

The existing _build_provider_connector in sla_monitoring.py only built Azure connectors (returned None for AWS/GCP, letting ClaimFilingService handle them internally). After the factory was built, we replaced the 45-line Azure-only function with a 15-line wrapper that handles all three providers.

Commit 8: Duplicate removal

fix(claim-filing): remove duplicate file_sla_claim from routes.py

A stale file_sla_claim endpoint existed in routes.py from an earlier iteration. It passed connector=None to ClaimFilingService — bypassing the factory entirely. We wrote a test that asserts the function no longer exists in routes.py (via AST parsing), deleted the duplicate, verified the test passes.

Commit 9: Status normalisation fix

fix(claim-filing): normalize AWS/Azure get_claim_status to approved/denied/pending/unknown

The fix for the critical bug in commit 6. Both AWS and Azure get_claim_status() now map their raw vocabulary to approved/denied/pending/unknown. The poll endpoint also gained a guard: if credit_status not in ("approved", "denied", "pending"): continue.

The Bugs That Didn’t Ship

Three bugs were caught in code review before merging:

ConnectorFactory ignoring authenticate() return value
GCP missing credentials fallback test
AWS/Azure get_claim_status() returning raw strings instead of normalised vocabulary

The third one is the one that mattered most. It was a silent data corruption bug — would have marked legitimate approved credits as denied, with no error log, no exception, no obvious signal. Caught by a final code review that read the poll endpoint logic carefully and traced the flow from provider API to database write.

That’s the value of the two-stage review process. Not the obvious bugs — the ones that look correct at the function level but are wrong at the system level.

What We’d Do Differently

Extract the service code map earlier. The DRY violation in the AWS service code map (two identical 27-entry dicts) would have been free to avoid at the time of writing. We deferred it and now it requires touching two files whenever a new service is added.

Design the status vocabulary before the providers. We designed the poll endpoint first, then built the providers, then discovered the vocabulary mismatch. Starting from the vocabulary contract and building the providers to match it would have caught this at design time.

Fintropy is a multi-cloud FinOps platform in private beta. Learn more at nuvikatech.com

What We Built#

The 9 Commits#

Commit 1: ConnectorFactory#

Commit 2: TenantClaimSettings#

Commit 3: AWS service code expansion#

Commit 4: GCP get_claim_status()#

Commit 5: Auto-file hook#

Commit 6: Poll endpoint#

Commit 7: Refactor _build_provider_connector#

Commit 8: Duplicate removal#

Commit 9: Status normalisation fix#

The Bugs That Didn’t Ship#

What We’d Do Differently#