AI Infrastructure Cost Optimization

Your smartest engineering is creating your biggest cloud bill. We help cut AI infrastructure spend without slowing the team.

GPU rightsizing · Inference economics · Training spend · Vector DB & RAG cost shape

Book a 30-minute AI cost review → Read: The AI Cost Paradox

Microsoft ISV Program member FOCUS-aligned FinOps Fintropy: 497 scan rules AWS · Azure · GCP Inference & training

The AI cost surface - where the money actually goes

AI bills do not look like normal cloud bills. The cost surface is broader, the unit economics are stranger, and the leverage points are very different. Six categories explain almost every dollar.

Category 1

Inference

At steady state, inference is typically 80-95% of LLM TCO. Sustained spend, every request, often the largest line item by a wide margin.

Category 2

Training

Pre-training spikes, fine-tuning runs, and continual training. One-time bursts that anchor planning but rarely dominate long-term cost.

Category 3

Fine-tuning

LoRA, QLoRA, and full fine-tunes. Cheaper than pre-training but often re-run on every data refresh - costs compound when ungoverned.

Category 4

Vector DB & RAG

Embedding generation, vector storage, index serving, and retrieval. The silent line item that grows linearly with corpus size.

Category 5

Evaluation & observability

LLM-as-judge passes, eval harness runs, trace storage, and prompt logging. Often invisible until the bill arrives.

Category 6

Data movement

Egress between training storage and GPUs, cross-region embedding traffic, and pipeline shuffles. Cheap per GB, expensive at AI scale.

If your AI bill is dominated by a single category, optimization is straightforward. If it is spread evenly, you have a portfolio problem - that is what an assessment is for.

GPU rightsizing - A100, H100, L4, T4

Most teams over-pay for GPUs because they pick the newest part rather than the right one. The bottleneck - memory bandwidth, compute, or memory capacity - determines the answer. The workload type - inference, training, or fine-tuning - determines the purchase model.

L4 - inference workhorse

Cheap, memory-efficient inference for most production workloads up to roughly 7B parameters. Best price-per-token on serving for many use cases. Strong default for production inference.

A100 - the safe heavy lift

80GB variant is the workhorse for larger-model inference (13B-70B) and most fine-tuning. MIG carving lets you serve multiple inference tenants per card.

H100 - only when needed

Worth the premium when memory bandwidth, FP8 precision, or speculative decoding meaningfully change throughput - typically frontier-model serving at high QPS.

T4 - legacy & light inference

Still cost-effective for small models, embeddings, and batch inference on quantized models. Often retired too early; sometimes retained too long.

Memory-bandwidth-bound vs compute-bound

LLM decoding is almost always memory-bandwidth-bound - H100 wins because of HBM3 bandwidth, not raw FLOPS. Prefill and training are more compute-bound - the A100/H100 gap narrows. Knowing which side your workload sits on determines whether the H100 upgrade pays for itself.

Spot, on-demand, reserved

Training and async batch inference: spot or preemptible with checkpointing. Real-time inference: on-demand or reserved with autoscaling guardrails. Reserved capacity (1yr / 3yr) makes sense only when the baseline traffic is stable - which is rarely true in the first 12 months of a new product.

Inference cost levers - the 80% of your AI bill

Six levers, in roughly the order you should pull them. Most teams stop at one or two and leave 40-60% of inference cost on the table.

Model selection cascade

Frontier API for hard requests, mid-tier API for routine, fine-tuned open-weights for high-volume, distilled small model for trivial classification. Route on intent, not vibes. The single biggest lever for most workloads.

Batching

Dynamic batching for serving (vLLM, TGI, TensorRT-LLM) lifts throughput 5-10x for the same hardware. Static batching is leaving money on the table. Tail-latency tradeoffs are real but manageable.

Quantization

INT8 for most production inference, FP8 on H100 for the throughput win, AWQ and GPTQ for memory-bound serving. Quality cost is usually small and measurable - skipping quantization is a habit, not a choice.

Caching

Prompt cache for repeated system prompts, semantic cache for near-duplicate queries, KV cache reuse across conversation turns. For chatty applications, caching is often the highest-leverage lever after model selection.

Speculative decoding

Draft model proposes, target model verifies. 2-3x throughput improvement on memory-bandwidth-bound decoding for compatible model pairs. H100 is where the math tends to work best.

Structured output

JSON-mode and constrained decoding cut output tokens dramatically vs free-form generation. Often forgotten; almost always cost-positive when the use case supports it.

Training cost control - spikes, not steady state

Training is the line item that scares the CFO but rarely dominates the bill. The leverage is in not over-paying for spikes - and in remembering that the data pipeline is often more expensive than the GPUs.

Spot / preemptible with checkpointing

For non-critical phases (hyperparameter search, ablations, exploration), spot can be 60-80% cheaper. The cost is restart overhead - minimized with frequent checkpointing and resilient training loops.

Gradient checkpointing

Trade compute for memory - fit a larger model on smaller GPUs without dropping to a cheaper instance. The default for most fine-tuning workflows that hit memory ceilings.

Mixed precision (BF16 / FP16)

BF16 is the default for modern training. Halves memory, accelerates compute on supported hardware, rarely costs accuracy. If you are still on FP32, the audit answer is straightforward.

ZeRO-3 / FSDP optimizer parallelism

Shard optimizer states across GPUs to fit larger models without renting bigger boxes. The cluster topology and interconnect speed determine whether the savings actually materialize.

Data pipeline efficiency

The often-overlooked cost: data loading, augmentation, tokenization, and shuffle pipelines burning CPU and network while GPUs idle. Profiling utilization reveals more savings than tuning the model.

Smarter hyperparameter search

Bayesian optimization, multi-fidelity, and early-stopping cut search budgets by 3-10x vs grid search. Most teams default to grid and never revisit.

Vector DB & RAG - the silent line item

RAG looked cheap when the corpus was 100K documents. At 10M, the bill changes character. The choice of vector database, index quantization, and retrieval strategy is now a real cost decision - not a developer-experience preference.

pgvector (Postgres)

Best when your data already lives in Postgres and the corpus is moderate (sub-10M vectors). One system to operate; one bill to read. Loses to specialized DBs at very large scale or extreme QPS.

Pinecone

Managed, fast, simple. Premium pricing pays for operational simplicity. Best when team velocity matters more than per-vector cost.

Weaviate

Strong hybrid search, GraphQL surface, multi-modal. Self-hosted gets cheap at scale; managed sits between Pinecone and self-hosted Qdrant on price.

Qdrant

Excellent self-hosted economics at scale. Rust-based, memory-efficient, strong filtering. The right call when per-vector cost dominates the decision.

Milvus

Distributed, billion-vector-class. The right answer at very large scale; operational overhead is real and requires investment.

PQ & scalar quantization

Product quantization cuts vector storage 4-16x with measurable but small recall cost. Scalar quantization is the easier win. Almost every RAG estate over 1M vectors is over-paying without it.

Hybrid retrieval (sparse + dense)

BM25 + dense vectors with reciprocal-rank-fusion often reduces index size and improves recall simultaneously. The dense index gets to be smaller because sparse handles exact-match queries that dense never should have indexed.

Re-ranking with smaller models

Cross-encoder re-ranker on top-50 retrieved candidates filters out irrelevant docs before the expensive embedding or generation step runs. Cheap to add, often the difference between "good RAG" and "expensive RAG".

Assessment scope & 90-day roadmap

What we evaluate, what you get, and how the first 90 days are structured.

What we evaluate

• Model serving topology - engines, batching strategy, autoscaling
• GPU utilization patterns - idle ratios, MIG opportunities, fleet sizing
• Inference latency-vs-cost frontier - current point, achievable point
• Model selection cascade - which requests deserve which model
• RAG index hit rate, recall, and re-ranker economics
• Training job efficiency - utilization, spot uptake, pipeline bottlenecks
• Vector DB cost per million vectors at current and projected scale
• Observability cost - trace storage, eval harness runs, prompt logs

What you get

• Baseline AI infrastructure cost by category and workload
• Prioritized backlog with $$ impact estimates per item
• Risk notes - accuracy regression risk, latency regression risk
• Owner-mapped 90-day roadmap aligned with your engineering rhythm
• Recommended monitoring + budget guardrails to prevent regression
• Optional execution support post-roadmap (retainer)

Deep dive · 5,000 words

The AI Cost Paradox: Why Your Smartest Engineering Creates Your Biggest Bills

A long-form walk through the seven optimization strategies that cut AI infrastructure cost without sacrificing accuracy. Required reading before any AI cost review.

Read the deep dive →

How an AI cost engagement works

Four weeks for assessment - one extra week vs. our cloud engagements, because AI workloads need benchmarking time the dashboards do not provide.

Step 1 - Week 1

Discovery & baseline

Read-only access. Cost surface mapping. Workload inventory. Cloud bill correlation.

Step 2 - Week 2-3

Benchmarking

Inference profiling, GPU utilization analysis, quantization quality sweeps, RAG index economics.

Step 3 - Week 4

Roadmap

Prioritized 90-day backlog. Owners. Estimated savings. Accuracy risk notes.

Step 4 - Ongoing

Execution

Retainer for execution support. Monthly reporting against baseline.

Engagement pricing

Indicative ranges, scoped per engagement. Fixed-fee for assessments; retainer for execution.

Diagnostic

AI Infrastructure Cost Diagnostic

2 weeks · fixed-fee

$10K – $20K

Assessment

AI Infrastructure Cost Assessment

4 weeks · fixed-fee

$20K – $60K

Retainer

Ongoing AI FinOps retainer

Monthly · execution support

from $8K / month

Indicative - scoped per engagement. INR pricing for India-headquartered clients. See all services →

Common questions

What does AI cost optimization consulting cover?+

GPU rightsizing (A100, H100, L4, T4 selection and MIG carving), inference cost reduction (batching, quantization, model selection cascade, caching), training spend control (spot instances, gradient checkpointing, mixed precision), vector DB and RAG infrastructure tuning (pgvector vs Pinecone vs Weaviate vs Qdrant, PQ quantization, hybrid retrieval), and observability cost. We do not cover labor or organizational cost - only infrastructure. If you are searching for AI labour cost optimization, that is HR territory, not ours.

How much can AI infrastructure cost optimization save?+

Honest framing: 30-60% on inference is realistic with disciplined batching, quantization, and a model selection cascade. 15-30% on training spend with spot instances, gradient checkpointing, and better hyperparameter search. Vector DB and RAG often 40-70% via product quantization and hybrid retrieval. Specifics depend entirely on your current baseline - we will not promise a number before assessment.

Should we serve our own LLM or use an API for our use case?+

It depends on volume, latency requirements, and customization needs. Below roughly 5M tokens per day, a managed API is usually cheaper than self-hosting once you account for GPU utilization gaps. Above roughly 50M tokens per day, self-hosting often wins - sometimes by 5-10x. The middle band requires modeling against your specific traffic shape, latency SLO, and model size. We do that modeling as part of the assessment.

Which GPU should we use for inference - A100, H100, or L4?+

L4 is the inference workhorse for most production workloads up to roughly 7B parameters - cheap, memory-efficient, and widely available. A100 is the right call for larger-model inference and most training. H100 is only worth the premium when memory bandwidth, FP8, or speculative decoding meaningfully change throughput - typically high-throughput frontier-model serving. MIG carving on A100 or H100 can change the math for multi-tenant inference.

Do you do AI cost optimization for Indian companies?+

Yes. We are based in Thane, Mumbai, and we work with Indian companies running AI workloads on AWS, Azure, and GCP - the same playbook applies. India-headquartered clients get INR billing and on-the-ground delivery across Mumbai, Pune, Bangalore, Delhi NCR, Hyderabad, and Chennai.

Is "AI labour cost optimization" something you do?+

No. Labour and organizational cost optimization sits with HR and operating-model consultants. We focus exclusively on AI infrastructure cost: GPUs, inference, training, vector databases, and observability. We have seen the search query and want to be clear about scope so we do not waste your time.

Book a 30-minute AI cost review

4-week assessment. Fixed-fee. Read-only access. INR-priced for Indian clients.

Schedule a Call