The landscape for customizing and deploying AI models changed significantly through 2024 and 2025. Major platforms launched or expanded fine-tuning capabilities, deprecated older APIs, and consolidated around deployment patterns that balance managed convenience against self-hosted control. Understanding which models you can actually fine-tune, where those capabilities live across cloud providers, and how self-hosted serving infrastructure evolved clarifies which approaches fit specific workflows in 2026.
This guide examines what changed recently, which managed fine-tuning options exist today, and where self-hosted deployment tooling stands for teams that need infrastructure control or cost optimization at scale.
Managed Fine-Tuning: Where Capabilities Moved
The clearest pattern is platform consolidation. Fine-tuning moved from experimental APIs to production tiers with clearer model lifecycle management and tighter integration into agent platforms.
OpenAI: GPT-4o Fine-Tuning and Responses API Integration
Best for: teams that need brand voice, structured output format, or domain-specific instruction following on GPT-4 class models and plan to integrate with Responses API agent workflows.
Trade-off: training examples are truncated if they exceed token limits; you manage dataset preparation and prompt design outside the fine-tuning interface.
OpenAI announced GPT-4o fine-tuning on August 20, 2024. Fine-tunable base models now include gpt-4o-2024-08-06, gpt-4o-mini-2024-07-18, gpt-4.1-mini-2025-04-14, and gpt-4.1-nano-2025-04-14. The platform's best-practices documentation explicitly notes that training examples are truncated if they exceed context or example token limits, which means dataset quality and example length matter more than dataset size alone. Teams building agentic systems should understand that fine-tuned models integrate with the Responses API for tool calling and persistent conversations, replacing the deprecated Assistants API architecture.
Google: Vertex AI Exclusive for Gemini Tuning
Best for: teams already using Google Cloud who need function calling tuning or supervised fine-tuning on Gemini 2.0 Flash or Gemini 2.5 variants.
Trade-off: the Gemini API no longer supports fine-tuning at all; you must use Vertex AI, which introduces GCP infrastructure dependencies.
Google deprecated Gemini 1.5 Flash-001 in May 2025, eliminating the last model in the Gemini API that supported fine-tuning. Fine-tuning capabilities now live exclusively in Vertex AI. Gemini 2.0 Flash tuning became generally available on March 11, 2025, including support for tuning function calling. Supervised fine-tuning for Gemini 2.5 Flash-Lite and Gemini 2.5 Pro launched on August 8, 2025. Vertex AI also publishes model lifecycle tables with explicit release and retirement dates—for example, gemini-2.5-pro was released June 17, 2025 and is scheduled for retirement June 17, 2026. This lifecycle transparency helps teams plan migration timelines, though it also signals that model versions are deprecated faster than many teams assume.
AWS Bedrock: Classic and Reinforcement Fine-Tuning
Best for: teams using Llama 3.2 models (including vision-capable 11B and 90B variants) or experimenting with reinforcement tuning for judgment and preference tasks.
Trade-off: reinforcement fine-tuning launched for Nova 2 Lite initially; broader model support is rolling out, and the approach requires defining reward functions or judges.
AWS Bedrock added fine-tuning support for Meta Llama 3.2 models on March 14, 2025, including vision-capable variants 11B and 90B. On December 3, 2025, Bedrock launched reinforcement fine-tuning, claiming 66% average accuracy gains over base models. This approach uses prompts plus reward functions or AI judges rather than large labeled datasets, positioning it for teams that want to shape model judgment on code generation correctness, math reasoning, or customer support quality without manual labeling overhead. Bedrock also supports Custom Model Import, allowing teams to bring their own tuned Llama, Mistral, or Flan T5 models for managed serving via Bedrock APIs.
When Fine-Tuning Makes Sense and When It Doesn't
The industry pattern is prompting first, then retrieval-augmented generation, then fine-tuning only when necessary. Fine-tuning works best for consistent output format enforcement, brand voice alignment, or domain-specific instruction following where retrieval can't inject the needed behavior. Teams fine-tune for customer support classification with specific labels, CRM field extraction into fixed schemas, e-commerce listing normalization, or compliance-style rewriting where the model must follow internal policy language patterns. The constraint is that factual knowledge should live in retrieval rather than weights—fine-tuning a model on your product catalog is less effective than retrieving current catalog data at inference time and letting the model reason over it.
AWS Bedrock's reinforcement fine-tuning introduces a newer pattern for judgment tasks. Instead of collecting thousands of labeled examples showing correct support responses, you define a reward function that scores response quality or use an AI judge to rank outputs. This is positioned for code generation preference shaping, math reasoning correctness incentives, or customer support tone and completeness ranking. The trade-off is that defining good reward functions requires understanding what quality means for your task, and poorly designed rewards can push models toward unexpected behavior.
Self-Hosted Serving: vLLM, KServe, TensorRT-LLM
Teams deploying models on their own infrastructure face a choice between managed inference stacks and lower-level tooling. The three most active projects in early 2026 are vLLM for general-purpose serving, KServe for Kubernetes-native deployment, and TensorRT-LLM for performance optimization.
vLLM announced its V1 architecture alpha on January 27, 2025. The redesign unifies scheduling of prompt and generated tokens, supports chunked prefills and prefix caching, and targets zero-overhead prefix caching for repeated prompts. An April 2025 article described vLLM 0.8.x switching to the V1 engine with prefix caching enabled by default and multimodal preprocessing moved to a separate process that caches at multiple levels. For teams running high-volume inference where repeated prompt prefixes are common—customer support templates, code generation with shared context—prefix caching reduces latency and cost by avoiding re-computation.
KServe v0.15 launched May 27, 2025 with explicit generative AI serving focus. The release added Envoy AI Gateway integration for traffic management, multi-node inference, LLM autoscaling with KEDA, distributed KV cache with LMCache, and vLLM backend upgrade to vLLM 0.8.5. The December 2024 v0.14 release introduced model cache and Hugging Face Hub download via hf:// storageUri, simplifying artifact loading. For teams deploying on Kubernetes who need autoscaling, gateway routing, and cached model distribution, KServe provides a platform layer above raw inference servers.
NVIDIA TensorRT-LLM focuses on performance primitives. Recent release notes document KV cache connector APIs for state transfer in disaggregated serving, KV cache reuse and offloading with salting for secure reuse, speculative decoding improvements, and LoRA-related engine loading bug fixes. A May 2025 security bulletin described HMAC-enabled pickle serialization for socket-based IPC, which matters for production deployments where model serving processes communicate across network boundaries. For teams optimizing inference latency or building disaggregated architectures where prompt processing and token generation run on separate hardware, TensorRT-LLM's low-level control justifies the integration complexity.
Use-Case Workflow Mapping
Understanding which workflows benefit most from fine-tuning versus which are better served by prompting or retrieval clarifies where to invest effort.
Customer support chatbots fine-tune for response style, escalation patterns, and strict answer formats. Deployment benefits from low-latency serving and prefix caching for repetitive FAQ patterns. Sales and CRM automation fine-tunes for lead qualification rubrics and structured extraction into CRM fields. Document workflows fine-tune for contract clause extraction or document classification, with deployment optimized for batch processing if latency is less critical than throughput. Developer tools fine-tune for code review style, lint explanations, or repo-specific conventions, with deployment requiring long-context models and fast inference servers where developers expect subsecond response.
For teams evaluating managed versus self-hosted deployment, the decision hinges on whether you need control over inference costs at scale, on-premises deployment for compliance, or deep integration with internal systems. Managed platforms like OpenAI's API, Google Vertex AI, or AWS Bedrock simplify deployment and provide predictable pricing tiers, but they introduce per-token costs that can become expensive at high volume. Self-hosted serving using vLLM, KServe, or TensorRT-LLM eliminates per-request charges but requires GPU infrastructure, ongoing maintenance, and expertise to optimize performance.
Migration Context and Agent Platform Shifts
The Assistants API shutdown on August 26, 2026 affects teams using fine-tuned models in agent workflows. Fine-tuning itself isn't deprecated—the models remain available—but the orchestration layer for agents is changing. Teams that fine-tuned GPT models for use in Assistants-based chatbots or support agents need to migrate orchestration logic to Responses API while retaining the same fine-tuned model weights. The practical implication is that fine-tuning work is portable, but the code that manages conversations and tool calling is not.
The broader context is that neutral standards like MCP are emerging for agent-tool integration. Fine-tuned models can serve agents regardless of whether those agents use proprietary platforms or MCP-based architectures. If you're investing in fine-tuning for production agent workflows, ensuring the deployment stack supports MCP or other portable standards reduces the risk that agent platform shifts force redeployment.
Choosing Your Fine-Tuning and Deployment Approach
For most teams that need fine-tuning for brand voice, structured output formats, or domain-specific instruction following and want to avoid infrastructure management, using managed platforms like OpenAI for GPT-4o fine-tuning, Google Vertex AI for Gemini 2.0 Flash or 2.5 variants, or AWS Bedrock for Llama 3.2 is the better choice because these platforms handle training orchestration, model versioning, and inference serving with predictable pricing and minimal operational overhead. OpenAI's integration with Responses API simplifies agent deployment for teams building conversational systems, and Vertex AI's model lifecycle transparency helps plan for retirement dates before they force migration. If your workflow is primarily about customizing behavior on established model families and you can accept vendor pricing at your scale, managed fine-tuning eliminates weeks of infrastructure work.
Self-hosted fine-tuning and deployment make sense for teams that need cost control at high inference volumes, on-premises deployment for regulatory compliance, or deep integration with proprietary internal systems where managed platforms' APIs don't provide necessary flexibility. vLLM's prefix caching and unified scheduling optimize latency for workflows with repeated prompt patterns. KServe's Kubernetes-native autoscaling and model caching support production deployments where traffic varies and infrastructure cost matters. TensorRT-LLM's disaggregated serving primitives and KV cache connectors enable architectures where prompt processing and generation scale independently, which is valuable at very high volumes. If your team has ML infrastructure expertise and operates at a scale where per-token API costs exceed the total cost of owning GPU infrastructure plus engineering time, self-hosted deployment justifies the complexity.
AWS Bedrock's reinforcement fine-tuning is worth evaluating if your use case involves shaping model judgment on tasks where defining reward functions is clearer than labeling thousands of examples. Code generation correctness, math reasoning validation, or customer support response quality ranking all fit this pattern. The approach is newer and less proven than supervised fine-tuning, but for teams where labeled data is expensive or unavailable and you can articulate what good outputs look like programmatically, reinforcement tuning provides an alternative training path that launched recently and is expanding to more base models through 2026.
Note: Fine-tuning availability and deployment tooling are evolving rapidly. Verify current model support, pricing, and lifecycle schedules before committing to a platform. Self-hosted serving stacks like vLLM and KServe release updates frequently—expect specification changes and performance improvements through 2026.