GPU cloud inference with OpenAI-compatible endpoints and management API

CoreWeave documents inference products at docs.coreweave.com/products/inference spanning Serverless, Dedicated (BYOW on H100/B200/A100-class GPUs), and CKS options, all exposing OpenAI API-compatible endpoints per the inference introduction. The Inference API at api.coreweave.com (v1alpha1) manages gateways, deployments, and capacity claims over REST/JSON, gRPC, or Connect with Bearer tokens requiring Inference Viewer or Inference Admin roles. Getting-started guides walk through gateway creation with IAM authentication, body-based routing on the model field, and chat completion requests against deployed weights in CoreWeave Object Storage.

Category Developer Tools

Pricing Usage-based GPU inference; see CoreWeave billing docs for Dedicated and Serverless pricing

Platforms Web / API / Terraform

gpuinferenceneocloud

Use cases

Serve custom model weights on dedicated NVIDIA GPUs with OpenAI clients
Programmatically list and manage inference deployments via REST API
Reserve GPU capacity with CapacityClaimService before launch windows
Compare neocloud inference options when hyperscaler bridge capacity is scarce
Integrate existing OpenAI SDK apps by pointing at gateway endpoints

Key features

OpenAI-compatible chat/completions via inference gateways per docs
DeploymentService and GatewayService APIs at api.coreweave.com/v1alpha1
Dedicated BYOW deployments with autoscaling and capacity claims
Terraform provider for inference resources
IAM-authenticated gateways with body-based model routing

Who Is It For?

ML engineers needing dedicated GPU inference infrastructure
Platform teams evaluating neocloud capacity outside hyperscaler regions
DevOps engineers managing inference via Terraform and CoreWeave IAM

Frequently Asked Questions

How is Dedicated Inference different from Serverless?: Docs describe Dedicated as BYOW on reserved GPU infrastructure; Serverless is the fully managed tier—pick per control and ops burden.
Which API version is documented?: The Inference API reference labels endpoints v1alpha1 and notes APIs may change before GA.
What auth do API calls need?: Bearer tokens from CoreWeave API access tokens with Inference Viewer or Inference Admin roles per the API overview.

3 Indexed items

NVIDIA NIM

Developer ToolsDeveloper Program hos…

NVIDIA NIM documents performance-optimized inference microservices at docs.api.nvidia.com/nim and docs.nvidia.com/nim that expose industry-standard APIs (OpenAI-compatible `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, Anthropic-compatible `/v1/messages`) from containerized models backed by TensorRT-LLM, vLLM, or SGLang per deployment. Teams can self-host GPU-accelerated models on cloud, data center, or RTX workstations, or prototype via NVIDIA-hosted NIM API endpoints through the Developer Program. Management endpoints such as `/v1/health/ready` and `/v1/metrics` support readiness probes and Prometheus metrics on self-hosted containers per the LLM API reference.

Baseten

Developer ToolsUsage-based inference…

Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.

fal

Developer ToolsPer-second Serverless…

fal documents a serverless platform at fal.ai/docs where teams deploy custom models as Python `fal.App` classes with `@fal.endpoint` handlers on auto-scaling H100/A100/B200 runners, or call 1,000+ hosted Model APIs through a unified client. The workflow uses `fal run` for temporary cloud testing and `fal deploy` for persistent endpoints (for example `your-username/my-model` via `fal_client.subscribe` or `https://queue.fal.run/`). Docs describe `setup()` for one-time model loading, machine_type GPU selection, auth modes (private vs public), per-second Serverless billing versus hourly fal Compute for training, and built-in App Analytics with Prometheus-compatible metrics.

CoreWeave