GPU cloud inference with OpenAI-compatible endpoints and management API
CoreWeave documents inference products at docs.coreweave.com/products/inference spanning Serverless, Dedicated (BYOW on H100/B200/A100-class GPUs), and CKS options, all exposing OpenAI API-compatible endpoints per the inference introduction. The Inference API at api.coreweave.com (v1alpha1) manages gateways, deployments, and capacity claims over REST/JSON, gRPC, or Connect with Bearer tokens requiring Inference Viewer or Inference Admin roles. Getting-started guides walk through gateway creation with IAM authentication, body-based routing on the model field, and chat completion requests against deployed weights in CoreWeave Object Storage.
Use cases
- Serve custom model weights on dedicated NVIDIA GPUs with OpenAI clients
- Programmatically list and manage inference deployments via REST API
- Reserve GPU capacity with CapacityClaimService before launch windows
- Compare neocloud inference options when hyperscaler bridge capacity is scarce
- Integrate existing OpenAI SDK apps by pointing at gateway endpoints
Key features
- OpenAI-compatible chat/completions via inference gateways per docs
- DeploymentService and GatewayService APIs at api.coreweave.com/v1alpha1
- Dedicated BYOW deployments with autoscaling and capacity claims
- Terraform provider for inference resources
- IAM-authenticated gateways with body-based model routing
Who Is It For?
- ML engineers needing dedicated GPU inference infrastructure
- Platform teams evaluating neocloud capacity outside hyperscaler regions
- DevOps engineers managing inference via Terraform and CoreWeave IAM
Frequently Asked Questions
- How is Dedicated Inference different from Serverless?
- Docs describe Dedicated as BYOW on reserved GPU infrastructure; Serverless is the fully managed tier—pick per control and ops burden.
- Which API version is documented?
- The Inference API reference labels endpoints v1alpha1 and notes APIs may change before GA.
- What auth do API calls need?
- Bearer tokens from CoreWeave API access tokens with Inference Viewer or Inference Admin roles per the API overview.
Related
Related
3 Indexed items
NVIDIA NIM
NVIDIA NIM documents performance-optimized inference microservices at docs.api.nvidia.com/nim and docs.nvidia.com/nim that expose industry-standard APIs (OpenAI-compatible `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, Anthropic-compatible `/v1/messages`) from containerized models backed by TensorRT-LLM, vLLM, or SGLang per deployment. Teams can self-host GPU-accelerated models on cloud, data center, or RTX workstations, or prototype via NVIDIA-hosted NIM API endpoints through the Developer Program. Management endpoints such as `/v1/health/ready` and `/v1/metrics` support readiness probes and Prometheus metrics on self-hosted containers per the LLM API reference.
Baseten
Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.
fal
fal documents a serverless platform at fal.ai/docs where teams deploy custom models as Python `fal.App` classes with `@fal.endpoint` handlers on auto-scaling H100/A100/B200 runners, or call 1,000+ hosted Model APIs through a unified client. The workflow uses `fal run` for temporary cloud testing and `fal deploy` for persistent endpoints (for example `your-username/my-model` via `fal_client.subscribe` or `https://queue.fal.run/`). Docs describe `setup()` for one-time model loading, machine_type GPU selection, auth modes (private vs public), per-second Serverless billing versus hourly fal Compute for training, and built-in App Analytics with Prometheus-compatible metrics.