C

AI Tool

CoreWeave

GPU cloud inference with OpenAI-compatible endpoints and management API

CoreWeave documents inference products at docs.coreweave.com/products/inference spanning Serverless, Dedicated (BYOW on H100/B200/A100-class GPUs), and CKS options, all exposing OpenAI API-compatible endpoints per the inference introduction. The Inference API at api.coreweave.com (v1alpha1) manages gateways, deployments, and capacity claims over REST/JSON, gRPC, or Connect with Bearer tokens requiring Inference Viewer or Inference Admin roles. Getting-started guides walk through gateway creation with IAM authentication, body-based routing on the model field, and chat completion requests against deployed weights in CoreWeave Object Storage.

Category Developer Tools
Pricing Usage-based GPU inference; see CoreWeave billing docs for Dedicated and Serverless pricing
Platforms Web / API / Terraform
gpuinferenceneocloud

Use cases

  • Serve custom model weights on dedicated NVIDIA GPUs with OpenAI clients
  • Programmatically list and manage inference deployments via REST API
  • Reserve GPU capacity with CapacityClaimService before launch windows
  • Compare neocloud inference options when hyperscaler bridge capacity is scarce
  • Integrate existing OpenAI SDK apps by pointing at gateway endpoints

Key features

  • OpenAI-compatible chat/completions via inference gateways per docs
  • DeploymentService and GatewayService APIs at api.coreweave.com/v1alpha1
  • Dedicated BYOW deployments with autoscaling and capacity claims
  • Terraform provider for inference resources
  • IAM-authenticated gateways with body-based model routing

Who Is It For?

  • ML engineers needing dedicated GPU inference infrastructure
  • Platform teams evaluating neocloud capacity outside hyperscaler regions
  • DevOps engineers managing inference via Terraform and CoreWeave IAM

Frequently Asked Questions

How is Dedicated Inference different from Serverless?
Docs describe Dedicated as BYOW on reserved GPU infrastructure; Serverless is the fully managed tier—pick per control and ops burden.
Which API version is documented?
The Inference API reference labels endpoints v1alpha1 and notes APIs may change before GA.
What auth do API calls need?
Bearer tokens from CoreWeave API access tokens with Inference Viewer or Inference Admin roles per the API overview.

Related

Related

3 Indexed items

NVIDIA NIM

Developer ToolsDeveloper Program hosted APIs for prototyping; NVIDIA AI Enterprise for production self-host (see nvidia.com/nim)

NVIDIA NIM documents performance-optimized inference microservices at docs.api.nvidia.com/nim and docs.nvidia.com/nim that expose industry-standard APIs (OpenAI-compatible `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, Anthropic-compatible `/v1/messages`) from containerized models backed by TensorRT-LLM, vLLM, or SGLang per deployment. Teams can self-host GPU-accelerated models on cloud, data center, or RTX workstations, or prototype via NVIDIA-hosted NIM API endpoints through the Developer Program. Management endpoints such as `/v1/health/ready` and `/v1/metrics` support readiness probes and Prometheus metrics on self-hosted containers per the LLM API reference.

Baseten

Developer ToolsUsage-based inference and training; see baseten.co/pricing

Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.

fal

Developer ToolsPer-second Serverless execution; Model APIs per call; Compute per GPU-hour (see fal.ai pricing)

fal documents a serverless platform at fal.ai/docs where teams deploy custom models as Python `fal.App` classes with `@fal.endpoint` handlers on auto-scaling H100/A100/B200 runners, or call 1,000+ hosted Model APIs through a unified client. The workflow uses `fal run` for temporary cloud testing and `fal deploy` for persistent endpoints (for example `your-username/my-model` via `fal_client.subscribe` or `https://queue.fal.run/`). Docs describe `setup()` for one-time model loading, machine_type GPU selection, auth modes (private vs public), per-second Serverless billing versus hourly fal Compute for training, and built-in App Analytics with Prometheus-compatible metrics.