N

AI Tool

NVIDIA NIM

Containerized inference microservices with OpenAI-compatible APIs for LLMs and more

NVIDIA NIM documents performance-optimized inference microservices at docs.api.nvidia.com/nim and docs.nvidia.com/nim that expose industry-standard APIs (OpenAI-compatible `/v1/chat/completions`, `/v1/completions`, `/v1/responses`, Anthropic-compatible `/v1/messages`) from containerized models backed by TensorRT-LLM, vLLM, or SGLang per deployment. Teams can self-host GPU-accelerated models on cloud, data center, or RTX workstations, or prototype via NVIDIA-hosted NIM API endpoints through the Developer Program. Management endpoints such as `/v1/health/ready` and `/v1/metrics` support readiness probes and Prometheus metrics on self-hosted containers per the LLM API reference.

Category Developer Tools
Pricing Developer Program hosted APIs for prototyping; NVIDIA AI Enterprise for production self-host (see nvidia.com/nim)
Platforms Web / API / Docker / Kubernetes
inferencegpucontainers

Use cases

  • Swap OpenAI client `base_url` to a local NIM container for on-prem LLM inference
  • Deploy Kubernetes-scaled NIM microservices with Prometheus metrics
  • Prototype models on NVIDIA-hosted endpoints before self-hosting under AI Enterprise
  • Run Anthropic-style `/v1/messages` against a NIM LLM container
  • Pick TRT-LLM or vLLM engines per GPU infrastructure per deployment guides

Key features

  • OpenAI-compatible chat, completion, and responses endpoints per NIM LLM reference
  • Anthropic-compatible `/v1/messages` routing through vLLM backend per docs
  • Self-hosted containers with `/v1/health/live` and `/v1/health/ready` probes
  • Catalog spans LLMs, vision, speech, and other use cases on docs.api.nvidia.com/nim
  • Hosted NIM APIs via Developer Program for unlimited prototyping (FAQ)

Who Is It For?

  • ML engineers shipping GPU inference on NVIDIA hardware
  • Platform teams standardizing on OpenAI-compatible internal gateways
  • Developers evaluating self-host vs hosted NIM endpoints

Frequently Asked Questions

How is NIM different from calling NVIDIA cloud APIs directly?
NIM packages optimized inference containers and standard APIs; you can self-host the same microservice pattern or use hosted dev endpoints per NVIDIA docs.
Which API style should I use?
Docs list OpenAI-compatible `/v1/chat/completions` for most chat apps and `/v1/messages` when you need Anthropic-compatible clients.
Do I need NVIDIA AI Enterprise?
Developer Program covers prototyping on hosted APIs; production self-host typically requires NVIDIA AI Enterprise licensing per product pages.

Related

Related

3 Indexed items

CoreWeave

Developer ToolsUsage-based GPU inference; see CoreWeave billing docs for Dedicated and Serverless pricing

CoreWeave documents inference products at docs.coreweave.com/products/inference spanning Serverless, Dedicated (BYOW on H100/B200/A100-class GPUs), and CKS options, all exposing OpenAI API-compatible endpoints per the inference introduction. The Inference API at api.coreweave.com (v1alpha1) manages gateways, deployments, and capacity claims over REST/JSON, gRPC, or Connect with Bearer tokens requiring Inference Viewer or Inference Admin roles. Getting-started guides walk through gateway creation with IAM authentication, body-based routing on the model field, and chat completion requests against deployed weights in CoreWeave Object Storage.

Baseten

Developer ToolsUsage-based inference and training; see baseten.co/pricing

Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.

AssemblyAI

Developer ToolsPay-as-you-go per audio hour; enterprise plans (see assemblyai.com/pricing)

AssemblyAI documents Voice AI APIs at assemblyai.com/docs where developers transcribe and analyze audio via REST at `https://api.assemblyai.com` and real-time WebSockets at `wss://streaming.assemblyai.com` (EU pre-recorded host `api.eu.assemblyai.com` per cloud residency docs). Pre-recorded transcription requires an explicit `speech_models` array on every `POST /v2/transcript` request—docs recommend `universal-3-pro` with `universal-2` fallback for 99-language coverage. The platform also publishes a Voice Agent API for speech-to-speech agents, Speech Understanding features (diarization, sentiment, summarization), Guardrails, and an LLM Gateway to run frontier models on transcripts.