Serverless GPU endpoints and Pods API for AI inference workloads

RunPod documents a serverless platform at docs.runpod.io where teams deploy containerized AI handlers without managing servers, paying only for compute time used. Developers write Python handler functions with the Runpod SDK (`runpod.serverless.start`), package Docker images, and expose queue-based endpoints at `https://api.runpod.ai/v2/{ENDPOINT_ID}/runsync` or `/run` with `Authorization: Bearer RUNPOD_API_KEY`. Docs cover streaming handlers, load-balancing endpoints with custom HTTP frameworks, Pods for persistent GPUs, network volumes, and a REST API at rest.runpod.io for programmatic resource management.

Category Developer Tools

Pricing Per-second serverless compute; Pods billed per GPU-hour (see runpod.io/pricing)

Platforms Web / API / Python / Docker

gpuserverlessinference

Use cases

Serve custom inference handlers with autoscaling and no idle GPU cost
Prototype handlers locally then deploy Docker workers from the quickstart flow
Run long training jobs on Pods while keeping bursty traffic on Serverless
Integrate GPU capacity into CI/CD via REST API and API keys
Stream LLM tokens using documented streaming handler options

Key features

Queue-based endpoints with `/runsync`, `/run`, `/status`, `/stream`, and `/health` documented in send-requests guide
Handler functions via Runpod SDK including streaming and concurrent patterns
Load-balancing endpoints allowing FastAPI/Flask without a queue handler
Pods API for persistent GPU instances and network volumes per api-reference overview
OpenAPI schema at rest.runpod.io/v1/openapi.json for automation

Who Is It For?

ML engineers shipping GPU inference without Kubernetes
Startups needing bursty GPU capacity with per-second billing
Teams already containerizing models who want managed autoscale endpoints

Frequently Asked Questions

How do I authenticate API calls?: RunPod docs require a Runpod API key in the Authorization Bearer header for Serverless and REST API requests.
What is the difference between Serverless and Pods?: Serverless endpoints autoscale containerized handlers per job; Pods are persistent GPU instances for dev or long-running workloads per docs.runpod.io.
Can I use my own web framework?: Docs state load-balancing endpoints can expose custom HTTP APIs via frameworks like FastAPI or Flask without the queue handler pattern.

3 Indexed items

fal

Developer ToolsPer-second Serverless…

fal documents a serverless platform at fal.ai/docs where teams deploy custom models as Python `fal.App` classes with `@fal.endpoint` handlers on auto-scaling H100/A100/B200 runners, or call 1,000+ hosted Model APIs through a unified client. The workflow uses `fal run` for temporary cloud testing and `fal deploy` for persistent endpoints (for example `your-username/my-model` via `fal_client.subscribe` or `https://queue.fal.run/`). Docs describe `setup()` for one-time model loading, machine_type GPU selection, auth modes (private vs public), per-second Serverless billing versus hourly fal Compute for training, and built-in App Analytics with Prometheus-compatible metrics.

Modal

Developer ToolsPer-second serverless…

Modal documents a serverless cloud at modal.com where engineers run compute-intensive Python with zero infrastructure configuration: deploy OpenAI-compatible LLM services, batch workflows, job queues, GPU training and fine-tuning, and thousands of isolated Sandboxes for agent-generated code. Official guides show defining apps with `@app.function`, container images via `modal.Image`, and GPU types in code rather than YAML. Modal states pricing is per-second serverless usage with pooled capacity across major clouds, and supports calling functions from JavaScript/Go clients in addition to Python.

Baseten

Developer ToolsUsage-based inference…

Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.

RunPod