Serverless GPU endpoints and Pods API for AI inference workloads
RunPod documents a serverless platform at docs.runpod.io where teams deploy containerized AI handlers without managing servers, paying only for compute time used. Developers write Python handler functions with the Runpod SDK (`runpod.serverless.start`), package Docker images, and expose queue-based endpoints at `https://api.runpod.ai/v2/{ENDPOINT_ID}/runsync` or `/run` with `Authorization: Bearer RUNPOD_API_KEY`. Docs cover streaming handlers, load-balancing endpoints with custom HTTP frameworks, Pods for persistent GPUs, network volumes, and a REST API at rest.runpod.io for programmatic resource management.
Use cases
- Serve custom inference handlers with autoscaling and no idle GPU cost
- Prototype handlers locally then deploy Docker workers from the quickstart flow
- Run long training jobs on Pods while keeping bursty traffic on Serverless
- Integrate GPU capacity into CI/CD via REST API and API keys
- Stream LLM tokens using documented streaming handler options
Key features
- Queue-based endpoints with `/runsync`, `/run`, `/status`, `/stream`, and `/health` documented in send-requests guide
- Handler functions via Runpod SDK including streaming and concurrent patterns
- Load-balancing endpoints allowing FastAPI/Flask without a queue handler
- Pods API for persistent GPU instances and network volumes per api-reference overview
- OpenAPI schema at rest.runpod.io/v1/openapi.json for automation
Who Is It For?
- ML engineers shipping GPU inference without Kubernetes
- Startups needing bursty GPU capacity with per-second billing
- Teams already containerizing models who want managed autoscale endpoints
Frequently Asked Questions
- How do I authenticate API calls?
- RunPod docs require a Runpod API key in the Authorization Bearer header for Serverless and REST API requests.
- What is the difference between Serverless and Pods?
- Serverless endpoints autoscale containerized handlers per job; Pods are persistent GPU instances for dev or long-running workloads per docs.runpod.io.
- Can I use my own web framework?
- Docs state load-balancing endpoints can expose custom HTTP APIs via frameworks like FastAPI or Flask without the queue handler pattern.
Related
Related
3 Indexed items
Modal
Modal documents a serverless cloud at modal.com where engineers run compute-intensive Python with zero infrastructure configuration: deploy OpenAI-compatible LLM services, batch workflows, job queues, GPU training and fine-tuning, and thousands of isolated Sandboxes for agent-generated code. Official guides show defining apps with `@app.function`, container images via `modal.Image`, and GPU types in code rather than YAML. Modal states pricing is per-second serverless usage with pooled capacity across major clouds, and supports calling functions from JavaScript/Go clients in addition to Python.
Baseten
Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.
Fireworks AI
Fireworks AI documents a REST platform at docs.fireworks.ai where developers call language, image, and embedding models with Bearer API keys from the dashboard or `firectl api-key create`. Models use globally unique IDs such as `accounts/<account>/models/<model-id>` and can be served via serverless inference for popular open weights (for example Llama 3.1 70B listed on fireworks.ai/models) or private dedicated GPU deployments for custom base models and LoRA addons. Official docs distinguish serverless per-token billing with best-effort uptime from dedicated deployments billed per GPU-second with private capacity, and state that prompts and generated outputs are not logged except for documented exceptions such as the FireFunction model or opt-in advanced features.