Train, deploy, and serve models with Truss, Model APIs, and OpenAI-compatible endpoints

Baseten documents a training and inference platform at docs.baseten.co where teams deploy models via the open-source Truss framework or call hosted Model APIs without standing up infrastructure. Config-only Truss deployments point at Hugging Face checkpoints, select GPU resources, and engines such as TensorRT-LLM; `truss push` builds optimized containers and exposes OpenAI-compatible sync endpoints like `https://model-{model_id}.api.baseten.co/environments/production/sync/v1`. Custom architectures use a Truss `Model` class with `load` and `predict` in `model.py`. Model APIs provide immediate OpenAI-SDK-style access to catalog models (DeepSeek, Qwen, GLM, and others listed in docs) using `BASETEN_API_KEY`.

Category Developer Tools

Pricing Usage-based inference and training; see baseten.co/pricing

Platforms Web / API / Python / CLI

inferencedeploymenttruss

Use cases

Ship a Hugging Face LLM to a GPU endpoint without writing Dockerfiles
Prototype with Model APIs then promote a tuned checkpoint via Truss training flows
Serve agent backends that already use the OpenAI SDK by swapping base URL and API key
Run custom inference logic in `predict` while Baseten manages containers and scaling
Benchmark TensorRT-LLM optimized builds against baseline PyTorch serving

Key features

Truss `config.yaml` deployments for supported open LLMs, embeddings, and image models per build-your-first-model guide
OpenAI-compatible HTTP APIs on engine-based deployments with documented `BASETEN_API_KEY` authentication
Custom `model.py` Model lifecycle (`__init__`, `load`, `predict`) for preprocessing and unsupported architectures
Development vs production promotion paths (`/development/predict` to `/production/predict`) documented in deployment guides
Model APIs for zero-setup inference on catalog checkpoints without a private deployment

Who Is It For?

ML engineers deploying open-weight models to production APIs
Platform teams standardizing on Truss packaging for internal model catalogs
Startups needing managed GPU inference without operating Kubernetes

Frequently Asked Questions

Do I always need a custom model.py file?: No. Baseten docs show config-only Truss deployments for many popular open architectures; custom Python is for unsupported engines or bespoke preprocessing.
How do I authenticate API calls?: Docs use `Authorization: Api-Key` headers with `BASETEN_API_KEY` for deployment endpoints and Model APIs.
What is the difference between Model APIs and Truss deployments?: Model APIs are hosted catalog endpoints you can call immediately; Truss deployments package your chosen model and hardware into a dedicated Baseten endpoint you manage.

3 Indexed items

fal

Developer ToolsPer-second Serverless…

fal documents a serverless platform at fal.ai/docs where teams deploy custom models as Python `fal.App` classes with `@fal.endpoint` handlers on auto-scaling H100/A100/B200 runners, or call 1,000+ hosted Model APIs through a unified client. The workflow uses `fal run` for temporary cloud testing and `fal deploy` for persistent endpoints (for example `your-username/my-model` via `fal_client.subscribe` or `https://queue.fal.run/`). Docs describe `setup()` for one-time model loading, machine_type GPU selection, auth modes (private vs public), per-second Serverless billing versus hourly fal Compute for training, and built-in App Analytics with Prometheus-compatible metrics.

Fireworks AI

Developer ToolsServerless per-token …

Fireworks AI documents a REST platform at docs.fireworks.ai where developers call language, image, and embedding models with Bearer API keys from the dashboard or `firectl api-key create`. Models use globally unique IDs such as `accounts/<account>/models/<model-id>` and can be served via serverless inference for popular open weights (for example Llama 3.1 70B listed on fireworks.ai/models) or private dedicated GPU deployments for custom base models and LoRA addons. Official docs distinguish serverless per-token billing with best-effort uptime from dedicated deployments billed per GPU-second with private capacity, and state that prompts and generated outputs are not logged except for documented exceptions such as the FireFunction model or opt-in advanced features.

RunPod

Developer ToolsPer-second serverless…

RunPod documents a serverless platform at docs.runpod.io where teams deploy containerized AI handlers without managing servers, paying only for compute time used. Developers write Python handler functions with the Runpod SDK (`runpod.serverless.start`), package Docker images, and expose queue-based endpoints at `https://api.runpod.ai/v2/{ENDPOINT_ID}/runsync` or `/run` with `Authorization: Bearer RUNPOD_API_KEY`. Docs cover streaming handlers, load-balancing endpoints with custom HTTP frameworks, Pods for persistent GPUs, network volumes, and a REST API at rest.runpod.io for programmatic resource management.

Baseten