Overview
The Inference Engine project provides a unified API gateway for serving open source language models through various inference backends. Currently implemented with Ollama, the architecture is designed to support multiple inference frameworks, allowing optimization for different deployment scenarios - from local development to high-throughput production workloads.
The engine adds authentication, usage tracking, request logging, and intelligent routing on top of the underlying inference frameworks, providing a production-ready interface for self-hosted LLMs.
Current Implementation: Ollama
Features
- Single Endpoint: Unified API for multiple Ollama instances
- Request Tracing: Complete request lifecycle logging for debugging
- Cost Tracking: Token usage and compute cost estimation per request
- Model Policies: Per-model configuration for timeouts, concurrency, rate limits
- Client SDKs: TypeScript and Python libraries for easy integration
Architecture
- Instance pooling for load distribution
- Health checks with automatic failover
- Rolling updates for model and prompt changes
- Structured logging to BigQuery for analytics
Use Cases
Ollama is ideal for:
- Local development and experimentation
- Single-user applications and internal tools
- Rapid prototyping of LLM features
- Models requiring ease of deployment over performance
Alternative Inference Frameworks
vLLM
Best for: Production workloads with high concurrency
vLLM achieves 120-160 req/sec throughput with continuous batching and 50-80ms time-to-first-token latency. Key advantages:
- PagedAttention Algorithm: Dramatically improves memory efficiency and throughput
- Continuous Batching: Dynamic request batching for maximum GPU utilization
- OpenAI-Compatible API: Drop-in replacement for OpenAI endpoints
- Strong Concurrency: Designed for multi-user production scenarios
Use when:
- Serving many concurrent users
- Throughput is critical (API services, production deployments)
- GPU resources are available and should be maximized
Text Generation Inference (TGI)
Best for: Enterprise deployments on diverse hardware
TGI achieves 100-140 req/sec throughput with dynamic batching and 60-90ms time-to-first-token. Developed by Hugging Face, it provides:
- Broad Hardware Support: NVIDIA/AMD GPUs, AWS Inferentia, Intel Gaudi
- Advanced Quantization: Multiple quantization methods for reduced memory
- Tensor Parallelism: Distribute large models across multiple GPUs
- Production Features: Metrics, distributed tracing, health checks
Use when:
- Deploying on non-NVIDIA hardware
- Need enterprise support and documentation
- Hugging Face ecosystem integration is valuable
llama.cpp
Best for: CPU inference and edge deployment
Optimized for running models on CPUs and limited hardware:
- CPU Optimization: SIMD acceleration, quantization for CPU inference
- Low Memory: Runs large models on consumer hardware
- Cross-Platform: Works on Windows, macOS, Linux, mobile
- No GPU Required: Enables inference on any machine
Use when:
- GPU unavailable or cost-prohibitive
- Edge deployment (mobile, embedded devices)
- Privacy-sensitive scenarios requiring local-only inference
TensorRT-LLM
Best for: Maximum performance on NVIDIA hardware
NVIDIA’s optimized inference solution achieving lowest latency:
- GPU Optimization: Leverages latest NVIDIA GPU features
- Graph Optimization: Fuses operations for minimal overhead
- Mixed Precision: FP16/INT8 inference with minimal quality loss
- Multi-GPU: Efficient tensor and pipeline parallelism
Use when:
- Latency is critical (real-time applications)
- NVIDIA hardware is available
- Willing to invest in optimization work
Framework Comparison Summary
| Framework | Throughput | Latency | Best For | Hardware |
|---|---|---|---|---|
| Ollama | 1-3 req/sec | Variable | Development, single-user | CPU/GPU |
| vLLM | 120-160 req/sec | 50-80ms | Production, high concurrency | NVIDIA GPU |
| TGI | 100-140 req/sec | 60-90ms | Enterprise, diverse hardware | Multi-vendor |
| llama.cpp | Varies | Higher | CPU inference, edge | CPU |
| TensorRT-LLM | Highest | Lowest | Latency-critical, NVIDIA | NVIDIA GPU |
Benchmarks approximate, vary by model size and hardware
Roadmap: Multi-Backend Support
Phase 1: Abstract Interface Layer
- Define common interface across all backends
- Implement adapter pattern for backend-specific logic
- Create benchmark suite for performance comparison
Phase 2: vLLM Integration
- Deploy vLLM instances for production workloads
- Implement intelligent routing (Ollama for dev, vLLM for prod)
- Migration path from Ollama to vLLM with API compatibility
Phase 3: Backend Selection
- Automatic backend selection based on request characteristics
- Cost/performance optimization across backends
- A/B testing framework for backend comparison
Phase 4: Advanced Features
- Model caching and warm-up strategies
- Request prioritization and QoS
- Multi-region deployment with intelligent routing
- Custom backend integrations
Implementation Details
Current (Ollama)
- Python-based API gateway
- Async request handling with retry logic
- Prometheus metrics export
- Docker deployment with GPU support
Planned Architecture
- Pluggable backend system
- Configuration-driven backend selection
- Per-model backend assignment
- Health monitoring across all backends
- Unified observability regardless of backend
Migration Strategy
For applications currently using Ollama:
- API Compatibility: Maintain identical API surface
- Gradual Rollout: Traffic shifting from Ollama to new backends
- Performance Validation: Compare latency/throughput before full migration
- Rollback Capability: Instant revert if issues detected
This ensures zero downtime and minimal risk when adopting higher-performance backends.