Inference Engine — Hector Sanchez

Overview

The Inference Engine project provides a unified API gateway for serving open source language models through various inference backends. Currently implemented with Ollama, the architecture is designed to support multiple inference frameworks, allowing optimization for different deployment scenarios - from local development to high-throughput production workloads.

The engine adds authentication, usage tracking, request logging, and intelligent routing on top of the underlying inference frameworks, providing a production-ready interface for self-hosted LLMs.

Current Implementation: Ollama

Features

Single Endpoint: Unified API for multiple Ollama instances
Request Tracing: Complete request lifecycle logging for debugging
Cost Tracking: Token usage and compute cost estimation per request
Model Policies: Per-model configuration for timeouts, concurrency, rate limits
Client SDKs: TypeScript and Python libraries for easy integration

Architecture

Instance pooling for load distribution
Health checks with automatic failover
Rolling updates for model and prompt changes
Structured logging to BigQuery for analytics

Use Cases

Ollama is ideal for:

Local development and experimentation
Single-user applications and internal tools
Rapid prototyping of LLM features
Models requiring ease of deployment over performance

Alternative Inference Frameworks

vLLM

Best for: Production workloads with high concurrency

vLLM achieves 120-160 req/sec throughput with continuous batching and 50-80ms time-to-first-token latency. Key advantages:

PagedAttention Algorithm: Dramatically improves memory efficiency and throughput
Continuous Batching: Dynamic request batching for maximum GPU utilization
OpenAI-Compatible API: Drop-in replacement for OpenAI endpoints
Strong Concurrency: Designed for multi-user production scenarios

Use when:

Serving many concurrent users
Throughput is critical (API services, production deployments)
GPU resources are available and should be maximized

Text Generation Inference (TGI)

Best for: Enterprise deployments on diverse hardware

TGI achieves 100-140 req/sec throughput with dynamic batching and 60-90ms time-to-first-token. Developed by Hugging Face, it provides:

Broad Hardware Support: NVIDIA/AMD GPUs, AWS Inferentia, Intel Gaudi
Advanced Quantization: Multiple quantization methods for reduced memory
Tensor Parallelism: Distribute large models across multiple GPUs
Production Features: Metrics, distributed tracing, health checks

Use when:

Deploying on non-NVIDIA hardware
Need enterprise support and documentation
Hugging Face ecosystem integration is valuable

llama.cpp

Best for: CPU inference and edge deployment

Optimized for running models on CPUs and limited hardware:

CPU Optimization: SIMD acceleration, quantization for CPU inference
Low Memory: Runs large models on consumer hardware
Cross-Platform: Works on Windows, macOS, Linux, mobile
No GPU Required: Enables inference on any machine

Use when:

GPU unavailable or cost-prohibitive
Edge deployment (mobile, embedded devices)
Privacy-sensitive scenarios requiring local-only inference

TensorRT-LLM

Best for: Maximum performance on NVIDIA hardware

NVIDIA’s optimized inference solution achieving lowest latency:

GPU Optimization: Leverages latest NVIDIA GPU features
Graph Optimization: Fuses operations for minimal overhead
Mixed Precision: FP16/INT8 inference with minimal quality loss
Multi-GPU: Efficient tensor and pipeline parallelism

Use when:

Latency is critical (real-time applications)
NVIDIA hardware is available
Willing to invest in optimization work

Framework Comparison Summary

Framework	Throughput	Latency	Best For	Hardware
Ollama	1-3 req/sec	Variable	Development, single-user	CPU/GPU
vLLM	120-160 req/sec	50-80ms	Production, high concurrency	NVIDIA GPU
TGI	100-140 req/sec	60-90ms	Enterprise, diverse hardware	Multi-vendor
llama.cpp	Varies	Higher	CPU inference, edge	CPU
TensorRT-LLM	Highest	Lowest	Latency-critical, NVIDIA	NVIDIA GPU

Benchmarks approximate, vary by model size and hardware

Roadmap: Multi-Backend Support

Phase 1: Abstract Interface Layer

Define common interface across all backends
Implement adapter pattern for backend-specific logic
Create benchmark suite for performance comparison

Phase 2: vLLM Integration

Deploy vLLM instances for production workloads
Implement intelligent routing (Ollama for dev, vLLM for prod)
Migration path from Ollama to vLLM with API compatibility

Phase 3: Backend Selection

Automatic backend selection based on request characteristics
Cost/performance optimization across backends
A/B testing framework for backend comparison

Phase 4: Advanced Features

Model caching and warm-up strategies
Request prioritization and QoS
Multi-region deployment with intelligent routing
Custom backend integrations

Implementation Details

Current (Ollama)

Python-based API gateway
Async request handling with retry logic
Prometheus metrics export
Docker deployment with GPU support

Planned Architecture

Pluggable backend system
Configuration-driven backend selection
Per-model backend assignment
Health monitoring across all backends
Unified observability regardless of backend

Migration Strategy

For applications currently using Ollama:

API Compatibility: Maintain identical API surface
Gradual Rollout: Traffic shifting from Ollama to new backends
Performance Validation: Compare latency/throughput before full migration
Rollback Capability: Instant revert if issues detected

This ensures zero downtime and minimal risk when adopting higher-performance backends.