Back to projects

Inference Engine

High-performance serving infrastructure for open source language models.

in-development infrastructure llm-serving

Overview

The Inference Engine project provides a unified API gateway for serving open source language models through various inference backends. Currently implemented with Ollama, the architecture is designed to support multiple inference frameworks, allowing optimization for different deployment scenarios - from local development to high-throughput production workloads.

The engine adds authentication, usage tracking, request logging, and intelligent routing on top of the underlying inference frameworks, providing a production-ready interface for self-hosted LLMs.

Current Implementation: Ollama

Features

  • Single Endpoint: Unified API for multiple Ollama instances
  • Request Tracing: Complete request lifecycle logging for debugging
  • Cost Tracking: Token usage and compute cost estimation per request
  • Model Policies: Per-model configuration for timeouts, concurrency, rate limits
  • Client SDKs: TypeScript and Python libraries for easy integration

Architecture

  • Instance pooling for load distribution
  • Health checks with automatic failover
  • Rolling updates for model and prompt changes
  • Structured logging to BigQuery for analytics

Use Cases

Ollama is ideal for:

  • Local development and experimentation
  • Single-user applications and internal tools
  • Rapid prototyping of LLM features
  • Models requiring ease of deployment over performance

Alternative Inference Frameworks

vLLM

Best for: Production workloads with high concurrency

vLLM achieves 120-160 req/sec throughput with continuous batching and 50-80ms time-to-first-token latency. Key advantages:

  • PagedAttention Algorithm: Dramatically improves memory efficiency and throughput
  • Continuous Batching: Dynamic request batching for maximum GPU utilization
  • OpenAI-Compatible API: Drop-in replacement for OpenAI endpoints
  • Strong Concurrency: Designed for multi-user production scenarios

Use when:

  • Serving many concurrent users
  • Throughput is critical (API services, production deployments)
  • GPU resources are available and should be maximized

Text Generation Inference (TGI)

Best for: Enterprise deployments on diverse hardware

TGI achieves 100-140 req/sec throughput with dynamic batching and 60-90ms time-to-first-token. Developed by Hugging Face, it provides:

  • Broad Hardware Support: NVIDIA/AMD GPUs, AWS Inferentia, Intel Gaudi
  • Advanced Quantization: Multiple quantization methods for reduced memory
  • Tensor Parallelism: Distribute large models across multiple GPUs
  • Production Features: Metrics, distributed tracing, health checks

Use when:

  • Deploying on non-NVIDIA hardware
  • Need enterprise support and documentation
  • Hugging Face ecosystem integration is valuable

llama.cpp

Best for: CPU inference and edge deployment

Optimized for running models on CPUs and limited hardware:

  • CPU Optimization: SIMD acceleration, quantization for CPU inference
  • Low Memory: Runs large models on consumer hardware
  • Cross-Platform: Works on Windows, macOS, Linux, mobile
  • No GPU Required: Enables inference on any machine

Use when:

  • GPU unavailable or cost-prohibitive
  • Edge deployment (mobile, embedded devices)
  • Privacy-sensitive scenarios requiring local-only inference

TensorRT-LLM

Best for: Maximum performance on NVIDIA hardware

NVIDIA’s optimized inference solution achieving lowest latency:

  • GPU Optimization: Leverages latest NVIDIA GPU features
  • Graph Optimization: Fuses operations for minimal overhead
  • Mixed Precision: FP16/INT8 inference with minimal quality loss
  • Multi-GPU: Efficient tensor and pipeline parallelism

Use when:

  • Latency is critical (real-time applications)
  • NVIDIA hardware is available
  • Willing to invest in optimization work

Framework Comparison Summary

FrameworkThroughputLatencyBest ForHardware
Ollama1-3 req/secVariableDevelopment, single-userCPU/GPU
vLLM120-160 req/sec50-80msProduction, high concurrencyNVIDIA GPU
TGI100-140 req/sec60-90msEnterprise, diverse hardwareMulti-vendor
llama.cppVariesHigherCPU inference, edgeCPU
TensorRT-LLMHighestLowestLatency-critical, NVIDIANVIDIA GPU

Benchmarks approximate, vary by model size and hardware

Roadmap: Multi-Backend Support

Phase 1: Abstract Interface Layer

  • Define common interface across all backends
  • Implement adapter pattern for backend-specific logic
  • Create benchmark suite for performance comparison

Phase 2: vLLM Integration

  • Deploy vLLM instances for production workloads
  • Implement intelligent routing (Ollama for dev, vLLM for prod)
  • Migration path from Ollama to vLLM with API compatibility

Phase 3: Backend Selection

  • Automatic backend selection based on request characteristics
  • Cost/performance optimization across backends
  • A/B testing framework for backend comparison

Phase 4: Advanced Features

  • Model caching and warm-up strategies
  • Request prioritization and QoS
  • Multi-region deployment with intelligent routing
  • Custom backend integrations

Implementation Details

Current (Ollama)

  • Python-based API gateway
  • Async request handling with retry logic
  • Prometheus metrics export
  • Docker deployment with GPU support

Planned Architecture

  • Pluggable backend system
  • Configuration-driven backend selection
  • Per-model backend assignment
  • Health monitoring across all backends
  • Unified observability regardless of backend

Migration Strategy

For applications currently using Ollama:

  1. API Compatibility: Maintain identical API surface
  2. Gradual Rollout: Traffic shifting from Ollama to new backends
  3. Performance Validation: Compare latency/throughput before full migration
  4. Rollback Capability: Instant revert if issues detected

This ensures zero downtime and minimal risk when adopting higher-performance backends.