LLM Service

An LLM-powered impact analysis and report generation service for the Rippler system. This service leverages Large Language Models (OpenAI GPT-4o-mini, Anthropic Claude, and local models via Ollama) to generate comprehensive impact analysis reports for code changes.

Architecture Diagram

The following diagram illustrates the internal architecture of the LLM Service, showing the multi-provider LLM integration, automatic fallback mechanism, request validation, prompt building, and response parsing.

Update the Architecture Diagram

To view and edit the architecture diagram:

Open /docs/architecture/services/llm-service.drawio in diagrams.net or VS Code with the Draw.io extension
The diagram shows the complete LLM service architecture including FastAPI endpoints, provider routing, fallback management, caching, rate limiting, and integration with OpenAI, Anthropic, and Ollama
After making changes, export to HTML and copy to /website/static/architecture/services/llm-service.drawio.html

Overview

The LLM Service is the intelligence engine of Rippler, using advanced AI models to analyze code changes and predict their impact across the microservice architecture.

Features

Multi-Provider LLM Integration: Support for OpenAI GPT-4o-mini, Anthropic Claude, and local models (Ollama)
Automatic Fallback: Seamlessly falls back to local models when remote APIs are unavailable
Impact Analysis: Generate detailed analysis of code changes and their downstream impact
Risk Scoring: Automated risk/impact scoring (high, medium, low)
Structured Reports: JSON-formatted reports with natural language explanations
Rate Limiting: Graceful handling of API rate limits with retry logic
Performance: Sub-10-second response time for typical PRs
Cost Optimization: Streaming and caching support

Technology Stack

Python 3.11+: Modern Python runtime
FastAPI: High-performance async web framework
OpenAI SDK: GPT-4o-mini integration
Anthropic SDK: Claude integration
Ollama: Local LLM inference
Pydantic: Data validation and serialization

API Endpoints

POST /api/v1/analyze

Accepts structured diff/change data and returns a comprehensive impact analysis report.

Request Body:

{
  "repository": {
    "name": "my-repo",
    "owner": "my-org",
    "url": "https://github.com/my-org/my-repo"
  },
  "pull_request": {
    "number": 123,
    "title": "Add new feature",
    "description": "This PR adds a new feature...",
    "author": "developer",
    "branch": "feature/new-feature",
    "base_branch": "main"
  },
  "changes": [
    {
      "file": "src/service.py",
      "type": "modified",
      "additions": 50,
      "deletions": 10,
      "diff": "diff content..."
    }
  ],
  "dependencies": {
    "direct": ["service-a", "service-b"],
    "transitive": ["service-c", "service-d"]
  },
  "mode": "production"
}

Response:

{
  "summary": {
    "text": "Natural language summary...",
    "confidence": 0.95
  },
  "changes_analysis": {
    "description": "Detailed analysis...",
    "confidence": 0.90
  },
  "affected_services": [
    {
      "name": "service-a",
      "impact_level": "high",
      "reason": "Direct dependency with breaking changes",
      "confidence": 0.85
    }
  ],
  "risk_assessment": {
    "overall_risk": "medium",
    "score": 0.65,
    "factors": ["Breaking API changes", "Multiple services affected"],
    "confidence": 0.88
  },
  "stakeholders": [
    {
      "name": "Team A",
      "role": "owner",
      "notification_priority": "high",
      "reason": "Owns affected service"
    }
  ],
  "recommendations": [
    {
      "priority": "high",
      "action": "Add integration tests",
      "rationale": "To verify compatibility..."
    }
  ],
  "metadata": {
    "model": "gpt-4o-mini",
    "processing_time_ms": 4500,
    "tokens_used": 2500,
    "used_fallback": false
  }
}

Configuration

Environment Variables

Variable	Description	Default
`LLM_PROVIDER`	LLM provider (openai, anthropic, or local)	openai
`OPENAI_API_KEY`	OpenAI API key	-
`ANTHROPIC_API_KEY`	Anthropic API key	-
`OPENAI_MODEL`	OpenAI model name	gpt-4o-mini
`ANTHROPIC_MODEL`	Anthropic model name	claude-3-haiku-20240307
`LOCAL_MODEL_ENABLED`	Enable local model fallback	true
`LOCAL_MODEL_TYPE`	Local model type	ollama
`LOCAL_MODEL_NAME`	Ollama model name	codellama:7b
`LOCAL_MODEL_BASE_URL`	Ollama API URL	http://localhost:11434
`LOCAL_MODEL_GPU_LAYERS`	GPU layers (-1=auto, 0=CPU only)	-1
`ENABLE_FALLBACK`	Enable automatic fallback	true
`HOST`	Server host	0.0.0.0
`PORT`	Server port	8000
`LOG_LEVEL`	Logging level	INFO
`MAX_RETRIES`	Max retry attempts	3
`TIMEOUT_SECONDS`	Request timeout	30
`ENABLE_CACHING`	Enable result caching	true

Local Model Setup

The service supports local models through Ollama for fast, offline inference.

Installing Ollama

# macOS/Linux
curl -fsSL https://ollama.ai/install.sh | sh

# Or download from https://ollama.ai/download

Pulling a Model

# Pull CodeLlama 7B (recommended for code analysis)
ollama pull codellama:7b

# Or use other code-focused models
ollama pull codellama:13b
ollama pull deepseek-coder:6.7b

Starting Ollama

# Ollama runs as a service on port 11434
ollama serve

# Verify it's running
curl http://localhost:11434/api/tags

GPU Acceleration

Enable GPU acceleration for faster inference:

# Enable GPU with automatic layer offloading (default)
export LOCAL_MODEL_GPU_LAYERS=-1

# Use CPU only
export LOCAL_MODEL_GPU_LAYERS=0

# For multi-GPU setups
export LOCAL_MODEL_MAIN_GPU=0

GPU Requirements:

NVIDIA GPU with CUDA support (for best performance)
AMD GPU with ROCm support (experimental)
Apple Silicon (M1/M2/M3) uses Metal acceleration automatically

Benefits:

3-10x faster inference compared to CPU-only
Lower latency for real-time analysis
Better handling of larger models

Installation

# Clone the repository
git clone https://github.com/Citi-Rippler/llm-service.git
cd llm-service

# Install dependencies
pip install -r requirements.txt

# For development
pip install -r requirements-dev.txt

Running the Service

Using Python

# Set environment variables
export OPENAI_API_KEY=your-key-here
export LLM_PROVIDER=openai

# Run the service
python -m app.main

Using Uvicorn

uvicorn app.main:app --host 0.0.0.0 --port 8000 --reload

Using Docker

docker build -t rippler-llm-service .
docker run -p 8000:8000 \
  -e OPENAI_API_KEY=your-key \
  -e LLM_PROVIDER=openai \
  rippler-llm-service

Using Local Models

# Start Ollama
ollama serve

# Run the service with local provider
export LLM_PROVIDER=local
export LOCAL_MODEL_NAME=codellama:7b
python -m app.main

Fallback Configuration

When ENABLE_FALLBACK=true (default), the service automatically falls back to local models if:

Remote API is unavailable (connection errors)
Rate limits are exceeded
Authentication fails

The fallback is transparent - the response includes used_fallback: true in the metadata.

Example fallback scenario:

# OpenAI API down or rate limited
# Service automatically uses local Ollama model
# Response includes: "metadata": {"model": "codellama:7b", "used_fallback": true}

Integration

From Impact Analyzer

The Impact Analyzer sends structured data to the LLM service:

import httpx

async def analyze_pr(pr_data):
    async with httpx.AsyncClient() as client:
        response = await client.post(
            "http://llm-service:8000/api/v1/analyze",
            json=pr_data,
            timeout=30.0
        )
        return response.json()

Required Data Structure

The LLM service expects:

Repository Information: Basic metadata about the repository
Pull Request Details: PR number, title, description, author
Code Changes: File-level diffs and statistics
Dependency Information: Direct and transitive dependencies

Performance

Response Time: Optimized for < 10 seconds for typical PRs (10-20 file changes)
Token Usage: Efficient prompt design to minimize token consumption
Caching: Results cached for identical inputs
Streaming: Optional streaming support for large analyses

Performance Benchmarks

Provider	Avg Response Time	Cost per 1K Tokens
OpenAI GPT-4o-mini	3-5s	$0.15
Anthropic Claude Haiku	2-4s	$0.25
Ollama CodeLlama 7B (GPU)	5-8s	Free
Ollama CodeLlama 7B (CPU)	20-30s	Free

Error Handling

The service handles:

API rate limits (with exponential backoff)
Network timeouts
Invalid input data
LLM service unavailability (with automatic fallback)

All errors return structured JSON responses with appropriate HTTP status codes.

Testing

# Run all tests
pytest

# Run with coverage
pytest --cov=app --cov-report=html

# Run specific test
pytest tests/test_api.py

Development

Code Quality

# Format code
black app/ tests/

# Sort imports
isort app/ tests/

# Lint
flake8 app/ tests/

# Type checking
mypy app/

Project Structure

app/
├── main.py              # FastAPI application entry point
├── api/
│   └── v1/
│       └── endpoints.py # API endpoints
├── core/
│   ├── config.py        # Configuration management
│   └── prompts.py       # Prompt templates
├── models/
│   ├── request.py       # Request models
│   └── response.py      # Response models
├── services/
│   ├── llm_service.py   # LLM integration
│   └── analyzer.py      # Impact analysis logic
└── utils/
    ├── retry.py         # Retry logic
    └── cache.py         # Caching utilities

Repository

GitHub: llm-service

Architecture Diagram​

Overview​

Features​

Technology Stack​

API Endpoints​

POST /api/v1/analyze​

Configuration​

Environment Variables​

Local Model Setup​

Installing Ollama​

Pulling a Model​

Starting Ollama​

GPU Acceleration​

Installation​

Running the Service​

Using Python​

Using Uvicorn​

Using Docker​

Using Local Models​

Fallback Configuration​

Integration​

From Impact Analyzer​

Required Data Structure​

Performance​

Performance Benchmarks​

Error Handling​

Testing​

Development​

Code Quality​

Project Structure​

Related Documentation​

Repository​