[GH-ISSUE #17] feat: Implement OpenTelemetry observability (traces, metrics, logs) #2

New issue

Closed

opened 2026-03-02 05:12:20 +03:00 by kerem · 1 comment

kerem commented

2026-03-02 05:12:20 +03:00

Owner

Originally created by @dviejokfs on GitHub (Feb 26, 2026).
Original GitHub issue: https://github.com/gotempsh/temps/issues/17

Problem description

Temps currently has no built-in support for distributed tracing, metrics collection, or structured log aggregation from deployed applications. Users have no way to observe the internal behavior of their services — they can't see request latency breakdowns, trace cross-service calls, identify slow operations, or correlate logs with specific traces. This makes debugging production issues difficult and time-consuming.

Proposed solution

Add a complete OpenTelemetry (OTel) observability stack to Temps, covering the full pipeline from ingest to visualization:

Backend (`temps-otel` crate)

OTLP/HTTP protobuf ingest for traces, metrics, and logs with gzip/zstd decompression
Dual auth modes: API keys (tk_) with X-Temps-Project-Id header, and deployment tokens (dt_) with automatic project/environment/deployment binding
Header-based routes (POST /otel/v1/{traces,metrics,logs}) and path-based routes (POST /otel/v1/{project_id}/{environment_id}/{deployment_id}/{traces,metrics,logs})
TimescaleDB storage with hypertables, continuous aggregates (1min/1hr rollups), compression policies (7 days), and retention policies (90 days)
Query API: list/filter spans, get single trace, query metrics with time_bucket aggregation, query logs with severity/search filters, list metric names, pipeline stats, health summaries, insights
Rate limiting per project and storage quota checks on ingest
Anomaly detection with baseline computation and insight generation
Health compute service for per-service health summaries
Sidecar config generation for OTel Collector deployment alongside containers
OpenAPI annotations on all endpoints

Auth & Permissions

New OtelRead and OtelWrite permissions
deployment_id field added to deployment tokens for full context propagation
Auth middleware updated to pass deployment_id from token records

Frontend

Traces list page with filterable table (time range, service name, status, search by trace ID), pagination, and grouped-by-trace view showing trace-level error status
Trace detail page with span waterfall/Gantt chart visualization, span detail panel (timing, IDs, resource info, attributes, events), and refresh button
Setup section with environment selector, OTLP endpoint display, and Next.js integration code snippets
Sidebar navigation item ("Traces" with Workflow icon)

Infrastructure

Migration adding deployment_id to deployment_tokens table
Protobuf schema fix: Span.flags and Span.Link.flags changed from uint32 to fixed32 per OTLP v1.1.0+ spec

Alternatives considered

External OTel backends (Jaeger, Tempo, SigNoz): Adds operational complexity and external dependencies. A built-in solution provides tighter integration with Temps' auth, projects, and deployments model.
Server-side tail sampling: Initially implemented at 1% base rate but removed — sampling should be the client SDK's responsibility (head-based) since the server should store everything it receives.

Additional context

The temps-otel crate follows the existing plugin architecture (TempsPlugin trait) and is registered at position 9.7 in console.rs
117+ unit tests covering ingest, decode, auth, rate limiting, error handling, and query paths
Zero clippy warnings
Client SDKs configure sampling via standard OTel SDK settings (e.g., TraceIdRatioBased sampler)

Originally created by @dviejokfs on GitHub (Feb 26, 2026). Original GitHub issue: https://github.com/gotempsh/temps/issues/17 ## Problem description Temps currently has no built-in support for distributed tracing, metrics collection, or structured log aggregation from deployed applications. Users have no way to observe the internal behavior of their services — they can't see request latency breakdowns, trace cross-service calls, identify slow operations, or correlate logs with specific traces. This makes debugging production issues difficult and time-consuming. ## Proposed solution Add a complete OpenTelemetry (OTel) observability stack to Temps, covering the full pipeline from ingest to visualization: ### Backend (`temps-otel` crate) - **OTLP/HTTP protobuf ingest** for traces, metrics, and logs with gzip/zstd decompression - **Dual auth modes**: API keys (`tk_`) with `X-Temps-Project-Id` header, and deployment tokens (`dt_`) with automatic project/environment/deployment binding - **Header-based routes** (`POST /otel/v1/{traces,metrics,logs}`) and **path-based routes** (`POST /otel/v1/{project_id}/{environment_id}/{deployment_id}/{traces,metrics,logs}`) - **TimescaleDB storage** with hypertables, continuous aggregates (1min/1hr rollups), compression policies (7 days), and retention policies (90 days) - **Query API**: list/filter spans, get single trace, query metrics with time_bucket aggregation, query logs with severity/search filters, list metric names, pipeline stats, health summaries, insights - **Rate limiting** per project and **storage quota** checks on ingest - **Anomaly detection** with baseline computation and insight generation - **Health compute service** for per-service health summaries - **Sidecar config generation** for OTel Collector deployment alongside containers - **OpenAPI annotations** on all endpoints ### Auth & Permissions - New `OtelRead` and `OtelWrite` permissions - `deployment_id` field added to deployment tokens for full context propagation - Auth middleware updated to pass deployment_id from token records ### Frontend - **Traces list page** with filterable table (time range, service name, status, search by trace ID), pagination, and grouped-by-trace view showing trace-level error status - **Trace detail page** with span waterfall/Gantt chart visualization, span detail panel (timing, IDs, resource info, attributes, events), and refresh button - **Setup section** with environment selector, OTLP endpoint display, and Next.js integration code snippets - Sidebar navigation item ("Traces" with Workflow icon) ### Infrastructure - Migration adding `deployment_id` to `deployment_tokens` table - Protobuf schema fix: `Span.flags` and `Span.Link.flags` changed from `uint32` to `fixed32` per OTLP v1.1.0+ spec ## Alternatives considered - **External OTel backends** (Jaeger, Tempo, SigNoz): Adds operational complexity and external dependencies. A built-in solution provides tighter integration with Temps' auth, projects, and deployments model. - **Server-side tail sampling**: Initially implemented at 1% base rate but removed — sampling should be the client SDK's responsibility (head-based) since the server should store everything it receives. ## Additional context - The `temps-otel` crate follows the existing plugin architecture (`TempsPlugin` trait) and is registered at position 9.7 in `console.rs` - 117+ unit tests covering ingest, decode, auth, rate limiting, error handling, and query paths - Zero clippy warnings - Client SDKs configure sampling via standard OTel SDK settings (e.g., `TraceIdRatioBased` sampler)