[GH-ISSUE #17] feat: Implement OpenTelemetry observability (traces, metrics, logs) #2

Closed
opened 2026-03-02 05:12:20 +03:00 by kerem · 1 comment
Owner

Originally created by @dviejokfs on GitHub (Feb 26, 2026).
Original GitHub issue: https://github.com/gotempsh/temps/issues/17

Problem description

Temps currently has no built-in support for distributed tracing, metrics collection, or structured log aggregation from deployed applications. Users have no way to observe the internal behavior of their services — they can't see request latency breakdowns, trace cross-service calls, identify slow operations, or correlate logs with specific traces. This makes debugging production issues difficult and time-consuming.

Proposed solution

Add a complete OpenTelemetry (OTel) observability stack to Temps, covering the full pipeline from ingest to visualization:

Backend (temps-otel crate)

  • OTLP/HTTP protobuf ingest for traces, metrics, and logs with gzip/zstd decompression
  • Dual auth modes: API keys (tk_) with X-Temps-Project-Id header, and deployment tokens (dt_) with automatic project/environment/deployment binding
  • Header-based routes (POST /otel/v1/{traces,metrics,logs}) and path-based routes (POST /otel/v1/{project_id}/{environment_id}/{deployment_id}/{traces,metrics,logs})
  • TimescaleDB storage with hypertables, continuous aggregates (1min/1hr rollups), compression policies (7 days), and retention policies (90 days)
  • Query API: list/filter spans, get single trace, query metrics with time_bucket aggregation, query logs with severity/search filters, list metric names, pipeline stats, health summaries, insights
  • Rate limiting per project and storage quota checks on ingest
  • Anomaly detection with baseline computation and insight generation
  • Health compute service for per-service health summaries
  • Sidecar config generation for OTel Collector deployment alongside containers
  • OpenAPI annotations on all endpoints

Auth & Permissions

  • New OtelRead and OtelWrite permissions
  • deployment_id field added to deployment tokens for full context propagation
  • Auth middleware updated to pass deployment_id from token records

Frontend

  • Traces list page with filterable table (time range, service name, status, search by trace ID), pagination, and grouped-by-trace view showing trace-level error status
  • Trace detail page with span waterfall/Gantt chart visualization, span detail panel (timing, IDs, resource info, attributes, events), and refresh button
  • Setup section with environment selector, OTLP endpoint display, and Next.js integration code snippets
  • Sidebar navigation item ("Traces" with Workflow icon)

Infrastructure

  • Migration adding deployment_id to deployment_tokens table
  • Protobuf schema fix: Span.flags and Span.Link.flags changed from uint32 to fixed32 per OTLP v1.1.0+ spec

Alternatives considered

  • External OTel backends (Jaeger, Tempo, SigNoz): Adds operational complexity and external dependencies. A built-in solution provides tighter integration with Temps' auth, projects, and deployments model.
  • Server-side tail sampling: Initially implemented at 1% base rate but removed — sampling should be the client SDK's responsibility (head-based) since the server should store everything it receives.

Additional context

  • The temps-otel crate follows the existing plugin architecture (TempsPlugin trait) and is registered at position 9.7 in console.rs
  • 117+ unit tests covering ingest, decode, auth, rate limiting, error handling, and query paths
  • Zero clippy warnings
  • Client SDKs configure sampling via standard OTel SDK settings (e.g., TraceIdRatioBased sampler)
Originally created by @dviejokfs on GitHub (Feb 26, 2026). Original GitHub issue: https://github.com/gotempsh/temps/issues/17 ## Problem description Temps currently has no built-in support for distributed tracing, metrics collection, or structured log aggregation from deployed applications. Users have no way to observe the internal behavior of their services — they can't see request latency breakdowns, trace cross-service calls, identify slow operations, or correlate logs with specific traces. This makes debugging production issues difficult and time-consuming. ## Proposed solution Add a complete OpenTelemetry (OTel) observability stack to Temps, covering the full pipeline from ingest to visualization: ### Backend (`temps-otel` crate) - **OTLP/HTTP protobuf ingest** for traces, metrics, and logs with gzip/zstd decompression - **Dual auth modes**: API keys (`tk_`) with `X-Temps-Project-Id` header, and deployment tokens (`dt_`) with automatic project/environment/deployment binding - **Header-based routes** (`POST /otel/v1/{traces,metrics,logs}`) and **path-based routes** (`POST /otel/v1/{project_id}/{environment_id}/{deployment_id}/{traces,metrics,logs}`) - **TimescaleDB storage** with hypertables, continuous aggregates (1min/1hr rollups), compression policies (7 days), and retention policies (90 days) - **Query API**: list/filter spans, get single trace, query metrics with time_bucket aggregation, query logs with severity/search filters, list metric names, pipeline stats, health summaries, insights - **Rate limiting** per project and **storage quota** checks on ingest - **Anomaly detection** with baseline computation and insight generation - **Health compute service** for per-service health summaries - **Sidecar config generation** for OTel Collector deployment alongside containers - **OpenAPI annotations** on all endpoints ### Auth & Permissions - New `OtelRead` and `OtelWrite` permissions - `deployment_id` field added to deployment tokens for full context propagation - Auth middleware updated to pass deployment_id from token records ### Frontend - **Traces list page** with filterable table (time range, service name, status, search by trace ID), pagination, and grouped-by-trace view showing trace-level error status - **Trace detail page** with span waterfall/Gantt chart visualization, span detail panel (timing, IDs, resource info, attributes, events), and refresh button - **Setup section** with environment selector, OTLP endpoint display, and Next.js integration code snippets - Sidebar navigation item ("Traces" with Workflow icon) ### Infrastructure - Migration adding `deployment_id` to `deployment_tokens` table - Protobuf schema fix: `Span.flags` and `Span.Link.flags` changed from `uint32` to `fixed32` per OTLP v1.1.0+ spec ## Alternatives considered - **External OTel backends** (Jaeger, Tempo, SigNoz): Adds operational complexity and external dependencies. A built-in solution provides tighter integration with Temps' auth, projects, and deployments model. - **Server-side tail sampling**: Initially implemented at 1% base rate but removed — sampling should be the client SDK's responsibility (head-based) since the server should store everything it receives. ## Additional context - The `temps-otel` crate follows the existing plugin architecture (`TempsPlugin` trait) and is registered at position 9.7 in `console.rs` - 117+ unit tests covering ingest, decode, auth, rate limiting, error handling, and query paths - Zero clippy warnings - Client SDKs configure sampling via standard OTel SDK settings (e.g., `TraceIdRatioBased` sampler)
kerem 2026-03-02 05:12:20 +03:00
Author
Owner

@dviejokfs commented on GitHub (Feb 28, 2026):

Closed by #18

<!-- gh-comment-id:3976746599 --> @dviejokfs commented on GitHub (Feb 28, 2026): Closed by #18
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/temps#2
No description provided.