git/metabuilder

Fork 0

mirror of https://github.com/johndoe6345789/metabuilder.git synced 2026-04-24 13:54:57 +00:00

Files

johndoe6345789 862cc29457 various changes

2026-03-09 22:30:41 +00:00

12 KiB

Raw Permalink Blame History

Phase 8 Monitoring Infrastructure

Comprehensive Observability Stack for Email Client

Last Updated: 2026-01-24

Overview

This directory contains a production-ready monitoring infrastructure for the Phase 8 Email Client implementation. It provides:

Metrics Collection: Prometheus scraping all services
Log Aggregation: ELK Stack (Elasticsearch, Logstash, Kibana)
Visualization: Grafana dashboards
Distributed Tracing: OpenTelemetry + Jaeger
Alerting: Prometheus AlertManager with multi-channel notifications
Health Checks: Standardized endpoints for all services

Directory Structure

monitoring/
├── prometheus/                    # Prometheus configuration
│   ├── prometheus.yml            # Main scrape config
│   └── alert_rules.yml           # Alert rule definitions
├── grafana/                      # Grafana configuration
│   ├── provisioning/
│   │   ├── datasources.yml       # Data source definitions
│   │   └── dashboards.yml        # Dashboard provisioning
│   └── dashboards/
│       ├── email-service-overview.json
│       └── system-resources.json
├── alertmanager/                 # Alert routing & notifications
│   └── alertmanager.yml
├── logstash/                     # Log processing pipelines
│   ├── logstash.conf
│   └── email-service-template.json
├── opentelemetry/                # Distributed tracing
│   └── otel-collector-config.yml
├── loki/                         # Log aggregation (optional)
│   └── loki-config.yml
├── docker-compose.monitoring.yml # Full monitoring stack
├── HEALTH_ENDPOINTS.md          # Health check specifications
├── PERFORMANCE_BASELINES.md     # Expected metrics & thresholds
└── README.md                    # This file

Quick Start

1. Start Monitoring Stack

# From project root
cd deployment/monitoring

# Create external network if not exists
docker network create emailclient-net || true

# Start monitoring services
docker-compose -f docker-compose.monitoring.yml up -d

# Verify all services are running
docker-compose -f docker-compose.monitoring.yml ps

2. Access Web UIs

Service	URL	Default Credentials
Grafana	http://localhost:3000	admin/admin
Prometheus	http://localhost:9090	-
Alertmanager	http://localhost:9093	-
Kibana	http://localhost:5601	elastic/changeme
Jaeger	http://localhost:16686	-

3. Configure Notifications

Edit alertmanager/alertmanager.yml with your notification settings:

# Slack notifications
global:
  slack_api_url: https://hooks.slack.com/services/YOUR/WEBHOOK/URL

# PagerDuty for critical alerts
receivers:
  - name: critical-ops
    pagerduty_configs:
      - service_key: YOUR_SERVICE_KEY

4. Create Custom Dashboards

Grafana dashboards are provisioned from grafana/dashboards/.

Add new dashboard:

Design in Grafana UI
Export as JSON
Save to grafana/dashboards/
Restart Grafana to auto-load

Key Features

Metrics & Monitoring

Prometheus collects metrics from:

Email Service (Flask + Gunicorn)
Celery workers & beat scheduler
PostgreSQL database
Redis cache/broker
Postfix SMTP server
Dovecot IMAP/POP3 server
Docker containers (cAdvisor)
System resources (Node Exporter)

Key metrics tracked:

Request latency (P50, P95, P99)
Error rates (4xx, 5xx)
Queue depth (Celery tasks)
Database connections & queries
Cache hit rates & memory usage
Mail queue depth (Postfix)
IMAP/POP3 active sessions

Alert Rules

Infrastructure Alerts:

CPU usage > 80% for 5 minutes
Memory usage > 85% for 5 minutes
Disk usage > 85% for 10 minutes
Out of memory (< 5% available)

Database Alerts:

Connection pool > 80% capacity
Replication lag > 10 seconds
Long-running queries (> 5 minutes)
Database size > 5 GB

Cache Alerts:

Redis memory > 85% of max
Client connections > 1000
Key evictions detected
Command latency P99 > 100ms

Application Alerts:

Email Service down (2 minutes)
Error rate > 5% for 3 minutes
Request latency P99 > 2 seconds
DB connection pool > 80%
IMAP pool > 90% utilized
SMTP pool > 90% utilized

Task Queue Alerts:

Queue backup > 1000 pending tasks
Task failure rate > 10% for 5 minutes
Worker offline
Celery Beat not running

Email Server Alerts:

Postfix down
Mail queue > 100 messages
Dovecot down
IMAP connection errors elevated

Log Aggregation

ELK Stack components:

Elasticsearch: Central log storage & search
Logstash: Parse, enrich, and forward logs
Kibana: Log search and analysis interface

Log sources:

Email Service application logs
Celery task execution logs
PostgreSQL query logs
Postfix SMTP logs
Dovecot IMAP/POP3 logs
Container stdout/stderr
Syslog messages

Log indexing: Logs automatically indexed by type and timestamp

logs-email-service-YYYY.MM.dd
logs-celery-worker-YYYY.MM.dd
logs-postgresql-YYYY.MM.dd
logs-postfix-YYYY.MM.dd
logs-dovecot-YYYY.MM.dd

Distributed Tracing

OpenTelemetry Collector receives traces from:

OTLP protocol (gRPC + HTTP)
Jaeger format
Zipkin format

Jaeger features:

Trace visualization across service boundaries
Latency analysis by service
Error tracking with full stack traces
Dependency graph analysis

Health Checks

All services expose /health endpoints:

curl http://email-service:5000/health | jq
curl http://prometheus:9090/-/healthy
curl http://elasticsearch:9200/_cluster/health

See HEALTH_ENDPOINTS.md for detailed specifications.

Configuration

Environment Variables

Create .env file in deployment/ directory:

# Elasticsearch
ELASTICSEARCH_PASSWORD=your-secure-password

# Grafana
GRAFANA_USER=admin
GRAFANA_PASSWORD=your-secure-password

# Notifications
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...
PAGERDUTY_SERVICE_KEY=...
OPSGENIE_API_KEY=...
EMAIL_PASSWORD=...

Customizing Alert Rules

Edit prometheus/alert_rules.yml to:

Adjust thresholds:

- alert: HighCPUUsage
  expr: (100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)) > 80
  for: 5m

Add new alert:

- alert: MyCustomAlert
  expr: your_metric > threshold
  for: 5m
  annotations:
    summary: "Description"

Reload alerts:

curl -X POST http://localhost:9090/-/reload

Customizing Dashboards

Grafana provisioning files:

grafana/provisioning/datasources.yml - Data source configuration
grafana/provisioning/dashboards.yml - Dashboard location config

Dashboard JSON files:

grafana/dashboards/email-service-overview.json
grafana/dashboards/system-resources.json

To add new dashboard:

Create in Grafana UI
Export JSON
Save to grafana/dashboards/
Restart Grafana

Performance Baselines

See PERFORMANCE_BASELINES.md for:

Expected response times
Normal resource usage
Typical error rates
Queue depth patterns
Database query times
Cache hit rates

Troubleshooting

Prometheus Not Scraping Metrics

Check service is running: docker-compose ps
Verify targets: http://localhost:9090/targets
Check prometheus.yml for syntax errors: docker-compose logs prometheus
Ensure network connectivity: docker exec emailclient-prometheus curl http://email-service:5000/metrics

Kibana Not Showing Logs

Check Elasticsearch is running: http://localhost:9200/_cluster/health
Verify Logstash is processing: docker-compose logs logstash
Check indices were created: curl http://localhost:9200/_cat/indices
Inspect Logstash config: docker-compose logs logstash | grep -i error

Alerts Not Firing

Check rule syntax: http://localhost:9090/alerts
Verify data is being scraped: http://localhost:9090/graph
Check alert evaluation: Look at Prometheus logs
Test alert manually: docker-compose logs alertmanager

High Memory Usage

Check Elasticsearch heap: curl http://localhost:9200/_nodes/stats | jq '.nodes[].jvm.mem.heap_used_in_bytes'
Reduce retention: Edit prometheus/prometheus.yml retention.time
Scale horizontally: Add more Logstash instances
Monitor with: http://localhost:3000/d/system-resources

Scaling Considerations

Single Node (Development)

Current setup suitable for:

< 10 services
< 1000 metrics per service
< 100 GB logs per day
< 100 events per second

Multi-Node (Production)

For production scaling:

Separate Elasticsearch cluster

elasticsearch:
  discovery.type: multi-node
  cluster.name: emailclient
  node.roles: [data, master]

Add more Logstash workers
```
logstash:
  scale: 3
```

Use remote Prometheus storage

remote_write:
  - url: http://victoriametrics:8428/api/v1/write

Separate monitoring network
- Keep monitoring on isolated network
- Use VPN for remote monitoring

Maintenance

Weekly Tasks

Review alert firing patterns in Alertmanager
Check disk usage on monitoring nodes
Verify log ingestion rate in Kibana
Spot-check dashboard accuracy

Monthly Tasks

Review and optimize alert thresholds
Archive old logs to cold storage
Update Prometheus retention policies
Performance analysis & optimization

Quarterly Tasks

Full system backup (Elasticsearch indices, Grafana dashboards)
Security audit of alerting credentials
Capacity planning for next quarter
Major version updates for monitoring stack

Security

Authentication

Grafana:

Change default admin password
Enable LDAP/SAML integration for enterprise

Elasticsearch:

Use strong password (not shown in configs)
Enable TLS/SSL for inter-node communication
Restrict Kibana access via reverse proxy

Network Security

Keep monitoring on isolated network
Use firewall rules to restrict access
No public internet access to monitoring UIs
VPN for remote access

Credentials Management

# Store sensitive values in .env
ELASTICSEARCH_PASSWORD=generated-secure-password
SLACK_WEBHOOK_URL=https://hooks.slack.com/services/...

# Never commit .env to git
echo ".env" >> .gitignore

Backup & Recovery

Daily Backup

# Backup Elasticsearch indices
curl -X PUT "localhost:9200/_snapshot/backup" -H 'Content-Type: application/json' -d'
{
  "type": "fs",
  "settings": {
    "location": "/mnt/backup/elasticsearch"
  }
}'

# Backup Grafana dashboards
docker exec emailclient-grafana grafana-cli admin export-dashboard /var/lib/grafana/backup

Restore from Backup

# Restore Elasticsearch indices
curl -X POST "localhost:9200/_snapshot/backup/my-snapshot/_restore"

# Restore Grafana dashboards
docker exec emailclient-grafana grafana-cli admin import-dashboard /var/lib/grafana/backup

Documentation References

Health Endpoints - Service health check specs
Performance Baselines - Expected metrics & thresholds
Prometheus Documentation
Grafana Documentation
ELK Stack Guide
OpenTelemetry Guide

Support & Issues

Common Issues:

Service discovery not finding services → Check DNS/hostnames in prometheus.yml
High cardinality metrics → Add metric relabel rules to drop high-cardinality labels
Slow Elasticsearch → Reduce shard count or add more nodes
Missing logs → Verify Logstash pipeline config and input sources

Getting Help:

Check service logs: docker-compose logs [service-name]
Review alert history in Alertmanager: http://localhost:9093
Check Prometheus targets: http://localhost:9090/targets
Query metrics directly: http://localhost:9090/graph

12 KiB Raw Permalink Blame History