Dogeek

joined 2 years ago
[–] Dogeek@alien.top 1 points 2 years ago

Oh lord, I have so much info to give ! For the setup, it's running on kubernetes 1.28.2, so YMMV. My monitoring stack is :

  • Grafana -- Dashboards
  • Alertmanager -- Alerting
  • Prometheus -- Time series Database
  • Loki -- Logs database
  • Promtail -- Log collector
  • Mimir -- Long term metrics&logs storage
  • Tempo -- Datadog APM, but with Grafana, allows you to track requests through a network of services, invaluable to link your reverse proxy, to your apps, to your SSO to your database...
  • SMTP Relay -- A homemade SMTP relay that eases setting up mail alerts, allows me to push mail through mailjet using my domain
  • Node-exporter -- exports metrics for the server
  • Exportarr -- exports metrics for sonarr/radarr etc
  • pihole-exporter -- exports pihole metrics for prometheus scraping
  • smart-exporter -- exports S.M.A.R.T metrics (for HDD health)
  • ntfy -- for notifications to my phone (other than mail)

The rest is pretty much the same, if the service exports prometheus metrics by default, I use that, and write a ServiceMonitor and a Service manifest for that, it usually looks like that

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: traefik
  labels:
    app.kubernetes.io/component: traefik
    app.kubernetes.io/instance: traefik
    app.kubernetes.io/managed-by: kustomize
    app.kubernetes.io/name: traefik
    app.kubernetes.io/part-of: traefik
spec:
  selector:
    matchLabels:
      app.kubernetes.io/name: traefik-metrics
  endpoints:
  - port: metrics
    interval: 30s
    path: /metrics
    scheme: http
    tlsConfig:
      insecureSkipVerify: true
  namespaceSelector:
    matchNames:
    - traefik

apiVersion: v1
kind: Service
metadata:
  name: traefik-metrics
  namespace: traefik
  labels:
    app.kubernetes.io/name: traefik-metrics
spec:
  type: ClusterIP
  ports:
    - protocol: TCP
      name: metrics
      port: 8082
  selector:
    app.kubernetes.io/name: traefik

If the app doesn't include a prometheus endpoint, I just find an existing exporter for that app, most popular ones have that, and ready made grafana dashboards.

For alerting, I create PrometheusRule object with the prometheus query and the message to alert me (depending on the severity, it's either a mail for med-low severity incidents, phone notification for high sev). I try to keep mails / notifications to a minimum, just alerts on load, CPU, RAM, and potential SMART errors as well give me alerts.