Oh lord, I have so much info to give ! For the setup, it's running on kubernetes 1.28.2, so YMMV. My monitoring stack is :
Grafana -- Dashboards
Alertmanager -- Alerting
Prometheus -- Time series Database
Loki -- Logs database
Promtail -- Log collector
Mimir -- Long term metrics&logs storage
Tempo -- Datadog APM, but with Grafana, allows you to track requests through a network of services, invaluable to link your reverse proxy, to your apps, to your SSO to your database...
SMTP Relay -- A homemade SMTP relay that eases setting up mail alerts, allows me to push mail through mailjet using my domain
Node-exporter -- exports metrics for the server
Exportarr -- exports metrics for sonarr/radarr etc
pihole-exporter -- exports pihole metrics for prometheus scraping
ntfy -- for notifications to my phone (other than mail)
The rest is pretty much the same, if the service exports prometheus metrics by default, I use that, and write a ServiceMonitor and a Service manifest for that, it usually looks like that
If the app doesn't include a prometheus endpoint, I just find an existing exporter for that app, most popular ones have that, and ready made grafana dashboards.
For alerting, I create PrometheusRule object with the prometheus query and the message to alert me (depending on the severity, it's either a mail for med-low severity incidents, phone notification for high sev). I try to keep mails / notifications to a minimum, just alerts on load, CPU, RAM, and potential SMART errors as well give me alerts.
Oh lord, I have so much info to give ! For the setup, it's running on kubernetes 1.28.2, so YMMV. My monitoring stack is :
The rest is pretty much the same, if the service exports prometheus metrics by default, I use that, and write a
ServiceMonitor
and aService
manifest for that, it usually looks like thatIf the app doesn't include a prometheus endpoint, I just find an existing exporter for that app, most popular ones have that, and ready made grafana dashboards.
For alerting, I create
PrometheusRule
object with the prometheus query and the message to alert me (depending on the severity, it's either a mail for med-low severity incidents, phone notification for high sev). I try to keep mails / notifications to a minimum, just alerts on load, CPU, RAM, and potential SMART errors as well give me alerts.