I use sar for historical, my own scripts running under cron on the hosts for specific things I'm interested in keeping an eye on and my on scripts under cron on my monitoring machines for alerting me when something's wrong. I don't use a dashboard.
Self-Hosted Main
A place to share alternatives to popular online services that can be self-hosted without giving up privacy or locking you into a service you don't control.
For Example
- Service: Dropbox - Alternative: Nextcloud
- Service: Google Reader - Alternative: Tiny Tiny RSS
- Service: Blogger - Alternative: WordPress
We welcome posts that include suggestions for good self-hosted alternatives to popular online services, how they are better, or how they give back control of your data. Also include hints and tips for less technical readers.
Useful Lists
- Awesome-Selfhosted List of Software
- Awesome-Sysadmin List of Software
Zabbix. Aslo for Windows, it could be Rainmeter https://www.rainmeter.net/ or HWiNFO https://www.hwinfo.com/. For Linux, Conky.
If nobody complains everything is fine.
I run music bots game servers mostly so even if something fails it‘s nothing really that critical.
When I‘m at home I usually ssh into my main host machine and have btop running on my second monitor. It shows me the processes, ram , cpu, network and disk space. Oh yeah and load averages. It also looks super pretty and supports skins :)
We use zabbix here. Zabbix is amazing and we put it in all of our templates so any new servers and hosts pop up on zabbix dashboard preconfigured just like that. For logs and security we use an Elastik "ELK stack" which gives us a heads up if anything is wrong in the logs, and zabbix gives us a head up of the systems health all together. Between the two, our health monitor panel combines the two windows so we can see full server health and any problems right there as a todo list for the IT team
I use Telegraf + InfluxDB + Grafana for monitoring my home network and systems. Grafana has a learning curve for building panels and dashboards, but is incredibly flexible. I use it for more than server performance. I have a dual-monitor "kiosk" (old Mac mini) in my office displaying two Grafana dashboards. These are:
Network/Power/Storage showing:
- firewall block events & sources for last 12 hrs (from pfSense via Elasticsearch),
- current UPS statuses and power usage for last 12 hrs (Telegraf apcupsd plugin -> InfluxDB),
- WAN traffic for last 12 hrs ( from pfSense via Telegraf -> InfluxDB),
- current DHCP clients (custom Python script -> MySQL), and
- current drive and RAID pool health (custom Python scripts -> MySQL)
Server sensors and performance showing:
- current status of important cron jobs (using Healthchecks -> Prometheus),
- current server CPU usage and temps, and memory usage (Telegraf -> InfluxDB)
- server host CPU usage and temps, and memory usage for last 3 hrs (Telegraf -> InfluxDB)
- Proxmox VM CPU and memory usage for last 3 hrs (Proxmox -> InfluxDB)
- Docker container CPU and memory usage for last 3 hrs (Telegraf Docker plugin -> InfluxDB)
Netdata works really well for system performance for Linux and can be installed from the default repositories of major distributions.
Network/Power/Storage
Pretty cool dashboards. I liked the DHCP clients info, does it also report DHCP reservations?
Where do you do DHCP, on the PFSense or somewhere else?
does it also report DHCP reservations?
Thanks, and yes, Type "static" are DHCP reservations.
Where do you do DHCP, on the PFSense or somewhere else?
Yes, on pfSense. I use the Python function written by pletch/scrape_pfsense_dhcp_leases.py (on Github) that scrapes the pfSense status_dhcp_leases.php page. Then added my own function for querying my TP-Link APs using SNMP to determine which AP a wireless DHCP client is connected to.
I can throw the script up on Dropbox if you are interested. I am mediocre at writing Python, so it is pretty specific to my environment.
I use Home Assistant already. They have a plugin for glances. I guess all I'm interested in is cpu temp and load. Any changes =somethings up
If get ahead of it by getting extra.
Need 16 gb of ram and 8 cores ? Well let me add 64 gb to my cart and 12 core CPU.
Hasn’t failed me
CheckMK for general monitoring, Grafana/Prometheus for Proxmox-cluster, Wazuh for IDS-purposes and UptimeKuma for general uptime on services. It's not like it's necessary, but it's nice to tinker in my homelab before implementing the same services on a "professional level" at work.
My HomeAssistant is stable, so wifey is not being used as a monitor ;-)
I use btop, I use arch btw
I don't track their performance, I just track if they're up or down.
I use uptimekuma running on a free tier of fly.io so I can tell if my cluster had a catastrophic failure. There's no point in the alerting system running on the same system.
I don't find it valuable so I don't. (Maybe run top
as needed.)
Nagios for service/QOS, Grafana for dashboarding for some items more specific. Planning on eventually switching to zabbix but nagios is so simple that i feel having a hard time justifying moving over 400 monitored services to it
Observium..
If it's just one server, Netdata is a better option..
Zabbix for hardware, certificate monitoring
Prometheus for service monitoring (e.g how many are actually using my Jellyfin server, so i know if I need to scale etc.)
First for PRTG.
If its down, I assume performance is bad
I came across monit
recently, seems nice
I literally tried all. Nagios is the best one
Quick checks: Proxmox dashboard, htop or glances, Portainer
Extensive monitoring: Prometheus (node-exporter), Rsyslog server, Loki, Grafana, Uptime Kuma, Alertmanager (via Gotify)
Uptime Kuma for my services Netdata + Prometheus + Grafana for server health (alerts and visualization)
Girlfriend first Alert Manager second. Girlfriend is usually faster.
Prometheus and grafana
Oh lord, I have so much info to give ! For the setup, it's running on kubernetes 1.28.2, so YMMV. My monitoring stack is :
- Grafana -- Dashboards
- Alertmanager -- Alerting
- Prometheus -- Time series Database
- Loki -- Logs database
- Promtail -- Log collector
- Mimir -- Long term metrics&logs storage
- Tempo -- Datadog APM, but with Grafana, allows you to track requests through a network of services, invaluable to link your reverse proxy, to your apps, to your SSO to your database...
- SMTP Relay -- A homemade SMTP relay that eases setting up mail alerts, allows me to push mail through mailjet using my domain
- Node-exporter -- exports metrics for the server
- Exportarr -- exports metrics for sonarr/radarr etc
- pihole-exporter -- exports pihole metrics for prometheus scraping
- smart-exporter -- exports S.M.A.R.T metrics (for HDD health)
- ntfy -- for notifications to my phone (other than mail)
The rest is pretty much the same, if the service exports prometheus metrics by default, I use that, and write a ServiceMonitor
and a Service
manifest for that, it usually looks like that
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: traefik
labels:
app.kubernetes.io/component: traefik
app.kubernetes.io/instance: traefik
app.kubernetes.io/managed-by: kustomize
app.kubernetes.io/name: traefik
app.kubernetes.io/part-of: traefik
spec:
selector:
matchLabels:
app.kubernetes.io/name: traefik-metrics
endpoints:
- port: metrics
interval: 30s
path: /metrics
scheme: http
tlsConfig:
insecureSkipVerify: true
namespaceSelector:
matchNames:
- traefik
apiVersion: v1
kind: Service
metadata:
name: traefik-metrics
namespace: traefik
labels:
app.kubernetes.io/name: traefik-metrics
spec:
type: ClusterIP
ports:
- protocol: TCP
name: metrics
port: 8082
selector:
app.kubernetes.io/name: traefik
If the app doesn't include a prometheus endpoint, I just find an existing exporter for that app, most popular ones have that, and ready made grafana dashboards.
For alerting, I create PrometheusRule
object with the prometheus query and the message to alert me (depending on the severity, it's either a mail for med-low severity incidents, phone notification for high sev). I try to keep mails / notifications to a minimum, just alerts on load, CPU, RAM, and potential SMART errors as well give me alerts.
I use net data for both dashboards and alerts. Works great and easy to setup.
Its not well liked but I use nagios core for alerts and jump to grafana which has data in prometheus, influxdb, and mysql backend for trends like cpu usage hard drive Temps etc.