# Production-Ready Monitoring Setup <<<<<<< ours This guide describes how to deploy a Prometheus and Grafana monitoring stack on a k3s + Proxmox homelab. The Helm chart values are stored in `k8s/monitoring-values.yaml`. ======= This guide shows how to deploy Prometheus and Grafana using the `kube-prometheus-stack` Helm chart. The default values are stored in `gitops/clusters/homelab/infrastructure/monitoring/monitoring-values.yaml`. >>>>>>> theirs ## 1. Deploy Prometheus & Grafana Install the `kube-prometheus-stack` chart and expose Grafana on NodePort `30080`: ```bash helm repo add prometheus-community https://prometheus-community.github.io/helm-charts helm repo update helm install prom-stack prometheus-community/kube-prometheus-stack \ -n monitoring --create-namespace ``` ## 2. Enable Persistence with local-path Prometheus generates a large write load, so running it on distributed storage like Longhorn is discouraged. Use node-local storage instead. Grafana can remain on Longhorn because it has lighter write requirements. Update `gitops/clusters/homelab/infrastructure/monitoring/monitoring-values.yaml` so only Prometheus claims `local-path` storage: ```yaml # gitops/clusters/homelab/infrastructure/monitoring/monitoring-values.yaml >>>>>>> theirs grafana: persistence: enabled: true accessModes: - ReadWriteOnce size: 10Gi storageClassName: longhorn prometheus: prometheusSpec: retention: 30d storageSpec: volumeClaimTemplate: spec: accessModes: ["ReadWriteOnce"] storageClassName: local-path resources: requests: storage: 100Gi ``` Apply the chart with these values to persist data locally. ## 3. Install and Configure Exporters **Node Exporter** - Run as a systemd service on Proxmox hosts and as a DaemonSet on k3s. - Metrics available on port `9100`. **Proxmox PVE Exporter** - Install via `pipx` to `/usr/local/bin/pve_exporter`. - Configure `/etc/pve_exporter/config.yaml`: ```yaml default: user: root@pam token_name: prometheus token_value: verify_ssl: false ``` - Start with systemd using `--config.file` and expose metrics on `9221` at path `/pve`. ## 4. Configure Prometheus Scrape Jobs Add the exporters to `additionalScrapeConfigs` in `monitoring-values.yaml`: ```yaml - job_name: proxmox-node-exporter static_configs: - targets: - 192.168.4.122:9100 - … - job_name: proxmox-pve-exporter metrics_path: /pve static_configs: - targets: - 192.168.4.122:9221 ``` Update the release: ```bash helm upgrade prom-stack prometheus-community/kube-prometheus-stack \ -n monitoring -f monitoring-values.yaml ``` ## 5. Import Grafana Dashboards - **Node Exporter Full** (1860) - **SMART / Disk Health** (13654) - **Proxmox via Prometheus** (10347) - (Optional) **Proxmox VE Node** (10048) and **Cluster Summary** (10049) ## 6. Set up Alerting Define PrometheusRules in `monitoring-values.yaml`: ```yaml additionalPrometheusRules: - name: host-alerts groups: - name: disk-usage rules: - alert: HighDiskUsage expr: (pve_disk_usage_bytes{id=~"storage/.+"} / pve_disk_size_bytes{id=~"storage/.+"}) > 0.80 for: 10m - name: cpu-temp rules: - alert: HighCPUTemperature expr: node_hwmon_temp_celsius{sensor="temp1"} > 85 for: 5m ``` ## 7. GPU Monitoring (optional) - Deploy NVIDIA DCGM Exporter as a DaemonSet (port `9400`). - Create a `ServiceMonitor` for `app=dcgm-exporter`. - Import the NVIDIA GPU dashboard (12256). ## 8. Next Steps - Configure notification channels (Slack, email, PagerDuty). - Secure access via ingress with TLS or Tailscale. - Add nightly drift checks via CI/cron. - Create recording rules for rollups and downsampling. - Integrate Thanos for long-term storage. ======= Apply the Helm chart with these values to ensure data stays on each node's local disk. ## Deploy with Flux The monitoring stack is now managed through Flux. The manifests live under `gitops/clusters/homelab/infrastructure/monitoring`. Flux applies the `kube-prometheus-stack` chart using the same `gitops/clusters/homelab/infrastructure/monitoring/monitoring-values.yaml` file, so existing Grafana dashboards and Prometheus data remain intact.