# Production-Ready Monitoring Setup

<<<<<<< ours
This guide describes how to deploy a Prometheus and Grafana monitoring stack on a k3s + Proxmox homelab. The Helm chart values are stored in `k8s/monitoring-values.yaml`.
=======
This guide shows how to deploy Prometheus and Grafana using the `kube-prometheus-stack` Helm chart. The default values are stored in `gitops/clusters/homelab/infrastructure/monitoring/monitoring-values.yaml`.
>>>>>>> theirs

## 1. Deploy Prometheus & Grafana

Install the `kube-prometheus-stack` chart and expose Grafana on NodePort `30080`:

```bash
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prom-stack prometheus-community/kube-prometheus-stack \
  -n monitoring --create-namespace
```

## 2. Enable Persistence with local-path

Prometheus generates a large write load, so running it on distributed storage like Longhorn is discouraged.
Use node-local storage instead. Grafana can remain on Longhorn because it has lighter write requirements. 
Update `gitops/clusters/homelab/infrastructure/monitoring/monitoring-values.yaml` so only 
Prometheus claims `local-path` storage:

```yaml
# gitops/clusters/homelab/infrastructure/monitoring/monitoring-values.yaml
>>>>>>> theirs
grafana:
  persistence:
    enabled: true
    accessModes:
      - ReadWriteOnce
    size: 10Gi
    storageClassName: longhorn

prometheus:
  prometheusSpec:
    retention: 30d
    storageSpec:
      volumeClaimTemplate:
        spec:
          accessModes: ["ReadWriteOnce"]
          storageClassName: local-path
          resources:
            requests:
              storage: 100Gi
```

Apply the chart with these values to persist data locally.

## 3. Install and Configure Exporters

**Node Exporter**
- Run as a systemd service on Proxmox hosts and as a DaemonSet on k3s.
- Metrics available on port `9100`.

**Proxmox PVE Exporter**
- Install via `pipx` to `/usr/local/bin/pve_exporter`.
- Configure `/etc/pve_exporter/config.yaml`:

```yaml
default:
    user: root@pam
    token_name: prometheus
    token_value: <YOUR_TOKEN>
    verify_ssl: false
```

- Start with systemd using `--config.file` and expose metrics on `9221` at path `/pve`.

## 4. Configure Prometheus Scrape Jobs

Add the exporters to `additionalScrapeConfigs` in `monitoring-values.yaml`:

```yaml
- job_name: proxmox-node-exporter
  static_configs:
    - targets:
      - 192.168.4.122:9100
      - …
- job_name: proxmox-pve-exporter
  metrics_path: /pve
  static_configs:
    - targets:
      - 192.168.4.122:9221
```

Update the release:

```bash
helm upgrade prom-stack prometheus-community/kube-prometheus-stack \
  -n monitoring -f monitoring-values.yaml
```

## 5. Import Grafana Dashboards

- **Node Exporter Full** (1860)
- **SMART / Disk Health** (13654)
- **Proxmox via Prometheus** (10347)
- (Optional) **Proxmox VE Node** (10048) and **Cluster Summary** (10049)

## 6. Set up Alerting

Define PrometheusRules in `monitoring-values.yaml`:

```yaml
additionalPrometheusRules:
  - name: host-alerts
    groups:
      - name: disk-usage
        rules:
          - alert: HighDiskUsage
            expr: (pve_disk_usage_bytes{id=~"storage/.+"} / pve_disk_size_bytes{id=~"storage/.+"}) > 0.80
            for: 10m
      - name: cpu-temp
        rules:
          - alert: HighCPUTemperature
            expr: node_hwmon_temp_celsius{sensor="temp1"} > 85
            for: 5m
```

## 7. GPU Monitoring (optional)

- Deploy NVIDIA DCGM Exporter as a DaemonSet (port `9400`).
- Create a `ServiceMonitor` for `app=dcgm-exporter`.
- Import the NVIDIA GPU dashboard (12256).

## 8. Next Steps

- Configure notification channels (Slack, email, PagerDuty).
- Secure access via ingress with TLS or Tailscale.
- Add nightly drift checks via CI/cron.
- Create recording rules for rollups and downsampling.
- Integrate Thanos for long-term storage.
=======
Apply the Helm chart with these values to ensure data stays on each node's local disk.

## Deploy with Flux

The monitoring stack is now managed through Flux. The manifests live under
`gitops/clusters/homelab/infrastructure/monitoring`. Flux applies the
`kube-prometheus-stack` chart using the same
`gitops/clusters/homelab/infrastructure/monitoring/monitoring-values.yaml` file,
so existing Grafana dashboards and Prometheus data remain intact.