Senior DevOps & Platform Infrastructure Engineer

I build the infrastructure
cities run on.

100,000+ IoT sensors · bare-metal Kubernetes · GitOps from commit to cluster · public-sector grade

I'm Belhadj Kessas, infrastructure lead of one of France's largest municipal Smart City IoT deployments, at Montpellier Méditerranée Métropole — a metropolitan authority serving 500,000+ citizens. Air quality, waste management, water and energy metering, mobility: not a pilot, not a demo. A live city, running on a platform I designed, built, and operate.

Montpellier, France · CKA — Certified Kubernetes Administrator, in progress (CNCF, 2026)

100,000+ IoT sensors in production
500,000+ citizens served by the platform
1 min from silent gateway to alert fired
4 Kubernetes clusters, GitOps-managed
0 manual steps, commit to cluster

GitOps

Provisioned entirely from code

The platform is fully on-premise and GitOps-driven: four bare-metal Kubernetes clusters with every layer declared in Git — cluster provisioning, application rollout, drift detection, environment promotion.

A git push triggers a self-hosted runner that calls the hypervisor API and stands up a complete cluster — RKE2, Cilium, MetalLB, ArgoCD — with zero manual steps. Destroy it, push again, get an identical one back.

Data sovereignty, auditability, open source end to end — infrastructure built to serve citizens and built to last.

Edge to cloud

One platform, sensor to dashboard

production RKE2 on VMware vSphere the live city workload — 100k+ sensors, every citizen-facing service
pre-production RKE2 on Proxmox identical promotion target — changes prove themselves here first
monitoring Dedicated 3-node RKE2 the observability substrate, on its own failure domain — detailed below
gpu edge lab Jetson Orin cluster YOLO computer-vision inference experiments, GPU-aware scheduling — building toward MLOps on sovereign hardware

Shipped 2026

Observability with its own cluster

In 2026 I shipped a centralized, multi-tenant observability platform on a dedicated three-node cluster: Mimir for long-term metrics and Loki for logs, both backed by Rook-Ceph S3 object storage. Grafana Alloy collectors on production and pre-production remote-write into it, separated into three isolated tenants — production, pre-production, and the monitoring cluster watching itself. One Grafana federates every data source.

Provisioned the same way as everything else — OpenTofu → Ansible → GitLab CI → RKE2 → ArgoCD — and fed by real production traffic from 100,000+ sensors.

The architectural point is separation of concerns: the observability substrate lives on its own cluster, so losing a monitored cluster never means losing the ability to see it.

What I master

Depth where it counts

Kubernetes & platform engineering

RKE2/Rancher multi-cluster on bare metal — production on VMware vSphere, pre-production on Proxmox. MetalLB, Rook-Ceph, Cilium (eBPF), Envoy Gateway, Helm.

CKA in progress — CNCF, expected 2026

GitOps & infrastructure as code

ArgoCD, GitLab CI/CD, OpenTofu, Terraform, Ansible. Full lifecycle as code: cluster provisioning, app rollout, drift detection, environment promotion. Built a self-service platform where external partners deploy via Git without ever touching cluster internals.

Large-scale IoT & networking

End-to-end LoRaWAN at city scale: RF planning, gateway deployment, VLAN segmentation, a multi-tenant LoRaWAN network server on Kubernetes. Automated device provisioning and OTA updates for a 100k+ fleet. Edge-to-cloud pipelines that survive partial outages.

Observability & SRE

Designed a dual-layer telemetry stack from zero — Zabbix outside the clusters; a centralized multi-tenant Mimir + Loki platform with Grafana Alloy collectors inside, on its own dedicated cluster. Proactive degradation thresholds, not post-failure alarms.

Detection in minutes, not days

Programming & emerging

Python (Django), Rust, Bash. Computer vision on the GPU edge lab: YOLO inference experiments on a Jetson Orin cluster, GPU-aware Kubernetes scheduling. Prototyping on-prem Kubeflow for AI-assisted log analysis.

Proof, not promises

Independently verifiable

One minute, not days
  1. T+0:00 A LoRa gateway goes silent — physical power failure.
  2. T+1 min Monitoring fires. Not when sensor data goes missing — when the gateway stops responding.
  3. Same hour A technician is on site and finds the fault — resolved before the data gap mattered.
“The real measure of an observability platform isn't how many incidents it helps resolve — it's how many it prevents.”

Beyond the day job

Radio is the passion

Contact

Let's talk.

If you're building infrastructure people depend on — or you just want to talk radio and Kubernetes — I'd like to hear from you.

contact@belhadj.dev