MSc Thesis · Progress meeting · 3 June 2026
Stable-edge filtering for passive OT device classification under operational change
What changed since the draft you have — and a finding I need to flag.
Jonathan van den Heuvel · supervisors dr. Cyril Hsu & dr. Chrysa Papagianni
University of Amsterdam — System & Network Engineering × KPMG Cyber
The question
Can a simple "keep only the edges that persist" filter make passive OT device classification more robust to operational change?
Passive analysis is the default in OT — active scanning can disturb the process. The observed communication graph mixes stable operational links with transient ones (engineering sessions, scans). The thesis tests temporal persistence as the proxy for "which edges to keep".
- RQ1Does the filter help on steady-state traffic?
- RQ2Does it make classification robust when hosts undergo operational change?
- RQ3Where does the classification signal — and the filter's effect — live?
What I built
A reproducible OT lab with edge-level ground truth
- 20 always-on hosts (5 classes × 4), + 2 conditional = 22 observed
- Modbus/TCP + S7 control traffic; HTTP/DNS on IT endpoints
- One passive tap on a shared segment — no active scanning
- Steady state + four scripted change scenarios
- Seeded, fingerprinted, released — Docker + Zeek + flow records
- controller
- PLC · Modbus/S7 · polled
- supervisory
- HMI · polls controllers
- engineering
- workstation · sessions
- historian
- periodic snapshots
- IT endpoint
- HTTP / DNS
conditional hosts (onboarding HMI, noise injector) join only in their own scenario, never in the held-out test set
How it's evaluated
Host-stratified inductive split — classify hosts the model has never seen
- Train on steady-state only; apply without retraining under change
- Per seed: 3 train + 1 held-out host per class, redrawn each seed
- 10 lab × 10 model seeds; paired Wilcoxon within seed
- Window-stratified would just memorise hostnames — this is inductive
Since the draft you have
Everything you asked for — and then one step further
Going deep into the pipeline, as asked, surfaced something I need to flag →
- ✓ Host-stratified inductive split
- ✓ "Does it learn?" — train-fit sanity check
- ✓ Data-preparation method section
- ✓ Re-run with corrected protocol
- ✓ 10 lab × 10 model seeds
- ✓ Random-forest baseline
- ✓ Frequency-rate baseline
- ✓ (W, θ) sweep
A finding I need to flag
The filter the method describes is not the filter the code ran
- Method (and contribution #1): a phase-local filter
- Code: presence over the whole observation window — all phases
- The 40-min outage drags plc-1's polls to 22/30 = 73% < θ, so they're cut from every window
This explains the result — and it's the realistic passive case: you can't segment phases live.
RQ2 · robustness under change
No benefit anywhere — and a significant penalty under maintenance
- 1Controller paused → its polls fall below θ
- 2Filter strips them → in-degree 20→0, in-bytes 2.1M→0
- 3Features collapse → misread as an idle IT endpoint
0.45 → 0.36 Δ −0.089 · p = 0.027 · worse 8/10
Is it real?
Four controls isolate the cause
- Random, same count removed → harmless. Not "any removal hurts."
- Byte-volume → also harmful. Content-blind proxies hit the low-volume polls.
- Phase-local (filter as described) → removes nothing. The penalty is the window.
Ruling out confounds
Not distribution shift; concentrated on the paused controller
- Train on sparsified graphs? penalty persists — Δ −0.126 so it's feature destruction, not dense-train / sparse-test shift
- Leave-one-controller-out plc-1 (paused) −0.229 vs plc-2 −0.040 clean per-controller estimate — the effect tracks the paused host
- Classifier-independent random forest shows the same −0.104 not a graph-aggregation artifact
RQ1 / RQ3 · does the graph help?
A graph-free random forest is at least as accurate as GraphSAGE
- The graph-free model is at least as accurate — in fact slightly but significantly higher (paired p = 0.037, CI [+0.004, +0.049])
- So the communication graph buys no accuracy gain at this scale
- The result is a property of the host features, not message passing — which makes the maintenance finding classifier-independent
What I think the thesis now says
Content-agnostic edge filtering is fragile for passive OT classification.
A controller's class-defining inbound polls are both low-volume and event-sensitive, so multiple natural proxies — persistence, byte-volume — preferentially strip them; random removal doesn't, but is useless as a denoiser. A useful filter must be content/semantics-aware.
What it means — and what it doesn't
For the field
- Don't deploy persistence/volume edge filters as generic preprocessing
- Record asset state at capture (was anything paused?)
- If you denoise, do it semantics-aware — protocol, direction, rate
Scope & limits — stated up front
- Magnitude is lab-specific — a near-bipartite poll topology
- Mechanism (low-volume + event-sensitive class-defining edges) is general
- Field validation on a real trace = primary future work
Tightening the thesis
Citation audit + reframe in progress
- 24 / 34 citations verified faithful against the source PDFs; 10 corrected
- e.g. Heo & Shin add attacker hosts + noise (not edge injection); Niedermaier = MAC/ARP, not flow stats
- Method being rewritten to the observation-window filter; contribution #1 recast
- ~30 pages, trimming to stay in the examiner's 20–30 range
Where I'd like your steer
Decisions to align on today
- Contribution #1 — rename to "observation-window persistence", and keep idealized phase-local as future work?
- Central claim — the "content-agnostic filtering is fragile" wording, or condition it tighter?
- Graph framing — move "the graph is inert at this scale" into the abstract?
- Title — keep "stable-edge filtering", or shift toward "edge-preprocessing"?
- Length — fold the controls into RQ2 (vs a new section) to stay ≤ 30 pp?
In one line
The negative result survives — and is stronger and more honest than the draft you have.
Jonathan van den Heuvel · UvA SNE × KPMG Cyber · 3 June 2026
Thank you — questions?