MSc Thesis · Progress meeting · 3 June 2026

Stable-edge filtering for passive OT device classification under operational change

What changed since the draft you have — and a finding I need to flag.

Jonathan van den Heuvel · supervisors dr. Cyril Hsu & dr. Chrysa Papagianni
University of Amsterdam — System & Network Engineering × KPMG Cyber

Opener — say it before the slides: "Before we discuss the draft you have: by working through the training pipeline the way Cyril asked, I found that the filter my method describes — phase-local — is not the filter the code ran. The whole maintenance penalty comes from persistence computed over an observation window that spans the outage. The negative finding survives and is actually stronger once reframed honestly." Lead with the discovery.

The question

Can a simple "keep only the edges that persist" filter make passive OT device classification more robust to operational change?

Passive analysis is the default in OT — active scanning can disturb the process. The observed communication graph mixes stable operational links with transient ones (engineering sessions, scans). The thesis tests temporal persistence as the proxy for "which edges to keep".

RQ1Does the filter help on steady-state traffic?
RQ2Does it make classification robust when hosts undergo operational change?
RQ3Where does the classification signal — and the filter's effect — live?

Three RQs, verbatim from the proposal. The interesting one is RQ2. Frame today around RQ2 + what the controls now show.

What I built

A reproducible OT lab with edge-level ground truth

20 always-on hosts (5 classes × 4), + 2 conditional = 22 observed
Modbus/TCP + S7 control traffic; HTTP/DNS on IT endpoints
One passive tap on a shared segment — no active scanning
Steady state + four scripted change scenarios
Seeded, fingerprinted, released — Docker + Zeek + flow records

controller: PLC · Modbus/S7 · polled
supervisory: HMI · polls controllers
engineering: workstation · sessions
historian: periodic snapshots
IT endpoint: HTTP / DNS

conditional hosts (onboarding HMI, noise injector) join only in their own scenario, never in the held-out test set

If Cyril asks "based on a paper?": inspired by ICSSIM & MiniCPS (same tradition, protocols, Purdue roles), but custom Docker because those don't support scriptable scenario phases. Not a reproduction.

How it's evaluated

Host-stratified inductive split — classify hosts the model has never seen

Train on steady-state only; apply without retraining under change
Per seed: 3 train + 1 held-out host per class, redrawn each seed
10 lab × 10 model seeds; paired Wilcoxon within seed
Window-stratified would just memorise hostnames — this is inductive

0.772 → 0.490 macro-F1: training hosts → held-out hosts The model learns, then hits a generalisation ceiling (a sup/eng/hist three-way confusion). Chance = 0.20.

This closes Cyril's 6 May asks: proper held-out split + "first fit the training set". The old 0.247 was an underpowered n=4 split — fixed.

Since the draft you have

Everything you asked for — and then one step further

Going deep into the pipeline, as asked, surfaced something I need to flag →

✓ Host-stratified inductive split
✓ "Does it learn?" — train-fit sanity check
✓ Data-preparation method section
✓ Re-run with corrected protocol
✓ 10 lab × 10 model seeds
✓ Random-forest baseline
✓ Frequency-rate baseline
✓ (W, θ) sweep

Frame as: "I did the six things from 6 May, plus the analyses we discussed on 1 May." Then pivot to the discovery — don't oversell the checklist, it's the setup.

A finding I need to flag

The filter the method describes is not the filter the code ran

Method (and contribution #1): a phase-local filter
Code: presence over the whole observation window — all phases
The 40-min outage drags plc-1's polls to 22/30 = 73% < θ, so they're cut from every window

This explains the result — and it's the realistic passive case: you can't segment phases live.

plc-1 inbound polls across one maintenance run (30 windows)

THE landmine — raise it first, as a win. Your old "why phase-local?" answer and "config-drift validates phase-locality" are now false; retract them if they come up. The discovery came from doing what Cyril asked (understand the pipeline).

RQ2 · robustness under change

No benefit anywhere — and a significant penalty under maintenance

1Controller paused → its polls fall below θ
2Filter strips them → in-degree 20→0, in-bytes 2.1M→0
3Features collapse → misread as an idle IT endpoint

0.45 → 0.36 Δ −0.089 · p = 0.027 · worse 8/10

Δ held-out macro-F1 (filtered − baseline), per scenario

Don't claim per-class sup/eng/hist movements — noise floor. "All-zero collapse" is exact only for plc-1's outage windows (pooled controller F1 0.82, not 0).

Is it real?

Four controls isolate the cause

Random, same count removed → harmless. Not "any removal hurts."
Byte-volume → also harmful. Content-blind proxies hit the low-volume polls.
Phase-local (filter as described) → removes nothing. The penalty is the window.

Maintenance Δ macro-F1 — same count removed per window, four selection rules

This is the strongest section — the audit said these controls were missing; they're run now. The honest concession: freq-rate also hurting defeats "persistence specifically" but is the evidence for the reframe (content-blindness).

Ruling out confounds

Not distribution shift; concentrated on the paused controller

Train on sparsified graphs? penalty persists — Δ −0.126 so it's feature destruction, not dense-train / sparse-test shift
Leave-one-controller-out plc-1 (paused) −0.229 vs plc-2 −0.040 clean per-controller estimate — the effect tracks the paused host
Classifier-independent random forest shows the same −0.104 not a graph-aggregation artifact

B1 (sparsified-train) closes the "just distribution shift" objection. B5 (per-controller) replaces the weak n=3 dose argument with clean folds. Both run this session.

RQ1 / RQ3 · does the graph help?

A graph-free random forest is at least as accurate as GraphSAGE

The graph-free model is at least as accurate — in fact slightly but significantly higher (paired p = 0.037, CI [+0.004, +0.049])
So the communication graph buys no accuracy gain at this scale
The result is a property of the host features, not message passing — which makes the maintenance finding classifier-independent

Held-out macro-F1 — random forest vs GraphSAGE (10 seeds, paired)

Own this — don't defend the graph. RF ≥ GraphSAGE strengthens the result (classifier-independent). Use "at least as accurate / no measurable benefit", not "RF is more accurate" (let Cyril pick the phrasing).

What I think the thesis now says

Content-agnostic edge filtering is fragile for passive OT classification.

A controller's class-defining inbound polls are both low-volume and event-sensitive, so multiple natural proxies — persistence, byte-volume — preferentially strip them; random removal doesn't, but is useless as a denoiser. A useful filter must be content/semantics-aware.

The reframe — the alignment you want from this meeting. Use this "middle" wording. Never the bare "temporal persistence is the wrong abstraction" (over-claim, tied to the mis-stated filter). Conservative fallback if pushed on external validity.

What it means — and what it doesn't

For the field

Don't deploy persistence/volume edge filters as generic preprocessing
Record asset state at capture (was anything paused?)
If you denoise, do it semantics-aware — protocol, direction, rate

Scope & limits — stated up front

Magnitude is lab-specific — a near-bipartite poll topology
Mechanism (low-volume + event-sensitive class-defining edges) is general
Field validation on a real trace = primary future work

Be most humble on external validity (the bipartite-topology objection). Concede the magnitude is lab-specific; defend only the qualitative mechanism (byte-volume replicating it supports generality).

Tightening the thesis

Citation audit + reframe in progress

24 / 34 citations verified faithful against the source PDFs; 10 corrected
e.g. Heo & Shin add attacker hosts + noise (not edge injection); Niedermaier = MAC/ARP, not flow stats
Method being rewritten to the observation-window filter; contribution #1 recast
~30 pages, trimming to stay in the examiner's 20–30 range

34→24citations faithful

10corrected

10×10seeds

~30 pagesexaminer target ≤ 30

If asked "did you read the papers?": yes — 34 audited against PDFs, 24 faithful, 10 corrected; none weaken the gap (operational-change robustness + temporal edge filtering).

Where I'd like your steer

Decisions to align on today

Contribution #1 — rename to "observation-window persistence", and keep idealized phase-local as future work?
Central claim — the "content-agnostic filtering is fragile" wording, or condition it tighter?
Graph framing — move "the graph is inert at this scale" into the abstract?
Title — keep "stable-edge filtering", or shift toward "edge-preprocessing"?
Length — fold the controls into RQ2 (vs a new section) to stay ≤ 30 pp?

Ask these as genuine questions — you want their buy-in on the reframe, not approval of a fait accompli. This is a progress meeting; soliciting direction is the point.

In one line

The negative result survives — and is stronger and more honest than the draft you have.

Jonathan van den Heuvel · UvA SNE × KPMG Cyber · 3 June 2026
Thank you — questions?

Close on the honest-and-stronger framing. Then open the floor for the decisions on slide 14.