MSc Thesis · Progress meeting · 3 June 2026

Stable-edge filtering for passive OT device classification under operational change

What changed since the draft you have — and a finding I need to flag.

Jonathan van den Heuvel  ·  supervisors dr. Cyril Hsu & dr. Chrysa Papagianni
University of Amsterdam — System & Network Engineering  ×  KPMG Cyber

Opener — say it before the slides: "Before we discuss the draft you have: by working through the training pipeline the way Cyril asked, I found that the filter my method describes — phase-local — is not the filter the code ran. The whole maintenance penalty comes from persistence computed over an observation window that spans the outage. The negative finding survives and is actually stronger once reframed honestly." Lead with the discovery.

The question

Can a simple "keep only the edges that persist" filter make passive OT device classification more robust to operational change?

Passive analysis is the default in OT — active scanning can disturb the process. The observed communication graph mixes stable operational links with transient ones (engineering sessions, scans). The thesis tests temporal persistence as the proxy for "which edges to keep".

  1. RQ1Does the filter help on steady-state traffic?
  2. RQ2Does it make classification robust when hosts undergo operational change?
  3. RQ3Where does the classification signal — and the filter's effect — live?
Three RQs, verbatim from the proposal. The interesting one is RQ2. Frame today around RQ2 + what the controls now show.

What I built

A reproducible OT lab with edge-level ground truth

  • 20 always-on hosts (5 classes × 4), + 2 conditional = 22 observed
  • Modbus/TCP + S7 control traffic; HTTP/DNS on IT endpoints
  • One passive tap on a shared segment — no active scanning
  • Steady state + four scripted change scenarios
  • Seeded, fingerprinted, released — Docker + Zeek + flow records
controller
PLC · Modbus/S7 · polled
supervisory
HMI · polls controllers
engineering
workstation · sessions
historian
periodic snapshots
IT endpoint
HTTP / DNS

conditional hosts (onboarding HMI, noise injector) join only in their own scenario, never in the held-out test set

If Cyril asks "based on a paper?": inspired by ICSSIM & MiniCPS (same tradition, protocols, Purdue roles), but custom Docker because those don't support scriptable scenario phases. Not a reproduction.

How it's evaluated

Host-stratified inductive split — classify hosts the model has never seen

  • Train on steady-state only; apply without retraining under change
  • Per seed: 3 train + 1 held-out host per class, redrawn each seed
  • 10 lab × 10 model seeds; paired Wilcoxon within seed
  • Window-stratified would just memorise hostnames — this is inductive
0.772 0.490 macro-F1: training hosts → held-out hosts The model learns, then hits a generalisation ceiling (a sup/eng/hist three-way confusion). Chance = 0.20.
This closes Cyril's 6 May asks: proper held-out split + "first fit the training set". The old 0.247 was an underpowered n=4 split — fixed.

Since the draft you have

Everything you asked for — and then one step further

Going deep into the pipeline, as asked, surfaced something I need to flag →

  • Host-stratified inductive split
  • "Does it learn?" — train-fit sanity check
  • Data-preparation method section
  • Re-run with corrected protocol
  • 10 lab × 10 model seeds
  • Random-forest baseline
  • Frequency-rate baseline
  • (W, θ) sweep
Frame as: "I did the six things from 6 May, plus the analyses we discussed on 1 May." Then pivot to the discovery — don't oversell the checklist, it's the setup.

A finding I need to flag

The filter the method describes is not the filter the code ran

  • Method (and contribution #1): a phase-local filter
  • Code: presence over the whole observation window — all phases
  • The 40-min outage drags plc-1's polls to 22/30 = 73% < θ, so they're cut from every window

This explains the result — and it's the realistic passive case: you can't segment phases live.

plc-1 inbound polls across one maintenance run (30 windows)
THE landmine — raise it first, as a win. Your old "why phase-local?" answer and "config-drift validates phase-locality" are now false; retract them if they come up. The discovery came from doing what Cyril asked (understand the pipeline).

RQ2 · robustness under change

No benefit anywhere — and a significant penalty under maintenance

  1. 1Controller paused → its polls fall below θ
  2. 2Filter strips them → in-degree 20→0, in-bytes 2.1M→0
  3. 3Features collapse → misread as an idle IT endpoint

0.45 0.36  Δ −0.089 · p = 0.027 · worse 8/10

Δ held-out macro-F1 (filtered − baseline), per scenario
Don't claim per-class sup/eng/hist movements — noise floor. "All-zero collapse" is exact only for plc-1's outage windows (pooled controller F1 0.82, not 0).

Is it real?

Four controls isolate the cause

  • Random, same count removed → harmless. Not "any removal hurts."
  • Byte-volumealso harmful. Content-blind proxies hit the low-volume polls.
  • Phase-local (filter as described) → removes nothing. The penalty is the window.
Maintenance Δ macro-F1 — same count removed per window, four selection rules
This is the strongest section — the audit said these controls were missing; they're run now. The honest concession: freq-rate also hurting defeats "persistence specifically" but is the evidence for the reframe (content-blindness).

Ruling out confounds

Not distribution shift; concentrated on the paused controller

  • Train on sparsified graphs? penalty persists — Δ −0.126 so it's feature destruction, not dense-train / sparse-test shift
  • Leave-one-controller-out plc-1 (paused) −0.229  vs  plc-2 −0.040 clean per-controller estimate — the effect tracks the paused host
  • Classifier-independent random forest shows the same −0.104 not a graph-aggregation artifact
B1 (sparsified-train) closes the "just distribution shift" objection. B5 (per-controller) replaces the weak n=3 dose argument with clean folds. Both run this session.

RQ1 / RQ3 · does the graph help?

A graph-free random forest is at least as accurate as GraphSAGE

  • The graph-free model is at least as accurate — in fact slightly but significantly higher (paired p = 0.037, CI [+0.004, +0.049])
  • So the communication graph buys no accuracy gain at this scale
  • The result is a property of the host features, not message passing — which makes the maintenance finding classifier-independent
Held-out macro-F1 — random forest vs GraphSAGE (10 seeds, paired)
Own this — don't defend the graph. RF ≥ GraphSAGE strengthens the result (classifier-independent). Use "at least as accurate / no measurable benefit", not "RF is more accurate" (let Cyril pick the phrasing).

What I think the thesis now says

Content-agnostic edge filtering is fragile for passive OT classification.

A controller's class-defining inbound polls are both low-volume and event-sensitive, so multiple natural proxies — persistence, byte-volume — preferentially strip them; random removal doesn't, but is useless as a denoiser. A useful filter must be content/semantics-aware.

The reframe — the alignment you want from this meeting. Use this "middle" wording. Never the bare "temporal persistence is the wrong abstraction" (over-claim, tied to the mis-stated filter). Conservative fallback if pushed on external validity.

What it means — and what it doesn't

For the field

  • Don't deploy persistence/volume edge filters as generic preprocessing
  • Record asset state at capture (was anything paused?)
  • If you denoise, do it semantics-aware — protocol, direction, rate

Scope & limits — stated up front

  • Magnitude is lab-specific — a near-bipartite poll topology
  • Mechanism (low-volume + event-sensitive class-defining edges) is general
  • Field validation on a real trace = primary future work
Be most humble on external validity (the bipartite-topology objection). Concede the magnitude is lab-specific; defend only the qualitative mechanism (byte-volume replicating it supports generality).

Tightening the thesis

Citation audit + reframe in progress

  • 24 / 34 citations verified faithful against the source PDFs; 10 corrected
  • e.g. Heo & Shin add attacker hosts + noise (not edge injection); Niedermaier = MAC/ARP, not flow stats
  • Method being rewritten to the observation-window filter; contribution #1 recast
  • ~30 pages, trimming to stay in the examiner's 20–30 range
34→24citations faithful
10corrected
10×10seeds
~30 pagesexaminer target ≤ 30
If asked "did you read the papers?": yes — 34 audited against PDFs, 24 faithful, 10 corrected; none weaken the gap (operational-change robustness + temporal edge filtering).

Where I'd like your steer

Decisions to align on today

  1. Contribution #1 — rename to "observation-window persistence", and keep idealized phase-local as future work?
  2. Central claim — the "content-agnostic filtering is fragile" wording, or condition it tighter?
  3. Graph framing — move "the graph is inert at this scale" into the abstract?
  4. Title — keep "stable-edge filtering", or shift toward "edge-preprocessing"?
  5. Length — fold the controls into RQ2 (vs a new section) to stay ≤ 30 pp?
Ask these as genuine questions — you want their buy-in on the reframe, not approval of a fait accompli. This is a progress meeting; soliciting direction is the point.

In one line

The negative result survives — and is stronger and more honest than the draft you have.

Jonathan van den Heuvel  ·  UvA SNE × KPMG Cyber  ·  3 June 2026
Thank you — questions?

Close on the honest-and-stronger framing. Then open the floor for the decisions on slide 14.
    Stable-Edge Filtering · progress 1 / 15

    → or click to advance  ·  ? for help

    UvA · System & Network Engineering × KPMG Cyber

    Does keeping only the persistent edges make passive OT classification more robust?

    Thesis progress · Jonathan van den Heuvel · 3 June 2026
    supervisors dr. Cyril Hsu & dr. Chrysa Papagianni

    press any key to begin