ImpMIA: Leveraging Implicit Bias for Membership Inference Attack

Yuval Golbari* · Navve Wasserman* · Gal Vardi · Michal Irani

* Equal contribution

Weizmann Institute of Science

Read on arXiv Code & Datasets coming soon

Abstract

Determining which data samples were used to train a model, known as Membership Inference Attack (MIA), is a well-studied and important problem with implications on data privacy. SotA methods (which are black-box attacks) rely on training many auxiliary reference models to imitate the behavior of the attacked model. As such, they rely on assumptions which rarely hold in real-world settings: (i) the attacker knows the training hyperparameters; (ii) all available non-training samples come from the same distribution as the training data; and (iii) the fraction of training data in the evaluation set is known. We show that removing these assumptions significantly harms the performance of black-box attacks. We introduce ImpMIA, a Membership Inference Attack that exploits the Implicit Bias of neural networks. Building on the maximum-margin implicit bias theory, ImpMIA uses the Karush-Kuhn-Tucker (KKT) optimality conditions to identify training samples -- those whose gradients most strongly reconstruct the trained model's parameters. Our approach is optimization-based, and requires NO training of reference-models, thus removing the need for any knowledge/assumptions regarding the attacked model's training procedure. While ImpMIA is a white-box attack (a setting which assumes access to model weights), this is becoming increasingly realistic given that many models are publicly available (e.g., via Hugging Face). ImpMIA achieves SotA performance compared to both black and white box attacks in settings where only the model weights are known, and a superset of the training data is available.

Setting, Implicit Bias & KKT

Setting and KKT formulation

Setting. We assume white-box access to a trained model’s parameters θ (and the ability to compute gradients w.r.t. θ), and an inference superset S that contains the unknown training set T ⊆ S. The objective is to determine, for each candidate x ∈ S, whether it belongs to the original training set.

Implicit Bias → KKT. Gradient-based optimization tends to converge to solutions that satisfy the Karush–Kuhn–Tucker (KKT) optimality conditions of a certain maximum-margin problem. In practice, this implies that the trained parameters of a network can be approximately expressed as a linear combination of per-sample gradients from the training set:
θ ≈ Σᵢ λᵢ gᵢ(xᵢ; θ), where each coefficient λᵢ is nonnegative and gᵢ is the per-sample margin gradient for example xᵢ.

ImpMIA Attack (λ-Optimization)

Lambda optimization overview

Given a set of candidate samples and the trained network weights, we optimize a set of coefficients {λᵢ}—one for each sample—that best reconstruct the network parameters θ. This provides the key signal: training samples are expected to receive significantly larger coefficients, while non-members remain small.

Distinguishing Members vs. Non-Members

Lambda score vs. decision-boundary distance

Scatter over the superset: x-axis = distance to the decision boundary; y-axis = λ score; points colored by membership. Members receive consistently higher λ while non-members cluster near λ ≈ 0.

Quantitative Results Under Realistic Settings (No Assumptions)

Membership inference results across datasets

We compare ImpMIA against state-of-the-art black-box and white-box membership attacks across three commonly used datasets—CIFAR-10, CIFAR-100, and CINIC-10. We audit with white-box access to released weights. The primary metric is TPR at low FPR (e.g., 0.01% / 0.00%); AUC is reported for context. We remove assumptions commonly used by reference-model MIAs—known training configuration, matched non-member distribution, and known member ratio—and also report the combined no-assumptions case.

Outcome. Under these realistic settings, reference-model attacks degrade sharply, while ImpMIA maintains strong detection at low FPRs. See the table for per-dataset trends.

Influence of Assumptions (Ablation)

Ablation of assumptions and their effect on membership inference performance

We quantify how common assumptions used by reference-model MIAs affect results. The table reports TPR (%) at low FPR (0.01% / 0.00%) on CINIC-10 while removing each assumption.

Outcome. As assumptions are removed, reference-model attacks degrade sharply—especially at zero-FPR operating points—while ImpMIA remains stable and often improves, retaining strong detection at low FPR without tuning for the member ratio or the exact training configuration.

TPR–FPR Curve (No-Auxiliary-Knowledge, CIFAR-10)

ROC-style evaluation comparing ImpMIA to leading baselines (LiRA, RMIA, Attack-R) in the No-Auxiliary-Knowledge setting.

This setting jointly removes common reference-model assumptions: (i) unknown training configuration, (ii) distribution shift in the candidate pool, and (iii) unknown member ratio.

X-axis: False positive rate (FPR, log scale). Y-axis: True positive rate (TPR, log scale). The dashed diagonal indicates random guessing.

ImpMIA achieves consistently higher TPR in the low-FPR regime, demonstrating stronger membership detection under realistic conditions where reference-model attacks degrade.

TPR–FPR curve on CIFAR-10 in the No-Auxiliary-Knowledge setting

BibTeX

@article{golbari2026impMIA, title = {ImpMIA: Leveraging Implicit Bias for Membership Inference Attack}, author = {Golbari, Yuval and Wasserman, Navve and Vardi, Gal and Irani, Michal}, journal = {arXiv preprint arXiv:2510.10625v3}, year = {2026}, url = {https://arxiv.org/abs/2510.10625v3} }
© 2026 ImpMIA Authors.