ImpMIA · Membership Inference via Implicit Bias & KKT

Abstract

Determining which data samples were used to train a model—known as Membership Inference Attack (MIA)—is a well-studied and important problem with implications for data privacy. Black-box methods presume access only to the model’s outputs and often rely on training auxiliary reference models. While they show strong empirical performance, they depend on assumptions that rarely hold in real-world settings: (i) the attacker knows the training hyperparameters; (ii) all available non-training samples come from the same distribution as the training data; and (iii) the fraction of training data in the evaluation set is known. In this work, we demonstrate that removing these assumptions leads to a significant drop in black-box performance. We introduce ImpMIA, a membership inference attack that exploits the implicit bias of neural networks, removing the need for any reference models and their assumptions. ImpMIA is a white-box attack – a setting that assumes access to model weights and is increasingly realistic given many publicly available models (e.g., via Hugging Face). Building on maximum-margin implicit-bias theory, ImpMIA uses the Karush–Kuhn–Tucker (KKT) optimality conditions to identify training samples by finding those whose gradients most strongly reconstruct the trained model’s parameters. As a result, ImpMIA achieves state-of-the-art performance compared to both black- and white-box attacks in realistic settings where only the model weights and a superset of the training data are available.

Setting, Implicit Bias & KKT

Setting. We assume white-box access to a trained model’s parameters θ (and the ability to compute gradients w.r.t. θ), and an inference superset S that contains the unknown training set T ⊆ S. The objective is to determine, for each candidate x ∈ S, whether it belongs to the original training set.

Implicit Bias → KKT. Gradient-based optimization tends to converge to solutions that satisfy the Karush–Kuhn–Tucker (KKT) optimality conditions of a certain maximum-margin problem. In practice, this implies that the trained parameters of a network can be approximately expressed as a linear combination of per-sample gradients from the training set:
θ ≈ Σᵢ λᵢ gᵢ(xᵢ; θ), where each coefficient λᵢ is nonnegative and gᵢ is the per-sample margin gradient for example xᵢ.

ImpMIA Attack (λ-Optimization)

Given a set of candidate samples and the trained network weights, we optimize a set of coefficients {λᵢ}—one for each sample—that best reconstruct the network parameters θ. This provides the key signal: training samples are expected to receive significantly larger coefficients, while non-members remain small.

Distinguishing Members vs. Non-Members

Lambda score vs. decision-boundary distance

Scatter over the superset: x-axis = distance to the decision boundary; y-axis = λ score; points colored by membership. Members receive consistently higher λ while non-members cluster near λ ≈ 0.

Quantitative Results Under Realistic Settings (No Assumptions)

Membership inference results across datasets

We compare ImpMIA against state-of-the-art black-box and white-box membership attacks across three commonly used datasets—CIFAR-10, CIFAR-100, and CINIC-10. We audit with white-box access to released weights. The primary metric is TPR at low FPR (e.g., 0.01% / 0.00%); AUC is reported for context. We remove assumptions commonly used by reference-model MIAs—known training configuration, matched non-member distribution, and known member ratio—and also report the combined no-assumptions case.

Outcome. Under these realistic settings, reference-model attacks degrade sharply, while ImpMIA maintains strong detection at low FPRs. See the table for per-dataset trends.

Influence of Assumptions (Ablation)

Ablation of assumptions and their effect on membership inference performance

We quantify how common assumptions used by reference-model MIAs affect results. The table reports TPR (%) at low FPR (0.01% / 0.00%) on CINIC-10 while removing each assumption.

Unknown training configuration. The attacker does not know the target’s training hyperparameters.
Different data distribution. The candidate pool mixes samples from the same distribution as the training data and samples from a different distribution
Unknown fraction of members. The attacker does not know the member ratio in the candidate pool.
All assumptions removed. Combine the above: mixed distributions, mismatched reference-model training, and hidden member ratio.

Outcome. As assumptions are removed, reference-model attacks degrade sharply—especially at zero-FPR operating points—while ImpMIA remains stable and often improves, retaining strong detection at low FPR without tuning for the member ratio or the exact training configuration.

Effect of Assumption Removal

Performance of the best prior method (LiRA) and our method (ImpMIA) under the progressive removal of assumptions on CINIC-10.

Y-axis: TPR at 0% FPR. X-axis: assumptions removed in sequence — Known training configuration → Matched non-member distribution → Known member ratio → All removed.

As assumptions are removed, LiRA performance drops sharply, while ImpMIA remains stable, showing minimal dependence on these assumptions.

CINIC-10: Performance of LiRA and ImpMIA under progressive assumption removal

BibTeX

@article{golbari2025impMIA, title = {ImpMIA: Leveraging Implicit Bias for Membership Inference under Realistic Scenarios}, author = {Golbari, Yuval and Wasserman, Navve and Vardi, Gal and Irani, Michal}, journal = {arXiv preprint arXiv:2510.10625}, year = {2025} }