Understanding Model Threats
This resource is designed to provide detailed information on various threat categories, helping you understand and mitigate potential risks in AI and machine learning systems.
Architectural Backdoors
An architectural backdoor in a model is typically a parallel path from model input to model output. The presence of a backdoor means that an attacker has modified the architecture of the model in such a way that they (attacker) can manipulate the models output. The backdoor path is usually not used/activated with normal/non-malicious inputs, i.e. the model continues to behave as expected despite the presence of the backdoor. This means that a model with an architectural backdoor will continue to behave as expected in the absence of a trigger in input. When a trigger is detected in the input, the backdoor path gets activated and results in unexpected/malicious output from the compromised model. In this sense, an attacker can control when a backdoored model would give unexpected/incorrect. Another important consideration is the stealth level of backdoor attacks. A model compromised with a backdoor would continue to behave as expected for normal inputs. This means a user/victim using a backdoored model would not know that they are using a compromised model till the models unexpected behavior manifests, if at all.
Overview
If a model has failed for this issue it means that:
- The model format is detected as TensorFlows’ (TF)
SavedModel
. - The model contains an architectural backdoor that can compromise a ML model, i.e model will give unexpected outputs for certain inputs.
Another important consideration is the stealth level of backdoor attacks. A model compromised with a backdoor would continue to behave as expected for normal inputs. This means a user/victim using a backdoored model would not know that they are using a compromised model till the models unexpected behavior manifests, if at all.
TensorFlow SavedModel Format
The SavedModel format saves models’ architecture (such as layers) as a graph. The graph represents the computation and flow of data in terms of nodes (operators) and edges (flow). A model saved using SavedModel
does not depend on the original model building code to run, i.e. SavedModel
format is inclusive of all model building code as well as any trained parameters.
The SavedModel
extends model portability since it does not require model building code. Though at the same time, attackers can exploit the code-inclusive serialization format of SavedModel
to ship malicious code to users.
Please note that SavedModel
format is deprecated with the introduction of Keras 3. The recommended format is .keras
Background Information
The main points are:
- TensorFlow models saved using
SavedModel
should be deemed as running “packaged code”. - The
SavedModel
format saves model code and trained parameters in a graph data structure. - The presence of an architectural backdoor means that an attacker has modified the architecture of the model in such a way that they (attacker) can manipulate the models output.
- Only use/load models from trusted sources.
Further reading:
Attack Flow
Impact
A model compromised with a backdoor can have significant consequences, particularly when deployed in critical systems or made available via API:
- Service disruption: When triggered, the backdoor could cause the model to generate outputs that crash downstream systems or create processing bottlenecks.
- Delayed detection: Sophisticated backdoors may operate intermittently or only under specific conditions, allowing them to cause damage over extended periods before discovery.
Remediation
Only use models that come from a trusted source.
Use model signatures to verify a model is not tampered with.
In the case of an architectural backdoor, visualizing a model with OSS such as netron would also help in identification of a backdoor.