Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning

Abstract

Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models — limiting reliability, reproducibility, and deployment feasibility.

To address this gap, we introduce SAVANT, a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline — structured scene description extraction and multi-modal evaluation — existing VLMs achieve significantly higher scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains: Street, Infrastructure, Movable Objects, and Environment.

We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. The resulting high-quality dataset was further used to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy — surpassing all models evaluated while enabling local deployment at near-zero cost.

Framework Overview

SAVANT data-model refinement cycle. The architecture operates in two structured phases: (1) Scene Decomposition into Semantic Domains and (2) Joint Anomaly Evaluation conditioned on both the image and aggregated layer descriptions. Predictions undergo Human-in-the-Loop verification to curate high-quality labels for fine-tuning a compact VLM.

Semantic Anomaly Formulation

Unlike conventional out-of-distribution detection that identifies unknown objects, semantic anomalies involve familiar elements in contextually invalid configurations. A traffic light on a truck or a stop sign on a billboard are both composed of common driving objects, but their relationships violate expected contextual constraints.

(a) Moon mistaken for traffic light

(b) Traffic lights transported on truck

(d) Police cars crossing diagonally

Four Semantic Layers

SAVANT decomposes scene-level anomaly detection into four layers adapted from the 6-Layer Model for urban traffic description:

Street
Topology, geometry, surface, lane markings

Infrastructure
Lights, signs, cones, barriers, construction

Movable Objects
Vehicles, pedestrians, dynamic entities

Environment
Weather, visibility, lighting conditions

Two-Phase Pipeline

Phase 1 — Structured Scene Description Extraction: Four parallel extraction signatures guide the VLM to systematically describe the scene according to each semantic layer, ensuring comprehensive coverage.

Phase 2 — Multi-Modal Scene Evaluation: The VLM receives the original image alongside the structured descriptions for layer-wise anomaly assessment, cross-layer interaction analysis, and final binary classification.

Key Results

93.8%

Accuracy
(Fine-tuned 7B)

90.8%

Recall
(Fine-tuned 7B)

+18.5%

Absolute Recall Improvement
over baselines

VLMs
Evaluated

Fine-Tuned vs. Baseline vs. Top Proprietary Models

Model	Acc.	Rec.	Prec.	F1	T(s)	Queries
Qwen2.5-VL-7B (NFT)	0.55	0.48	0.55	0.52	1.74	1
Qwen2.5-VL-7B (FT)	0.94	0.91	0.97	0.94	0.08	1
Qwen2.5-VL-7B (PFT)	0.84	0.82	0.85	0.83	13.85	2
Gemini-2.0-FE (full_opt)	0.88	0.90	0.86	0.88	17.8	5

NFT = Non-Fine-Tuned, FT = Fine-Tuned Single-Shot, PFT = Pipeline Fine-Tuned. T = Inference time.

Resolution Analysis

Performance vs. resolution for top models per family. 360p represents the optimal balance between performance and efficiency.

Layer-Specific Error Analysis

Failure rates (%) across semantic layer combinations for three Qwen2.5-VL-7B variants. Layer abbreviations: S (Street), I (Infrastructure), M (Movable Objects), E (Environmental).

Dataset Distribution

Layer-wise anomaly distribution comparing CODALM_medium dataset (5,078 samples) and its test split (1,020 samples).

SUPPLEMENTARY Detailed Method Documentation

The following sections present the complete architectures, prompts, and implementation details for all evaluation methods described in the paper. These could not be included in the main paper due to space constraints.

Method Architectures & Prompts

Each DSPy Signature has a docstring, InputFields, and OutputFields. DSPy's ChatAdapter combines these into a structured prompt with a system message and a user message. Below we show the full assembled prompts for each method.

image_baseline Single-shot Image Classification

Direct VLM prompting with image only, without structured reasoning or layered analysis. Uses BasicAnomalyImageSignature with dspy.ChainOfThought.

System Message

Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle

Your output fields are:
1. `is_anomaly` (bool):
        Determine if the scene is anomalous, i.e. an out-of-distribution or highly unusual event, considering:

        1. Is this common to see while driving?
        2. Does this have an effect on the ego vehicle?
        3. Can this influence the ego vehicle's behavior?
        4. Can the ego vehicle drive safely in this scenario?

2. `explanation` (str): Explanation of why the scene is or is not anomalous

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## scene_image ## ]]
{scene_image}

[[ ## is_anomaly ## ]]
{is_anomaly}        # note: the value you produce must be True or False

[[ ## explanation ## ]]
{explanation}

[[ ## completed ## ]]

In adhering to this structure, your objective is:
        You are a fault monitor for a visual autonomous vehicle system.
        Your task is to analyze the traffic scene and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
        Determine if the scene constitutes a normal observation or an anomaly and explain your reasoning.
        Normal observations do not detrimentally affect the vehicle's performance, whereas anomalies might.

        IMPORTANT: You must provide your response in the exact JSON format with ALL required fields:
        {
            "is_anomaly": true/false,
            "explanation": "Final explanation of why the scene is or is not anomalous"
        }

User Message

[[ ## scene_image ## ]]
<image>

Respond with the corresponding output fields, starting with the field `[[ ## is_anomaly ## ]]`
(must be formatted as a valid Python bool), then `[[ ## explanation ## ]]`, and then ending
with the marker for `[[ ## completed ## ]]`.

text_baseline Two-phase Text-only Classification

Two-phase approach without structured reasoning. Phase 1 extracts an unstructured scene description; Phase 2 classifies based on text only (no image in evaluation).

Phase 1: Scene Description Extraction

Uses BasicSceneDescriptionSignature with dspy.ChainOfThought.

System Message

Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle

Your output fields are:
1. `scene_description` (str): A simple description of what is visible in the traffic scene

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## scene_image ## ]]
{scene_image}

[[ ## scene_description ## ]]
{scene_description}

[[ ## completed ## ]]

In adhering to this structure, your objective is:
        You are a traffic scene analysis expert for an autonomous vehicle system.
        With over 10 years of experience in traffic scene analysis, you specialize in analyzing traffic scenes from the front camera of the ego vehicle.
        Your task is to analyze the current traffic scene and provide a comprehensive analysis.

        The goal is to capture all details that could affect the ego vehicle's driving behavior or safety.

User Message

[[ ## scene_image ## ]]
<image>

Respond with the corresponding output fields, starting with the field `[[ ## scene_description ## ]]`,
and then ending with the marker for `[[ ## completed ## ]]`.

Phase 2: Anomaly Classification from Text

Uses BasicAnomalyTextSignature with dspy.ChainOfThought. Receives only the text description (no image).

System Message

Your input fields are:
1. `scene_description` (str): The description of the traffic scene

Your output fields are:
1. `is_anomaly` (bool):
        Determine if the scene is anomalous, i.e. an out-of-distribution or highly unusual event, considering:

        1. Is this common to see while driving?
        2. Does this have an effect on the ego vehicle?
        3. Can this influence the ego vehicle's behavior?
        4. Can the ego vehicle drive safely in this scenario?

2. `explanation` (str): Explanation of why the scene is or is not anomalous

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## scene_description ## ]]
{scene_description}

[[ ## is_anomaly ## ]]
{is_anomaly}        # note: the value you produce must be True or False

[[ ## explanation ## ]]
{explanation}

[[ ## completed ## ]]

In adhering to this structure, your objective is:
        You are a fault monitor for a visual autonomous vehicle system.
        Your task is to analyze the traffic scene and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
        Determine if the scene constitutes a normal observation or an anomaly and explain your reasoning.
        Normal observations do not detrimentally affect the vehicle's performance, whereas anomalies might.

User Message

[[ ## scene_description ## ]]
<scene description text from Phase 1>

Respond with the corresponding output fields, starting with the field `[[ ## is_anomaly ## ]]`
(must be formatted as a valid Python bool), then `[[ ## explanation ## ]]`, and then ending
with the marker for `[[ ## completed ## ]]`.

baseline Two-phase Multimodal Classification

Two-phase approach without structured reasoning. Phase 1 extracts an unstructured scene description; Phase 2 performs multimodal classification using both image and text.

Phase 1: Scene Description Extraction

Same as text_baseline Phase 1 (see above).

Phase 2: Anomaly Classification from Image + Text

Uses BasicAnomalySignature with dspy.ChainOfThought. Receives both the scene description and the original image.

System Message

Your input fields are:
1. `scene_description` (str): The description of the traffic scene
2. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle

Your output fields are:
1. `is_anomaly` (bool):
        Determine if the scene is anomalous, i.e. an out-of-distribution or highly unusual event, considering:

        1. Is this common to see while driving?
        2. Does this have an effect on the ego vehicle?
        3. Can this influence the ego vehicle's behavior?
        4. Can the ego vehicle drive safely in this scenario?

2. `explanation` (str): Explanation of why the scene is or is not anomalous

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## scene_description ## ]]
{scene_description}

[[ ## scene_image ## ]]
{scene_image}

[[ ## is_anomaly ## ]]
{is_anomaly}        # note: the value you produce must be True or False

[[ ## explanation ## ]]
{explanation}

[[ ## completed ## ]]

In adhering to this structure, your objective is:
        You are a fault monitor for a visual autonomous vehicle system.
        Your task is to analyze the traffic scene and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
        Determine if the scene constitutes a normal observation or an anomaly and explain your reasoning.
        Normal observations do not detrimentally affect the vehicle's performance, whereas anomalies might.

User Message

[[ ## scene_description ## ]]
<scene description text from Phase 1>

[[ ## scene_image ## ]]
<image>

Respond with the corresponding output fields, starting with the field `[[ ## is_anomaly ## ]]`
(must be formatted as a valid Python bool), then `[[ ## explanation ## ]]`, and then ending
with the marker for `[[ ## completed ## ]]`.

image Single-shot Layered Image Classification

Single-query approach using structured layered prompting. The VLM analyzes the image using the four-layer decomposition. Uses SceneAnomalyImageSignature with dspy.ChainOfThought.

System Message

Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle

Your output fields are:
1. `street_layer_analysis` (str): Analysis of street-level anomalies (topology, geometry, surface, lane markings). Describe if there are any unusual or dangerous road conditions
2. `street_layer_anomaly` (bool)
3. `infrastructure_layer_analysis` (str): Analysis of infrastructure anomalies (traffic lights, signs, cones, barriers, construction). Describe any unusual or problematic infrastructure elements
4. `infrastructure_layer_anomaly` (bool)
5. `movable_objects_layer_analysis` (str): Analysis of movable object anomalies (vehicles, pedestrians, etc.). Describe if there any unusual behaviors or dangerous situations.
6. `movable_objects_layer_anomaly` (bool)
7. `environmental_layer_analysis` (str): Analysis of environmental anomalies (weather, visibility). Describe if conditions are truly out-of-distribution or simply require caution.
8. `environmental_layer_anomaly` (bool)
9. `is_anomaly` (bool): Determine if the scene is anomalous, i.e. an out-of-distribution or highly unusual event, considering:
        1. Individual layer anomalies
        2. Cross-layer interaction effects
        3. Context-specific requirements from semantic layer
        4. Criticality of the scene

        Finally, provide an overall assessment considering:
        1. Is this common to see while driving?
        2. Does this have an effect on the ego vehicle?
        3. Can this influence the ego vehicle's behavior?
        4. Can the ego vehicle drive safely in this scenario?
10. `explanation` (str): Final explanation considering all layers and their interactions

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## scene_image ## ]]
{scene_image}

[[ ## street_layer_analysis ## ]]
{street_layer_analysis}

[[ ## street_layer_anomaly ## ]]
{street_layer_anomaly}        # note: the value you produce must be True or False

[[ ## infrastructure_layer_analysis ## ]]
{infrastructure_layer_analysis}

[[ ## infrastructure_layer_anomaly ## ]]
{infrastructure_layer_anomaly}        # note: the value you produce must be True or False

[[ ## movable_objects_layer_analysis ## ]]
{movable_objects_layer_analysis}

[[ ## movable_objects_layer_anomaly ## ]]
{movable_objects_layer_anomaly}        # note: the value you produce must be True or False

[[ ## environmental_layer_analysis ## ]]
{environmental_layer_analysis}

[[ ## environmental_layer_anomaly ## ]]
{environmental_layer_anomaly}        # note: the value you produce must be True or False

[[ ## is_anomaly ## ]]
{is_anomaly}        # note: the value you produce must be True or False

[[ ## explanation ## ]]
{explanation}

[[ ## completed ## ]]

In adhering to this structure, your objective is:
        You are a fault monitor for a visual autonomous vehicle system.
        Your task is to analyze the traffic scene layer by layer and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
        For each layer, determine if constitutes a normal observation or an anomaly and explain your reasoning.
        Normal observations do not detrimentally affect the vehicle's performance, whereas anomalies might.

        Finally, provide an overall assessment.

User Message

[[ ## scene_image ## ]]
<image>

Respond with the corresponding output fields, starting with the field `[[ ## street_layer_analysis ## ]]`,
then `[[ ## street_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## infrastructure_layer_analysis ## ]]`,
then `[[ ## infrastructure_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## movable_objects_layer_analysis ## ]]`,
then `[[ ## movable_objects_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## environmental_layer_analysis ## ]]`,
then `[[ ## environmental_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## is_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## explanation ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

text Two-phase Layered Text-only Classification

Two-phase approach with structured layered analysis. Phase 1 extracts a structured scene description using four parallel layer-specific signatures; Phase 2 performs text-only classification.

Phase 1: Layered Scene Description Extraction

Four parallel extraction signatures (one per semantic layer). See Extraction Prompts for complete prompts.

Phase 2: Anomaly Classification from Layered Text

Uses SceneAnomalyTextSignature with dspy.ChainOfThought. Receives only the structured text description (no image).

System Message

Your input fields are:
1. `scene_description` (str): Structured description of the scene

Your output fields are:
1. `street_layer_analysis` (str): Analysis of street-level anomalies (topology, geometry, surface, lane markings). Describe if there are any unusual or dangerous road conditions
2. `street_layer_anomaly` (bool)
3. `infrastructure_layer_analysis` (str): Analysis of infrastructure anomalies (traffic lights, signs, cones, barriers, construction). Describe any unusual or problematic infrastructure elements
4. `infrastructure_layer_anomaly` (bool)
5. `movable_objects_layer_analysis` (str): Analysis of movable object anomalies (vehicles, pedestrians, etc.). Describe if there any unusual behaviors or dangerous situations.
6. `movable_objects_layer_anomaly` (bool)
7. `environmental_layer_analysis` (str): Analysis of environmental anomalies (weather, visibility). Describe if conditions are truly out-of-distribution or simply require caution.
8. `environmental_layer_anomaly` (bool)
9. `is_anomaly` (bool): Determine if the scene is anomalous, i.e. an out-of-distribution or highly unusual event, considering:
        1. Individual layer anomalies
        2. Cross-layer interaction effects
        3. Context-specific requirements from semantic layer
        4. Criticality of the scene

        Finally, provide an overall assessment considering:
        1. Is this common to see while driving?
        2. Does this have an effect on the ego vehicle?
        3. Can this influence the ego vehicle's behavior?
        4. Can the ego vehicle drive safely in this scenario?
10. `explanation` (str): Final explanation considering all layers and their interactions

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## scene_description ## ]]
{scene_description}

[[ ## street_layer_analysis ## ]]
{street_layer_analysis}

[[ ## street_layer_anomaly ## ]]
{street_layer_anomaly}        # note: the value you produce must be True or False

[[ ## infrastructure_layer_analysis ## ]]
{infrastructure_layer_analysis}

[[ ## infrastructure_layer_anomaly ## ]]
{infrastructure_layer_anomaly}        # note: the value you produce must be True or False

[[ ## movable_objects_layer_analysis ## ]]
{movable_objects_layer_analysis}

[[ ## movable_objects_layer_anomaly ## ]]
{movable_objects_layer_anomaly}        # note: the value you produce must be True or False

[[ ## environmental_layer_analysis ## ]]
{environmental_layer_analysis}

[[ ## environmental_layer_anomaly ## ]]
{environmental_layer_anomaly}        # note: the value you produce must be True or False

[[ ## is_anomaly ## ]]
{is_anomaly}        # note: the value you produce must be True or False

[[ ## explanation ## ]]
{explanation}

[[ ## completed ## ]]

In adhering to this structure, your objective is:
        You are a fault monitor for a visual autonomous vehicle system.
        Your task is to analyze the traffic scene layer by layer and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
        For each layer, determine if constitutes a normal observation or an anomaly and explain your reasoning.
        Normal observtions do not detrimentally affect the vehicle's performance, whereas anomalies might.

        Finally, provide an overall assessment.

User Message

[[ ## scene_description ## ]]
<structured scene description from extraction step>

Respond with the corresponding output fields, starting with the field `[[ ## street_layer_analysis ## ]]`,
then `[[ ## street_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## infrastructure_layer_analysis ## ]]`,
then `[[ ## infrastructure_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## movable_objects_layer_analysis ## ]]`,
then `[[ ## movable_objects_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## environmental_layer_analysis ## ]]`,
then `[[ ## environmental_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## is_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## explanation ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

full Two-phase Layered Multimodal Classification (SAVANT)

Complete SAVANT pipeline with multimodal structured evaluation. Phase 1 extracts structured scene descriptions; Phase 2 performs classification using both image and layered text. Uses SceneAnomalySignature with dspy.ChainOfThought.

Phase 1: Layered Scene Description Extraction

Four parallel extraction signatures (one per semantic layer). See Extraction Prompts for complete prompts.

Phase 2: Anomaly Classification from Image + Layered Text

System Message

Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle
2. `scene_description` (str): Structured description of the scene

Your output fields are:
1. `street_layer_analysis` (str): Analysis of street-level anomalies (topology, geometry, surface, lane markings). Describe if there are any unusual or dangerous road conditions
2. `street_layer_anomaly` (bool)
3. `infrastructure_layer_analysis` (str): Analysis of infrastructure anomalies (traffic lights, signs, cones, barriers, construction). Describe any unusual or problematic infrastructure elements
4. `infrastructure_layer_anomaly` (bool)
5. `movable_objects_layer_analysis` (str): Analysis of movable object anomalies (vehicles, pedestrians, etc.). Describe if there any unusual behaviors or dangerous situations.
6. `movable_objects_layer_anomaly` (bool)
7. `environmental_layer_analysis` (str): Analysis of environmental anomalies (weather, visibility). Describe if conditions are truly out-of-distribution or simply require caution.
8. `environmental_layer_anomaly` (bool)
9. `is_anomaly` (bool): Determine if the scene is anomalous, i.e. an out-of-distribution or highly unusual event, considering:
        1. Individual layer anomalies
        2. Cross-layer interaction effects
        3. Context-specific requirements from semantic layer
        4. Criticality of the scene

        Finally, provide an overall assessment considering:
        1. Is this common to see while driving?
        2. Does this have an effect on the ego vehicle?
        3. Can this influence the ego vehicle's behavior?
        4. Can the ego vehicle drive safely in this scenario?
10. `explanation` (str): Final explanation considering all layers and their interactions

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## scene_image ## ]]
{scene_image}

[[ ## scene_description ## ]]
{scene_description}

[[ ## street_layer_analysis ## ]]
{street_layer_analysis}

[[ ## street_layer_anomaly ## ]]
{street_layer_anomaly}        # note: the value you produce must be True or False

[[ ## infrastructure_layer_analysis ## ]]
{infrastructure_layer_analysis}

[[ ## infrastructure_layer_anomaly ## ]]
{infrastructure_layer_anomaly}        # note: the value you produce must be True or False

[[ ## movable_objects_layer_analysis ## ]]
{movable_objects_layer_analysis}

[[ ## movable_objects_layer_anomaly ## ]]
{movable_objects_layer_anomaly}        # note: the value you produce must be True or False

[[ ## environmental_layer_analysis ## ]]
{environmental_layer_analysis}

[[ ## environmental_layer_anomaly ## ]]
{environmental_layer_anomaly}        # note: the value you produce must be True or False

[[ ## is_anomaly ## ]]
{is_anomaly}        # note: the value you produce must be True or False

[[ ## explanation ## ]]
{explanation}

[[ ## completed ## ]]

In adhering to this structure, your objective is:
        You are a fault monitor for a visual autonomous vehicle system.
        Your task is to analyze the traffic scene layer by layer and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
        For each layer, determine if constitutes a normal observation or an anomaly and explain your reasoning.
        Normal observtions do not detrimentally affect the vehicle's performance, whereas anomalies might.

        Finally, provide an overall assessment.

User Message

[[ ## scene_image ## ]]
<image>

[[ ## scene_description ## ]]
<structured scene description from extraction step>

Respond with the corresponding output fields, starting with the field `[[ ## street_layer_analysis ## ]]`,
then `[[ ## street_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## infrastructure_layer_analysis ## ]]`,
then `[[ ## infrastructure_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## movable_objects_layer_analysis ## ]]`,
then `[[ ## movable_objects_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## environmental_layer_analysis ## ]]`,
then `[[ ## environmental_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## is_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## explanation ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

*_opt MIPROv2 Optimized Variants

Optimized versions of text and full using DSPy MIPROv2 prompt optimization. The optimizer modifies the Phase 2 evaluation prompts in two ways: (1) rewritten instructions tailored to each model, and (2) few-shot demonstrations selected from training data. Only Phase 2 (evaluation) is optimized; the extraction phase uses the same base prompts.

Optimization setup: dspy.MIPROv2 with num_candidates=15, init_temperature=0.3, num_trials=25, minibatch_size=8, max_bootstrapped_demos=5, max_labeled_demos=6. Training data: ~103 anomalous + 21 normal scenes. Metric: binary classification accuracy.

Summary of Optimized Prompts Across Models

Model	Optimizer	Instruction Changes	Demos
Qwen2.5-VL-7B	text	Minimal (minor rewording)	6
Qwen2.5-VL-7B	image/full	Added per-layer bullet points	6
Qwen2.5-VL-3B	text	Unchanged	0
Qwen2.5-VL-3B	image/full	Unchanged	0
Qwen2.5-VL-32B	text	Unchanged	6
Qwen2.5-VL-32B	image/full	Unchanged	0
Qwen2.5-VL-72B	text	Completely rewritten ("guardian angel")	6
Qwen2.5-VL-72B	image/full	Unchanged	0
Gemini-2.0-FE	text	Completely rewritten ("vigilant fault monitor")	4
Gemini-2.0-FE	image/full	Completely rewritten ("sophisticated fault monitor")	4

Example: Gemini-2.0-Flash-Exp — Text Optimizer (Rewritten by MIPROv2)

You are tasked with serving as a vigilant fault monitor for a sophisticated autonomous vehicle
system. Your core responsibility involves meticulously analyzing a traffic scene, dissecting it
into distinct layers -- street, infrastructure, movable objects, and environmental -- to pinpoint any
potential anomalies that could jeopardize vehicle safety, induce unsafe maneuvers, or lead to
violations of traffic regulations.

Each layer demands a thorough scrutiny:

1. **Street Layer**: Assess the road's geometry, topology, surface condition, and lane markings.
   Consider whether the layout, including any merging lanes, splits, or bottlenecks, presents
   challenges. Evaluate the road surface's integrity and the clarity of lane markings, factoring
   in any disruptions caused by construction or wear.

2. **Infrastructure Layer**: Investigate traffic control devices such as lights, signs, cones,
   and barriers. Determine if their placement and condition adhere to standards, ensuring they
   effectively guide traffic. Take note of any construction sites, evaluating their legitimacy
   and the adequacy of their demarcation.

3. **Movable Objects Layer**: Observe vehicles, pedestrians, and other dynamic entities within
   the scene. Analyze their behavior and positioning relative to the ego vehicle, identifying
   any unusual movements or placements that could signal danger.

4. **Environmental Layer**: Gauge the impact of weather conditions, visibility, and lighting
   on the driving environment. Consider how these factors may affect the vehicle's perception
   and maneuverability.

For each layer, discern whether the observations constitute a normal scenario conducive to safe
driving or an anomaly that could compromise safety. Provide a reasoned explanation for your
determination, elucidating the specific elements that support your conclusion.

Ultimately, synthesize your findings into an overarching assessment. Reflect on the interplay
between the layers, considering how anomalies in one area might exacerbate issues in another.
Evaluate the scene's commonality, its implications for the ego vehicle's performance, and the
necessity for altering the vehicle's behavior. Conclude with a definitive statement on the
scene's overall safety and the vehicle's ability to navigate it securely.

Example: Gemini-2.0-Flash-Exp — Full Optimizer (Rewritten by MIPROv2)

You are a sophisticated fault monitor for a visual autonomous vehicle system, tasked with
meticulously analyzing traffic scenes to ensure the safety and optimal performance of the vehicle.
Your role is to dissect the scene into four distinct layers -- street, infrastructure, movable objects,
and environmental -- and conduct a thorough examination for any anomalies that could jeopardize vehicle
safety, lead to unsafe actions, or result in traffic rule violations.

For each layer, your analysis should be comprehensive and detailed:
- **Street Layer:** Assess the road's geometry, topology, surface condition, and lane markings.
  Determine if there are any unusual road conditions, such as unexpected lane closures, irregular
  surface textures, or unclear markings that could confuse the vehicle.
- **Infrastructure Layer:** Evaluate the state of traffic lights, signs, cones, barriers, and
  construction sites. Identify any infrastructure elements that are out of place, damaged, or
  improperly set up, which could mislead the vehicle or create hazardous situations.
- **Movable Objects Layer:** Analyze the behavior and positioning of vehicles, pedestrians, and
  other dynamic elements. Look for any unusual movements, unexpected stops, or erratic actions
  that could pose a threat to the vehicle's safety.
- **Environmental Layer:** Consider the weather, visibility, and lighting conditions. Determine
  if the environmental factors are within normal ranges or if they present challenges that could
  affect the vehicle's performance, such as reduced visibility due to fog or glare from streetlights.

For each layer, clearly state whether the observations are normal or anomalous, and provide a
detailed explanation for your reasoning. Normal observations should not negatively impact the
vehicle's performance, while anomalies could potentially cause issues.

After analyzing each layer, synthesize your findings into an overall assessment. Consider the
interactions between layers, the context-specific requirements of the scene, and the criticality
of the observed conditions. Provide a final explanation that summarizes the safety and potential
impact on the vehicle's behavior, addressing the following questions:
1. Is this a common occurrence while driving?
2. Does this have an effect on the ego vehicle?
3. Can this influence the ego vehicle's behavior?
4. Can the ego vehicle drive safely in this scenario?

Your analysis should be highly informative and descriptive, ensuring that all potential hazards
are identified and explained in detail.

Example: Qwen2.5-VL-72B — Text Optimizer (Rewritten by MIPROv2)

You are the guardian angel of an autonomous vehicle navigating through complex and ever-changing
traffic scenes. Your mission is to meticulously dissect each scene into its fundamental
components -- street, infrastructure, movable objects, and environmental factors -- and scrutinize them
for any deviations from the norm that could jeopardize the vehicle's safety, operational integrity,
or adherence to traffic laws.

**Your Role:**
- **Layered Detective:** Delve deep into the intricacies of each scene layer. For the street layer,
  investigate the road's geometry, topology, surface condition, and lane markings. In the
  infrastructure layer, examine traffic lights, signs, cones, barriers, and construction sites.
  Assess the behavior of vehicles, pedestrians, and other dynamic entities in the movable objects
  layer. Lastly, consider the environmental factors like weather, visibility, and lighting
  conditions in the environmental layer.

- **Anomaly Hunter:** Identify and flag any anomalies within each layer. Determine if what you
  observe is a normal occurrence or a potential hazard. Provide a rationale for your judgment,
  explaining why certain observations are deemed anomalies and how they could impact the vehicle's
  performance.

- **Synthesizer of Truth:** After analyzing each layer, synthesize your findings to deliver an
  overarching verdict on the scene's anomaly status. Consider the interplay between layers, the
  broader context, and the criticality of the situation. Offer a conclusive explanation that
  encapsulates your reasoning, guiding the autonomous vehicle towards safe and informed
  decision-making.

**Your Task:**
Given a structured description of a traffic scene, conduct a thorough analysis of each layer,
pinpoint anomalies, and provide a comprehensive assessment. Your insights will empower the vehicle
to navigate the world with precision and safety.

Method Comparison Summary

Method	Layered Analysis	Image in Phase 2	DSPy Opt.	Queries
`image_baseline`	No	Yes (single)	No	1
`text_baseline`	No	No	No	2
`baseline`	No	Yes	No	2
`image`	Yes	Yes (single)	No	1
`text`	Yes	No	No	5
`full`	Yes	Yes	No	5
`text_opt`	Yes	No	Yes	5
`full_opt`	Yes	Yes	Yes	5

Layered Scene Description Extraction Prompts

The text and full methods use four parallel extraction signatures in Phase 1, one per semantic layer. Each signature extracts structured information from the image.

Layer 1: Street Layer — StreetLayerSignature

Extracts topology, geometry, surface condition, and lane markings. Uses StreetLayerAnalyzer with dspy.ChainOfThought.

System Message

Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle

Your output fields are:
1. `topology` (str): Describe the street network layout or structure.
                                      For example: road type (highway, urban, residential), intersections (T-junction, roundabout),
                                      number of carriageways, connections (intersections, merges, splits), etc.
2. `geometry` (str): Describe The geometry of the road, including number of lanes, lane types
                                     (driving, turning, bus, bike etc.), lane widths, road curvature (straight, curved, intersection, etc.),
                                     road gradient (flat, uphill, downhill, etc.), and any geometric features like merging lanes, splits,
                                     or bottlenecks
3. `surface` (str): The road surface characteristics including material type (asphalt, concrete, gravel, etc.),
                                    surface condition (dry, wet, icy, snowy etc.), surface quality (smooth, rough, damaged etc.), and any
                                    hazards or anomalies (potholes, cracks, debris, speed bumps, etc.)
4. `lane_markings` (str): The characteristics and condition of lane markings, including marking types
                                          (solid, dashed, double, yellow, white), marking visibility (clear, faded, obscured), special markings
                                          (arrows, crosswalks, stop lines), and any irregularities or temporary markings

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## scene_image ## ]]
{scene_image}

[[ ## topology ## ]]
{topology}

[[ ## geometry ## ]]
{geometry}

[[ ## surface ## ]]
{surface}

[[ ## lane_markings ## ]]
{lane_markings}

[[ ## completed ## ]]

In adhering to this structure, your objective is:
        You are a traffic scene analysis expert for an autonomous vehicle system.
        With over 10 years of experience in traffic scene analysis, you specialize in analyzing traffic scenes from the front camera of the ego vehicle.
        Your task is to analyze the current traffic scene strictly based on the street level and provide a comprehensive analysis.

        The street level analysis includes:
        - Topology: Overall street network structure (e.g., road type, intersections)
        - Geometry: Road geometry details (e.g., lane count, curvature, gradient)
        - Surface: Road surface material, condition and potential hazards
        - Lane markings: Quality and visibility of lane markings, including any irregularities

        The goal is to capture all street-related details that could affect the ego vehicle's driving behavior or safety.

User Message

[[ ## scene_image ## ]]
<image>

Respond with the corresponding output fields, starting with the field `[[ ## topology ## ]]`,
then `[[ ## geometry ## ]]`, then `[[ ## surface ## ]]`, then `[[ ## lane_markings ## ]]`,
and then ending with the marker for `[[ ## completed ## ]]`.

Layer 2: Infrastructure Layer — InfrastructureLayerSignature

Extracts traffic lights, signs, cones, barriers, and construction sites. Uses InfrastructureLayerAnalyzer with dspy.ChainOfThought.

System Message

Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle

Your output fields are:
1. `traffic_lights` (str): Describe any traffic lights or objects mistaken for traffic lights.
                                           Include position, state (green, red, flashing) and whether they seem correctly placed or attached to
                                           unusual objects. Return 'None' if no traffic lights are present.
2. `traffic_signs` (str): Describe any traffic signs or potential lookalikes (e.g., signs on billboards, defaced or
                                          damaged signs). Include sign type, visibility, placement, and note if they appear
                                          normal or if anything seems odd. Return 'None' if no traffic signs are present.
3. `traffic_cones` (str): Describe traffic cones, including their quantity, positions (e.g., placed on lane markings, soulders),
                                          and any unusual cases (e.g., objects that look like cones but aren't). Indicate if placement is typical
                                          (construction zone) or out of place. Return 'None' if no traffic cones are present.
4. `traffic_barriers` (str): Describe any traffic barriers, noting their position, condition and whether they appear damaged,
                                            misplaced, or used in unexpected ways. Return 'None' if no traffic barriers are present.
5. `construction_sites` (str): Describe any construction or maintenance areas, including signs of ongoing work, equipment,
                                               or partial road closures. Note if items appear legitimate or incorrectly placed
                                               (e.g., unmarked hazards, open pits). Return 'None' if no construction sites are present.

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## scene_image ## ]]
{scene_image}

[[ ## traffic_lights ## ]]
{traffic_lights}

[[ ## traffic_signs ## ]]
{traffic_signs}

[[ ## traffic_cones ## ]]
{traffic_cones}

[[ ## traffic_barriers ## ]]
{traffic_barriers}

[[ ## construction_sites ## ]]
{construction_sites}

[[ ## completed ## ]]

In adhering to this structure, your objective is:
        You are a traffic scene analysis expert for an autonomous vehicle system.
        With over 10 years of experience in traffic scene analysis, you specialize in analyzing traffic scenes from the front camera of the ego vehicle.
        Your task is to analyze the current traffic scene strictly based on the infrastructure level and provide a comprehensive analysis.

        The infrastructure level analysis includes:
        - Traffic lights
        - Traffic signs
        - Traffic cones
        - Traffic barriers
        - Construction sites

        Focus on:
        - Normal infrastructure (traffic lights, traffic signs, cones, barriers, construction areas)
        - Incorrect or unusual placements and positioning (e.g., traffic signs on billboards, traffic lights carried by vehicles)
        - Damaged or defaced infrastructure (graffiti on signs, bent or obscured signs)
        - Objects that might be misinterpreted as infrastructure (e.g., balloons mistaken for traffic cones, moon for traffic lights)

        Note their presence, placement, and condition -- especially if something seems out-of-place or anomalous.
        If something is unclear or partially visible, indicate that.
        The goal is to capture all infrastructure-related details that could affect the ego vehicle's driving behavior or safety.

User Message

[[ ## scene_image ## ]]
<image>

Respond with the corresponding output fields, starting with the field `[[ ## traffic_lights ## ]]`,
then `[[ ## traffic_signs ## ]]`, then `[[ ## traffic_cones ## ]]`, then `[[ ## traffic_barriers ## ]]`,
then `[[ ## construction_sites ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.

Layer 3: Movable Objects Layer — MovableObjectsLayerSignature

Extracts vehicles, pedestrians, and other dynamic objects. Uses MovableObjectsLayerAnalyzer with dspy.ChainOfThought.

System Message

Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle

Your output fields are:
1. `vehicles` (str): Describe each visible vehicle (car, truck, bus, motorcycle, bicycle, scooter etc.)
                                     its approximate position and orientation to the ego vehicle and any unusual features, anomalies or
                                     out-of-place behavior. Return 'None' if no vehicles are present.
2. `pedestrians` (str): Describe any visible pedestrians, including their position relative to the ego vehicle
                                    and any odd or risky behavior. Return 'None' if no pedestrians are present.
3. `others` (str): Describe any other dynamic, moving or potentially moving objects (animals, scooters,
                                   unusual objects, etc.) and any anomalous or out-of-place behavior of such objects in the scene.
                                   Return 'None' if no other objects are present.

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## scene_image ## ]]
{scene_image}

[[ ## vehicles ## ]]
{vehicles}

[[ ## pedestrians ## ]]
{pedestrians}

[[ ## others ## ]]
{others}

[[ ## completed ## ]]

In adhering to this structure, your objective is:
        You are a traffic scene analysis expert for an autonomous vehicle system.
        With over 10 years of experience in traffic scene analysis, you specialize in analyzing traffic scenes from the front camera of the ego vehicle.
        Your task is to analyze the current traffic scene strictly based on the movable (dynamic) objects level and provide a comprehensive analysis.

        The movable objects level analysis includes:
        - Vehicles
        - Pedestrians
        - Others (e.g., animals, bicycles, shopping carts, any other moving objects)

        Focus on:
        - Object type
        - Approximate location and orientation relative to the ego vehicle
        - Any observable features, potential hazards, anomalies or out-of-place behavior for each object

        Since only one frame is available, avoid making strong assumptions about movement.
        The goal is to capture all movable objects-related details that could affect the ego vehicle's driving behavior or safety.

User Message

[[ ## scene_image ## ]]
<image>

Respond with the corresponding output fields, starting with the field `[[ ## vehicles ## ]]`,
then `[[ ## pedestrians ## ]]`, then `[[ ## others ## ]]`,
and then ending with the marker for `[[ ## completed ## ]]`.

Layer 4: Environmental Layer — EnvironmentalLayerSignature

Extracts weather, visibility, and lighting conditions. Uses EnvironmentalLayerAnalyzer with dspy.ChainOfThought.

System Message

Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle

Your output fields are:
1. `weather` (str): Describe the present weather conditions (e.g., light rain, heavy snow, dense fog)
                                    Include intensity if relevant (light, moderate, heavy).
2. `visibility` (str): Select one of four visibility levels:
                                                                                   'clear' (no obstruction), 'light' (slight reduction),
                                                                                   'moderate' (noticeable but still manageable), or
                                                                                   'severe' (heavily impaired, only if background or roadside
                                                                                   features are not clearly visible).
3. `lighting` (str): Describe lighting conditions, such as daylight, dusk, night, or glare. Include any impact on
                                     visibility (e.g., sun glare causing partial obstruction).

All interactions will be structured in the following way, with the appropriate values filled in.

[[ ## scene_image ## ]]
{scene_image}

[[ ## weather ## ]]
{weather}

[[ ## visibility ## ]]
{visibility}

[[ ## lighting ## ]]
{lighting}

[[ ## completed ## ]]

In adhering to this structure, your objective is:
        You are a traffic scene analysis expert for an autonomous vehicle system.
        With over 10 years of experience in traffic scene analysis, you specialize in analyzing traffic scenes from the front camera of the ego vehicle.
        Your task is to analyze the current traffic scene strictly based on the environmental level and provide a comprehensive analysis.

        The environmental level analysis includes:
        - Weather: Description of the weather conditions (e.g., sunny, light rain, heavy snow, dense fog)
        - Visibility (clear, light, moderate, severe)
        - Lighting (daylight, dusk, night, glare)

        Provide enough detail to capture conditions that might affect driving safety or sensor performance.
        Distinguish normal weather conditions from unusually, truly severe or out-of-distribution scenarios.
        Avoid labeling normal conditions as anomalies unless they are truly severe.

        The goal is to capture all environmental-related details that could affect the ego vehicle's driving behavior or safety.

User Message

[[ ## scene_image ## ]]
<image>

Respond with the corresponding output fields, starting with the field `[[ ## weather ## ]]`,
then `[[ ## visibility ## ]]`, then `[[ ## lighting ## ]]`,
and then ending with the marker for `[[ ## completed ## ]]`.

DSPy Framework Details

We leverage DSPy for systematic prompt optimization and pipeline composition.

Signature Definitions

import dspy

class SceneDescription(dspy.Signature):
    """Extract structured scene description."""
    image: dspy.Image = dspy.InputField()
    street_layer: str = dspy.OutputField()
    infrastructure_layer: str = dspy.OutputField()
    movable_objects_layer: str = dspy.OutputField()
    environmental_layer: str = dspy.OutputField()

class AnomalyEvaluation(dspy.Signature):
    """Evaluate scene for semantic anomalies."""
    image: dspy.Image = dspy.InputField()
    description: str = dspy.InputField()
    classification: str = dspy.OutputField()
    confidence: float = dspy.OutputField()
    reasoning: str = dspy.OutputField()

Pipeline Module

class SAVANTPipeline(dspy.Module):
    def __init__(self):
        super().__init__()
        self.describe = dspy.ChainOfThought(SceneDescription)
        self.evaluate = dspy.ChainOfThought(AnomalyEvaluation)

    def forward(self, image):
        desc = self.describe(image=image)
        full_desc = f"Street: {desc.street_layer}\n"
        full_desc += f"Infra: {desc.infrastructure_layer}\n"
        full_desc += f"Objects: {desc.movable_objects_layer}\n"
        full_desc += f"Env: {desc.environmental_layer}"
        return self.evaluate(image=image, description=full_desc)

Optimization Configuration

from dspy.teleprompt import MIPROv2

def anomaly_metric(example, prediction, trace=None):
    return 1.0 if example.label == \
        prediction.classification else 0.0

optimizer = MIPROv2(
    metric=anomaly_metric,
    num_candidates=10,
    init_temperature=1.0,
    num_threads=4
)
optimized_pipeline = optimizer.compile(
    SAVANTPipeline(),
    trainset=train_examples,
    max_bootstrapped_demos=4,
    max_labeled_demos=8
)

Example Model Outputs: NFT vs FT vs PFT

We present example outputs from each model variant using Qwen2.5-VL-7B, ordered as: NFT (base model), FT (fine-tuned single-shot), and PFT (pipeline fine-tuned). We use the same image — not present in the CODA dataset — across all three methods to demonstrate how SAVANT improves detection performance.

Example anomaly scene: Workers in a lift performing maintenance on traffic signals at an icy intersection — an unusual scene that may confuse the conventional software stack.

Base Model Without Fine-tuning (NFT)

Prompt Template (NFT)

You are a fault monitor for a visual autonomous vehicle system.
Your task is to analyze the traffic scene and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
Determine if the scene constitutes a normal observation or an anomaly and explain your reasoning.
Normal observations do not detrimentally affect the vehicle's performance, whereas anomalies might.

IMPORTANT: You must provide your response in the exact JSON format with ALL required fields:
{
  "is_anomaly": true/false,
  "explanation": "Final explanation of why the scene is or is not anomalous"
}

Analyze the provided traffic scene image for anomalies. Consider:
1. Is there anything unusual about this situation?
2. Does this have an effect on the ego vehicle?
3. Can this influence the ego vehicle's behavior?
4. Can the ego vehicle drive safely in this scenario?

Prediction: Normal Incorrect

{
  "is_anomaly": false,
  "explanation": "The scene depicts a typical roadwork situation where
  workers are performing maintenance on traffic signals. There are no
  unusual elements that would affect the safety of the ego vehicle.
  The workers are at a safe distance from the road. The ego vehicle
  can safely proceed through the intersection as there are no immediate
  hazards or obstructions."
}

The base model fails to detect the anomaly. It dismisses the workers in the lift as a "typical roadwork situation" and incorrectly concludes the ego vehicle can safely proceed — not accounting for potentially non-functional traffic lights, collision risk due to temporary lack of signalization, and icy road conditions at the intersection.

Fine-tuned Single-Shot (FT)

Prompt Template (FT)

You are a fault monitor for a visual autonomous vehicle system.
Your task is to analyze the traffic scene layer by layer and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
For each layer, determine if constitutes a normal observation or an anomaly and explain your reasoning.
Normal observtions do not detrimentally affect the vehicle's performance, whereas anomalies might.

Finally, provide an overall assessment.

Analyze this traffic scene image and determine if it contains any anomalies that could affect vehicle safety, cause unsafe actions, or violate traffic rules.

Consider all layers:
- Street Layer: Road geometry, topology, surface, lane markings
- Infrastructure Layer: Traffic lights, signs, cones, barriers, construction
- Movable Objects Layer: Vehicles, pedestrians, dynamic entities
- Environmental Layer: Weather, visibility, lighting

Respond with only: "anomaly" or "normal"

Prediction: Anomaly Correct

anomaly

The fine-tuned model has internalized the structured analysis from SAVANT and correctly classifies the scene as anomalous with minimal latency. However, it cannot explain why — for interpretability, we rely on PFT. A practical deployment could run FT at 10+ FPS for real-time monitoring, engaging PFT on-demand for explanations when anomalies are detected.

Pipeline Fine-tuned (PFT)

Prompt Template — Phase 1: Scene Description (PFT)

You are an expert at analyzing traffic scenes for autonomous vehicles. Your task is to provide detailed, structured descriptions across four distinct layers using the optimized framework.

Analyze this traffic scene image using the four-layer framework:

1. **Street Layer**: Assess the road's geometry (lane configuration, curves, gradients), topology (intersections, merges, splits), surface condition (material, damage, wetness), and lane markings (visibility, type, condition).

2. **Infrastructure Layer**: Investigate traffic control devices including traffic lights (state, visibility, positioning), signs (type, condition, placement), cones (placement, condition), barriers (type, condition), and construction sites (legitimacy, safety measures).

3. **Movable Objects Layer**: Observe and describe vehicles (type, position, orientation, behavior), pedestrians (location, movement, safety), and other dynamic objects (animals, debris, unusual items).

4. **Environmental Layer**: Evaluate weather conditions (precipitation, wind, temperature effects), visibility (clear, fog, glare), and lighting (daylight, artificial, shadows).

Provide your response in this exact JSON format:

{
  "street_layer": {
    "geometry": "Detailed analysis of road geometry...",
    "topology": "Detailed analysis of road topology...",
    "surface": "Detailed analysis of surface condition...",
    "lane_markings": "Detailed analysis of lane markings..."
  },
  "infrastructure_layer": {
    "traffic_lights": "Analysis of traffic lights or 'None'",
    "traffic_signs": "Analysis of traffic signs or 'None'",
    "traffic_cones": "Analysis of traffic cones or 'None'",
    "traffic_barriers": "Analysis of barriers or 'None'",
    "construction_sites": "Analysis of construction sites or 'None'"
  },
  "movable_objects_layer": {
    "vehicles": "Analysis of vehicles or 'None'",
    "pedestrians": "Analysis of pedestrians or 'None'",
    "others": "Analysis of other objects or 'None'"
  },
  "environmental_layer": {
    "weather": "Analysis of weather conditions",
    "visibility": "Analysis of visibility conditions",
    "lighting": "Analysis of lighting conditions"
  }
}

Prompt Template — Phase 2: Multimodal Evaluation (PFT)

You are a fault monitor for a visual autonomous vehicle system.
Your task is to analyze the traffic scene layer by layer and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
For each layer, determine if constitutes a normal observation or an anomaly and explain your reasoning.
Normal observtions do not detrimentally affect the vehicle's performance, whereas anomalies might.

Finally, provide an overall assessment.

Analyze the provided image and scene description for anomalies using systematic layer-by-layer evaluation.

Scene Description:
{scene_description}

Perform the following analysis:

1. **Reasoning**: Think step by step about potential anomalies in each layer
2. **Street Layer Analysis**: Analyze street-level anomalies (topology, geometry, surface, lane markings)
3. **Infrastructure Layer Analysis**: Analyze infrastructure anomalies (traffic lights, signs, cones, barriers, construction)
4. **Movable Objects Layer Analysis**: Analyze movable object anomalies (vehicles, pedestrians, etc.)
5. **Environmental Layer Analysis**: Analyze environmental anomalies (weather, visibility)

Provide your assessment in this exact JSON format:

{
  "reasoning": "Step-by-step reasoning about potential anomalies...",
  "street_layer_analysis": "Detailed analysis of street layer...",
  "street_layer_anomaly": true/false,
  "infrastructure_layer_analysis": "Detailed analysis of infrastructure layer...",
  "infrastructure_layer_anomaly": true/false,
  "movable_objects_layer_analysis": "Detailed analysis of movable objects layer...",
  "movable_objects_layer_anomaly": true/false,
  "environmental_layer_analysis": "Detailed analysis of environmental layer...",
  "environmental_layer_anomaly": true/false,
  "is_anomaly": true/false,
  "explanation": "Overall explanation considering all layers and their interactions..."
}

Prediction: Anomaly Correct

Phase 1 — Scene Description Output

{
  "street_layer": {
    "geometry": "The road appears to have at least one lane in each
      direction. The lane widths are not clearly visible due to the
      snow and ice. The road seems relatively straight with a flat
      gradient. There are no visible merging lanes or splits.",
    "topology": "The scene appears to be an urban or suburban road
      with an intersection. The road type is likely a local road or
      a collector road. The intersection is a standard four-way
      intersection with traffic lights.",
    "surface": "The road surface is primarily asphalt, but it is
      covered with a layer of snow and ice. The surface condition
      is slippery due to the ice. There may be potholes or cracks
      hidden beneath the snow and ice.",
    "lane_markings": "Lane markings are not clearly visible due to
      the snow and ice. It is difficult to determine the type,
      visibility, or condition of the lane markings."
  },
  "infrastructure_layer": {
    "traffic_lights": "There are traffic lights visible in the image.
      One set is located in the distance, and another set is visible
      on the right side of the image, near the stop sign. The state
      of the lights is not clear. They appear to be correctly placed.",
    "traffic_signs": "A stop sign is visible on the right side of the
      image. It appears to be correctly placed.",
    "traffic_cones": "There are a few traffic cones visible in the
      image. They are placed near the construction site on the right
      side of the road. They appear to be correctly placed.",
    "traffic_barriers": "None",
    "construction_sites": "There is a construction site on the right
      side of the road. Workers are present, and a truck is parked
      nearby. The area is marked with traffic cones."
  },
  "movable_objects_layer": {
    "vehicles": "- Car: A dark-colored car is positioned in the center
      of the road, facing the ego vehicle. It appears to be moving
      forward.\n- Truck: A black pickup truck is positioned in the
      intersection, facing slightly to the right.\n- Truck: A white
      pickup truck is positioned on the right side of the road, near
      the stop sign, facing slightly to the left.\n- Truck: A red
      truck is positioned on the left side of the road, facing
      slightly to the right.",
    "pedestrians": "- Two workers are in a lift, working on the
      overhead wires.\n- One worker is standing on the white truck
      on the right side of the road, working on the stop sign.",
    "others": "None"
  },
  "environmental_layer": {
    "weather": "Overcast, with snow on the ground.",
    "visibility": "light",
    "lighting": "Daylight"
  }
}

Phase 2 — Multimodal Evaluation Output

{
  "reasoning": "Layer-by-layer analysis reveals the following:",
  "street_layer_analysis": "The road surface is slippery due to the snow
    and ice, which significantly reduces traction. The lane markings are
    not clearly visible, making it difficult to maintain lane discipline.
    The presence of an intersection further increases the complexity of
    the situation.",
  "street_layer_anomaly": true,
  "infrastructure_layer_analysis": "The traffic lights and stop sign
    appear to be correctly placed, but their functionality is uncertain
    due to the ongoing construction. The construction site itself is a
    potential hazard, as it may obstruct the road or create unexpected
    traffic patterns.",
  "infrastructure_layer_anomaly": true,
  "movable_objects_layer_analysis": "The presence of workers in the lift
    and on the truck near the stop sign is a significant hazard. The
    workers may unexpectedly move into the roadway. The other vehicles
    appear to be operating normally, but their actions may be affected
    by the slippery road conditions.",
  "movable_objects_layer_anomaly": true,
  "environmental_layer_analysis": "The overcast weather and slippery road
    surface reduce visibility and traction, respectively. These conditions
    require increased caution and reduced speed.",
  "environmental_layer_anomaly": false,
  "is_anomaly": true,
  "explanation": "The scene is anomalous due to the combination of the
    slippery road surface, the presence of workers near the roadway, and
    the ongoing construction. The slippery road surface reduces traction
    and increases the risk of skidding. The workers near the roadway pose
    a risk of unexpected movements into the roadway. The construction site
    may obstruct the road or create unexpected traffic patterns. This
    situation is not common and requires the ego vehicle to exercise
    extreme caution, reduce speed, and be prepared to react to
    unexpected events."
}

The PFT pipeline provides detailed, explainable layer-by-layer analysis. Unlike NFT which dismissed this as a "typical roadwork situation," PFT correctly flags uncertain traffic light functionality, slippery icy road surfaces, and workers in a lift who may unexpectedly move into the roadway.

Resources

Datasets

🤗

CODALM-small

100 manually annotated images (50 anomalous, 50 normal) for model selection

🤗

CODALM-medium

5,078 human-validated annotations with balanced test subset (1,020 samples)

🤗

CODALM-large

9,640 images fully labeled using SAVANT — largest semantically-annotated anomaly detection dataset

Models (LoRA Adapters for Qwen2.5-VL-7B)

🤗

Scene Description

Phase 1 structured extraction — generates layered scene descriptions

🤗

Multimodal Evaluation

Phase 2 pipeline model — layer-by-layer anomaly assessment with explanations (PFT)

🤗

Anomaly Classifier

Single-shot model — 93.8% accuracy, 90.8% recall, 0.08s inference (FT)

Code

Framework Implementation

Coming soon

Citation

@inproceedings{anonymous2026savant, title={Can VLMs Unlock Semantic Anomaly Detection? A Framework for Structured Reasoning}, author={Anonymous}, booktitle={Under Review}, year={2026} }