Autonomous driving systems remain critically vulnerable to the long-tail of rare, out-of-distribution semantic anomalies. While VLMs have emerged as promising tools for perception, their application in anomaly detection remains largely restricted to prompting proprietary models — limiting reliability, reproducibility, and deployment feasibility.
To address this gap, we introduce SAVANT, a novel model-agnostic reasoning framework that reformulates anomaly detection as a layered semantic consistency verification. By applying SAVANT's two-phase pipeline — structured scene description extraction and multi-modal evaluation — existing VLMs achieve significantly higher scores in detecting anomalous driving scenarios from input images. Our approach replaces ad hoc prompting with semantic-aware reasoning, transforming VLM-based detection into a principled decomposition across four semantic domains: Street, Infrastructure, Movable Objects, and Environment.
We show that across a balanced set of real-world driving scenarios, applying SAVANT improves VLM absolute recall by approximately 18.5% compared to prompting baselines. Moreover, this gain enables reliable large-scale annotation: leveraging the best proprietary model within our framework, we automatically labeled around 10,000 real-world images with high confidence. The resulting high-quality dataset was further used to fine-tune a 7B open-source model (Qwen2.5-VL) to perform single-shot anomaly detection, achieving 90.8% recall and 93.8% accuracy — surpassing all models evaluated while enabling local deployment at near-zero cost.
Unlike conventional out-of-distribution detection that identifies unknown objects, semantic anomalies involve familiar elements in contextually invalid configurations. A traffic light on a truck or a stop sign on a billboard are both composed of common driving objects, but their relationships violate expected contextual constraints.
SAVANT decomposes scene-level anomaly detection into four layers adapted from the 6-Layer Model for urban traffic description:
Phase 1 — Structured Scene Description Extraction: Four parallel extraction signatures guide the VLM to systematically describe the scene according to each semantic layer, ensuring comprehensive coverage.
Phase 2 — Multi-Modal Scene Evaluation: The VLM receives the original image alongside the structured descriptions for layer-wise anomaly assessment, cross-layer interaction analysis, and final binary classification.
| Model | Acc. | Rec. | Prec. | F1 | T(s) | Queries |
|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B (NFT) | 0.55 | 0.48 | 0.55 | 0.52 | 1.74 | 1 |
| Qwen2.5-VL-7B (FT) | 0.94 | 0.91 | 0.97 | 0.94 | 0.08 | 1 |
| Qwen2.5-VL-7B (PFT) | 0.84 | 0.82 | 0.85 | 0.83 | 13.85 | 2 |
| Gemini-2.0-FE (full_opt) | 0.88 | 0.90 | 0.86 | 0.88 | 17.8 | 5 |
NFT = Non-Fine-Tuned, FT = Fine-Tuned Single-Shot, PFT = Pipeline Fine-Tuned. T = Inference time.
The following sections present the complete architectures, prompts, and implementation details for all evaluation methods described in the paper. These could not be included in the main paper due to space constraints.
Each DSPy Signature has a docstring, InputFields, and OutputFields. DSPy's ChatAdapter combines these into a structured prompt with a system message and a user message. Below we show the full assembled prompts for each method.
Direct VLM prompting with image only, without structured reasoning or layered analysis. Uses BasicAnomalyImageSignature with dspy.ChainOfThought.
System Message
Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle
Your output fields are:
1. `is_anomaly` (bool):
Determine if the scene is anomalous, i.e. an out-of-distribution or highly unusual event, considering:
1. Is this common to see while driving?
2. Does this have an effect on the ego vehicle?
3. Can this influence the ego vehicle's behavior?
4. Can the ego vehicle drive safely in this scenario?
2. `explanation` (str): Explanation of why the scene is or is not anomalous
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## scene_image ## ]]
{scene_image}
[[ ## is_anomaly ## ]]
{is_anomaly} # note: the value you produce must be True or False
[[ ## explanation ## ]]
{explanation}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
You are a fault monitor for a visual autonomous vehicle system.
Your task is to analyze the traffic scene and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
Determine if the scene constitutes a normal observation or an anomaly and explain your reasoning.
Normal observations do not detrimentally affect the vehicle's performance, whereas anomalies might.
IMPORTANT: You must provide your response in the exact JSON format with ALL required fields:
{
"is_anomaly": true/false,
"explanation": "Final explanation of why the scene is or is not anomalous"
}
User Message
[[ ## scene_image ## ]]
<image>
Respond with the corresponding output fields, starting with the field `[[ ## is_anomaly ## ]]`
(must be formatted as a valid Python bool), then `[[ ## explanation ## ]]`, and then ending
with the marker for `[[ ## completed ## ]]`.
Two-phase approach without structured reasoning. Phase 1 extracts an unstructured scene description; Phase 2 classifies based on text only (no image in evaluation).
Uses BasicSceneDescriptionSignature with dspy.ChainOfThought.
System Message
Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle
Your output fields are:
1. `scene_description` (str): A simple description of what is visible in the traffic scene
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## scene_image ## ]]
{scene_image}
[[ ## scene_description ## ]]
{scene_description}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
You are a traffic scene analysis expert for an autonomous vehicle system.
With over 10 years of experience in traffic scene analysis, you specialize in analyzing traffic scenes from the front camera of the ego vehicle.
Your task is to analyze the current traffic scene and provide a comprehensive analysis.
The goal is to capture all details that could affect the ego vehicle's driving behavior or safety.
User Message
[[ ## scene_image ## ]]
<image>
Respond with the corresponding output fields, starting with the field `[[ ## scene_description ## ]]`,
and then ending with the marker for `[[ ## completed ## ]]`.
Uses BasicAnomalyTextSignature with dspy.ChainOfThought. Receives only the text description (no image).
System Message
Your input fields are:
1. `scene_description` (str): The description of the traffic scene
Your output fields are:
1. `is_anomaly` (bool):
Determine if the scene is anomalous, i.e. an out-of-distribution or highly unusual event, considering:
1. Is this common to see while driving?
2. Does this have an effect on the ego vehicle?
3. Can this influence the ego vehicle's behavior?
4. Can the ego vehicle drive safely in this scenario?
2. `explanation` (str): Explanation of why the scene is or is not anomalous
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## scene_description ## ]]
{scene_description}
[[ ## is_anomaly ## ]]
{is_anomaly} # note: the value you produce must be True or False
[[ ## explanation ## ]]
{explanation}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
You are a fault monitor for a visual autonomous vehicle system.
Your task is to analyze the traffic scene and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
Determine if the scene constitutes a normal observation or an anomaly and explain your reasoning.
Normal observations do not detrimentally affect the vehicle's performance, whereas anomalies might.
User Message
[[ ## scene_description ## ]]
<scene description text from Phase 1>
Respond with the corresponding output fields, starting with the field `[[ ## is_anomaly ## ]]`
(must be formatted as a valid Python bool), then `[[ ## explanation ## ]]`, and then ending
with the marker for `[[ ## completed ## ]]`.
Two-phase approach without structured reasoning. Phase 1 extracts an unstructured scene description; Phase 2 performs multimodal classification using both image and text.
Same as text_baseline Phase 1 (see above).
Uses BasicAnomalySignature with dspy.ChainOfThought. Receives both the scene description and the original image.
System Message
Your input fields are:
1. `scene_description` (str): The description of the traffic scene
2. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle
Your output fields are:
1. `is_anomaly` (bool):
Determine if the scene is anomalous, i.e. an out-of-distribution or highly unusual event, considering:
1. Is this common to see while driving?
2. Does this have an effect on the ego vehicle?
3. Can this influence the ego vehicle's behavior?
4. Can the ego vehicle drive safely in this scenario?
2. `explanation` (str): Explanation of why the scene is or is not anomalous
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## scene_description ## ]]
{scene_description}
[[ ## scene_image ## ]]
{scene_image}
[[ ## is_anomaly ## ]]
{is_anomaly} # note: the value you produce must be True or False
[[ ## explanation ## ]]
{explanation}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
You are a fault monitor for a visual autonomous vehicle system.
Your task is to analyze the traffic scene and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
Determine if the scene constitutes a normal observation or an anomaly and explain your reasoning.
Normal observations do not detrimentally affect the vehicle's performance, whereas anomalies might.
User Message
[[ ## scene_description ## ]]
<scene description text from Phase 1>
[[ ## scene_image ## ]]
<image>
Respond with the corresponding output fields, starting with the field `[[ ## is_anomaly ## ]]`
(must be formatted as a valid Python bool), then `[[ ## explanation ## ]]`, and then ending
with the marker for `[[ ## completed ## ]]`.
Single-query approach using structured layered prompting. The VLM analyzes the image using the four-layer decomposition. Uses SceneAnomalyImageSignature with dspy.ChainOfThought.
System Message
Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle
Your output fields are:
1. `street_layer_analysis` (str): Analysis of street-level anomalies (topology, geometry, surface, lane markings). Describe if there are any unusual or dangerous road conditions
2. `street_layer_anomaly` (bool)
3. `infrastructure_layer_analysis` (str): Analysis of infrastructure anomalies (traffic lights, signs, cones, barriers, construction). Describe any unusual or problematic infrastructure elements
4. `infrastructure_layer_anomaly` (bool)
5. `movable_objects_layer_analysis` (str): Analysis of movable object anomalies (vehicles, pedestrians, etc.). Describe if there any unusual behaviors or dangerous situations.
6. `movable_objects_layer_anomaly` (bool)
7. `environmental_layer_analysis` (str): Analysis of environmental anomalies (weather, visibility). Describe if conditions are truly out-of-distribution or simply require caution.
8. `environmental_layer_anomaly` (bool)
9. `is_anomaly` (bool): Determine if the scene is anomalous, i.e. an out-of-distribution or highly unusual event, considering:
1. Individual layer anomalies
2. Cross-layer interaction effects
3. Context-specific requirements from semantic layer
4. Criticality of the scene
Finally, provide an overall assessment considering:
1. Is this common to see while driving?
2. Does this have an effect on the ego vehicle?
3. Can this influence the ego vehicle's behavior?
4. Can the ego vehicle drive safely in this scenario?
10. `explanation` (str): Final explanation considering all layers and their interactions
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## scene_image ## ]]
{scene_image}
[[ ## street_layer_analysis ## ]]
{street_layer_analysis}
[[ ## street_layer_anomaly ## ]]
{street_layer_anomaly} # note: the value you produce must be True or False
[[ ## infrastructure_layer_analysis ## ]]
{infrastructure_layer_analysis}
[[ ## infrastructure_layer_anomaly ## ]]
{infrastructure_layer_anomaly} # note: the value you produce must be True or False
[[ ## movable_objects_layer_analysis ## ]]
{movable_objects_layer_analysis}
[[ ## movable_objects_layer_anomaly ## ]]
{movable_objects_layer_anomaly} # note: the value you produce must be True or False
[[ ## environmental_layer_analysis ## ]]
{environmental_layer_analysis}
[[ ## environmental_layer_anomaly ## ]]
{environmental_layer_anomaly} # note: the value you produce must be True or False
[[ ## is_anomaly ## ]]
{is_anomaly} # note: the value you produce must be True or False
[[ ## explanation ## ]]
{explanation}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
You are a fault monitor for a visual autonomous vehicle system.
Your task is to analyze the traffic scene layer by layer and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
For each layer, determine if constitutes a normal observation or an anomaly and explain your reasoning.
Normal observations do not detrimentally affect the vehicle's performance, whereas anomalies might.
Finally, provide an overall assessment.
User Message
[[ ## scene_image ## ]]
<image>
Respond with the corresponding output fields, starting with the field `[[ ## street_layer_analysis ## ]]`,
then `[[ ## street_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## infrastructure_layer_analysis ## ]]`,
then `[[ ## infrastructure_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## movable_objects_layer_analysis ## ]]`,
then `[[ ## movable_objects_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## environmental_layer_analysis ## ]]`,
then `[[ ## environmental_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## is_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## explanation ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.
Two-phase approach with structured layered analysis. Phase 1 extracts a structured scene description using four parallel layer-specific signatures; Phase 2 performs text-only classification.
Four parallel extraction signatures (one per semantic layer). See Extraction Prompts for complete prompts.
Uses SceneAnomalyTextSignature with dspy.ChainOfThought. Receives only the structured text description (no image).
System Message
Your input fields are:
1. `scene_description` (str): Structured description of the scene
Your output fields are:
1. `street_layer_analysis` (str): Analysis of street-level anomalies (topology, geometry, surface, lane markings). Describe if there are any unusual or dangerous road conditions
2. `street_layer_anomaly` (bool)
3. `infrastructure_layer_analysis` (str): Analysis of infrastructure anomalies (traffic lights, signs, cones, barriers, construction). Describe any unusual or problematic infrastructure elements
4. `infrastructure_layer_anomaly` (bool)
5. `movable_objects_layer_analysis` (str): Analysis of movable object anomalies (vehicles, pedestrians, etc.). Describe if there any unusual behaviors or dangerous situations.
6. `movable_objects_layer_anomaly` (bool)
7. `environmental_layer_analysis` (str): Analysis of environmental anomalies (weather, visibility). Describe if conditions are truly out-of-distribution or simply require caution.
8. `environmental_layer_anomaly` (bool)
9. `is_anomaly` (bool): Determine if the scene is anomalous, i.e. an out-of-distribution or highly unusual event, considering:
1. Individual layer anomalies
2. Cross-layer interaction effects
3. Context-specific requirements from semantic layer
4. Criticality of the scene
Finally, provide an overall assessment considering:
1. Is this common to see while driving?
2. Does this have an effect on the ego vehicle?
3. Can this influence the ego vehicle's behavior?
4. Can the ego vehicle drive safely in this scenario?
10. `explanation` (str): Final explanation considering all layers and their interactions
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## scene_description ## ]]
{scene_description}
[[ ## street_layer_analysis ## ]]
{street_layer_analysis}
[[ ## street_layer_anomaly ## ]]
{street_layer_anomaly} # note: the value you produce must be True or False
[[ ## infrastructure_layer_analysis ## ]]
{infrastructure_layer_analysis}
[[ ## infrastructure_layer_anomaly ## ]]
{infrastructure_layer_anomaly} # note: the value you produce must be True or False
[[ ## movable_objects_layer_analysis ## ]]
{movable_objects_layer_analysis}
[[ ## movable_objects_layer_anomaly ## ]]
{movable_objects_layer_anomaly} # note: the value you produce must be True or False
[[ ## environmental_layer_analysis ## ]]
{environmental_layer_analysis}
[[ ## environmental_layer_anomaly ## ]]
{environmental_layer_anomaly} # note: the value you produce must be True or False
[[ ## is_anomaly ## ]]
{is_anomaly} # note: the value you produce must be True or False
[[ ## explanation ## ]]
{explanation}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
You are a fault monitor for a visual autonomous vehicle system.
Your task is to analyze the traffic scene layer by layer and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
For each layer, determine if constitutes a normal observation or an anomaly and explain your reasoning.
Normal observtions do not detrimentally affect the vehicle's performance, whereas anomalies might.
Finally, provide an overall assessment.
User Message
[[ ## scene_description ## ]]
<structured scene description from extraction step>
Respond with the corresponding output fields, starting with the field `[[ ## street_layer_analysis ## ]]`,
then `[[ ## street_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## infrastructure_layer_analysis ## ]]`,
then `[[ ## infrastructure_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## movable_objects_layer_analysis ## ]]`,
then `[[ ## movable_objects_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## environmental_layer_analysis ## ]]`,
then `[[ ## environmental_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## is_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## explanation ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.
Complete SAVANT pipeline with multimodal structured evaluation. Phase 1 extracts structured scene descriptions; Phase 2 performs classification using both image and layered text. Uses SceneAnomalySignature with dspy.ChainOfThought.
Four parallel extraction signatures (one per semantic layer). See Extraction Prompts for complete prompts.
System Message
Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle
2. `scene_description` (str): Structured description of the scene
Your output fields are:
1. `street_layer_analysis` (str): Analysis of street-level anomalies (topology, geometry, surface, lane markings). Describe if there are any unusual or dangerous road conditions
2. `street_layer_anomaly` (bool)
3. `infrastructure_layer_analysis` (str): Analysis of infrastructure anomalies (traffic lights, signs, cones, barriers, construction). Describe any unusual or problematic infrastructure elements
4. `infrastructure_layer_anomaly` (bool)
5. `movable_objects_layer_analysis` (str): Analysis of movable object anomalies (vehicles, pedestrians, etc.). Describe if there any unusual behaviors or dangerous situations.
6. `movable_objects_layer_anomaly` (bool)
7. `environmental_layer_analysis` (str): Analysis of environmental anomalies (weather, visibility). Describe if conditions are truly out-of-distribution or simply require caution.
8. `environmental_layer_anomaly` (bool)
9. `is_anomaly` (bool): Determine if the scene is anomalous, i.e. an out-of-distribution or highly unusual event, considering:
1. Individual layer anomalies
2. Cross-layer interaction effects
3. Context-specific requirements from semantic layer
4. Criticality of the scene
Finally, provide an overall assessment considering:
1. Is this common to see while driving?
2. Does this have an effect on the ego vehicle?
3. Can this influence the ego vehicle's behavior?
4. Can the ego vehicle drive safely in this scenario?
10. `explanation` (str): Final explanation considering all layers and their interactions
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## scene_image ## ]]
{scene_image}
[[ ## scene_description ## ]]
{scene_description}
[[ ## street_layer_analysis ## ]]
{street_layer_analysis}
[[ ## street_layer_anomaly ## ]]
{street_layer_anomaly} # note: the value you produce must be True or False
[[ ## infrastructure_layer_analysis ## ]]
{infrastructure_layer_analysis}
[[ ## infrastructure_layer_anomaly ## ]]
{infrastructure_layer_anomaly} # note: the value you produce must be True or False
[[ ## movable_objects_layer_analysis ## ]]
{movable_objects_layer_analysis}
[[ ## movable_objects_layer_anomaly ## ]]
{movable_objects_layer_anomaly} # note: the value you produce must be True or False
[[ ## environmental_layer_analysis ## ]]
{environmental_layer_analysis}
[[ ## environmental_layer_anomaly ## ]]
{environmental_layer_anomaly} # note: the value you produce must be True or False
[[ ## is_anomaly ## ]]
{is_anomaly} # note: the value you produce must be True or False
[[ ## explanation ## ]]
{explanation}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
You are a fault monitor for a visual autonomous vehicle system.
Your task is to analyze the traffic scene layer by layer and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
For each layer, determine if constitutes a normal observation or an anomaly and explain your reasoning.
Normal observtions do not detrimentally affect the vehicle's performance, whereas anomalies might.
Finally, provide an overall assessment.
User Message
[[ ## scene_image ## ]]
<image>
[[ ## scene_description ## ]]
<structured scene description from extraction step>
Respond with the corresponding output fields, starting with the field `[[ ## street_layer_analysis ## ]]`,
then `[[ ## street_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## infrastructure_layer_analysis ## ]]`,
then `[[ ## infrastructure_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## movable_objects_layer_analysis ## ]]`,
then `[[ ## movable_objects_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## environmental_layer_analysis ## ]]`,
then `[[ ## environmental_layer_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## is_anomaly ## ]]` (must be formatted as a valid Python bool),
then `[[ ## explanation ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.
Optimized versions of text and full using DSPy MIPROv2 prompt optimization. The optimizer modifies the Phase 2 evaluation prompts in two ways: (1) rewritten instructions tailored to each model, and (2) few-shot demonstrations selected from training data. Only Phase 2 (evaluation) is optimized; the extraction phase uses the same base prompts.
Optimization setup: dspy.MIPROv2 with num_candidates=15, init_temperature=0.3, num_trials=25, minibatch_size=8, max_bootstrapped_demos=5, max_labeled_demos=6. Training data: ~103 anomalous + 21 normal scenes. Metric: binary classification accuracy.
| Model | Optimizer | Instruction Changes | Demos |
|---|---|---|---|
| Qwen2.5-VL-7B | text | Minimal (minor rewording) | 6 |
| Qwen2.5-VL-7B | image/full | Added per-layer bullet points | 6 |
| Qwen2.5-VL-3B | text | Unchanged | 0 |
| Qwen2.5-VL-3B | image/full | Unchanged | 0 |
| Qwen2.5-VL-32B | text | Unchanged | 6 |
| Qwen2.5-VL-32B | image/full | Unchanged | 0 |
| Qwen2.5-VL-72B | text | Completely rewritten ("guardian angel") | 6 |
| Qwen2.5-VL-72B | image/full | Unchanged | 0 |
| Gemini-2.0-FE | text | Completely rewritten ("vigilant fault monitor") | 4 |
| Gemini-2.0-FE | image/full | Completely rewritten ("sophisticated fault monitor") | 4 |
You are tasked with serving as a vigilant fault monitor for a sophisticated autonomous vehicle
system. Your core responsibility involves meticulously analyzing a traffic scene, dissecting it
into distinct layers -- street, infrastructure, movable objects, and environmental -- to pinpoint any
potential anomalies that could jeopardize vehicle safety, induce unsafe maneuvers, or lead to
violations of traffic regulations.
Each layer demands a thorough scrutiny:
1. **Street Layer**: Assess the road's geometry, topology, surface condition, and lane markings.
Consider whether the layout, including any merging lanes, splits, or bottlenecks, presents
challenges. Evaluate the road surface's integrity and the clarity of lane markings, factoring
in any disruptions caused by construction or wear.
2. **Infrastructure Layer**: Investigate traffic control devices such as lights, signs, cones,
and barriers. Determine if their placement and condition adhere to standards, ensuring they
effectively guide traffic. Take note of any construction sites, evaluating their legitimacy
and the adequacy of their demarcation.
3. **Movable Objects Layer**: Observe vehicles, pedestrians, and other dynamic entities within
the scene. Analyze their behavior and positioning relative to the ego vehicle, identifying
any unusual movements or placements that could signal danger.
4. **Environmental Layer**: Gauge the impact of weather conditions, visibility, and lighting
on the driving environment. Consider how these factors may affect the vehicle's perception
and maneuverability.
For each layer, discern whether the observations constitute a normal scenario conducive to safe
driving or an anomaly that could compromise safety. Provide a reasoned explanation for your
determination, elucidating the specific elements that support your conclusion.
Ultimately, synthesize your findings into an overarching assessment. Reflect on the interplay
between the layers, considering how anomalies in one area might exacerbate issues in another.
Evaluate the scene's commonality, its implications for the ego vehicle's performance, and the
necessity for altering the vehicle's behavior. Conclude with a definitive statement on the
scene's overall safety and the vehicle's ability to navigate it securely.
You are a sophisticated fault monitor for a visual autonomous vehicle system, tasked with
meticulously analyzing traffic scenes to ensure the safety and optimal performance of the vehicle.
Your role is to dissect the scene into four distinct layers -- street, infrastructure, movable objects,
and environmental -- and conduct a thorough examination for any anomalies that could jeopardize vehicle
safety, lead to unsafe actions, or result in traffic rule violations.
For each layer, your analysis should be comprehensive and detailed:
- **Street Layer:** Assess the road's geometry, topology, surface condition, and lane markings.
Determine if there are any unusual road conditions, such as unexpected lane closures, irregular
surface textures, or unclear markings that could confuse the vehicle.
- **Infrastructure Layer:** Evaluate the state of traffic lights, signs, cones, barriers, and
construction sites. Identify any infrastructure elements that are out of place, damaged, or
improperly set up, which could mislead the vehicle or create hazardous situations.
- **Movable Objects Layer:** Analyze the behavior and positioning of vehicles, pedestrians, and
other dynamic elements. Look for any unusual movements, unexpected stops, or erratic actions
that could pose a threat to the vehicle's safety.
- **Environmental Layer:** Consider the weather, visibility, and lighting conditions. Determine
if the environmental factors are within normal ranges or if they present challenges that could
affect the vehicle's performance, such as reduced visibility due to fog or glare from streetlights.
For each layer, clearly state whether the observations are normal or anomalous, and provide a
detailed explanation for your reasoning. Normal observations should not negatively impact the
vehicle's performance, while anomalies could potentially cause issues.
After analyzing each layer, synthesize your findings into an overall assessment. Consider the
interactions between layers, the context-specific requirements of the scene, and the criticality
of the observed conditions. Provide a final explanation that summarizes the safety and potential
impact on the vehicle's behavior, addressing the following questions:
1. Is this a common occurrence while driving?
2. Does this have an effect on the ego vehicle?
3. Can this influence the ego vehicle's behavior?
4. Can the ego vehicle drive safely in this scenario?
Your analysis should be highly informative and descriptive, ensuring that all potential hazards
are identified and explained in detail.
You are the guardian angel of an autonomous vehicle navigating through complex and ever-changing
traffic scenes. Your mission is to meticulously dissect each scene into its fundamental
components -- street, infrastructure, movable objects, and environmental factors -- and scrutinize them
for any deviations from the norm that could jeopardize the vehicle's safety, operational integrity,
or adherence to traffic laws.
**Your Role:**
- **Layered Detective:** Delve deep into the intricacies of each scene layer. For the street layer,
investigate the road's geometry, topology, surface condition, and lane markings. In the
infrastructure layer, examine traffic lights, signs, cones, barriers, and construction sites.
Assess the behavior of vehicles, pedestrians, and other dynamic entities in the movable objects
layer. Lastly, consider the environmental factors like weather, visibility, and lighting
conditions in the environmental layer.
- **Anomaly Hunter:** Identify and flag any anomalies within each layer. Determine if what you
observe is a normal occurrence or a potential hazard. Provide a rationale for your judgment,
explaining why certain observations are deemed anomalies and how they could impact the vehicle's
performance.
- **Synthesizer of Truth:** After analyzing each layer, synthesize your findings to deliver an
overarching verdict on the scene's anomaly status. Consider the interplay between layers, the
broader context, and the criticality of the situation. Offer a conclusive explanation that
encapsulates your reasoning, guiding the autonomous vehicle towards safe and informed
decision-making.
**Your Task:**
Given a structured description of a traffic scene, conduct a thorough analysis of each layer,
pinpoint anomalies, and provide a comprehensive assessment. Your insights will empower the vehicle
to navigate the world with precision and safety.
| Method | Layered Analysis | Image in Phase 2 | DSPy Opt. | Queries |
|---|---|---|---|---|
image_baseline | No | Yes (single) | No | 1 |
text_baseline | No | No | No | 2 |
baseline | No | Yes | No | 2 |
image | Yes | Yes (single) | No | 1 |
text | Yes | No | No | 5 |
full | Yes | Yes | No | 5 |
text_opt | Yes | No | Yes | 5 |
full_opt | Yes | Yes | Yes | 5 |
The text and full methods use four parallel extraction signatures in Phase 1, one per semantic layer. Each signature extracts structured information from the image.
StreetLayerSignatureExtracts topology, geometry, surface condition, and lane markings. Uses StreetLayerAnalyzer with dspy.ChainOfThought.
System Message
Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle
Your output fields are:
1. `topology` (str): Describe the street network layout or structure.
For example: road type (highway, urban, residential), intersections (T-junction, roundabout),
number of carriageways, connections (intersections, merges, splits), etc.
2. `geometry` (str): Describe The geometry of the road, including number of lanes, lane types
(driving, turning, bus, bike etc.), lane widths, road curvature (straight, curved, intersection, etc.),
road gradient (flat, uphill, downhill, etc.), and any geometric features like merging lanes, splits,
or bottlenecks
3. `surface` (str): The road surface characteristics including material type (asphalt, concrete, gravel, etc.),
surface condition (dry, wet, icy, snowy etc.), surface quality (smooth, rough, damaged etc.), and any
hazards or anomalies (potholes, cracks, debris, speed bumps, etc.)
4. `lane_markings` (str): The characteristics and condition of lane markings, including marking types
(solid, dashed, double, yellow, white), marking visibility (clear, faded, obscured), special markings
(arrows, crosswalks, stop lines), and any irregularities or temporary markings
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## scene_image ## ]]
{scene_image}
[[ ## topology ## ]]
{topology}
[[ ## geometry ## ]]
{geometry}
[[ ## surface ## ]]
{surface}
[[ ## lane_markings ## ]]
{lane_markings}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
You are a traffic scene analysis expert for an autonomous vehicle system.
With over 10 years of experience in traffic scene analysis, you specialize in analyzing traffic scenes from the front camera of the ego vehicle.
Your task is to analyze the current traffic scene strictly based on the street level and provide a comprehensive analysis.
The street level analysis includes:
- Topology: Overall street network structure (e.g., road type, intersections)
- Geometry: Road geometry details (e.g., lane count, curvature, gradient)
- Surface: Road surface material, condition and potential hazards
- Lane markings: Quality and visibility of lane markings, including any irregularities
The goal is to capture all street-related details that could affect the ego vehicle's driving behavior or safety.
User Message
[[ ## scene_image ## ]]
<image>
Respond with the corresponding output fields, starting with the field `[[ ## topology ## ]]`,
then `[[ ## geometry ## ]]`, then `[[ ## surface ## ]]`, then `[[ ## lane_markings ## ]]`,
and then ending with the marker for `[[ ## completed ## ]]`.
InfrastructureLayerSignatureExtracts traffic lights, signs, cones, barriers, and construction sites. Uses InfrastructureLayerAnalyzer with dspy.ChainOfThought.
System Message
Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle
Your output fields are:
1. `traffic_lights` (str): Describe any traffic lights or objects mistaken for traffic lights.
Include position, state (green, red, flashing) and whether they seem correctly placed or attached to
unusual objects. Return 'None' if no traffic lights are present.
2. `traffic_signs` (str): Describe any traffic signs or potential lookalikes (e.g., signs on billboards, defaced or
damaged signs). Include sign type, visibility, placement, and note if they appear
normal or if anything seems odd. Return 'None' if no traffic signs are present.
3. `traffic_cones` (str): Describe traffic cones, including their quantity, positions (e.g., placed on lane markings, soulders),
and any unusual cases (e.g., objects that look like cones but aren't). Indicate if placement is typical
(construction zone) or out of place. Return 'None' if no traffic cones are present.
4. `traffic_barriers` (str): Describe any traffic barriers, noting their position, condition and whether they appear damaged,
misplaced, or used in unexpected ways. Return 'None' if no traffic barriers are present.
5. `construction_sites` (str): Describe any construction or maintenance areas, including signs of ongoing work, equipment,
or partial road closures. Note if items appear legitimate or incorrectly placed
(e.g., unmarked hazards, open pits). Return 'None' if no construction sites are present.
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## scene_image ## ]]
{scene_image}
[[ ## traffic_lights ## ]]
{traffic_lights}
[[ ## traffic_signs ## ]]
{traffic_signs}
[[ ## traffic_cones ## ]]
{traffic_cones}
[[ ## traffic_barriers ## ]]
{traffic_barriers}
[[ ## construction_sites ## ]]
{construction_sites}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
You are a traffic scene analysis expert for an autonomous vehicle system.
With over 10 years of experience in traffic scene analysis, you specialize in analyzing traffic scenes from the front camera of the ego vehicle.
Your task is to analyze the current traffic scene strictly based on the infrastructure level and provide a comprehensive analysis.
The infrastructure level analysis includes:
- Traffic lights
- Traffic signs
- Traffic cones
- Traffic barriers
- Construction sites
Focus on:
- Normal infrastructure (traffic lights, traffic signs, cones, barriers, construction areas)
- Incorrect or unusual placements and positioning (e.g., traffic signs on billboards, traffic lights carried by vehicles)
- Damaged or defaced infrastructure (graffiti on signs, bent or obscured signs)
- Objects that might be misinterpreted as infrastructure (e.g., balloons mistaken for traffic cones, moon for traffic lights)
Note their presence, placement, and condition -- especially if something seems out-of-place or anomalous.
If something is unclear or partially visible, indicate that.
The goal is to capture all infrastructure-related details that could affect the ego vehicle's driving behavior or safety.
User Message
[[ ## scene_image ## ]]
<image>
Respond with the corresponding output fields, starting with the field `[[ ## traffic_lights ## ]]`,
then `[[ ## traffic_signs ## ]]`, then `[[ ## traffic_cones ## ]]`, then `[[ ## traffic_barriers ## ]]`,
then `[[ ## construction_sites ## ]]`, and then ending with the marker for `[[ ## completed ## ]]`.
MovableObjectsLayerSignatureExtracts vehicles, pedestrians, and other dynamic objects. Uses MovableObjectsLayerAnalyzer with dspy.ChainOfThought.
System Message
Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle
Your output fields are:
1. `vehicles` (str): Describe each visible vehicle (car, truck, bus, motorcycle, bicycle, scooter etc.)
its approximate position and orientation to the ego vehicle and any unusual features, anomalies or
out-of-place behavior. Return 'None' if no vehicles are present.
2. `pedestrians` (str): Describe any visible pedestrians, including their position relative to the ego vehicle
and any odd or risky behavior. Return 'None' if no pedestrians are present.
3. `others` (str): Describe any other dynamic, moving or potentially moving objects (animals, scooters,
unusual objects, etc.) and any anomalous or out-of-place behavior of such objects in the scene.
Return 'None' if no other objects are present.
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## scene_image ## ]]
{scene_image}
[[ ## vehicles ## ]]
{vehicles}
[[ ## pedestrians ## ]]
{pedestrians}
[[ ## others ## ]]
{others}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
You are a traffic scene analysis expert for an autonomous vehicle system.
With over 10 years of experience in traffic scene analysis, you specialize in analyzing traffic scenes from the front camera of the ego vehicle.
Your task is to analyze the current traffic scene strictly based on the movable (dynamic) objects level and provide a comprehensive analysis.
The movable objects level analysis includes:
- Vehicles
- Pedestrians
- Others (e.g., animals, bicycles, shopping carts, any other moving objects)
Focus on:
- Object type
- Approximate location and orientation relative to the ego vehicle
- Any observable features, potential hazards, anomalies or out-of-place behavior for each object
Since only one frame is available, avoid making strong assumptions about movement.
The goal is to capture all movable objects-related details that could affect the ego vehicle's driving behavior or safety.
User Message
[[ ## scene_image ## ]]
<image>
Respond with the corresponding output fields, starting with the field `[[ ## vehicles ## ]]`,
then `[[ ## pedestrians ## ]]`, then `[[ ## others ## ]]`,
and then ending with the marker for `[[ ## completed ## ]]`.
EnvironmentalLayerSignatureExtracts weather, visibility, and lighting conditions. Uses EnvironmentalLayerAnalyzer with dspy.ChainOfThought.
System Message
Your input fields are:
1. `scene_image` (Image): The base64 encoded image of the traffic scene from the front camera of the ego vehicle
Your output fields are:
1. `weather` (str): Describe the present weather conditions (e.g., light rain, heavy snow, dense fog)
Include intensity if relevant (light, moderate, heavy).
2. `visibility` (str): Select one of four visibility levels:
'clear' (no obstruction), 'light' (slight reduction),
'moderate' (noticeable but still manageable), or
'severe' (heavily impaired, only if background or roadside
features are not clearly visible).
3. `lighting` (str): Describe lighting conditions, such as daylight, dusk, night, or glare. Include any impact on
visibility (e.g., sun glare causing partial obstruction).
All interactions will be structured in the following way, with the appropriate values filled in.
[[ ## scene_image ## ]]
{scene_image}
[[ ## weather ## ]]
{weather}
[[ ## visibility ## ]]
{visibility}
[[ ## lighting ## ]]
{lighting}
[[ ## completed ## ]]
In adhering to this structure, your objective is:
You are a traffic scene analysis expert for an autonomous vehicle system.
With over 10 years of experience in traffic scene analysis, you specialize in analyzing traffic scenes from the front camera of the ego vehicle.
Your task is to analyze the current traffic scene strictly based on the environmental level and provide a comprehensive analysis.
The environmental level analysis includes:
- Weather: Description of the weather conditions (e.g., sunny, light rain, heavy snow, dense fog)
- Visibility (clear, light, moderate, severe)
- Lighting (daylight, dusk, night, glare)
Provide enough detail to capture conditions that might affect driving safety or sensor performance.
Distinguish normal weather conditions from unusually, truly severe or out-of-distribution scenarios.
Avoid labeling normal conditions as anomalies unless they are truly severe.
The goal is to capture all environmental-related details that could affect the ego vehicle's driving behavior or safety.
User Message
[[ ## scene_image ## ]]
<image>
Respond with the corresponding output fields, starting with the field `[[ ## weather ## ]]`,
then `[[ ## visibility ## ]]`, then `[[ ## lighting ## ]]`,
and then ending with the marker for `[[ ## completed ## ]]`.
We leverage DSPy for systematic prompt optimization and pipeline composition.
import dspy
class SceneDescription(dspy.Signature):
"""Extract structured scene description."""
image: dspy.Image = dspy.InputField()
street_layer: str = dspy.OutputField()
infrastructure_layer: str = dspy.OutputField()
movable_objects_layer: str = dspy.OutputField()
environmental_layer: str = dspy.OutputField()
class AnomalyEvaluation(dspy.Signature):
"""Evaluate scene for semantic anomalies."""
image: dspy.Image = dspy.InputField()
description: str = dspy.InputField()
classification: str = dspy.OutputField()
confidence: float = dspy.OutputField()
reasoning: str = dspy.OutputField()
class SAVANTPipeline(dspy.Module):
def __init__(self):
super().__init__()
self.describe = dspy.ChainOfThought(SceneDescription)
self.evaluate = dspy.ChainOfThought(AnomalyEvaluation)
def forward(self, image):
desc = self.describe(image=image)
full_desc = f"Street: {desc.street_layer}\n"
full_desc += f"Infra: {desc.infrastructure_layer}\n"
full_desc += f"Objects: {desc.movable_objects_layer}\n"
full_desc += f"Env: {desc.environmental_layer}"
return self.evaluate(image=image, description=full_desc)
from dspy.teleprompt import MIPROv2
def anomaly_metric(example, prediction, trace=None):
return 1.0 if example.label == \
prediction.classification else 0.0
optimizer = MIPROv2(
metric=anomaly_metric,
num_candidates=10,
init_temperature=1.0,
num_threads=4
)
optimized_pipeline = optimizer.compile(
SAVANTPipeline(),
trainset=train_examples,
max_bootstrapped_demos=4,
max_labeled_demos=8
)
We present example outputs from each model variant using Qwen2.5-VL-7B, ordered as: NFT (base model), FT (fine-tuned single-shot), and PFT (pipeline fine-tuned). We use the same image — not present in the CODA dataset — across all three methods to demonstrate how SAVANT improves detection performance.
You are a fault monitor for a visual autonomous vehicle system.
Your task is to analyze the traffic scene and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
Determine if the scene constitutes a normal observation or an anomaly and explain your reasoning.
Normal observations do not detrimentally affect the vehicle's performance, whereas anomalies might.
IMPORTANT: You must provide your response in the exact JSON format with ALL required fields:
{
"is_anomaly": true/false,
"explanation": "Final explanation of why the scene is or is not anomalous"
}
Analyze the provided traffic scene image for anomalies. Consider:
1. Is there anything unusual about this situation?
2. Does this have an effect on the ego vehicle?
3. Can this influence the ego vehicle's behavior?
4. Can the ego vehicle drive safely in this scenario?
{
"is_anomaly": false,
"explanation": "The scene depicts a typical roadwork situation where
workers are performing maintenance on traffic signals. There are no
unusual elements that would affect the safety of the ego vehicle.
The workers are at a safe distance from the road. The ego vehicle
can safely proceed through the intersection as there are no immediate
hazards or obstructions."
}
The base model fails to detect the anomaly. It dismisses the workers in the lift as a "typical roadwork situation" and incorrectly concludes the ego vehicle can safely proceed — not accounting for potentially non-functional traffic lights, collision risk due to temporary lack of signalization, and icy road conditions at the intersection.
You are a fault monitor for a visual autonomous vehicle system.
Your task is to analyze the traffic scene layer by layer and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
For each layer, determine if constitutes a normal observation or an anomaly and explain your reasoning.
Normal observtions do not detrimentally affect the vehicle's performance, whereas anomalies might.
Finally, provide an overall assessment.
Analyze this traffic scene image and determine if it contains any anomalies that could affect vehicle safety, cause unsafe actions, or violate traffic rules.
Consider all layers:
- Street Layer: Road geometry, topology, surface, lane markings
- Infrastructure Layer: Traffic lights, signs, cones, barriers, construction
- Movable Objects Layer: Vehicles, pedestrians, dynamic entities
- Environmental Layer: Weather, visibility, lighting
Respond with only: "anomaly" or "normal"
anomaly
The fine-tuned model has internalized the structured analysis from SAVANT and correctly classifies the scene as anomalous with minimal latency. However, it cannot explain why — for interpretability, we rely on PFT. A practical deployment could run FT at 10+ FPS for real-time monitoring, engaging PFT on-demand for explanations when anomalies are detected.
You are an expert at analyzing traffic scenes for autonomous vehicles. Your task is to provide detailed, structured descriptions across four distinct layers using the optimized framework.
Analyze this traffic scene image using the four-layer framework:
1. **Street Layer**: Assess the road's geometry (lane configuration, curves, gradients), topology (intersections, merges, splits), surface condition (material, damage, wetness), and lane markings (visibility, type, condition).
2. **Infrastructure Layer**: Investigate traffic control devices including traffic lights (state, visibility, positioning), signs (type, condition, placement), cones (placement, condition), barriers (type, condition), and construction sites (legitimacy, safety measures).
3. **Movable Objects Layer**: Observe and describe vehicles (type, position, orientation, behavior), pedestrians (location, movement, safety), and other dynamic objects (animals, debris, unusual items).
4. **Environmental Layer**: Evaluate weather conditions (precipitation, wind, temperature effects), visibility (clear, fog, glare), and lighting (daylight, artificial, shadows).
Provide your response in this exact JSON format:
{
"street_layer": {
"geometry": "Detailed analysis of road geometry...",
"topology": "Detailed analysis of road topology...",
"surface": "Detailed analysis of surface condition...",
"lane_markings": "Detailed analysis of lane markings..."
},
"infrastructure_layer": {
"traffic_lights": "Analysis of traffic lights or 'None'",
"traffic_signs": "Analysis of traffic signs or 'None'",
"traffic_cones": "Analysis of traffic cones or 'None'",
"traffic_barriers": "Analysis of barriers or 'None'",
"construction_sites": "Analysis of construction sites or 'None'"
},
"movable_objects_layer": {
"vehicles": "Analysis of vehicles or 'None'",
"pedestrians": "Analysis of pedestrians or 'None'",
"others": "Analysis of other objects or 'None'"
},
"environmental_layer": {
"weather": "Analysis of weather conditions",
"visibility": "Analysis of visibility conditions",
"lighting": "Analysis of lighting conditions"
}
}
You are a fault monitor for a visual autonomous vehicle system.
Your task is to analyze the traffic scene layer by layer and identify any anomalies that could affect vehicle safety, cause the vehicle to take unsafe actions or violate traffic rules.
For each layer, determine if constitutes a normal observation or an anomaly and explain your reasoning.
Normal observtions do not detrimentally affect the vehicle's performance, whereas anomalies might.
Finally, provide an overall assessment.
Analyze the provided image and scene description for anomalies using systematic layer-by-layer evaluation.
Scene Description:
{scene_description}
Perform the following analysis:
1. **Reasoning**: Think step by step about potential anomalies in each layer
2. **Street Layer Analysis**: Analyze street-level anomalies (topology, geometry, surface, lane markings)
3. **Infrastructure Layer Analysis**: Analyze infrastructure anomalies (traffic lights, signs, cones, barriers, construction)
4. **Movable Objects Layer Analysis**: Analyze movable object anomalies (vehicles, pedestrians, etc.)
5. **Environmental Layer Analysis**: Analyze environmental anomalies (weather, visibility)
Provide your assessment in this exact JSON format:
{
"reasoning": "Step-by-step reasoning about potential anomalies...",
"street_layer_analysis": "Detailed analysis of street layer...",
"street_layer_anomaly": true/false,
"infrastructure_layer_analysis": "Detailed analysis of infrastructure layer...",
"infrastructure_layer_anomaly": true/false,
"movable_objects_layer_analysis": "Detailed analysis of movable objects layer...",
"movable_objects_layer_anomaly": true/false,
"environmental_layer_analysis": "Detailed analysis of environmental layer...",
"environmental_layer_anomaly": true/false,
"is_anomaly": true/false,
"explanation": "Overall explanation considering all layers and their interactions..."
}
{
"street_layer": {
"geometry": "The road appears to have at least one lane in each
direction. The lane widths are not clearly visible due to the
snow and ice. The road seems relatively straight with a flat
gradient. There are no visible merging lanes or splits.",
"topology": "The scene appears to be an urban or suburban road
with an intersection. The road type is likely a local road or
a collector road. The intersection is a standard four-way
intersection with traffic lights.",
"surface": "The road surface is primarily asphalt, but it is
covered with a layer of snow and ice. The surface condition
is slippery due to the ice. There may be potholes or cracks
hidden beneath the snow and ice.",
"lane_markings": "Lane markings are not clearly visible due to
the snow and ice. It is difficult to determine the type,
visibility, or condition of the lane markings."
},
"infrastructure_layer": {
"traffic_lights": "There are traffic lights visible in the image.
One set is located in the distance, and another set is visible
on the right side of the image, near the stop sign. The state
of the lights is not clear. They appear to be correctly placed.",
"traffic_signs": "A stop sign is visible on the right side of the
image. It appears to be correctly placed.",
"traffic_cones": "There are a few traffic cones visible in the
image. They are placed near the construction site on the right
side of the road. They appear to be correctly placed.",
"traffic_barriers": "None",
"construction_sites": "There is a construction site on the right
side of the road. Workers are present, and a truck is parked
nearby. The area is marked with traffic cones."
},
"movable_objects_layer": {
"vehicles": "- Car: A dark-colored car is positioned in the center
of the road, facing the ego vehicle. It appears to be moving
forward.\n- Truck: A black pickup truck is positioned in the
intersection, facing slightly to the right.\n- Truck: A white
pickup truck is positioned on the right side of the road, near
the stop sign, facing slightly to the left.\n- Truck: A red
truck is positioned on the left side of the road, facing
slightly to the right.",
"pedestrians": "- Two workers are in a lift, working on the
overhead wires.\n- One worker is standing on the white truck
on the right side of the road, working on the stop sign.",
"others": "None"
},
"environmental_layer": {
"weather": "Overcast, with snow on the ground.",
"visibility": "light",
"lighting": "Daylight"
}
}
{
"reasoning": "Layer-by-layer analysis reveals the following:",
"street_layer_analysis": "The road surface is slippery due to the snow
and ice, which significantly reduces traction. The lane markings are
not clearly visible, making it difficult to maintain lane discipline.
The presence of an intersection further increases the complexity of
the situation.",
"street_layer_anomaly": true,
"infrastructure_layer_analysis": "The traffic lights and stop sign
appear to be correctly placed, but their functionality is uncertain
due to the ongoing construction. The construction site itself is a
potential hazard, as it may obstruct the road or create unexpected
traffic patterns.",
"infrastructure_layer_anomaly": true,
"movable_objects_layer_analysis": "The presence of workers in the lift
and on the truck near the stop sign is a significant hazard. The
workers may unexpectedly move into the roadway. The other vehicles
appear to be operating normally, but their actions may be affected
by the slippery road conditions.",
"movable_objects_layer_anomaly": true,
"environmental_layer_analysis": "The overcast weather and slippery road
surface reduce visibility and traction, respectively. These conditions
require increased caution and reduced speed.",
"environmental_layer_anomaly": false,
"is_anomaly": true,
"explanation": "The scene is anomalous due to the combination of the
slippery road surface, the presence of workers near the roadway, and
the ongoing construction. The slippery road surface reduces traction
and increases the risk of skidding. The workers near the roadway pose
a risk of unexpected movements into the roadway. The construction site
may obstruct the road or create unexpected traffic patterns. This
situation is not common and requires the ego vehicle to exercise
extreme caution, reduce speed, and be prepared to react to
unexpected events."
}
The PFT pipeline provides detailed, explainable layer-by-layer analysis. Unlike NFT which dismissed this as a "typical roadwork situation," PFT correctly flags uncertain traffic light functionality, slippery icy road surfaces, and workers in a lift who may unexpectedly move into the roadway.