S3E: Semantic Symbolic State Estimation With Vision-Language Foundation Models

When
Where
Event language(s)
Title: S3E: Semantic Symbolic State Estimation With Vision-Language Foundation Models
Presenter: Nicola Dainese
Abstract: In automated task planning, state estimation is the process of translating an agent's sensor input into a high-level task state. It is important because real-world environments are unpredictable, and actions often do not lead to expected outcomes. State estimation enables the agent to manage uncertainties, adjust its plans, and make more informed decisions. Traditionally, researchers and practitioners relied on hand-crafted and hard-coded state estimation functions to determine the abstract state defined in the task domain. Recent advancements in Vision Language Models (VLMs) enable autonomous retrieval of semantic information from visual input. The authors present Semantic Symbolic State Estimation (S3E), the first general-purpose symbolic state estimator based on VLMs that can be applied in various settings without specialized coding or additional exploration. S3E takes advantage of the foundation model's internal world model and semantic understanding to assess the likelihood of certain symbolic components of the environment's state. They analyze S3E as a multi-label classifier, reveal different kinds of uncertainties that arise when using it, and show how they can be mitigated using natural language and targeted environment design. They show that S3E can achieve over 90% state estimation precision in their simulated and real-world robot experiments.
Paper link: https://openreview.net/forum?id=RKfBy2wlST
Disclaimer: The presenter is not part of the authors!