BEAF: Observing BEfore-AFter Changes to

Evaluate Hallucination in Vision-language Models

1Dept. of Electrical Engineering, POSTECH, 2Grad. School of Artificial Intelligence, POSTECH, 2Institute for Convergence Research and Education in Advanced Technology, Yonsei University

The key idea of our BEAF benchmark is manipulating visual scene information and

designing the metrics based on the model's answer changes along the scene changes.

Abstract

Large vision language models (LVLMs) perceive the world through a combination of a visual encoder and large language models (LLMs). The visual encoder, pre-trained on large-scale vision-text datasets, provides zero-shot generalization to visual data, and LLMs endow the high reasoning ability to LVLMs. It leads LVLMs to achieve high performance on wide benchmarks without fine-tuning, known as zero or few-shot capability of LLMs. However, recent studies show that LVLMs are vulnerable to hallucination. This undesirable behavior degrades reliability and credibility, thereby making users unable to fully trust the output from LVLMs.

To enhance trustworthiness and better tackle the hallucination of LVLMs, we curate a new evaluation dataset, called the BEfore-AFter hallucination dataset (BEAF), and introduce new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID). Unlike prior works that focus only on constructing questions and answers, the key idea of our benchmark is that we manipulate visual scene information by image editing models and design the metrics based on scene changes. This allows us to clearly assess whether LVLMs correctly understand a given scene by observing the ability to perceive changes. We also visualize the correctness heatmap by virtue of our two-axis view: vision and text. Upon evaluating LVLMs with our dataset, we observed that our metrics can reveal different aspects of LVLM hallucination.

BEAF Dataset


Interpolation end reference image.
Samples of the original and manipulated images in the BEAF dataset. The first column contains original images,
and the rest of the columns contain manipulated images. The removed object is noted below each image.

Data Statistics (ver1)

    Interpolation end reference image.
    Our BEAF dataset contains 26K image-question pairs, consisting of the original and manipulated ones. On average, an original image is asso- ciated with 3.45 manipulated images and 11.72 questions. The total number of removed objects is the same as the number of manipulated images, and one object in question is the same as the number of image-question pairs.

Evaluation Metrics

    Interpolation end reference image.
    We propose four new metrics: True Understanding (TU), IGnorance (IG), StuBbornness (SB), and InDecision (ID), for more detailed evaluation by exploiting the distinctive configuration of before-/after-changes in our dataset. TU measures whether the models truly understand the scene. IG measures the extent to which models lack knowledge about specific scene information. SB measures the extent to which models adhere to their initial answers, and the subscripts of p and n correspond to the consistent positive (“Yes”) and negative (“No”) answers, respectively. ID focuses on the answers to the questions that are not relevant to the removed objects. These answers should not be changed even after the manipulation.

Evaluation Results


Evaluation on BEAF dataset with proposed and traditional metrics

Interpolation end reference image. Interpolation end reference image.

Visualization of correctness heatmap

Interpolation end reference image.

Changes in answers for the open-ended generation task

Interpolation end reference image.


Acknowledgement


This work was partly supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government(MSIT) (No.2021-0-02068, Artificial Intelligence Innovation Hub; No.2022-0-00124, Development of Artificial Intelligence Technology for Self-Improving Competency-Aware Learning Capabilities; No.RS-2019-II191906, Artificial Intelligence Graduate School Program(POSTECH))

BibTeX

@inproceedings{yebin2024beaf,
  title     = {BEAF: Observing BEfore-AFter Changes to Evaluate Hallucination in Vision-language Models},
  author    = {Ye-Bin, Moon and Hyeon-Woo, Nam and Choi, Wonseok and Oh, Tae-Hyun},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2024},
}