ELSA Icon ELSA: Acoustic Event-Level Semantic Alignment for Fine-Grained Reference-Free Text-to-Audio Evaluation

Under Review
The paper is currently under review. The links will be made available after publication.
Teaser Image showing ELSA concept

Figure 1: Overview of existing reference-free metrics versus our proposed ELSA. ELSA captures fine-grained acoustic events aligned with semantic content.

Abstract

Text-to-Audio (TTA) generation has seen significant progress, but evaluating these systems remains challenging. Existing reference-free metrics often focus on global semantic alignment, neglecting fine-grained acoustic events. In this paper, we introduce ELSA (Event-Wise Semantic Alignment), a novel metric designed to evaluate TTA models by aligning acoustic events with their corresponding textual descriptions. Our experiments demonstrate that ELSA correlates better with human judgment compared to state-of-the-art baselines.

Method

The ELSA framework leverages advanced alignment techniques to map specific acoustic events in the generated audio to semantic units in the text prompt. This allows for a more granular evaluation of audio quality and relevance.

ELSA Model Architecture

Figure 2: The architecture of the ELSA metric.

Interactive Demo

Explore the ELSA pipeline in action. Select a sample to see how the text and audio are processed.

Text Prompt
...
OpenAI LLM
LLM Parsing
Parsed Text
...
Generated Audio
Meta LASS Model
LASS Separation
Separated Audio
ELSA Icon
ELSA Model
ELSA Score
0.00
Human Score: 0.00

Experimental Results

Swipe to see our comparison with state-of-the-art metrics and ablation studies.

Analysis

Detailed analysis of event sensitivity compared to previous approaches.

Citation


To be appeared.