Supplementary data for the research paper.
├── Automatic Evaluation Dialogues (DailyDialog based)/
│ ├── models/ # Generated dialogues by model and weight
│ ├── alignment_data_20250329.tsv # Alignment metrics for all dialogues
│ └── README.md
├── Human Evaluation Dialogues/
│ ├── {Model}/{Topic}/ # Dialogue pairs per model and topic
│ └── README.md
├── figures/ # High-resolution figures from the paper
├── benchmark_results.json # Generation latency benchmark
├── inter_model_evaluation_results.tsv # Inter-model comparison evaluations
├── intra_model_evaluation_results.tsv # Intra-model weight comparison evaluations
└── bradley_terry_rankings.json # Bradley-Terry model rankings
10,545 dialogues generated from DailyDialog with varying alignment weights.
| Model | Dialogues | Weight Range |
|---|---|---|
| BlenderBot-3B | 1,425 | 25–1000 |
| DialoGPT-small | 1,330 | 25–1000 |
| Llama-2-7b-chat | 3,895 | 25–7500 |
| Phi-3.5-mini-instruct | 3,895 | 25–7500 |
12 dialogues (6 pairs) for human preference study comparing baseline (w=0) vs. optimized weights.
| File | Description |
|---|---|
inter_model_evaluation_results.tsv |
LLM-as-judge rankings comparing dialogues across different generator models |
intra_model_evaluation_results.tsv |
LLM-as-judge rankings comparing weight configurations within the same model |
bradley_terry_rankings.json |
Aggregated Bradley-Terry rankings from pairwise comparisons of intra model dialogues |
benchmark_results.json |
Per-turn generation latency and wall time benchmarks for all generator models |
The same prompt was sent to multiple evaluator LLMs for cross-validation:
- GPT-4.1-mini, GPT-4o-mini (OpenAI)
- Claude 3.5 Haiku (Anthropic)
- Mistral Large (Mistral AI)
You are an advanced AI model tasked with analyzing dialogues.
**Task**: Evaluate the above AI-generated dialogues by ranking them based on the criteria below.
**Input Format**:
- Dialogues 1 to N: [Dialogue text with turns labeled "Human 1" and "Human 2" where Human 2's utterances are generated by the model]
**Instructions**:
1. Analyze each dialogue strictly based on the provided parameters
2. Score each parameter per dialogue (1-5 scale: 1=Poor, 5=Excellent)
3. Calculate weighted average scores:
- Turn-level: 50% weight (Average of all turn scores)
- Dialogue-level: 50% weight (Average of all dialogue scores)
4. Rank dialogues DESCENDING by final score
5. **DO NOT** invent non-existent models or parameters
**Evaluation Criteria**:
|| **Turn-Level (50%)** | **Dialogue-Level (50%)** |
|-------------------------------|--------------------------------|
| 1. Interestingness | 11. Coherence & Flow |
| 2. Engagement | 12. Error Recovery |
| 3. Specificity | 13. Consistency |
| 4. Relevance | 14. Response Diversity |
| 5. Correctness | 15. Topic Depth |
| 6. Semantic Appropriateness | 16. Personality Likeability |
| 7. Understandability | 17. User Understanding |
| 8. Fluency | 18. Flexibility and adaptability to the user |
| 9. Overall Turn Quality | 19. Informativeness |
| | 20. Inquisitiveness |
| | 21. Overall Dialogue Quality |
**Output Requirements**:
1. Return ONLY a comma-separated list of dialogue IDs in ranking order
2. Format: "Ranking: [ID1],[ID2],...,[IDN]"
3. No explanations or additional text
4. If unsure, prioritize dialogue coherence (Criteria 11) and relevance (Criteria 4)
**Example Valid Response**:
Ranking: 3,1,4,2
**Penalization Rules**:
- Hallucinated content → Ranking discarded
- Extra models mentioned → Automatic lowest rank
- Format violations → Score reduction
Agreement among LLM evaluators was measured using Kendall's W and pairwise Spearman rank correlation.
Overall (Bradley-Terry Weights Ranking):
| Statistic | Value |
|---|---|
| Cases analyzed | 363 model–dialogue combinations |
| Average Kendall's W | 0.655 (moderate agreement) |
| Kendall's W range | 0.133–0.991 |
| Average Spearman correlation | 0.449 |
| Statistically significant (p < 0.05) | 226 / 363 cases |
Per-Model Average Kendall's W:
| Model | Avg W |
|---|---|
| Phi-3.5-mini-instruct | 0.819 |
| Llama-2-7b-chat | 0.725 |
| DialoGPT-small | 0.609 |
| BlenderBot-3B | 0.481 |
Agreement Distribution:
| Level | Count |
|---|---|
| Strong (W ≥ 0.7) | 169 |
| Moderate (0.4 ≤ W < 0.7) | 153 |
| Weak (W < 0.4) | 41 |
Pairwise comparison design: 50 participants (recruited via Prolific) evaluated 6 dialogue pairs, each comparing a baseline (w=0) dialogue against an optimized-weight dialogue. Participants ranked dialogues from best (1) to worst (2).
Rank the following dialogues from the best (1) to the worst (2) based on the relevance and coherence of the responses by S2.
Relevance: The appropriateness of responses to immediate conversational context, i.e., the previous utterance of Speaker 1 (S1).
Coherence: The maintenance of thematic consistency and logical progression with respect to the full dialogue.
| Statistic | Value |
|---|---|
| Participants | 50 |
| Dialogue pairs | 6 |
| Average percentage agreement | 58.3% |
| Average normalized entropy | 0.849 |
Note: Traditional IRR metrics such as Fleiss' κ are not well-suited for this single-item pairwise design. Percentage agreement and normalized entropy are reported as more appropriate measures.
Per-turn generation latency measured on an NVIDIA H100 PCIe GPU with weighted decoding (w=5), top-k=20, 10 candidate samples, and max 30 tokens per turn. 10 dialogues (80 generated turns per model) after 1 warm-up dialogue.
| Model | Median (s) | Mean (s) | Std (s) | Min (s) | Max (s) | p95 (s) |
|---|---|---|---|---|---|---|
| BlenderBot-3B | 0.8712 | 0.8923 | 0.1169 | 0.6725 | 1.1659 | 1.1208 |
| Llama-2-7b-chat | 1.0795 | 1.0899 | 0.1549 | 0.7549 | 1.5620 | 1.3642 |
| Phi-3.5-mini-instruct | 0.9144 | 0.9294 | 0.1349 | 0.6803 | 1.2783 | 1.2210 |
| DialoGPT-small | 0.4290 | 0.4346 | 0.0621 | 0.3102 | 0.5999 | 0.5437 |
Model Load Times: DialoGPT-small 5.1s, Phi-3.5-mini-instruct 14.0s, Llama-2-7b-chat 23.7s, BlenderBot-3B 4.8s
benchmark_results.json
Full per-turn latencies and per-dialogue wall times for all models. Key fields:
timestamp, device, gpu_name, weight, num_samples, max_gen_length, top_k,
repetition_penalty, seed, num_dialogues, warmup_dialogues,
models[].{model_name, load_time_s, total_generated_turns,
per_turn_latency_stats, per_dialogue_wall_times, all_turn_latencies}
High-resolution versions of figures from the paper (in figures/):
| File | Description |
|---|---|
alignment_scores_low_weights.png |
Alignment scores for weights ≤1000 |
alignment_scores_high_weights.png |
Alignment scores for weights >1000 |
perplexity_low_weights.png |
Perplexity for weights ≤1000 |
perplexity_high_weights.png |
Perplexity for weights >1000 |
inter_model_win_rates.png |
Inter-model comparison win rates |
intra_model_bradley_terry_rankings.png |
Intra-model Bradley-Terry rankings |
inter_model_evaluation_results.tsv
evaluator_model alignment_range model_candidates preference_ranking
intra_model_evaluation_results.tsv
evaluator_model target_model dialogue_id weight_candidates preference_ranking
| Model | Type |
|---|---|
| BlenderBot-3B | Encoder-Decoder |
| DialoGPT-small | Decoder-only |
| Llama-2-7b-chat | Decoder-only |
| Phi-3.5-mini-instruct | Decoder-only |
Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.
Citation details will be added upon publication.
For research purposes. See DailyDialog license for base dialogue data terms.