Skip to content

sumitsrv/controlled-lexical-alignment-data

Repository files navigation

Weighted Decoding for Controlled Lexical Alignment in LLM-based Dialogues

Supplementary data for the research paper.

Contents

├── Automatic Evaluation Dialogues (DailyDialog based)/
│   ├── models/                          # Generated dialogues by model and weight
│   ├── alignment_data_20250329.tsv      # Alignment metrics for all dialogues
│   └── README.md
├── Human Evaluation Dialogues/
│   ├── {Model}/{Topic}/                 # Dialogue pairs per model and topic
│   └── README.md
├── figures/                             # High-resolution figures from the paper
├── benchmark_results.json               # Generation latency benchmark
├── inter_model_evaluation_results.tsv   # Inter-model comparison evaluations
├── intra_model_evaluation_results.tsv   # Intra-model weight comparison evaluations
└── bradley_terry_rankings.json          # Bradley-Terry model rankings

Datasets

Automatic Evaluation Dialogues

10,545 dialogues generated from DailyDialog with varying alignment weights.

Model Dialogues Weight Range
BlenderBot-3B 1,425 25–1000
DialoGPT-small 1,330 25–1000
Llama-2-7b-chat 3,895 25–7500
Phi-3.5-mini-instruct 3,895 25–7500

Human Evaluation Dialogues

12 dialogues (6 pairs) for human preference study comparing baseline (w=0) vs. optimized weights.

Evaluation Data Files

File Description
inter_model_evaluation_results.tsv LLM-as-judge rankings comparing dialogues across different generator models
intra_model_evaluation_results.tsv LLM-as-judge rankings comparing weight configurations within the same model
bradley_terry_rankings.json Aggregated Bradley-Terry rankings from pairwise comparisons of intra model dialogues
benchmark_results.json Per-turn generation latency and wall time benchmarks for all generator models

Automatic Evaluation (LLM-as-Judge)

Evaluator Models

The same prompt was sent to multiple evaluator LLMs for cross-validation:

  • GPT-4.1-mini, GPT-4o-mini (OpenAI)
  • Claude 3.5 Haiku (Anthropic)
  • Mistral Large (Mistral AI)

System Prompt

You are an advanced AI model tasked with analyzing dialogues.

Evaluation Prompt

**Task**: Evaluate the above AI-generated dialogues by ranking them based on the criteria below.

**Input Format**:
- Dialogues 1 to N: [Dialogue text with turns labeled "Human 1" and "Human 2" where Human 2's utterances are generated by the model]

**Instructions**:
1. Analyze each dialogue strictly based on the provided parameters
2. Score each parameter per dialogue (1-5 scale: 1=Poor, 5=Excellent)
3. Calculate weighted average scores:
   - Turn-level: 50% weight (Average of all turn scores)
   - Dialogue-level: 50% weight (Average of all dialogue scores)
4. Rank dialogues DESCENDING by final score
5. **DO NOT** invent non-existent models or parameters

**Evaluation Criteria**:
|| **Turn-Level (50%)**         | **Dialogue-Level (50%)**        |
|-------------------------------|--------------------------------|
| 1. Interestingness            | 11. Coherence & Flow           |
| 2. Engagement                 | 12. Error Recovery             |
| 3. Specificity                | 13. Consistency                |
| 4. Relevance                  | 14. Response Diversity         |
| 5. Correctness                | 15. Topic Depth                |
| 6. Semantic Appropriateness   | 16. Personality Likeability    |
| 7. Understandability          | 17. User Understanding         |
| 8. Fluency                    | 18. Flexibility and adaptability to the user |
| 9. Overall Turn Quality       | 19. Informativeness            |
|                               | 20. Inquisitiveness            |
|                               | 21. Overall Dialogue Quality   |

**Output Requirements**:
1. Return ONLY a comma-separated list of dialogue IDs in ranking order
2. Format: "Ranking: [ID1],[ID2],...,[IDN]"
3. No explanations or additional text
4. If unsure, prioritize dialogue coherence (Criteria 11) and relevance (Criteria 4)

**Example Valid Response**:
Ranking: 3,1,4,2

**Penalization Rules**:
- Hallucinated content → Ranking discarded
- Extra models mentioned → Automatic lowest rank
- Format violations → Score reduction

Inter-Rater Agreement (Automatic Evaluation)

Agreement among LLM evaluators was measured using Kendall's W and pairwise Spearman rank correlation.

Overall (Bradley-Terry Weights Ranking):

Statistic Value
Cases analyzed 363 model–dialogue combinations
Average Kendall's W 0.655 (moderate agreement)
Kendall's W range 0.133–0.991
Average Spearman correlation 0.449
Statistically significant (p < 0.05) 226 / 363 cases

Per-Model Average Kendall's W:

Model Avg W
Phi-3.5-mini-instruct 0.819
Llama-2-7b-chat 0.725
DialoGPT-small 0.609
BlenderBot-3B 0.481

Agreement Distribution:

Level Count
Strong (W ≥ 0.7) 169
Moderate (0.4 ≤ W < 0.7) 153
Weak (W < 0.4) 41

Human Evaluation

Study Design

Pairwise comparison design: 50 participants (recruited via Prolific) evaluated 6 dialogue pairs, each comparing a baseline (w=0) dialogue against an optimized-weight dialogue. Participants ranked dialogues from best (1) to worst (2).

Participant Instructions

Rank the following dialogues from the best (1) to the worst (2) based on the relevance and coherence of the responses by S2.

Relevance: The appropriateness of responses to immediate conversational context, i.e., the previous utterance of Speaker 1 (S1).

Coherence: The maintenance of thematic consistency and logical progression with respect to the full dialogue.

Inter-Rater Agreement (Human Evaluation)

Statistic Value
Participants 50
Dialogue pairs 6
Average percentage agreement 58.3%
Average normalized entropy 0.849

Note: Traditional IRR metrics such as Fleiss' κ are not well-suited for this single-item pairwise design. Percentage agreement and normalized entropy are reported as more appropriate measures.

Generation Performance Benchmark

Per-turn generation latency measured on an NVIDIA H100 PCIe GPU with weighted decoding (w=5), top-k=20, 10 candidate samples, and max 30 tokens per turn. 10 dialogues (80 generated turns per model) after 1 warm-up dialogue.

Model Median (s) Mean (s) Std (s) Min (s) Max (s) p95 (s)
BlenderBot-3B 0.8712 0.8923 0.1169 0.6725 1.1659 1.1208
Llama-2-7b-chat 1.0795 1.0899 0.1549 0.7549 1.5620 1.3642
Phi-3.5-mini-instruct 0.9144 0.9294 0.1349 0.6803 1.2783 1.2210
DialoGPT-small 0.4290 0.4346 0.0621 0.3102 0.5999 0.5437

Model Load Times: DialoGPT-small 5.1s, Phi-3.5-mini-instruct 14.0s, Llama-2-7b-chat 23.7s, BlenderBot-3B 4.8s

benchmark_results.json

Full per-turn latencies and per-dialogue wall times for all models. Key fields:

timestamp, device, gpu_name, weight, num_samples, max_gen_length, top_k,
repetition_penalty, seed, num_dialogues, warmup_dialogues,
models[].{model_name, load_time_s, total_generated_turns,
          per_turn_latency_stats, per_dialogue_wall_times, all_turn_latencies}

Figures

High-resolution versions of figures from the paper (in figures/):

File Description
alignment_scores_low_weights.png Alignment scores for weights ≤1000
alignment_scores_high_weights.png Alignment scores for weights >1000
perplexity_low_weights.png Perplexity for weights ≤1000
perplexity_high_weights.png Perplexity for weights >1000
inter_model_win_rates.png Inter-model comparison win rates
intra_model_bradley_terry_rankings.png Intra-model Bradley-Terry rankings

File Formats

inter_model_evaluation_results.tsv

evaluator_model  alignment_range  model_candidates  preference_ranking

intra_model_evaluation_results.tsv

evaluator_model  target_model  dialogue_id  weight_candidates  preference_ranking

Generator Models

Model Type
BlenderBot-3B Encoder-Decoder
DialoGPT-small Decoder-only
Llama-2-7b-chat Decoder-only
Phi-3.5-mini-instruct Decoder-only

References

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.

Citation

Citation details will be added upon publication.

License

For research purposes. See DailyDialog license for base dialogue data terms.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors