Weighted Decoding for Controlled Lexical Alignment in LLM-based Dialogues

Supplementary data for the research paper.

├── Automatic Evaluation Dialogues (DailyDialog based)/
│   ├── models/                          # Generated dialogues by model and weight
│   ├── alignment_data_20250329.tsv      # Alignment metrics for all dialogues
│   └── README.md
├── Human Evaluation Dialogues/
│   ├── {Model}/{Topic}/                 # Dialogue pairs per model and topic
│   └── README.md
├── figures/                             # High-resolution figures from the paper
├── benchmark_results.json               # Generation latency benchmark
├── inter_model_evaluation_results.tsv   # Inter-model comparison evaluations
├── intra_model_evaluation_results.tsv   # Intra-model weight comparison evaluations
└── bradley_terry_rankings.json          # Bradley-Terry model rankings

Datasets

Automatic Evaluation Dialogues

10,545 dialogues generated from DailyDialog with varying alignment weights.

Model	Dialogues	Weight Range
BlenderBot-3B	1,425	25–1000
DialoGPT-small	1,330	25–1000
Llama-2-7b-chat	3,895	25–7500
Phi-3.5-mini-instruct	3,895	25–7500

Human Evaluation Dialogues

12 dialogues (6 pairs) for human preference study comparing baseline (w=0) vs. optimized weights.

Evaluation Data Files

File	Description
`inter_model_evaluation_results.tsv`	LLM-as-judge rankings comparing dialogues across different generator models
`intra_model_evaluation_results.tsv`	LLM-as-judge rankings comparing weight configurations within the same model
`bradley_terry_rankings.json`	Aggregated Bradley-Terry rankings from pairwise comparisons of intra model dialogues
`benchmark_results.json`	Per-turn generation latency and wall time benchmarks for all generator models

Automatic Evaluation (LLM-as-Judge)

Evaluator Models

The same prompt was sent to multiple evaluator LLMs for cross-validation:

GPT-4.1-mini, GPT-4o-mini (OpenAI)
Claude 3.5 Haiku (Anthropic)
Mistral Large (Mistral AI)

System Prompt

You are an advanced AI model tasked with analyzing dialogues.

Evaluation Prompt

**Task**: Evaluate the above AI-generated dialogues by ranking them based on the criteria below.

**Input Format**:
- Dialogues 1 to N: [Dialogue text with turns labeled "Human 1" and "Human 2" where Human 2's utterances are generated by the model]

**Instructions**:
1. Analyze each dialogue strictly based on the provided parameters
2. Score each parameter per dialogue (1-5 scale: 1=Poor, 5=Excellent)
3. Calculate weighted average scores:
   - Turn-level: 50% weight (Average of all turn scores)
   - Dialogue-level: 50% weight (Average of all dialogue scores)
4. Rank dialogues DESCENDING by final score
5. **DO NOT** invent non-existent models or parameters

**Evaluation Criteria**:
|| **Turn-Level (50%)**         | **Dialogue-Level (50%)**        |
|-------------------------------|--------------------------------|
| 1. Interestingness            | 11. Coherence & Flow           |
| 2. Engagement                 | 12. Error Recovery             |
| 3. Specificity                | 13. Consistency                |
| 4. Relevance                  | 14. Response Diversity         |
| 5. Correctness                | 15. Topic Depth                |
| 6. Semantic Appropriateness   | 16. Personality Likeability    |
| 7. Understandability          | 17. User Understanding         |
| 8. Fluency                    | 18. Flexibility and adaptability to the user |
| 9. Overall Turn Quality       | 19. Informativeness            |
|                               | 20. Inquisitiveness            |
|                               | 21. Overall Dialogue Quality   |

**Output Requirements**:
1. Return ONLY a comma-separated list of dialogue IDs in ranking order
2. Format: "Ranking: [ID1],[ID2],...,[IDN]"
3. No explanations or additional text
4. If unsure, prioritize dialogue coherence (Criteria 11) and relevance (Criteria 4)

**Example Valid Response**:
Ranking: 3,1,4,2

**Penalization Rules**:
- Hallucinated content → Ranking discarded
- Extra models mentioned → Automatic lowest rank
- Format violations → Score reduction

Inter-Rater Agreement (Automatic Evaluation)

Agreement among LLM evaluators was measured using Kendall's W and pairwise Spearman rank correlation.

Overall (Bradley-Terry Weights Ranking):

Statistic	Value
Cases analyzed	363 model–dialogue combinations
Average Kendall's W	0.655 (moderate agreement)
Kendall's W range	0.133–0.991
Average Spearman correlation	0.449
Statistically significant (p < 0.05)	226 / 363 cases

Per-Model Average Kendall's W:

Model	Avg W
Phi-3.5-mini-instruct	0.819
Llama-2-7b-chat	0.725
DialoGPT-small	0.609
BlenderBot-3B	0.481

Agreement Distribution:

Level	Count
Strong (W ≥ 0.7)	169
Moderate (0.4 ≤ W < 0.7)	153
Weak (W < 0.4)	41

Human Evaluation

Study Design

Pairwise comparison design: 50 participants (recruited via Prolific) evaluated 6 dialogue pairs, each comparing a baseline (w=0) dialogue against an optimized-weight dialogue. Participants ranked dialogues from best (1) to worst (2).

Participant Instructions

Rank the following dialogues from the best (1) to the worst (2) based on the relevance and coherence of the responses by S2.

Relevance: The appropriateness of responses to immediate conversational context, i.e., the previous utterance of Speaker 1 (S1).

Coherence: The maintenance of thematic consistency and logical progression with respect to the full dialogue.

Inter-Rater Agreement (Human Evaluation)

Statistic	Value
Participants	50
Dialogue pairs	6
Average percentage agreement	58.3%
Average normalized entropy	0.849

Note: Traditional IRR metrics such as Fleiss' κ are not well-suited for this single-item pairwise design. Percentage agreement and normalized entropy are reported as more appropriate measures.

Generation Performance Benchmark

Per-turn generation latency measured on an NVIDIA H100 PCIe GPU with weighted decoding (w=5), top-k=20, 10 candidate samples, and max 30 tokens per turn. 10 dialogues (80 generated turns per model) after 1 warm-up dialogue.

Model	Median (s)	Mean (s)	Std (s)	Min (s)	Max (s)	p95 (s)
BlenderBot-3B	0.8712	0.8923	0.1169	0.6725	1.1659	1.1208
Llama-2-7b-chat	1.0795	1.0899	0.1549	0.7549	1.5620	1.3642
Phi-3.5-mini-instruct	0.9144	0.9294	0.1349	0.6803	1.2783	1.2210
DialoGPT-small	0.4290	0.4346	0.0621	0.3102	0.5999	0.5437

Model Load Times: DialoGPT-small 5.1s, Phi-3.5-mini-instruct 14.0s, Llama-2-7b-chat 23.7s, BlenderBot-3B 4.8s

benchmark_results.json

Full per-turn latencies and per-dialogue wall times for all models. Key fields:

timestamp, device, gpu_name, weight, num_samples, max_gen_length, top_k,
repetition_penalty, seed, num_dialogues, warmup_dialogues,
models[].{model_name, load_time_s, total_generated_turns,
          per_turn_latency_stats, per_dialogue_wall_times, all_turn_latencies}

Figures

High-resolution versions of figures from the paper (in figures/):

File	Description
`alignment_scores_low_weights.png`	Alignment scores for weights ≤1000
`alignment_scores_high_weights.png`	Alignment scores for weights >1000
`perplexity_low_weights.png`	Perplexity for weights ≤1000
`perplexity_high_weights.png`	Perplexity for weights >1000
`inter_model_win_rates.png`	Inter-model comparison win rates
`intra_model_bradley_terry_rankings.png`	Intra-model Bradley-Terry rankings

File Formats

inter_model_evaluation_results.tsv

evaluator_model  alignment_range  model_candidates  preference_ranking

intra_model_evaluation_results.tsv

evaluator_model  target_model  dialogue_id  weight_candidates  preference_ranking

Generator Models

Model	Type
BlenderBot-3B	Encoder-Decoder
DialoGPT-small	Decoder-only
Llama-2-7b-chat	Decoder-only
Phi-3.5-mini-instruct	Decoder-only

References

Yanran Li, Hui Su, Xiaoyu Shen, Wenjie Li, Ziqiang Cao, and Shuzi Niu. 2017. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 986–995, Taipei, Taiwan. Asian Federation of Natural Language Processing.

Citation

Citation details will be added upon publication.

License

For research purposes. See DailyDialog license for base dialogue data terms.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Weighted Decoding for Controlled Lexical Alignment in LLM-based Dialogues

Contents

Datasets

Automatic Evaluation Dialogues

Human Evaluation Dialogues

Evaluation Data Files

Automatic Evaluation (LLM-as-Judge)

Evaluator Models

System Prompt

Evaluation Prompt

Inter-Rater Agreement (Automatic Evaluation)

Human Evaluation

Study Design

Participant Instructions

Inter-Rater Agreement (Human Evaluation)

Generation Performance Benchmark

Figures

File Formats

Generator Models

References

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
Automatic Evaluation Dialogues (DailyDialog based)		Automatic Evaluation Dialogues (DailyDialog based)
Human Evaluation Dialogues		Human Evaluation Dialogues
figures		figures
README.md		README.md
benchmark_results.json		benchmark_results.json
bradley_terry_rankings.json		bradley_terry_rankings.json
inter_model_evaluation_results.tsv		inter_model_evaluation_results.tsv
intra_model_evaluation_results.tsv		intra_model_evaluation_results.tsv

Folders and files

Latest commit

History

Repository files navigation

Weighted Decoding for Controlled Lexical Alignment in LLM-based Dialogues

Contents

Datasets

Automatic Evaluation Dialogues

Human Evaluation Dialogues

Evaluation Data Files

Automatic Evaluation (LLM-as-Judge)

Evaluator Models

System Prompt

Evaluation Prompt

Inter-Rater Agreement (Automatic Evaluation)

Human Evaluation

Study Design

Participant Instructions

Inter-Rater Agreement (Human Evaluation)

Generation Performance Benchmark

Figures

File Formats

Generator Models

References

Citation

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages