This benchmark methodology can be freely used to generate comprehensive tests / benchmarks for assessing the attention control cognitive abilities of models.
"Measuring Progress Toward AGI - Cognitive Abilities" Hackathon by Google DeepMind
Attention Track | April, 2026
-
Community Benchmark Link: https://www.kaggle.com/benchmarks/armanzhalgasbayev/distraction-agnostic-bench
-
Kaggle Competition Writeup: https://www.kaggle.com/competitions/kaggle-measuring-agi/writeups/distraction-agnostic-benchmark
-
Kaggle Bench-Task Generation (Code): https://www.kaggle.com/code/armanzhalgasbayev/distraction-agnostic-bench-18-tasks-gen
We tested 16 models on 18 tasks of our distraction-agnostic-benchmark, and got this leaderboard (by models average peformance):
| Model | Mean_score |
|---|---|
| gemini-2.5-pro | 0.722 |
| gemini-3.1-pro-preview | 0.694 |
| gemini-2.5-flash | 0.639 |
| glm-5 | 0.611 |
| gemini-3-flash-preview | 0.556 |
| gemma-4-31b-it | 0.514 |
| gemma-4-26b-a4b-it | 0.417 |
| gpt-oss-120b | 0.417 |
| qwen3-next-80b-a3b-thinking | 0.222 |
| gpt-5.4-mini-2026-03-17 | 0.125 |
| claude-opus-4-5-20251101 | 0.097 |
| claude-sonnet-4-6-default | 0.083 |
| gemini-3.1-flash-lite-preview | 0.069 |
| deepseek-v3.2 | 0.056 |
| gpt-5.4-nano-2026-03-17 | 0.042 |
| claude-opus-4-6-default | 0 |
Max scores per task leaderboard:
| Task_Name | Max_Score | Models_Solved | Num_Models_Solved | |
|---|---|---|---|---|
| 0 | K=4 N=60 (Level-2) | 0.5 | ['gemini-2.5-pro', 'gemini-3.1-pro-preview', 'glm-5', 'gemini-3-flash-preview', 'gpt-oss-120b'] | 5 |
| 1 | K=6 N=72 (Level-3) | 0.5 | ['gemini-2.5-pro', 'gemini-3.1-pro-preview', 'gemini-2.5-flash', 'gemini-3-flash-preview', 'gpt-oss-120b'] | 5 |
| 2 | K=2 N=48 (Level-1) | 0.5 | ['gemini-2.5-pro', 'gemini-3.1-pro-preview', 'gemini-2.5-flash', 'gemini-3-flash-preview', 'gpt-oss-120b', 'gemma-4-26b-a4b-it'] | 6 |
| 3 | K=4 N=72 (Level-2) | 0.5 | ['gemini-2.5-pro', 'gemini-3.1-pro-preview', 'gemini-2.5-flash', 'gemini-3-flash-preview', 'gemma-4-31b-it', 'gpt-oss-120b'] | 6 |
| 4 | K=2 N=72 (Level-1) | 0.5 | ['gemini-2.5-pro', 'gemini-3.1-pro-preview', 'gemini-2.5-flash', 'gemini-3-flash-preview', 'gemma-4-31b-it', 'gpt-oss-120b', 'gemini-3.1-flash-lite-preview'] | 7 |
| 5 | K=4 N=36 (Level-2) | 0.5 | ['gemini-2.5-pro', 'gemini-3.1-pro-preview', 'gemini-2.5-flash', 'glm-5', 'gemini-3-flash-preview', 'gemma-4-31b-it', 'gpt-oss-120b'] | 7 |
| 6 | K=2 N=60 (Level-1) | 0.75 | ['glm-5'] | 1 |
| 7 | K=4 N=48 (Level-2) | 0.75 | ['glm-5'] | 1 |
| 8 | K=2 N=36 (Level-1) | 1 | ['gemini-2.5-flash'] | 1 |
| 9 | K=6 N=48 (Level-3) | 1 | ['gemini-2.5-pro'] | 1 |
| 10 | K=6 N=60 (Level-3) | 1 | ['gemini-3.1-pro-preview'] | 1 |
| 11 | K=6 N=36 (Level-3) | 1 | ['gemini-2.5-pro', 'gemini-3.1-pro-preview'] | 2 |
| 12 | K=2 N=24 (Level-1) | 1 | ['gemini-2.5-pro', 'gemini-3.1-pro-preview', 'gemini-2.5-flash'] | 3 |
| 13 | K=4 N=12 (Level-2) | 1 | ['gemini-2.5-pro', 'gemini-2.5-flash', 'glm-5'] | 3 |
| 14 | K=4 N=24 (Level-2) | 1 | ['gemini-2.5-pro', 'gemini-3.1-pro-preview', 'gemini-2.5-flash', 'gemma-4-26b-a4b-it'] | 4 |
| 15 | K=6 N=12 (Level-3) | 1 | ['gemini-2.5-pro', 'gemini-3.1-pro-preview', 'gemini-2.5-flash', 'glm-5'] | 4 |
| 16 | K=6 N=24 (Level-3) | 1 | ['gemini-2.5-pro', 'gemini-3.1-pro-preview', 'gemini-3-flash-preview', 'gemma-4-31b-it'] | 4 |
| 17 | K=2 N=12 (Level-1) | 1 | ['gemini-2.5-pro', 'gemini-3.1-pro-preview', 'gemini-2.5-flash', 'gemini-3-flash-preview', 'gemma-4-31b-it'] | 5 |
While most AI benchmarks focus on "accuracy", this track focuses on "attention" - specifically, how a model manages its limited cognitive resources when processing information.
We believe that the ability to pay attention (focus) on key tasks, ignoring unnecessary distractions (such as news / social media) is a key component of intelligence. In cognitive science, the ability to focus on key tasks and ignore distractions is formally known as Executive Attention or Cognitive Control. Dr. Randall Engle’s Executive Attention Theory studies demonstrate that the core difference between people with high fluid intelligence and low fluid intelligence is not memory storage, but the ability to actively maintain focus on a goal while actively inhibiting irrelevant distractions.
In human cognition, attention is an active, dynamic filter. In frontier LLMs (like ChatGPT, Claude, and Gemini), attention is a mathematical operation applied across a sequence of tokens limited by the context window. Relative to our benchmark, an average person solves these highly-distracted tasks easily because biological intelligence utilizes working memory and inhibition. For example, when a human faces a math problem filled with irrelevant, novel symbols, their frontal lobe actively inhibits the visual processing of those symbols. The human creates a small, pristine workspace in their working memory containing only the necessary variables. An LLM has no true "working memory" separated from its input. Its "memory" is the entire context window, noise and all. To an LLM, everything in the prompt exists simultaneously and must be mathematically reconciled. Thus, theoretically, these tasks will be very complex for LLMs, but easy for humans (with enough focus on the task).
The main idea of this benchmark is to simulate real social media distractors (using real-time news) and evaluate models for their ability to ignore unnecessary context and focus on the main task.
- LLM will be given a task to count the number of language changes from one language to another in the main text, ignoring news message blocks. If LLM gets the correct answer it gets ->
0.5score. - The first sentence written in English, will contain a simple sub-task, the LLM must identify this task and provide an answer to it ->
0.25score. - The second sentence written in English, will contain a second sub-task, the LLM must identify this task and provide an answer to it ->
0.25score. - The atttention stability assessment is based on the number of distractors, where:
- Level-1: 22 language changes, and the number of distractors will change by quantity: 12, 24, 36, 48, 60, 72.
- Level-2: 44 language changes, and the number of distractors will change by quantity: 12, 24, 36, 48, 60, 72.
- Level-3: 66 language changes, and the number of distractors will change by quantity: 12, 24, 36, 48, 60, 72.
- Subtasks (english): Sub-tasks are stored in subtasks.json and are designed as simple questions that any frontier model (ChatGPT, Gemini, Claude) can easily answer in isolation, ensuring that any failure is purely a result of attentional drift, not a lack of reasoning capability.
- The languages were specifically chosen so that knowledge of these languages is not necessary to count the number of changes in the languages.
- The problems are designed so that the average person can solve them easily.
- In the first part of counting the language changes, we will evaluate the model's ability to focus on the main task. Then, in the two sub-task (0.25) problems, we will evaluate whether the model read the main text with understanding and didn't lose focus by thinking that counting the number of language changes was the only task. Overall, this approach assesses the model's attentional focus on primary tasks in the presence of distracting information.
- By using real-time news data, we ensure that the distracting information is entirely new to the model.
- The language change counting tasks are also new to the model; furthermore, the sub-tasks are designed in a way that ensures the model has not encountered them before (in exact context). Subtasks tested on free LLM apps: ChatGPT, Gemini, Claude, and Grok (each can easily answer these questions). Subtasks here: data/subtasks.json
- Going too deep into the context will also be a problem, as each text will be an official translation of the phrase "Please stop counting! Years, lovers, and glasses of wine don't need to be counted!" received from "Google Translate".
- Languages used (10 ~ total | contents here: data/please-stop-counting.json):
kz- kazakh;ru- russian;ch- chinese (simplified);fr- french;ar- arabic;pr- portuguese (brazilian);jp- japanese;kr- korean;hw- hawaiian;pl- polish;
- News API (https://newsapi.org/) -> main source of a wide variety of real-time news (the LLMs are not aware of this news);
- Template of the message structure:
<message source="articles[idx].source.name" author="articles[idx].author" publishedAt="articles[idx].publishedAt">articles[idx].title + "\n" + articles[idx].description</message>- LLM must understand that this type of messages must be ignored regardless where there are placed.
- Code for fetching news (13 April, 2026 - news used): fetch_distraction_news.py
Main algorithm: generate_test_tasks.py
Level-1 (22 language changes template):
<english-subtask-1><10-foreign-language-text-mixed-with-distractors-on-sentence-level><english-subtask-2><10-foreign-language-text-mixed-with-distractors-on-sentence-level><english-stop-counting-sentence>
Level-2 (44 language changes template):
<english-subtask-1><10-foreign-language-text-mixed-with-distractors-on-sentence-level><english-subtask-2><10-foreign-language-text-mixed-with-distractors-on-sentence-level><english-stop-counting-sentence><10-foreign-language-text-mixed-with-distractors-on-sentence-level><english-stop-counting-sentence><10-foreign-language-text-mixed-with-distractors-on-sentence-level><english-stop-counting-sentence>
Level-3 (66 language changes template):
<english-subtask-1><10-foreign-language-text-mixed-with-distractors-on-sentence-level><english-subtask-2><10-foreign-language-text-mixed-with-distractors-on-sentence-level><english-stop-counting-sentence><10-foreign-language-text-mixed-with-distractors-on-sentence-level><english-stop-counting-sentence><10-foreign-language-text-mixed-with-distractors-on-sentence-level><english-stop-counting-sentence><10-foreign-language-text-mixed-with-distractors-on-sentence-level><english-stop-counting-sentence><10-foreign-language-text-mixed-with-distractors-on-sentence-level><english-stop-counting-sentence>
- Martyna Plomecka, Yao Yan, Nicholas Kang, Ryan Burnell, María Cruz, and Sara Wolley. Measuring Progress Toward AGI - Cognitive Abilities. https://kaggle.com/competitions/kaggle-measuring-agi, 2026. Kaggle.
- Burnell, R., Yamamori, Y., Firat, O., Olszewska, K., Hughes-Fitt, S., Kelly, O., Galatzer-Levy, I. R., Ringel Morris, M., Dafoe, A., Snyder, A. M., Goodman, N. D., Botvinick, M., & Legg, S. (2026, March 16). Measuring progress toward AGI: A cognitive framework. Google DeepMind. article-pdf
- Kaggle. Kaggle Benchmarks Python library. https://github.com/Kaggle/kaggle-benchmarks
- Engle, R. W. (2002). Working memory capacity as executive attention. Current directions in psychological science, 11(1), 19-23. article-link
- Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., & Liang, P. (2024). Lost in the middle: How language models use long contexts. Transactions of the association for computational linguistics, 12, 157-173. article-link
