A lot of recent works can be added to the paper list.

Rencently, many MLLM works on both image and video understanding achieve great results on video benchmarks. e.g. LLaVA-Next, InternLM, Vila, etc
I think these works should also be added to the paper list for readers to have a comprehensive understanding on this feild.