Skip to content

Question: Does R-KV support KV compression during the prefill stage? #21

@ZeitHaum

Description

@ZeitHaum

Hello, and thank you for the great work on R-KV!

I’ve been exploring the implementation in HuggingFace/rkv/monkeypatch.py and noticed that the current logic seems to apply KV compression only after the prefill stage. For example, in the following part:

query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)

cos, sin = position_embeddings
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin) `

It seems that the prefill attention does not apply compression before caching.

My question is:

Does the R-KV implementation (especially the vLLM version) support KV compression during the prefill stage?

If not, is there a recommended way to apply compression earlier to reduce TTFT (Time To First Token) for long-context inputs?

I’d like to test R-KV’s performance on long-text generation tasks, so understanding whether early-stage compression is possible would be really helpful.

Thank you!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions