Question: Does R-KV support KV compression during the prefill stage?

Hello, and thank you for the great work on R-KV!

I’ve been exploring the implementation in HuggingFace/rkv/monkeypatch.py and noticed that the current logic seems to apply KV compression only after the prefill stage. For example, in the following part:

    query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
    key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
    value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)

    cos, sin = position_embeddings
    query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin) `
It seems that the prefill attention does not apply compression before caching.

My question is:

Does the R-KV implementation (especially the vLLM version) support KV compression during the prefill stage?

If not, is there a recommended way to apply compression earlier to reduce TTFT (Time To First Token) for long-context inputs?

I’d like to test R-KV’s performance on long-text generation tasks, so understanding whether early-stage compression is possible would be really helpful.

Thank you!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Does R-KV support KV compression during the prefill stage? #21

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Question: Does R-KV support KV compression during the prefill stage? #21

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions