Hello, and thank you for the great work on R-KV!
I’ve been exploring the implementation in HuggingFace/rkv/monkeypatch.py and noticed that the current logic seems to apply KV compression only after the prefill stage. For example, in the following part:
query_states = self.q_proj(hidden_states).view(hidden_shape).transpose(1, 2)
key_states = self.k_proj(hidden_states).view(hidden_shape).transpose(1, 2)
value_states = self.v_proj(hidden_states).view(hidden_shape).transpose(1, 2)
cos, sin = position_embeddings
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin) `
It seems that the prefill attention does not apply compression before caching.
My question is:
Does the R-KV implementation (especially the vLLM version) support KV compression during the prefill stage?
If not, is there a recommended way to apply compression earlier to reduce TTFT (Time To First Token) for long-context inputs?
I’d like to test R-KV’s performance on long-text generation tasks, so understanding whether early-stage compression is possible would be really helpful.
Thank you!
Hello, and thank you for the great work on R-KV!
I’ve been exploring the implementation in HuggingFace/rkv/monkeypatch.py and noticed that the current logic seems to apply KV compression only after the prefill stage. For example, in the following part:
It seems that the prefill attention does not apply compression before caching.
My question is:
Does the R-KV implementation (especially the vLLM version) support KV compression during the prefill stage?
If not, is there a recommended way to apply compression earlier to reduce TTFT (Time To First Token) for long-context inputs?
I’d like to test R-KV’s performance on long-text generation tasks, so understanding whether early-stage compression is possible would be really helpful.
Thank you!