Since Vlaser VLM is trained by supervised finetuning (SFT) upon InternVL3, all the training and inference procedure adopt the same as InternVL3. For ease of use, we list the key processes including environment setup, training and inference steps here. If you encounter other sort of questions, please refer to InternVL-series at first.
-
Create a conda virtual environment and activate it:
conda create -n internvl python=3.9 conda activate internvl
-
Install dependencies using requirements.txt:
pip install -r requirements.txt pip install flash-attn==2.3.6 --no-build-isolation
Vlaser VLM is trained by performing the second supervised finetuning (SFT) upon InternVL3. If you want to train the model from scratch on your own, please create your own meta_data.json using the same format like this as the argument pass to --meta_path in the following shell script. The format for each specific JSONL (such as plain text data, single-image data, multi-image data, video data) can be organized according to the descriptions provided in this document.
-
Run Training scripts:
cd internvl_chat # for Vlaser-2B bash shell/internvl3.0/2nd_finetune/internvl3_2b_dynamic_res_2nd_finetune_full.sh # for Vlaser-8B bash shell/internvl3.0/2nd_finetune/internvl3_8b_dynamic_res_2nd_finetune_full.sh
| model name | type | download | size |
|---|---|---|---|
| InternVL3-2B | huggingface | 🤗 HF link | 4.2 GB |
| InternVL3-8B | huggingface | 🤗 HF link | 15.9 GB |
Please download the above model weights first and pass the download path to --model_name_or_path argument in the shell script.
We use transformers (transformers==4.54.0) as the inference architecture. We provide the example inference code for Vlaser, including single-image single-round conversation, single-image multi-round conversation and multi-image multi-round conversation in eval_example.py. Please insert the evaluation part of code to the corresponding official implementation of the following embodied reasoning benchmarks: ERQA, Ego-Plan2, Where2place, Pointarena, VSIBench, RefSpatial, MMSIBench, VLABench and EmbodiedBench. We also provide the self-implemented evaluation script for Pixmo-Points and Paco-Lavis in internvl_chat/eval.
For benchmarks that require point grounding capability, please add the following prefix to the question itself for fully reproduction:
You are InternVL. Your task is to locate several points in the given image according to the task descriptions. Your answer should be formatted as \"<point>[[x1, y1], [x2, y2],...]</point>\". The point coordinates are normalized to integers between 0 and 1000. Return the answer in the point format directly.
For example:
Question: You are InternVL. Your task is to locate several points in the given image according to the task descriptions. Your answer should be formatted as \"<point>[[x1, y1], [x2, y2],...]</point>\". The point coordinates are normalized to integers between 0 and 1000. Return the answer in the point format directly. Point to all the moai.
Answer: <point>[[254, 624], [304, 624], [351, 624], [400, 624], [460, 624], [540, 619], [645, 606], [795, 606], [925, 599]]</point>
You need to extract the point coordinates within <point></point> and transform to the ordinary coordinates in width and height as the post-processing procedure.
def text2pts(text, img_path):
width, height, = Image.open(img_path).size
pattern = r"<point>\s*\[\s*(\[.*?\])\s*\]</point>"
match = re.search(pattern, text, re.DOTALL)
points = []
if match:
points_str = match.group(1)
# 匹配所有[x, y]对
coord_pattern = r"\[\s*(\d+)\s*,\s*(\d+)\s*\]"
coords = re.findall(coord_pattern, points_str)
for x_str, y_str in coords:
x = int(x_str)
y = int(y_str)
# 归一化到图像尺寸
x = int(x / 1000 * width)
y = int(y / 1000 * height)
points.append((x, y))
return np.array(points)
