Abstract: Large Vision-Language Models (VLMs) have been extended to understand both images and videos. Visual token compression is leveraged to reduce the considerable token length of visual inputs.
Some results have been hidden because they may be inaccessible to you
Show inaccessible results