Remove categories projectors
article thumbnail

Mini-Gemini: Mining the Potential of Multi-modality Vision Language Models


To ensure training efficiency, the Mini-Gemini framework keeps the two vision encoders fixed, and optimizes the projectors of patch info mining in all stages, and optimizes the large language model during the instruction tuning stage itself.

article thumbnail

Visual Instruction Tuning for Pixel-Level Understanding with Osprey


Owing to its design and architecture, the Osprey framework is able to achieve fine-grained semantic understanding for object-level and part-level regions, and provides detailed object attributes along with primary object category and enhanced descriptions of complex scenes.