I tried an open source multimodal LLM, and it failed to impress me

A group of computer scientists from different universities have released an open source multimodal LLM called LLaVA, and I stumbled upon it while scrolling through Twitter last week. Similar to GPT-4, this LLM can process both text and image input. The project uses a general LLM and an image encoder to create a model for the Large Language and Vision Assistant. Since the announced features looked promising, I decided to test drive this large language model to understand how accurate and reliable it is and what we can expect from GPT4's upcoming multimodal model (especially its visual capabilities). With that said, let's go ahead and explore LLaVA.

AnyGPT: Any-to-Any Multimodal LLM – sound, text and image! (Open Source)

LLaVA (Large Language-and-Vision Assistant) is a multimodal LLM, similar to OpenAI's GPT-4, that can handle both text and image input. Although OpenAI has not yet added image processing capabilities to GPT-4, a new open source project has already done so by injecting a vision encoder.

Developed by computer scientists at the University of Wisconsin-Madison, Microsoft Research, and Columbia University, the project aims to demonstrate how a multimodal model would work and compare its capabilities to GPT-4.

It uses Vicuna as the large language model (LLM) and CLIP ViT-L/14 as a visual encoder, which, for the uninitiated, was developed by OpenAI. The project has generated high-quality multimodal instruction-following data using GPT-4 and it results in excellent performance. It achieves 92.53% in the ScienceQA benchmark.