Microsoft releases a small Phi-3 Vision multimodal model

Earlier in April, Microsoft released its first AI model under the open-source Phi-3 family: the Phi-3 Mini. And now, after almost a month, the Redmond giant has released a small multimodal model called Phi-3 Vision. At Build 2024, Microsoft also unveiled two more Phi-3 family models including the Phi-3 Small (7B) and Phi-3 Medium (14B). All of these models are open source under the MIT license.

Microsoft Phi3 Vision A small but powerful AI multimodal model

As for the Phi-3 Vision model, it is trained on 4.2 billion parameters. This means that the model is quite light. This is the first time a mega-company like Microsoft has created a multimodal open source model. It has a context length of 128K and you can also feed images. Google released the PaliGemma model, but it is not intended for conversational use.

That aside, Microsoft says the Phi-3 Vision model was trained on publicly available, high-quality training and code data. Microsoft has also generated synthetic data for mathematics, reasoning, general knowledge, charts, tables, graphs and images. Image courtesy: Microsoft

Despite its small size, the Phi-3 Vision model outperforms the Claude 3 Haiku, LlaVa and Gemini 1.0 Pro on many multimodal benchmarks. It even comes pretty close to OpenAI's GPT-4V model. Microsoft says developers can use the Phi-3 Vision model for OCR, chart and table understanding, general image understanding, and more.