Don't sleep on Grok 2.0; It is powerful but controversial

Elon Musk-led xAI released its state-of-the-art Grok 2.0 AI model in beta recently. In the blog post, xAI mentioned that the Grok 2.0 scored 87.5% on the MMLU benchmark with 0-shot CoT which really surprised me. This puts the model in the territory of the GPT-4o, which has achieved a score of 87.7% in the same MMLU benchmark.

UNCENSORED GROK 2.0 BROKE just the Internet!

I was curious to test the Grok 2.0 model and evaluate if it passes the "vibe" test in common sense tests. Thankfully, xAI added Grok 2.0 (Beta) to x.com, allowing X Premium users to evaluate the model.

I started testing the model by throwing some tricky reasoning questions that challenge even the best large language models (LLM). When asked if drying 20 towels under the sun would take longer than drying 15 towels, Grok 2.0 replied that it would take the same amount of time, which is correct. In my tests I have seen many models including the latest Llama 3.1 405B model fail this basic question.

It then correctly answered "9.9 is greater than 9.11", a simple test that has baffled many SOTA models. After that I asked Grok 2.0 to find out how many 'R's are in the word "Strawberry", it said three Rs. Which again is the correct answer. It even correctly spelled "strawberry" backwards – "yrrebwarts".