1 Comment

It’s an interesting experiment, but LLMs without a dedicated math engine backing it (various of the companies have toyed with integrating with Wolfram Alpha) will never consistently get math right, just given what the guts of the LLM actually are. If it does, it’s usually either tuned or has seen the exact problem before—some of the errors, if you re-experiment, are probably not super consistent with different formulations of the problems.

The likely reason why most of these models write code ok (though often not that well) is probably because random students and low level programmers have spammed StackOverflow with every variant of programming problems over the years.

That being said, as these multi-modal / “omni”-models show, it isn’t crazy to integrate different modalities or engines behind these things, so it’ll likely come, in which case some of these math problems wouldn’t be that interesting (because the handling is “trivial” from the perspective that a dedicated engine is dealing with it).

Anyway, interesting coverage for sure and great to see what happens with some example prompts/experiments—though it’ll be hard to consistently draw any conclusions from math problems, even if OpenAI made a big deal out of simple math problems (though even the presenter emphasized “simple”) in their announcement.

Expand full comment