‘Can any model do [insert task]?’ is a much harder question than it seems. I’m going to give you five vivid categories, with unambiguous examples, drawing on 6 new papers, of the kind of detail that is so often lost in 2024 debates on AI.
Link for Off-line Watching and Download: https://drive.google.com/file/d/1ep3Asw6_1LZRJoCKU1VGSaUcYMP6djBq/view?usp=sharing
Anthropic What LLMs Can’t Do: https://www.anthropic.com/news/a-new-initiative-for-developing-third-party-model-evaluations?s=09
Fool Your (Vision and) Language Model with Embarrassingly Simple Permutations https://arxiv.org/pdf/2310.01651
LLMs are not robust selectors: https://openreview.net/pdf?id=shr9PXz7T0
Eliminating Position Bias of Language Models: A Mechanistic Approach - https://arxiv.org/pdf/2407.01100
DROP Benchmark: https://arxiv.org/pdf/1903.00161v2
Investigating the Robustness of LLMs on Math Word Problems: https://arxiv.org/pdf/2406.15444
AI Explained Hat-tip: https://x.com/vipul_1011/status/1808243451852644794
Changing Answer Order Can Decrease MMLU Accuracy: https://arxiv.org/pdf/2406.19470
AlphaCode 2: https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf
AI Agents That Matter: https://arxiv.org/pdf/2407.01502
Functional benchmarks: https://arxiv.org/pdf/2402.19450v1
LLMs Cannot Plan: https://arxiv.org/pdf/2402.01817
MysteryWorld: https://arxiv.org/pdf/2302.06706
Bryson Tang
2024-07-31 17:19:21 +0000 UTCPhilip
2024-07-15 15:37:29 +0000 UTCPhilip
2024-07-15 15:37:06 +0000 UTCPhilip
2024-07-15 15:36:38 +0000 UTCPhilip
2024-07-15 15:36:18 +0000 UTCPhilip
2024-07-15 15:35:53 +0000 UTCPhilip
2024-07-15 15:34:37 +0000 UTCPhilip
2024-07-15 15:34:31 +0000 UTCPhilip
2024-07-15 15:34:06 +0000 UTCMachiel Reyneke
2024-07-07 05:16:12 +0000 UTCMark Levine
2024-07-07 00:46:29 +0000 UTCOldsnakenewtrix
2024-07-06 02:47:11 +0000 UTCPablo Rodríguez
2024-07-05 19:04:38 +0000 UTCPablo Rodríguez
2024-07-05 16:23:24 +0000 UTCTony Coffman
2024-07-05 15:02:43 +0000 UTCBrian Crabtree
2024-07-05 08:31:47 +0000 UTCCarlos Galarza
2024-07-04 20:57:54 +0000 UTCSteven
2024-07-04 20:23:26 +0000 UTCJason Tangen
2024-07-04 19:36:37 +0000 UTCJonathan Kirk
2024-07-04 19:03:24 +0000 UTC