This was getting really hard as of 2025! 

On notable example that <ChatGPT 4 Turbo> got wrong is perhaps:
> Write a sentence with 20 words.
and it gets the number of words wrong.

Bibliography:
* https://www.reddit.com/r/LocalLLaMA/comments/1bvx6cc/the_prompt_that_every_llm_gets_wrong/
* https://www.reddit.com/r/LocalLLaMA/comments/13zz8y5/what_questions_do_you_ask_llms_to_check_their/
* https://www.reddit.com/r/MachineLearning/comments/18jjobx/questions_that_llm_can_not_answerd/


 Simplest questions that LLMs get wrong (source code)