One important reminder regarding AI chatbots is to avoid relying on them for factual information, as they tend to be unreliable.
A recent study has shed light on this issue while also indicating that Apple made a wise decision by collaborating with OpenAI’s ChatGPT for inquiries that Siri cannot address.
There are two major issues with using large language models (LLMs) such as ChatGPT, Gemini, and Grok as substitutes for traditional web searches:
- They are frequently inaccurate
- They often exhibit excessive confidence in their incorrect information
A study highlighted by the Columbia Journalism Review found that when users provided chatbots with exact quotes from journalistic sources and asked for further details, the majority of responses were incorrect most of the time.
The Tow Center for Digital Journalism conducted assessments of eight AI chatbots that claim to execute live web searches for accurate information:
- ChatGPT
- Perplexity
- Perplexity Pro
- DeepSeek
- Microsoft’s Copilot
- Grok-2
- Grok-3
- Gemini
The straightforward task assigned to the chatbots
The study challenged each system with a quote from an article and requested a simple task: to locate that article online and provide a link, along with the headline, original publisher, and publication date.
The authors ensured the task was feasible by selecting excerpts that could be easily found on Google, with the original source appearing within the first three results.
The chatbots were evaluated based on their accuracy, whether they were entirely correct, correct but missing some details, partially incorrect, completely erroneous, or unable to respond.
The researchers also assessed how confidently the chatbots presented their results. For instance, did they state their answers as facts or use phrases like “it seems” or acknowledge that they couldn’t find an exact match for the quote?
The outcomes were disappointing
Overall, most chatbots turned out to be only partially or totally incorrect more often than not!
On average, these AI systems provided correct answers less than 40% of the time. Perplexity led the group with a 63% accuracy rate, while Grok-3 from X scored only 6%.
Some other notable findings included:
- Chatbots generally struggled to refuse questions they could not answer accurately and often provided incorrect or speculative responses instead.
- Premium chatbots tended to give more confidently incorrect information compared to their free alternatives.
- Several chatbots appeared to violate Robot Exclusion Protocol preferences.
- Generative search tools resorted to fabricating links and citing syndicated or copied versions of articles.
- Having content licensing agreements with news sources did not ensure accurate citations in chatbot responses.
However, Apple made a wise choice
While Perplexity’s performance was the highest, it seems to gain this advantage through unethical practices. Web publishers can use a robots.txt file to inform AI chatbots whether they are allowed to access their sites. National Geographic, for example, instructs them not to search its site, yet the report found that Perplexity successfully located all 10 quotes even though the articles were paywalled and without an existing licensing agreement.
Among the other chatbots, ChatGPT provided the most favorable outcomes – or, more accurately, the least unfavorable ones.
Nonetheless, the study clearly underscores a well-established notion: chatbots should be used for brainstorming and generating ideas, but they should never be relied upon for accurate factual information.
Highlighted accessories
Image: Apple