AI isn’t very good at history, finds a new paper


AI could excel at certain tasks such as coding o generate a podcast. But it struggles to pass a high-level history exam, a new paper has found.

A team of researchers created a new benchmark to test three major language models (LLM) – OpenAI’s GPT-4, Meta’s Llama and Google’s Gemini – on historical queries. The benchmark, Hist-LLM, tests the correctness of answers according to the Seshat Global History Databank, a vast database of historical knowledge named after the ancient Egyptian goddess of wisdom.

The results, that were introduced last month at the high-profile AI conference NeurIPS, were disappointing, according to researchers affiliated with the Center for Complexity Sciences (CSH), a research institute based in Austria. The best performing LLM was GPT-4 Turbo, but it only achieved about 46% accuracy – not much higher than random guessing.

“The takeaway from this study is that LLMs, while impressive, also lack the depth of understanding needed for advanced history. They’re great for the basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they don’t they are also up to it,” said Maria del Rio-Chanona, one of the paper’s co-authors and an associate. professor of computer science at University College London.

The researchers shared samples of historical questions with TechCrunch that LLMs got wrong. For example, GPT-4 Turbo was asked if scale armor was present during a specific time period in ancient Egypt. The LLM said yes, but the technology only appeared in Egypt 1500 years later.

Why are LLMs bad at answering technical history questions, when they can be so good at answering very complicated questions about things like coding? Del Rio-Chanona told TechCrunch that’s likely because LLMs tend to extrapolate from very prominent historical data, finding it difficult to retrieve more obscure historical knowledge.

For example, researchers asked GPT-4 if ancient Egypt had a professional army in a specific historical period. While the correct answer is no, the LLM incorrectly answered that it did. This is probably because there is a lot of public information about other ancient empires, such as Persia, that had standing armies.

“If you say A and B 100 times, and C 1 time, and then you asked a question about C, you can only remember A and B and try to extrapolate from it,” said del Rio-Chanona.

The researchers also identified other trends, including that the OpenAI and Llama models performed worse for certain regions such as sub-Saharan Africa, suggesting potential biases in their training data.

The results show that LLMs are still not a substitute for humans when it comes to certain domains, said Peter Turchin, who led the study and is a member of the faculty at CSH.

But researchers are still hoping that LLMs can help historians in the future. They are working to refine their benchmark by including more data from underrepresented regions and adding more complex questions.

“Overall, while our results highlight areas where LLMs need improvement, they also underscore the potential of these models to aid in historical research,” the paper reads.



Source link