A growing number of experts have called for these tests to be ditched, saying they boost AI hype and create “the illusion that [AI language models] have greater capabilities than what truly exists.” Read the full story here.
What stood out to me in Will’s story is that we know remarkably little about how AI language models work and why they generate the things they do. With these tests, we’re trying to measure and glorify their “intelligence” based on their outputs, without fully understanding how they function under the hood.
Our tendency to anthropomorphize makes this messy: “People have been giving human intelligence tests—IQ tests and so on—to machines since the very beginning of AI,” says Melanie Mitchell, an artificial-intelligence researcher at the Santa Fe Institute in New Mexico. “The issue throughout has been what it means when you test a machine like this. It doesn’t mean the same thing that it means for a human.”
Kids vs. GPT-3: Researchers at the University of California, Los Angeles, gave GPT-3 a story about a magical genie transferring jewels between two bottles and then asked it how to transfer gumballs from one bowl to another, using objects such as a posterboard and a cardboard tube. The idea is that the story hints at ways to solve the problem. GPT-3 proposed elaborate but mechanically nonsensical solutions. “This is the sort of thing that children can easily solve,” says Taylor Webb, one of the researchers.
AI language models are not humans: “With large language models producing text that seems so human-like, it is tempting to assume that human psychology tests will be useful for evaluating them. But that’s not true: human psychology tests rely on many assumptions that may not hold for large language models,” says Laura Weidinger, a senior research scientist at Google DeepMind.
Lessons from the animal kingdom: Lucy Cheke, a psychologist at the University of Cambridge, UK, suggests AI researchers could adapt techniques used to study animals, which have been developed to avoid jumping to conclusions based on human bias.
Nobody knows how language models work: “I think that the fundamental problem is that we keep focusing on test results rather than how you pass the tests,” says Tomer Ullman, a cognitive scientist at Harvard University.