As the excitement about the immense potential of large language models (LLMs) dies down, now comes the hard work of ironing out the things they don’t do well.
The word “hallucination” is the most obvious example, but at least output that is crazily fictitious stands out as wrong. It’s the lesser mistakes – factual inaccuracies, bias, misleading references – that are more of a problem because they aren’t noticed.
It’s become a big enough issue that a paper by the Oxford Internet Institute argued last year that the technology is so inclined to sloppy output that it poses a risk to science, education, and perhaps democracy itself.
The digital era finds itself struggling with the issue of factual accuracy across multiple spheres. LLMs, in particular, struggle with facts. This isn’t primarily the fault of the LLMs themselves; if the data used to train an LLM is inaccurate, the output will be too.
Now a team of researchers from IBM, MIT, Boston University, and Monash University in Indonesia has suggested techniques they believe could address the shortcomings in the way LLMs are trained. The paper’s abstract sums up the problem:
“Language models appear knowledgeable, but all they produce are predictions of words and phrases — an appearance of knowledge that doesn’t reflect a coherent grasp on the world. They don’t possess knowledge in the way that a person does.”
One solution is to deploy retrieval-augmented generation (RAG), which improves LLMs by feeding them high-quality specialist data.
The catch is that this requires a lot of computational resources and human labor, which renders the technique impractical for general LLMs.
Marking its own homework
The team’s alternative is something called deductive closure training (DCT), whereby the LLM assesses the accuracy of its own output.
In unsupervised mode, the LLM is given “seed” statements which it uses to generate a cloud of statements inferred from them, some of which are true, others which aren’t. The LLM model then analyses the probability that each of these statements is true by plotting a graph of their consistency. When supervised by humans, the model can also be seeded with statements known to be true.
“Supervised DCT improves LM fact verification and text generation accuracy by 3-26%; on CREAK, fully unsupervised DCT improves verification accuracy by 12%,” reported the team’s research paper (PDF).
Meanwhile, a second team has suggested a way to refine this further using a technique called self-specialization, essentially a way of turning a generalist model into a specialist one by ingesting material from specific areas of knowledge.
“They could give the model a genetics dataset and ask the model to generate a report on the gene variants and mutations it contains,” IBM explained. “With a small number of these seeds planted, the model begins generating new instructions and responses, calling on the latent expertise in its training data and using RAG to pull facts from external databases when necessary to ensure accuracy.”
This might sound rather like a way of implementing RAG. The difference is that these specialist models are only called upon, via an API, when they are needed, the researchers said.
Still bad at facts
According to Mark Stockley, who co-presents The AI Fix podcast with Graham Cluley, the underlying problem is that LLMs are widely misunderstood. They are good at specific tasks but are not, nor were ever intended to be, uncomplicated fact- or truth-checking engines.
“The IBM research doesn’t seem to address the root cause of why LLMs are bad at facts, but it suggests there is a useful but unspectacular modification that might make them less bad at the things they’re currently bad at,” he said.
“You can look at that and say the route to a truly intelligent AI doesn’t go through LLMs and so improving them is a sideshow, or you can look at that and say LLMs are useful in their own right, and a more useful LLM is therefore a more useful tool, whether it’s enroute to artificial general intelligence (AGI) or ultimately a cul-de-sac.”
What is not in doubt, however, is that LLMs need to evolve rapidly or face either becoming specialized, expensive tools for the few or glorified grammar checkers for everyone else.
As the excitement about the immense potential of large language models (LLMs) dies down, now comes the hard work of ironing out the things they don’t do well.
The word “hallucination” is the most obvious example, but at least output that is crazily fictitious stands out as wrong. It’s the lesser mistakes – factual inaccuracies, bias, misleading references – that are more of a problem because they aren’t noticed.
It’s become a big enough issue that a paper by the Oxford Internet Institute argued last year that the technology is so inclined to sloppy output that it poses a risk to science, education, and perhaps democracy itself.
The digital era finds itself struggling with the issue of factual accuracy across multiple spheres. LLMs, in particular, struggle with facts. This isn’t primarily the fault of the LLMs themselves; if the data used to train an LLM is inaccurate, the output will be too.
Now a team of researchers from IBM, MIT, Boston University, and Monash University in Indonesia has suggested techniques they believe could address the shortcomings in the way LLMs are trained. The paper’s abstract sums up the problem:
“Language models appear knowledgeable, but all they produce are predictions of words and phrases — an appearance of knowledge that doesn’t reflect a coherent grasp on the world. They don’t possess knowledge in the way that a person does.”
One solution is to deploy retrieval-augmented generation (RAG), which improves LLMs by feeding them high-quality specialist data.
The catch is that this requires a lot of computational resources and human labor, which renders the technique impractical for general LLMs.
Marking its own homework
The team’s alternative is something called deductive closure training (DCT), whereby the LLM assesses the accuracy of its own output.
In unsupervised mode, the LLM is given “seed” statements which it uses to generate a cloud of statements inferred from them, some of which are true, others which aren’t. The LLM model then analyses the probability that each of these statements is true by plotting a graph of their consistency. When supervised by humans, the model can also be seeded with statements known to be true.
“Supervised DCT improves LM fact verification and text generation accuracy by 3-26%; on CREAK, fully unsupervised DCT improves verification accuracy by 12%,” reported the team’s research paper (PDF).
Meanwhile, a second team has suggested a way to refine this further using a technique called self-specialization, essentially a way of turning a generalist model into a specialist one by ingesting material from specific areas of knowledge.
“They could give the model a genetics dataset and ask the model to generate a report on the gene variants and mutations it contains,” IBM explained. “With a small number of these seeds planted, the model begins generating new instructions and responses, calling on the latent expertise in its training data and using RAG to pull facts from external databases when necessary to ensure accuracy.”
This might sound rather like a way of implementing RAG. The difference is that these specialist models are only called upon, via an API, when they are needed, the researchers said.
Still bad at facts
According to Mark Stockley, who co-presents The AI Fix podcast with Graham Cluley, the underlying problem is that LLMs are widely misunderstood. They are good at specific tasks but are not, nor were ever intended to be, uncomplicated fact- or truth-checking engines.
“The IBM research doesn’t seem to address the root cause of why LLMs are bad at facts, but it suggests there is a useful but unspectacular modification that might make them less bad at the things they’re currently bad at,” he said.
“You can look at that and say the route to a truly intelligent AI doesn’t go through LLMs and so improving them is a sideshow, or you can look at that and say LLMs are useful in their own right, and a more useful LLM is therefore a more useful tool, whether it’s enroute to artificial general intelligence (AGI) or ultimately a cul-de-sac.”
What is not in doubt, however, is that LLMs need to evolve rapidly or face either becoming specialized, expensive tools for the few or glorified grammar checkers for everyone else. Read More