The good, the bad, and the completely made-up: Newsrooms on wrestling accurate answers out of AI

DailyTrend

Aug 4, 2025 - 14:00

0 3

Erlend Ofte Arntsen has filed more Freedom of Information Act requests than he can count — triple digits by one tally, quadruple when you include follow-ups and related requests.

Now, a new newsroom assistant at one of Norway’s largest newspapers is transforming Arntsen’s workflow, saving time that could be better spent on shoe-leather reporting than arguing in legalese with government bureaucrats.

That assistant is called FOIA Bot and is powered by generative AI. When the government sends back a request or rejection, the bot comes up with a competent rejoinder, given its access to the whole of Norway’s FOIA law and 75 templates of similar responses from the Norwegian Press Association.

“It’s something I would have had to use a half a day [for] when I’m back in my investigative unit, where I have time to think those long thoughts,” Arntsen, who works at Verdens Gang, told Nieman Lab. “I was able to get this done on a night shift working breaking news, because I used that bot.”

FOIA Bot is part of an emerging tech stack of newsroom tools that leverage a specialized AI architecture called retrieval-augmented generation, or RAG. (Apparently, no one ever asked a chatbot to use its creative writing powers to come up with a catchier name.) It’s the same method that powers search bots like The Financial Times’ Ask FT, which draws on FT content to answer reader queries and has been used by 35,000 readers since its formal launch this April.

RAG’s jargon-filled moniker belies a fairly simple approach — one that boosts reliability, key for journalists who find themselves in the reliability business. The model doesn’t create an answer from the vast expanses of Amazon reviews, medieval literature, and Reddit comments that general-purpose chatbots are typically trained on. Instead, a RAG-powered model retrieves information from a journalist-defined database, then uses that to augment what it generates with attributions to boot. The database can be a newsroom’s archives of fact-checked articles, a book of case law, or even a single PDF.

“If I was just to use, for example, ChatGPT, I would struggle because it hallucinates sources,” said Lars Adrian Giske, head of AI at iTromsø, an AI-forward newspaper in Norway. “Sure, it can give you an actual source like, ‘Check page 14, paragraph three on this page.’ But it can also hallucinate that, and then it’s really hard for me to go from the chat, look up the actual documentation, find the paragraph that it’s edited and figure out how it used that information. So you need systems that can do that in a way more secure way.”

Even with a more trustworthy AI workflow, hesitations abound. For many, AI and journalism remain an unholy marriage. Can a machine really atomize the entire journalistic process down into database-friendly chunks and vectors? What gets lost in the process of summarization? Are publishers mounting an unwinnable battle for attention against a new crop of Big Tech giants? And what if the genie is already out of the bottle?

“News media is about to change,” Giske said. “The article as we know it may not be the preferred format of readers or listeners or viewers in the years to come. People are getting used to generative ecosystems, and that won’t change.”

How RAG is showing up in newsrooms

A good RAG-based system is only as good as its database.

At iTromsø, Giske’s team used the method for an investigation into understaffing at a local hospital. FOIA requests returned thousands of pages of dense documents, so they broke them down into chunks before converting them into vectors or numerical representations. If a RAG-powered system is like an open-book exam, these chunks are the highlighted excerpts in the textbook provided to the model to write its essay.

The journalists asked the RAG system to surface the most newsworthy elements in the documents. What it returned — after plenty of tweaking to teach the system what they meant by newsworthiness — helped earn the team a Data-SKUP award, one of Norway’s most prestigious journalism honors.

“We used the RAG to do what we call smelling the data, and then we narrow it down as we go,” Giske said. “This led to uncovering something that was hidden within all this documentation: A doctor from Denmark, who was working remotely, spent four seconds reviewing X-ray images.”

Giske said the project would have easily taken at least three months of manual research time.

“These are approaches that help you get an overview of very large datasets,” he said. “There’s a ton of knowledge out there waiting to be discovered in open public data. But it’s really hard for one journalist or a team of journalists to go through all that data manually… I feel like the RAG-supported investigative journalism is just an extension of data journalism. It’s a natural evolution.”

In nearby Finland, data scientist Vertti Luostarinen used the same process to build an Olympics Bot for public broadcaster Yle. If you ask ChatGPT 4o to name the top ten greatest Finnish wrestlers, it may very well try to convince you that pentathlete Eero Lehtonen belongs on the list. (Now, that’s nine more Finnish wrestlers than most people can name, but a glaring factual inaccuracy like that does not a helpful chatbot make.)

During the 2024 Olympics, Yle’s crackerjack team of sports commentators, who were churning out something like 200 articles a day, constantly needed access to stats such as these. Luostarinen fed the Olympics Bot sports history, bios of athletes on Finland’s national team, the rules of every sport, tabular data about schedules, and articles from Yle’s live coverage news feed.

“I was expecting a lot more hallucinations — that’s the main thing that people are usually scared of with these models,” Luostarinen said. “There were a lot less hallucinations than I thought.”

Instead, the bot’s primary drawback was poor Finnish skills (perkele!) Sometimes, the bot spelled athletes’ names wrong, because it was taking a differently spelled variation from the user’s question. Sometimes it retrieved the correct information but refused to answer because of its language limitations.

Ultimately, Luostarinen came to a similar conclusion as Giske: RAGs have great potential when it comes to filtering and surfacing information from immense piles of data. It’s the act of summarizing that gives him pause.

They tend to “summarize information even when you specifically ask them not to,” he said. “It is nice when you need that kind of overview and summary, but in journalistic work you’re often interested in the details. I’m a bit worried what will happen to the way we as a society search for information if it’s always going through this kind of system that makes it more generic and loses specific details.”

JournalistGPT

Summarization is, in fact, the very application newsrooms are embracing the fastest. In addition to The Financial Times, The Washington Post unveiled “Ask the Post AI” last November, and The San Francisco Chronicle rolled out “the Kamala Harris News Assistant,” which pulled from nearly three decades of California political coverage to answer questions about the then-presidential candidate.

In a 2025 Reuters Institute survey, more than half of its 326 respondents said “they would be looking into AI chatbots and search interfaces” in the year ahead.

Deutsche Presse-Agentur (DPA), Germany’s largest wire agency, has taken all of its content from 2018 onward as well as its current newsfeed and built a real-time database that users and staffers alike can query. As the bot generates its summary, each answer comes with a little green number that links to the corresponding DPA article.

Inside the DPA newsroom, journalists are also using the new tool as a timesaver, with permission from higher-ups to include AI-generated copy in their stories, provided they first verify the information. DPA is even contemplating integrating the RAG-based tool directly into their content management system.

Because it is programmed to cite sources and include quotes, the system “has proven for us to be more robust against hallucinations,” says AI team lead Yannick Franke. And every piece of published copy still goes through the fact-checking process, so there’s an extra guardrail against inaccuracy.

“Every error is a catastrophe for news and for an agency in particular,” Astrid Maier, DPA’s deputy editor-in-chief, said. “But let’s be honest, people make mistakes too. In the end, you as a writer and then the editors are responsible for what’s in there. The human’s responsibility cannot change or be delegated to the AI.”

The greater risk, Maier thinks, is that DPA will lose its standing as a verification authority in Germany as media habits and the information ecosystem shift.

“We have to be capable of using these tools for our benefit,” she added. “If we sit on the sideline and observe, I think the risk is too high that we are gonna be left behind. It’s better for us to be able to master this technology for our own and our customer’s good and to be able to fulfill our mission and vision in the next ten or hopefully 75 years.”

The FT sees it similarly, their marketing team explained to me. They identify three ways their enterprise customers, who have access to their search-bot, consume the news: deep research mode, monitoring mode, and habitual mode. AI summaries address the first by providing comprehensive summaries on near-endless topics in seconds, but don’t serve as a wholesale replacement for editorial curation or the act of scrolling an app.

Not everyone is convinced.

“There’s multiple ways of using RAGs,” said Robin Berjon, a technologist and The New York Times’ former vice president of data governance. “If the LLM fetches a RAG that has reliable information, but then munches it and summarizes it back, then I wouldn’t trust that unless it quoted directly from the relevant documents. It is likely to introduce errors in the summarization.”

Room for improvement

Much of the newsroom discussion around RAGs centers on helpfulness. New research from Bloomberg spotlights the potential harmfulness of these systems.

Bloomberg’s Responsible AI team took a database using only Wikipedia articles — what they call a “pure vanilla RAG setup” — and asked 5,000 questions on topics like malware, disinformation, fraud, and illegal activity. The RAG-based models answered questions that non-RAG models almost always refused.

The key to ameliorating these risks is the same as in boosting reliability: evaluate systems continuously and build in appropriate guardrails.

“If you have a good understanding of how well it actually works, how often it hallucinates, how often it produces something that’s made up, how often it responds to unsafe queries — then you can make a much more informed decision whether this is something you want to roll out, or whether you need to add more components to your system to decrease those risks,” Sebastian Gehrmann, head of responsible AI at Bloomberg, said.

DPA had its own journalists stress-test the search-bot before trying it out with customers. Apparently, male editors loved asking the machine to list off the coaches of a beloved German soccer team over a specific period of time, which helped them realize the system’s struggles with counting. They’re also working with the German Research Center for AI to create a scientific evaluation process and benchmarks.

The FT beta-tested its product in waves and incorporated feedback from customers. They waited until 80% of users deemed it useful before rolling it out to the 7,000 businesses, institutions, and universities that subscribe to FT Professional.

And at VG, the newspaper automated part of FOIA Bot’s evaluation, using a method known as LLM-as-judge. They took 43 sample bot-written FOIA complaints and had a reviewer from the Norwegian Press Association come up with a list of expectations that each complaint should hit. They then used AI to score the model’s performance, finding that 381 of 548 expectations were fulfilled.

Even when a RAG-based tool clears internal standards or benchmarks, the tool can’t simply speak for itself. Readers need to understand how it works and how best to engage with it.

“News organizations are already spectacularly bad at conveying the level of confidence and the amount of work that went into establishing a piece. And then you slap an AI chatbot on top of that? It’s not gonna be great,” Berjon said. “It will require serious user experience work to make it clear to people what they can expect from this.”

The real challenge, Berjon said, is designing a news experience that doesn’t pass AI tools off as all-knowing or overly powerful. His advice: Skip the legal disclaimers and don’t over-rely on “this text was generated by a large language model” fine print.

“You have to make it part of the experience that the reliability is what it is,” Berjon said.

Josh Axelrod is the author of the Nature Briefing: AI and Robotics newsletter and a 2023–24 Fulbright journalism scholar based in Berlin. His reporting has appeared in outlets including Wired, NPR, Mother Jones, and The Boston Globe.

Illustration by Yutong Liu (Kingston School of Art) used under a Creative Commons license.