The Wayback Machine’s snapshots of news homepages plummet after a “breakdown” in archiving projects

DailyTrend

Oct 21, 2025 - 17:00

0 1

The Wayback Machine’s snapshots of news homepages plummet after a “breakdown” in archiving projects

On September 7, Russia carried out a massive drone attack on Ukraine’s capital, Kyiv, killing four people and injuring 40. The Associated Press reported that it was the largest aerial attack since the war between the two countries began in 2022.

The Kyiv Post, one of Ukraine’s leading English-language news outlets, covered the story, but no public record of its homepage exists in the Internet Archive’s Wayback Machine for that day. A homepage snapshot — a viewable version of a page captured by the Wayback Machine’s crawlers — does not appear until September 8, more than 24 hours after the attack.

In the first five months of 2025, the Wayback Machine captured snapshots of the Kyiv Post an average of 85 times per day. Between May 17 and October 1, though, the daily average dropped to one. For 52 days between May and October, the Wayback Machine shows no snapshots of the Kyiv Post at all. The Wayback Machine, an initiative from the nonprofit Internet Archive, has been archiving the webpages of news outlets — alongside millions of other websites — for nearly three decades. Earlier this month, it announced that it will soon archive its trillionth web page. The Internet Archive has long stressed the importance of archiving homepages, particularly to fact-check politicians’ claims. In 2018, for instance, when Donald Trump accused Google of failing to promote his State of the Union address on its homepage, Google used the Wayback Machine’s archive of its homepage to disprove the statement.

“[Google’s] job isn’t to make copies of the homepage every 10 minutes,” Mark Graham, the director of the Wayback Machine, said at the time. “Ours is.”

But a Nieman Lab analysis shows that the Wayback Machine’s snapshots of news outlets’ homepages have plummeted in recent months. Between January 1 and May 15, 2025, the Wayback Machine shows a total of 1.2 million snapshots collected from 100 major news sites’ homepages. Between May 17 and October 1, 2025, it shows 148,628 snapshots from those same 100 sites — a decline of 87%. (You can see our data here.)

While our analysis focused on news sites, they’re not the only URLs impacted. We documented a similarly large decrease in the number of snapshots available of federal government website homepages after May 16, during a period when the Trump administration has taken down pages on government sites and made changes, often without disclosure, otherwise known as “stealth editing.”

When we contacted Graham for this story, he confirmed there had been “a breakdown in some specific archiving projects in May that caused less archives to be created for some sites.” He did not answer our questions about which projects were impacted, saying only that they included “some news sites.”

Graham confirmed that the number of homepage archives is indicative of the amount of archiving happening across a website. He also said, though, that homepage crawling is just one of several processes the Internet Archive runs to find and save individual pages, and that “other processes that archive individual pages from those sites, including various news sites, [were] not affected by this breakdown. ”

After the Wayback Machine crawls websites, it builds indexes that structure and organize the material it’s collected. Graham said some of the missing snapshots we identified will become available once the relevant indexes are built.

“Some material we had archived post-May 16th of this year is not yet available via the Wayback Machine as their corresponding indexes have not yet been built,” he said.

Under normal circumstances, building these indexes can cause a delay of a few hours or a few days before the snapshots appear in the Wayback Machine. The delay we documented is more than five months long. Graham said there are “various operational reasons” for this delay, namely “resource allocation,” but otherwise declined to specify.

According to Graham, the “breakdown” in archiving projects has been fixed and the number of snapshots will soon return to its pre-May 16 levels. He did not share any more specifics on the timeframe. But when we re-analyzed our sample set on October 19, we found that the total number of snapshots for our testing period had actually declined since we first conducted the analysis on October 7.

“It’s less [about] having that one daily snapshot and more about having archives that are responsive so that if something goes on in Minnesota, there’s an ability to turn up the dial and get it more often,” Milligan said.

Archiving homepages isn’t just important for historical records, though. Homepages are also one of the central ways the Wayback Machine finds individual pages to save.

“Your entry point to a crawl is generally the homepage, because the homepage gives you the map to the structure of what is underlying that page,” said Matthew Weber, a communications professor at Rutgers University who researches local news ecosystems using the Internet Archive, adding, “Crawling that initial page is critical to being able to archive and store the article page.”

Crawlers that are set to regularly archive news publications often treat homepages as “seed URLs.” The crawler gets to individual articles by “hopping” to links found on that homepage, such as promoted stories.

“It tends to be the case that a homepage will be the seed and then the way that the crawler gets to the individual articles is by finding links to them off that page,” Owens said. “For any given crawl, the person running it will set multiple seeds, scopes, and configure how many hops it should make.”

Weber said he would expect a drop in the homepage crawls to result in decreased article page crawls, too. While we did look into the frequency of article page archiving, we weren’t able to conduct a systematic analysis across our entire sample set. Graham told us that individual page archiving processes were not affected by the May breakdown.

None of the experts we consulted for this story had noticed the dip in news homepage snapshots until we brought it to their attention. While they were surprised by our findings, they also highlighted a larger issue in the United States: There’s no real mandate for the internet to be preserved at all.

Outside the Internet Archive, the Library of Congress operates perhaps the second-largest web archiving initiative in the U.S., but on a significantly smaller scale. According to Owens, who used to be the director of digital services at the Library of Congress, the project currently has “on the order of 20 billion archived resource files,” while the Wayback Machine archives more than 500 million URLs per day.

In France, for example, the National Library is required by law to preserve and make accessible all French websites and other digital works. That works for all .fr domains, but would be much harder in the United States, Weber said. “How do you preserve the entirety of the .com?”

“It’s challenging that in the United States, we are reliant on a nonprofit organization that was started in the late 1990s with a lot of different individuals literally donating data to an entity that wanted to collect and aggregate this information,” Weber said. “We’re very grateful that the Internet Archive has built a systematic program out of that. But there is always the risk that threats and challenges cause that program to change in some ways. And it seems like that may be going on right now, which is concerning.”

Photo of the Internet Archive headquarters in San Francisco, California, courtesy of the Internet Archive