Physical Address

304 North Cardinal St.
Dorchester Center, MA 02124

How OpenAI’s bot took down this seven-person company’s website ‘like a DDoS attack’


on saturday, Triple gangers CEO Oleksandr Tomchuk was alerted that his company’s e-commerce site was taken down. It appeared to be a type of distributed denial-of-service attack.

He soon discovered that the culprit was a bot from OpenAI that was relentlessly trying to scrape his entire huge site.

“We have over 65,000 products, every product has a page,” Tomchuk told TechCrunch. “Each page has at least three pictures.”

OpenAI sent “tens of thousands” of server requests trying to download everything, hundreds of thousands of photos, with their detailed descriptions.

“OpenAI used 600 IPs to scrape data, and we’re still analyzing the logs from last week, maybe it’s a lot more,” he said of the IP addresses the bot used to try to consume his site.

“Their crawlers crushed our site,” he said “It was basically a DDoS attack.”

The Triplegangers website is their business. The seven-employee company has spent more than a decade assembling what it calls the largest database of “human digital doubles” on the web, meaning 3D image files scanned from real human models.

Sells 3D object files as well as photos – everything from hands to hair, skin and full bodies – to 3D artists, video game makers, anyone who needs to recreate in digital and authentic human characteristics.

Tomchuk’s team, based in Ukraine but also licensed in the United States out of Tampa, Florida, has a terms of service page on their site which prohibits bots from taking their images without permission. But this alone did nothing. Websites must use a properly configured robot.txt file with tags that specifically tell OpenAI’s bot, GPTBot, to leave the site alone. (OpenAI also has a couple of other bots, ChatGPT-User and OAI-SearchBot, which have their own tags, according to their information page about their crawlers.)

Robot.txt, otherwise known as the Robot Exclusion Protocol, was created to tell search engine sites what not to crawl as they index the web. OpenAI says on its information page that it honors such files when configured with its own set of non-crawl tags, although it also warns that it can take its bots up to 24 hours to recognize an updated robot.txt file.

As Tomchuk has experienced, if a site doesn’t properly use robot.txt, OpenAI and others take that to mean they can scrape to their heart’s content. It is not an opt-in system.

To add insult to injury, not only was Triplegangers knocked offline by OpenAI’s bot during US business hours, but Tomchuk is expecting a hefty AWS bill thanks to all the CPU and download activity from the bot.

Robot.txt is also not a failsafe. AI companies voluntarily comply. Another AI startup, Perplexity, was named quite famously last summer by a Wired investigation when some evidence implied Perplexity was not honor him

Triplegangers product page
Each of these is a product, with a product page that includes several more photos. Used with permission.Image credits:Triple gangers (opens in a new window)

It cannot be known for certain what was taken

As of Wednesday, after days of returning the OpenAI bot, Triplegangers had a properly configured robot.txt file in place, as well as a Cloudflare account set up to block its GPTBot and several other bots it discovered, including and Barkrowler (an SEO crawler) and Bytespider (the TokTok crawler). Tomchuk also hopes to have blocked crawlers from other AI modeling companies. As of Thursday morning, the site was not down, he said.

But Tomchuk still has no reasonable way to know exactly what OpenAI has successfully captured or to remove that material. He found no way to contact OpenAI and ask. OpenAI did not respond to TechCrunch’s request for comment. And OpenAI has so far failed to deliver its long-promised opt-out toolas TechCrunch reported recently.

This is a particularly tricky problem for Triplegangers. “We’re in a business where rights are kind of a serious issue, because we’re scanning real people,” he said. With laws like Europe’s GDPR, “they can’t just take a picture of someone on the web and use it.”

The Triplegangers website was also a particularly delightful find for AI crawlers. Startups worth more than billions of dollars, like Scale AIwere created where humans carefully tag images to train the AI. The Triplegangers site contains photos tagged in detail: ethnicity, age, tattoos vs. scars, all body types, and so on.

The irony is that the OpenAI bot’s greed is what alerted Triplegangers to how exposed it was. If he had scraped it more gently, Tomchuk would never have known, he said.

“It’s scary because there seems to be a loophole that these companies use to scan the data saying ‘you can opt to update your robot.txt with our tags,'” says Tomchuk, but that puts the responsibility on the owner of the understand how to block them.

log crawler openai
Triplegangers’ server logs showed how ruthlessly an OpenAI bot accessed the site, from hundreds of IP addresses. Used with permission.

He wants other small online businesses to know that the only way to find out if an AI bot is taking over a website’s copyright is to actively look for it. You are certainly not alone in being terrified of them. Owners of other websites have recently said Business Insider how the OpenAI bots crashed their sites and ran up their AWS bills.

The problem has grown in magnitude in 2024. New research from the digital advertising company DoubleVerify found that AI crawlers and scrapers caused an 86% increase in “general invalid traffic” in 2024 – that is, traffic that does not come from a real user.

However, “most sites remain unaware that they have been scraped by these bots,” warns Tomchuk. “Now we have to monitor daily log activity to see these bots.”

When you think about it, the whole model operates a bit like a shakedown mafia: AI bots take what they want unless you have protection.

“You should be asking for permission, not just scraping data,” says Tomchuk.

TechCrunch has a newsletter focused on AI! Sign up here to get it in your inbox every Wednesday.



Source link