How would we poison AI web crawls?

Question

rproffitt 2,706 https://5calls.org

4 Months Ago

Here's an example I was given from the web:

动态网自由门天安門天安门法輪功李洪志 Free Tibet 六四天安門事件 The Tiananmen Square protests of 1989 天安門大屠殺 The Tiananmen Square Massacre 反右派鬥爭 The Anti-Rightist Struggle 大躍進政策 The Great Leap Forward 文化大革命 The Great Proletarian Cultural Revolution 人權 Human Rights 民運 Democratization 自由 Freedom 獨立 Independence 多黨制 Multi-party system 台灣臺灣 Taiwan Formosa 中華民國 Republic of China 西藏土伯特唐古特 Tibet 達賴喇嘛 Dalai Lama 法輪功 Falun Dafa 新疆維吾爾自治區 The Xinjiang Uyghur Autonomous Region 諾貝爾和平獎 Nobel Peace Prize 劉暁波 Liu Xiaobo 民主言論思想反共反革命抗議運動騷亂暴亂騷擾擾亂抗暴平反維權示威游行李洪志法輪大法大法弟子強制斷種強制堕胎民族淨化人體實驗肅清胡耀邦趙紫陽魏京生王丹還政於民和平演變激流中國北京之春大紀元時報九評論共産黨獨裁專制壓制統一監視鎮壓迫害侵略掠奪破壞拷問屠殺活摘器官誘拐買賣人口遊進走私毒品賣淫春畫賭博六合彩天安門天安门法輪功李洪志 Winnie the Pooh 劉曉波动态网自由门

Next we have the Dissociated Press Examples from https://en.wikipedia.org/wiki/Dissociated_press where a script makes interesting replies.

AI as it stands today is plagiarism on a grand scale.

So the rulers have said it's bad if we plagiarize but good when they make billions by plagiarizing or rather PILLAGING the web.

artificial-intelligence-llm cybersecurity programming-construct

6 Contributors
33 Replies
322 Views
1 Month Discussion Span
Latest Post 3 Months Ago Latest Post by Dani

All 33 Replies

Pebble94464 85 Newbie Poster

4 Months Ago

Don't waste your time, rproffitt. Spamming the web is unlikely to achieve your goals...

Firstly, everything you post online is but a wee drop in the ocean. You'd need to do an illegal amount of spamming in order to sway an opinion.

Secondly, AI bots crawling the web can be instructed to simply ignore pages that contain censored keywords. AI may never get to read your posts!

The entities accusing AI of plagiarism are typically copyright owners who are understandably looking after their own interests. But it's also in everyone's interest for AI to be trained on reliable information, if we want AI to be useful to us, otherwise we'll end up with "garbage in, garbage out". We are going through a period of transformative change. There will be winners and losers. Embrace the future.

Dani commented: Much agreed :) +34

rproffitt commented: Not spam but poison. +17

Reverend Jim 5,242 Hi, I'm Jim, one of DaniWeb's moderators.

4 Months Ago

As an example, the person who developed Iocaine found that 94% of the traffic to his site was caused by bots. When you price and design a site for an expected human load, and then you get overwhelmed by bots, you can throw more money at it or you can take action against the bots. In my meagre understanding of all things web related, robots.txt is supposed to specify which pages of a website should be crawled or not crawled by bots. But it seems that the AI bots are ignoring this file. As such, any action taken against them by site owners is, in my mind, justified, including poisoning the data and sending them down rabbit holes.

The increasing energy demands caused by wider adoption of AI is only going to accelerate the already critical global warming crisis. I think that instead of building more powerful AI engines we should instead focus on developing lower energy versions. Alternately, we could arrange with Iceland to build the data centres where they can be run entirely on geothermal energy. I'm sure they wouldn't mind the added revenue, as long as it could be done while preserving their environment.

If you have a few minutes to kill you might want to read the wikipedia entry on Enshitification.

Edited 4 Months Ago by Reverend Jim

rproffitt commented: Thanks for this. Since AI has brought us to this point, we must poison those bots. +17

Salem 5,265 Posting Sage

4 Months Ago

But it's also in everyone's interest for AI to be trained on reliable information, if we want AI to be useful to us

Yeah, that ship slipped it's mooring when facebook appeared, drifted out to sea on the twitter tide, and promptly sank when muck took it over.

Domain specific AI's trained on the likes of https://arxiv.org/ might be worth something.

The garbage on social media just needs to be left to rot.

Pebble94464 85 Newbie Poster

4 Months Ago

As a human, can you detect gibberish content? You may think you can fool AI today or tomorrow, but what about a year from now? At some point in the future AI will match our intelligence and then quickly surpass us. Generating gibberish content might impede AI for a while but it's only delaying the inevitable. Resistance is useless!

rproffitt commented: "Take that meatbags" +17

Reply to this topic

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.

Thanks for this. Since AI has brought us to this point, we must poison those bots.

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 1 · 2025-01-30T01:03:26+00:00

I don't understand what goal you are trying to achieve?

Is your goal to open a dialog about the pros and cons of AI?

DaniWeb is powered by Cloudflare. One of the functions of Cloudflare is a sophisticated system to analyze and control how AI crawlers scan the website. In other words, if I want to dissuade AI bots from crawling DaniWeb, I would do so much more elegantly than by spamming the forums.

AI as it stands today is plagiarism on a grand scale.

I would have to agree with this. However, as a business model, the web has been set up so as to encourage (aka coerce) publishers to allow the unfettered crawling and indexing of their content in exchange for access to web traffic. We must allow Google to include our content in their generative AI overviews in exchange for any links to our site appearing anywhere in Google search results. Not being in Google is a death sentence, and, thus, we must comply. Preventing OpenAI, Applebot, Anthropic, etc. from crawling all of our content essentially means blocking ourselves from being found in the search engines of tomorrow.

rproffitt 2,706 https://5calls.org Moderator · Answer 2 · 2025-01-30T01:48:41+00:00

For example, with Meta and others removing fact checking we should find a way to render their AI and search results full of not so useful information.

We are right now veering towards a Fascist state with oligarchs and mega corporations stoking coal into the ovens.

We shouldn't be fuel for those ovens.

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 3 · 2025-01-30T17:51:52+00:00

I’m not nearly as much of a conspiracy theorist. I also don’t think that spamming Facebook with nonsensical posts is going to make the world a better place.

rproffitt 2,706 https://5calls.org Moderator · Answer 4 · 2025-01-30T23:28:55+00:00

I asked around and it appears we can affect change. The immigrant reporting hotline was flooded with reports about Elon Musk so that line shut down.

As to AI crawlers the work to poison the AIs is well underway. Examples follow.

Here is a curated list of strategies, offensive methods, and tactics for (algorithmic) sabotage, disruption, and deliberate poisoning.

🔻 iocaine
The deadliest AI poison—iocaine generates garbage rather than slowing crawlers.
🔗 https://git.madhouse-project.org/algernon/iocaine

🔻 Nepenthes
A tarpit designed to catch web crawlers, especially those scraping for LLMs. It devours anything that gets too close. @aaron
🔗 https://zadzmo.org/code/nepenthes/

🔻 Quixotic
Feeds fake content to bots and robots.txt-ignoring #LLM scrapers. @marcusb
🔗 https://marcusb.org/hacks/quixotic.html

🔻 Poison the WeLLMs
A reverse-proxy that serves diassociated-press style reimaginings of your upstream pages, poisoning any LLMs that scrape your content. @mike
🔗 https://codeberg.org/MikeCoats/poison-the-wellms

🔻 Django-llm-poison
A django app that poisons content when served to #AI bots. @Fingel
🔗 https://github.com/Fingel/django-llm-poison

🔻 KonterfAI
A model poisoner that generates nonsense content to degenerate LLMs.
🔗 https://codeberg.org/konterfai/konterfai

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 5 · 2025-01-30T23:40:37+00:00

If you're not a part of the solution, you're a part of the precipitate.

I think this sounds terrible. The global population is, more and more, relying on AI to serve up accurate answers. There's already the gigantic problem of hallucinations as well as AI consistently spewing out false information that sounds entirely believable, and therefore spreading false information.

How is making the problem worse going to help with your mission of turning the world into a better place?

AI appears to be making things worse. Better for the robber barons, not so much for us.

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 6 · 2025-01-31T19:38:48+00:00

When you price and design a site for an expected human load, and then you get overwhelmed by bots, you can throw more money at it or you can take action against the bots.

It's true that the majority of websites on the Internet today spend more bandwidth on bots than they do on human visitors. However, there are both bad bots and good bots, and they are not created equally.

In my meagre understanding of all things web related, robots.txt is supposed to specify which pages of a website should be crawled or not crawled by bots.

This is true. The primary difference between good bots and bad bots is that good bots respect your robots.txt file, which dictates which part of your site the specific bot is allowed to crawl, as well as how often it is able to be crawled, while bad bots tend to ignore this file.

However, that does not mean it's not possible to tame bad bots. Bad bots (and even good bots) can easily be tamed by serving them the appropriate HTTP status code. Instead of a 200 OK, you would send them a 429 to indicate a temporary block for too many requests, or a 403 forbidden if your intent is to permanently block the bot.

Good bots (and even most bad bots) tend to understand the intent of the status codes (e.g. 429 means try again later, but at a slower crawl speed), and, either way, you won't be wasting any bandwidth by serving these error codes.

But it seems that the AI bots are ignoring this file.

That is completely untrue. Bad bots tend to ignore the robots.txt file, but, as mentioned, that just means there's a different way of handling them that is just as effective, and does not cost any money. However, bad bots are not AI bots. Nearly all AI bots fall under good bots, such as OpenAI, Googlebot, etc., which of course respect robots.txt.

As mentioned in my previous post, if publishers wanted to block bandwidth resources from AI bots such as Googlebot, we easily could. The problem is that, in doing so, we're kicking ourselves in the foot. The Internet (the entire world economy, really) has been set up in such a way so as to make the little guys reliant on big enterprises such as Google. rproffitt, it seems, thinks he can take down Google by spewing garbage across the web.

As such, any action taken against them by site owners is, in my mind, justified, including poisoning the data and sending them down rabbit holes.

I have to wholeheartedly disagree with this. Being a wannabe hactivist because you don't like OpenAI or Google or big enterprise or whatever it is, by purposefully spewing garbage across the web, does nothing to make the world a better place.

rproffitt 2,706 https://5calls.org Moderator · Answer 7 · 2025-01-31T19:58:51+00:00

The OpenAI bot appears to be a bad bot. Discussed many times so here's just one: https://www.reddit.com/r/selfhosted/comments/1i154h7/openai_not_respecting_robotstxt_and_being_sneaky/

Fixes appear to be:

Block IP ranges from bots.
Replace words and poison the bots.

score 1 · Answer 8 · 2025-01-31T20:01:57+00:00

Thanks for the extra info although I disagree with the spewing comment. Nepenthes and Iocaine do not spew garbage across the web. They feed garbage to bots that access the protected sites. AI that returns bogus results on the ppther hand ARE spewing garbage across the web. BTW Nepenthes makes it clear that implementation will result in being unindexed by google.

The creator of Nepenthes says that it is ineffective against OpenAI which I take to mean that OpenAI is ignoring robots.txt.

I'll play the tune for us: "Bad bots, bad bots, what you gonna do when they come for you?" (poison them.)

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 9 · 2025-01-31T21:13:23+00:00

The OpenAI bot appears to be a bad bot.

This is not my experience. OpenAI respects my robots.txt file perfectly. I do want to add, though, that robots.txt files are very finicky, and I have seen many, many times people blaming the bots when the problem lies with a syntax or logic error in their robots.txt.

Nepenthes and Iocaine do not spew garbage across the web. They feed garbage to bots that access the protected sites.

The technique you're referring to is called spoofing, and it's what happens when you serve one set of content up to certain user agents or IP addresses, and a different set of content up to other user agents or IP addresses. It's still considered spewing garbage across the web. That garbage is being fed into Google. Into Wikipedia. Into the Internet Archive. Into ChatGPT. And, ultimately, it will end up being consumed by innocent users of the web.

The creator of Nepenthes says that it is ineffective against OpenAI which I take to mean that OpenAI is ignoring robots.txt.

I would say it's ineffective against OpenAI because OpenAI can detect the content thrown at it is nonsensical, and/or they're being delivered spoofed content, and they choose to actively ignore it.

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 10 · 2025-01-31T21:22:58+00:00

The OpenAI bot appears to be a bad bot.

Specifically, I would bet quite a large sum of money that the people who are complaining they can't get OpenAI to respect their robots.txt file either have a syntax error in their file, and/or aren't naming the correct user agents. I've seen people mistakingly try to reference a user agent called "OpenAI"! https://platform.openai.com/docs/bots/

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 11 · 2025-01-31T21:25:36+00:00

The creator of Nepenthes says that it is ineffective against OpenAI which I take to mean that OpenAI is ignoring robots.txt.

As mentioned, Nepenthes uses the spoofing technique. Spoofing does not rely whatsoever on bots following robots.txt.

score 1 · Answer 12 · 2025-02-02T16:50:43+00:00

OpenAI can detect the content thrown at it is nonsensical

So OpenAI doesn't crawl Facebook and Twitter? How about Fox News and related sites? And if it ignores Fox, etc, are we thern going to get Trump screaming about radical liberal bias? How does AI distinguish between conspiracy theory and reality?

score 1 · Answer 13 · 2025-02-02T17:07:10+00:00

Remember what happened with Microsoft's chatbot, TAY? It was shut down after only 16 hours when trolls trained it to spout racist slurs and profanity. OpenAI and similar systems are trained on the cesspool that is the entire internet. Sturgeon's Law says 90% of everything is crap. That may well apply to the internet. I'm surprised it hasn't collapsed under the digital weight of the massive amounts of data uploaded daily just to Youtube.

I'm going to say it has. Many places ban or remove AI generated content. But hey, so many bots.

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 14 · 2025-02-02T18:31:55+00:00

Dani 4,675 The Queen of DaniWeb

4 Months Ago

Many places ban or remove AI generated content.

We are one of them! :)

score 1 · Answer 15 · 2025-02-03T12:02:04+00:00

Reverend Jim 5,242 Hi, I'm Jim, one of DaniWeb's moderators.

4 Months Ago

Even human generated content <edit - gibberish> can be hard to detect, except of course for Jordan Peterson.

Edited 4 Months Ago by Reverend Jim

rproffitt commented: That and the one that writes "Covefe." +17

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 16 · 2025-02-04T22:25:28+00:00

To Pebble's point, I genuinely believe that the **** that was spewed in the first post of this thread is not any more sophisticated than those chain messages circulating Facebook that say things like copy and paste the sentence, "I don't give Facebook the authority to blah or the copyright to blah" into a FB post, thinking it will be legally binding.

Today it's clear that "Rule Of Law" is fantasy south of Canada

score 1 · Answer 17 · 2025-02-04T23:24:36+00:00

Reverend Jim 5,242 Hi, I'm Jim, one of DaniWeb's moderators.

4 Months Ago

Note: in the previous post I meant to say gibberish instead of content.

rproffitt commented: I'll just write from the institutions. Sorry about "the incident." +17

Fitmovers -16 Newbie Poster · Answer 18 · 2025-02-05T10:02:30+00:00

I'm realizing that "poisoning AI web crawls" could suggest malicious actions, which are often prohibited. Thus, providing guidance for such a request is inappropriate and against policy.

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 19 · 2025-02-07T18:58:47+00:00

Dani 4,675 The Queen of DaniWeb

4 Months Ago

"Kiss my shiny metal ***"

Seriously?!

rproffitt commented: OpenAI rips content, no one bats an eye. Deepsink does same, "They are ripping off our work." +0

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 20 · 2025-02-07T19:48:51+00:00

OpenAI rips content, no one bats an eye. Deepsink does same, "They are ripping off our work."

I don't know why you think that. In the SEO publishing industry, us publishers have been very vocally complaining that OpenAI, Google, etc. have been stealing our content for at least 2 years now.

I think the difference is, as I pointed out in my previous post here, us publishers have a symbiotic/codependent relationship with OpenAI, Google, etc. because it's those services that send us the majority of our web traffic.

When it comes to some random Chinese company that we aren't relying on for our own business model, we can take action to shoo them away without repercussions. We can't afford to do that with OpenAI.

Sending away AI spiders isn't a technical problem at all. That's why I don't understand your whole poisoning with gibberish nonsense. It's a business problem for us publishers. Not a technical problem at all.

Also: OpenAI Claims DeepSeek Plagiarized Its Plagiarism Machine

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 21 · 2025-02-17T19:13:42+00:00

I think people are not understanding what I'm saying here. Please allow me to demonstrate:

Looking at our Google Analytics right now, I can see that, aside from the top search engines such as Google, Bing, and DuckDuckGo, the next biggest place we get traffic from is ChatGPT. Moreover, the average engagement time per session for visitors finding DaniWeb through ChatGPT is more than double that of visitors finding DaniWeb from all other sources.

Us publishers are very aware that ChatGPT plagiarizes our content. We don't like that ChatGPT plagiarizes our content. Similarly, we are aware that Google plagiarizes our content, and we don't like that either. But, ultimately, it's a symbiotic relationship because, in return, ChatGPT gives us a good amount of quality web traffic we can't get from anywhere else. Google gives us nearly all our web traffic.

Poisoning ChatGPT isn't going to solve any problems. Rather, put your energy towards finding a way to give publishers like DaniWeb a way to earn an income without being dependent on ChatGPT and Google.

Wow! I find it surprising most of your traffic comes from ChatGPT. I guess AI is replacing traditional search engine queries?

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 22 · 2025-02-18T00:39:49+00:00

I guess AI is replacing traditional search engine queries?

ChatGPT traffic still doesn't surpass Google, but it's definitely way up there. I believe it's heading in that direction, yes.

rproffitt 2,706 https://5calls.org Moderator · Answer 23 · 2025-02-25T16:57:19+00:00

Update February 25, 2025 as others are kicking it into high gear to resist certain government data collecting.

And here I was only thinking about poison for the AI bots.

Dani 4,675 The Queen of DaniWeb Administrator Featured Poster Premium Member · Answer 24 · 2025-02-25T18:18:59+00:00

As someone who has made a career out of working with ad agencies, and has 3 patents on data mining user behavior within social platforms, that all sounds absolutely abhorrent.

Seems we should know what people are doing so you can adjust your data mining. Can you say "Arms race"?

score 1 · Answer 25 · 2025-02-25T21:12:30+00:00

This makes me think that we need WAAAY more apps that generate junk data

Right. That's what we need. Still more junk. We'll just push Sturgeon's law from 90% to 99.99%. That will make things better.

How would we poison AI web crawls?

Recommended Answers Collapse Answers

All 33 Replies

Recommended Answers