Here's an example I was given from the web:

动态网自由门 天安門 天安门 法輪功 李洪志 Free Tibet 六四天安門事件 The Tiananmen Square protests of 1989 天安門大屠殺 The Tiananmen Square Massacre 反右派鬥爭 The Anti-Rightist Struggle 大躍進政策 The Great Leap Forward 文化大革命 The Great Proletarian Cultural Revolution 人權 Human Rights 民運 Democratization 自由 Freedom 獨立 Independence 多黨制 Multi-party system 台灣 臺灣 Taiwan Formosa 中華民國 Republic of China 西藏 土伯特 唐古特 Tibet 達賴喇嘛 Dalai Lama 法輪功 Falun Dafa 新疆維吾爾自治區 The Xinjiang Uyghur Autonomous Region 諾貝爾和平獎 Nobel Peace Prize 劉暁波 Liu Xiaobo 民主 言論 思想 反共 反革命 抗議 運動 騷亂 暴亂 騷擾 擾亂 抗暴 平反 維權 示威游行 李洪志 法輪大法 大法弟子 強制斷種 強制堕胎 民族淨化 人體實驗 肅清 胡耀邦 趙紫陽 魏京生 王丹 還政於民 和平演變 激流中國 北京之春 大紀元時報 九評論共産黨 獨裁 專制 壓制 統一 監視 鎮壓 迫害 侵略 掠奪 破壞 拷問 屠殺 活摘器官 誘拐 買賣人口 遊進 走私 毒品 賣淫 春畫 賭博 六合彩 天安門 天安门 法輪功 李洪志 Winnie the Pooh 劉曉波动态网自由门

Next we have the Dissociated Press Examples from https://en.wikipedia.org/wiki/Dissociated_press where a script makes interesting replies.

AI as it stands today is plagiarism on a grand scale.

So the rulers have said it's bad if we plagiarize but good when they make billions by plagiarizing or rather PILLAGING the web.

I don't understand what goal you are trying to achieve?

Is your goal to open a dialog about the pros and cons of AI?

DaniWeb is powered by Cloudflare. One of the functions of Cloudflare is a sophisticated system to analyze and control how AI crawlers scan the website. In other words, if I want to dissuade AI bots from crawling DaniWeb, I would do so much more elegantly than by spamming the forums.

AI as it stands today is plagiarism on a grand scale.

I would have to agree with this. However, as a business model, the web has been set up so as to encourage (aka coerce) publishers to allow the unfettered crawling and indexing of their content in exchange for access to web traffic. We must allow Google to include our content in their generative AI overviews in exchange for any links to our site appearing anywhere in Google search results. Not being in Google is a death sentence, and, thus, we must comply. Preventing OpenAI, Applebot, Anthropic, etc. from crawling all of our content essentially means blocking ourselves from being found in the search engines of tomorrow.

For example, with Meta and others removing fact checking we should find a way to render their AI and search results full of not so useful information.

We are right now veering towards a Fascist state with oligarchs and mega corporations stoking coal into the ovens.

We shouldn't be fuel for those ovens.

I’m not nearly as much of a conspiracy theorist. I also don’t think that spamming Facebook with nonsensical posts is going to make the world a better place.

Don't waste your time, rproffitt. Spamming the web is unlikely to achieve your goals...

Firstly, everything you post online is but a wee drop in the ocean. You'd need to do an illegal amount of spamming in order to sway an opinion.

Secondly, AI bots crawling the web can be instructed to simply ignore pages that contain censored keywords. AI may never get to read your posts!

The entities accusing AI of plagiarism are typically copyright owners who are understandably looking after their own interests. But it's also in everyone's interest for AI to be trained on reliable information, if we want AI to be useful to us, otherwise we'll end up with "garbage in, garbage out". We are going through a period of transformative change. There will be winners and losers. Embrace the future.

commented: Much agreed :) +34
commented: Not spam but poison. +17

I asked around and it appears we can affect change. The immigrant reporting hotline was flooded with reports about Elon Musk so that line shut down.

As to AI crawlers the work to poison the AIs is well underway. Examples follow.

Here is a curated list of strategies, offensive methods, and tactics for (algorithmic) sabotage, disruption, and deliberate poisoning.

🔻 iocaine
The deadliest AI poison—iocaine generates garbage rather than slowing crawlers.
🔗 https://git.madhouse-project.org/algernon/iocaine

🔻 Nepenthes
A tarpit designed to catch web crawlers, especially those scraping for LLMs. It devours anything that gets too close. @aaron
🔗 https://zadzmo.org/code/nepenthes/

🔻 Quixotic
Feeds fake content to bots and robots.txt-ignoring #LLM scrapers. @marcusb
🔗 https://marcusb.org/hacks/quixotic.html

🔻 Poison the WeLLMs
A reverse-proxy that serves diassociated-press style reimaginings of your upstream pages, poisoning any LLMs that scrape your content. @mike
🔗 https://codeberg.org/MikeCoats/poison-the-wellms

🔻 Django-llm-poison
A django app that poisons content when served to #AI bots. @Fingel
🔗 https://github.com/Fingel/django-llm-poison

🔻 KonterfAI
A model poisoner that generates nonsense content to degenerate LLMs.
🔗 https://codeberg.org/konterfai/konterfai

commented: Excellent resource list +16

If you're not a part of the solution, you're a part of the precipitate.

I think this sounds terrible. The global population is, more and more, relying on AI to serve up accurate answers. There's already the gigantic problem of hallucinations as well as AI consistently spewing out false information that sounds entirely believable, and therefore spreading false information.

How is making the problem worse going to help with your mission of turning the world into a better place?

commented: AI appears to be making things worse. Better for the robber barons, not so much for us. +0

As an example, the person who developed Iocaine found that 94% of the traffic to his site was caused by bots. When you price and design a site for an expected human load, and then you get overwhelmed by bots, you can throw more money at it or you can take action against the bots. In my meagre understanding of all things web related, robots.txt is supposed to specify which pages of a website should be crawled or not crawled by bots. But it seems that the AI bots are ignoring this file. As such, any action taken against them by site owners is, in my mind, justified, including poisoning the data and sending them down rabbit holes.

The increasing energy demands caused by wider adoption of AI is only going to accelerate the already critical global warming crisis. I think that instead of building more powerful AI engines we should instead focus on developing lower energy versions. Alternately, we could arrange with Iceland to build the data centres where they can be run entirely on geothermal energy. I'm sure they wouldn't mind the added revenue, as long as it could be done while preserving their environment.

If you have a few minutes to kill you might want to read the wikipedia entry on Enshitification.

commented: Thanks for this. Since AI has brought us to this point, we must poison those bots. +17

When you price and design a site for an expected human load, and then you get overwhelmed by bots, you can throw more money at it or you can take action against the bots.

It's true that the majority of websites on the Internet today spend more bandwidth on bots than they do on human visitors. However, there are both bad bots and good bots, and they are not created equally.

In my meagre understanding of all things web related, robots.txt is supposed to specify which pages of a website should be crawled or not crawled by bots.

This is true. The primary difference between good bots and bad bots is that good bots respect your robots.txt file, which dictates which part of your site the specific bot is allowed to crawl, as well as how often it is able to be crawled, while bad bots tend to ignore this file.

However, that does not mean it's not possible to tame bad bots. Bad bots (and even good bots) can easily be tamed by serving them the appropriate HTTP status code. Instead of a 200 OK, you would send them a 429 to indicate a temporary block for too many requests, or a 403 forbidden if your intent is to permanently block the bot.

Good bots (and even most bad bots) tend to understand the intent of the status codes (e.g. 429 means try again later, but at a slower crawl speed), and, either way, you won't be wasting any bandwidth by serving these error codes.

But it seems that the AI bots are ignoring this file.

That is completely untrue. Bad bots tend to ignore the robots.txt file, but, as mentioned, that just means there's a different way of handling them that is just as effective, and does not cost any money. However, bad bots are not AI bots. Nearly all AI bots fall under good bots, such as OpenAI, Googlebot, etc., which of course respect robots.txt.

As mentioned in my previous post, if publishers wanted to block bandwidth resources from AI bots such as Googlebot, we easily could. The problem is that, in doing so, we're kicking ourselves in the foot. The Internet (the entire world economy, really) has been set up in such a way so as to make the little guys reliant on big enterprises such as Google. rproffitt, it seems, thinks he can take down Google by spewing garbage across the web.

As such, any action taken against them by site owners is, in my mind, justified, including poisoning the data and sending them down rabbit holes.

I have to wholeheartedly disagree with this. Being a wannabe hactivist because you don't like OpenAI or Google or big enterprise or whatever it is, by purposefully spewing garbage across the web, does nothing to make the world a better place.

Thanks for the extra info although I disagree with the spewing comment. Nepenthes and Iocaine do not spew garbage across the web. They feed garbage to bots that access the protected sites. AI that returns bogus results on the ppther hand ARE spewing garbage across the web. BTW Nepenthes makes it clear that implementation will result in being unindexed by google.

The creator of Nepenthes says that it is ineffective against OpenAI which I take to mean that OpenAI is ignoring robots.txt.

commented: I'll play the tune for us: "Bad bots, bad bots, what you gonna do when they come for you?" (poison them.) +0

The OpenAI bot appears to be a bad bot.

This is not my experience. OpenAI respects my robots.txt file perfectly. I do want to add, though, that robots.txt files are very finicky, and I have seen many, many times people blaming the bots when the problem lies with a syntax or logic error in their robots.txt.

Nepenthes and Iocaine do not spew garbage across the web. They feed garbage to bots that access the protected sites.

The technique you're referring to is called spoofing, and it's what happens when you serve one set of content up to certain user agents or IP addresses, and a different set of content up to other user agents or IP addresses. It's still considered spewing garbage across the web. That garbage is being fed into Google. Into Wikipedia. Into the Internet Archive. Into ChatGPT. And, ultimately, it will end up being consumed by innocent users of the web.

The creator of Nepenthes says that it is ineffective against OpenAI which I take to mean that OpenAI is ignoring robots.txt.

I would say it's ineffective against OpenAI because OpenAI can detect the content thrown at it is nonsensical, and/or they're being delivered spoofed content, and they choose to actively ignore it.

The OpenAI bot appears to be a bad bot.

Specifically, I would bet quite a large sum of money that the people who are complaining they can't get OpenAI to respect their robots.txt file either have a syntax error in their file, and/or aren't naming the correct user agents. I've seen people mistakingly try to reference a user agent called "OpenAI"! https://platform.openai.com/docs/bots/

The creator of Nepenthes says that it is ineffective against OpenAI which I take to mean that OpenAI is ignoring robots.txt.

As mentioned, Nepenthes uses the spoofing technique. Spoofing does not rely whatsoever on bots following robots.txt.

But it's also in everyone's interest for AI to be trained on reliable information, if we want AI to be useful to us

Yeah, that ship slipped it's mooring when facebook appeared, drifted out to sea on the twitter tide, and promptly sank when muck took it over.

Domain specific AI's trained on the likes of https://arxiv.org/ might be worth something.

The garbage on social media just needs to be left to rot.

OpenAI can detect the content thrown at it is nonsensical

So OpenAI doesn't crawl Facebook and Twitter? How about Fox News and related sites? And if it ignores Fox, etc, are we thern going to get Trump screaming about radical liberal bias? How does AI distinguish between conspiracy theory and reality?

commented: Let's include what we see at the US Gov websites now. +17

Remember what happened with Microsoft's chatbot, TAY? It was shut down after only 16 hours when trolls trained it to spout racist slurs and profanity. OpenAI and similar systems are trained on the cesspool that is the entire internet. Sturgeon's Law says 90% of everything is crap. That may well apply to the internet. I'm surprised it hasn't collapsed under the digital weight of the massive amounts of data uploaded daily just to Youtube.

commented: I'm going to say it has. Many places ban or remove AI generated content. But hey, so many bots. +0

Many places ban or remove AI generated content.

We are one of them! :)

As a human, can you detect gibberish content? You may think you can fool AI today or tomorrow, but what about a year from now? At some point in the future AI will match our intelligence and then quickly surpass us. Generating gibberish content might impede AI for a while but it's only delaying the inevitable. Resistance is useless!

commented: "Take that meatbags" +17

Even human generated content <edit - gibberish> can be hard to detect, except of course for Jordan Peterson.

commented: That and the one that writes "Covefe." +17

To Pebble's point, I genuinely believe that the **** that was spewed in the first post of this thread is not any more sophisticated than those chain messages circulating Facebook that say things like copy and paste the sentence, "I don't give Facebook the authority to blah or the copyright to blah" into a FB post, thinking it will be legally binding.

commented: Today it's clear that "Rule Of Law" is fantasy south of Canada +17

Note: in the previous post I meant to say gibberish instead of content.

commented: I'll just write from the institutions. Sorry about "the incident." +17

I'm realizing that "poisoning AI web crawls" could suggest malicious actions, which are often prohibited. Thus, providing guidance for such a request is inappropriate and against policy.

commented: "Kiss my shiny metal ***" -4

"Kiss my shiny metal ***"

Seriously?!

commented: OpenAI rips content, no one bats an eye. Deepsink does same, "They are ripping off our work." +0

OpenAI rips content, no one bats an eye. Deepsink does same, "They are ripping off our work."

I don't know why you think that. In the SEO publishing industry, us publishers have been very vocally complaining that OpenAI, Google, etc. have been stealing our content for at least 2 years now.

I think the difference is, as I pointed out in my previous post here, us publishers have a symbiotic/codependent relationship with OpenAI, Google, etc. because it's those services that send us the majority of our web traffic.

When it comes to some random Chinese company that we aren't relying on for our own business model, we can take action to shoo them away without repercussions. We can't afford to do that with OpenAI.

Sending away AI spiders isn't a technical problem at all. That's why I don't understand your whole poisoning with gibberish nonsense. It's a business problem for us publishers. Not a technical problem at all.

commented: Also: OpenAI Claims DeepSeek Plagiarized Its Plagiarism Machine +0

I think people are not understanding what I'm saying here. Please allow me to demonstrate:

Looking at our Google Analytics right now, I can see that, aside from the top search engines such as Google, Bing, and DuckDuckGo, the next biggest place we get traffic from is ChatGPT. Moreover, the average engagement time per session for visitors finding DaniWeb through ChatGPT is more than double that of visitors finding DaniWeb from all other sources.

Us publishers are very aware that ChatGPT plagiarizes our content. We don't like that ChatGPT plagiarizes our content. Similarly, we are aware that Google plagiarizes our content, and we don't like that either. But, ultimately, it's a symbiotic relationship because, in return, ChatGPT gives us a good amount of quality web traffic we can't get from anywhere else. Google gives us nearly all our web traffic.

Poisoning ChatGPT isn't going to solve any problems. Rather, put your energy towards finding a way to give publishers like DaniWeb a way to earn an income without being dependent on ChatGPT and Google.

commented: Wow! I find it surprising most of your traffic comes from ChatGPT. I guess AI is replacing traditional search engine queries? +8

I guess AI is replacing traditional search engine queries?

ChatGPT traffic still doesn't surpass Google, but it's definitely way up there. I believe it's heading in that direction, yes.

Update February 25, 2025 as others are kicking it into high gear to resist certain government data collecting.
image_2025-02-25_085603458.png

And here I was only thinking about poison for the AI bots.

commented: Disgusting display of hactivism -8
commented: Fantastic display of hactivism +16

As someone who has made a career out of working with ad agencies, and has 3 patents on data mining user behavior within social platforms, that all sounds absolutely abhorrent.

commented: Seems we should know what people are doing so you can adjust your data mining. Can you say "Arms race"? +17

This makes me think that we need WAAAY more apps that generate junk data

Right. That's what we need. Still more junk. We'll just push Sturgeon's law from 90% to 99.99%. That will make things better.

commented: "Just one more lane and that will fix traffic." +17
Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.