TechTakes

2264 readers

52 users here now

Big brain tech dude got yet another clueless take over at HackerNews etc? Here's the place to vent. Orange site, VC foolishness, all welcome.

This is not debate club. Unless it’s amusing debate.

For actually-good tech, you want our NotAwfulTech community

founded 2 years ago

MODERATORS

dgerard@awful.systems

Fighting the AI scraper bots at Pivot to AI and RationalWiki (pivot-to-ai.com)

submitted 4 months ago* (last edited 4 months ago) by dgerard@awful.systems to c/techtakes@awful.systems

7 comments fedilink hide all child comments

video version

that s3kr1t method: https://www.jwz.org/blog/2025/05/user-agent-blocking/#comment-259266

you are viewing a single comment's thread
view the rest of the comments

[–] HedyL@awful.systems 8 points 4 months ago* (last edited 4 months ago) (3 children)

Even if it's not the main topic of this article, I'm personally pleased that RationalWiki is back. And if the AI bots are now getting the error messages instead of me, then that's all the better.

Edit: But also - why do AI scrapers request pages that show differences between versions of wiki pages (or perform other similarly complex requests)? What's the point of that anyway?

[–] sailor_sega_saturn@awful.systems 10 points 4 months ago (1 children)

Edit: But also - why do AI scrapers request pages that show differences between versions of wiki pages (or perform other similarly complex requests)? What’s the point of that anyway?

This is just naive web crawling: Crawl a page, extract all the links, then crawl all the links and repeat.

Any crawler that doesn't know what their doing and doesn't respect robots but wants to crawl an entire domain will end up following these sorts of links naturally. It has no sense that the requests are "complex", just that it's fetching a URL with a few more query parameters than it started at.

The article even alludes to how to take advantage of this with it's "trap the bots in a maze of fake pages" suggestion. Even crawlers that know what they're doing will sometimes struggle with infinite URL spaces.

[–] HedyL@awful.systems 7 points 4 months ago* (last edited 4 months ago)

This is just naive web crawling: Crawl a page, extract all the links, then crawl all the links and repeat.

It's so ridiculous - supposedly these people have access to a super-smart AI (which is supposedly going to take all our jobs soon), but the AI can't even tell them which pages are worth scraping multiple times per second and which are not. Instead, they appear to kill their hosts like maladapted parasites regularly. It's probably not surprising, but still absurd.

Edit: Of course, I strongly assume that the scrapers don't use the AI in this context (I guess they only used it to write their code based on old Stackoverflow posts). Doesn't make it any less ridiculous though.

load more comments (1 replies)