Just realized my old forum posts from 3 years ago are now being scraped for AI training.

I was on a niche hobby forum last week and saw a user with a familiar, generic writing style, and a reverse image search on their profile picture led to a stock photo site. The whole thing felt like a bot scraping our old conversations to generate new 'authentic' content. Does anyone have a reliable method for checking if a specific text block has been used in a training dataset?

3 comments

3 Comments

tyler_hunt723mo ago

Actually, that stock photo thing is a huge red flag in my book. Knight.drew makes a fair point about generic writing, but bots are getting better at copying casual styles. Those old forum posts are a goldmine for training because they show real people talking in natural language. Wouldn't the big companies want that exact kind of messy, authentic data? How do we even know what gets scraped anymore?

the_riley2mo ago

Right, because nothing says "authentic human" like that perfectly lit stock photo.

knight.drew3mo ago

Honestly, I doubt most hobby forums are a primary target. The big AI companies need massive, clean datasets, not random threads about fixing a carburetor. That generic writing style is just how a lot of people type online. The stock photo is more likely a new user who didn't want to upload a real picture.