datahoarder
Who are we?
We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.
We are one. We are legion. And we're trying really hard not to forget.
-- 5-4-3-2-1-bang from this thread
view the rest of the comments
Sorry I don’t really understand why I’d be potentially subject to disciplinary action for downloading something fully available to the public..?
Like I could give you the url right now and you could access it (for privacy, I won’t). It’s not sensitive information or anything.
Ah! Ok. From what you said, it sounded like it was something only available on an internal network. Like a corporate Wiki or some such. It's totally possible I just missed something in the original post that would have disabused me of that incorrect notion.
(Disclaimer, I am not a lawyer and like to geek out on intellectual property law studies sometimes, but ultimately don't really know what I'm talking about here. But...)
If it's publicly available, that probably doesn't make scraping a copy legal. (It's called "copyright" because it places restrictions on "copying" (though it also places restrictions on other things like "public performance" and such.))
The knowledge bases being publicly available might make it less likely that they'll pursue you over anything. Maybe. But that depends on your boss' state of mind and such. One thought might be to do the scraping through Tor, but if you're doing this scraping close to the time you're leaving the company, that may not really give you any significant amount of palusible deniability.
Now, the fact that it's available on the open internet may open up some other options. If the number of pages in question isn't too terribly large, you could request one-by-one that the Wayback Machine save pages. (The "save page now" section at the bottom right.) That would mean a) no learning curve b) it's not you copying things so if there were any legal issues, it would most likely (again, IANAL) be Archive.org fighting them not you. That archive would stay up for a long time. You wouldn't have a copy locally, but if you had it bookmarked, it'd stay out there on the internet in a frozen state for you to reference any time.
Aslo, if you did go the
wget
route, you wouldn't have to deal with the cookies or anything, which makes things easier.