this post was submitted on 28 Aug 2023
16 points (100.0% liked)

datahoarder

6763 readers
31 users here now

Who are we?

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

We are one. We are legion. And we're trying really hard not to forget.

-- 5-4-3-2-1-bang from this thread

founded 4 years ago
MODERATORS
 

Greetings!

I have 2 related knowledge bases that I have been working on for a long time which I would like to download for my portfolio. I reduced the size of one by 5/6, and the other by 4/5, in addition to very very heavy ease-of-use editing . It is a substantial improvement, and what I want to market myself to do, if I have to keep working. I actively enjoy doing it.

My company will not approve the third party backup application for zendesk (the host of these kbs) and I don’t know how to write a script to back it up or anything. I don’t know how any of that works, but zendesk recommends a python script.

I checked a few indexers and couldn’t find the pages, but even if it was indexed by url, I can’t really go through and download individual page links, so I’m not sure what to do. I don’t know python, or anyone who knows Python.

For context, the project as a whole has been my world for the last 9 months, it’s not what I was hired to fix explicitly, which is why they dgaf about my portfolio, but it’s what I want to do (and I’m great at it, even have a degree for it!). I want a snapshot of the whole thing on my last day, ideally including the archived articles I got rid of.

Any ideas for someone who doesn’t code?

top 10 comments
sorted by: hot top controversial new old
[–] TootSweet@lemmy.world 7 points 1 year ago (2 children)

Hoo boy.

So, first off, this might be something you've already considered, but you mentioned "your company" a couple of times. Chances are if you've contributed to these knowledge bases as an employee, the content on those knowledge bases are owned by your employer. If you don't get permission to get a copy and incorporate it into your portfolio, you're potentially asking for legal trouble.

So, I guess what I'm saying is if you're not 100% sure your employer will be totally cool with it, try not to get caught. Also, depending on how on-the-ball the IT department is, they might notice that you're scraping specifically because you're sending a lot of extra traffic to those knowledge bases.

So, that out of the way, wget will most likely do it. But it'll probably take a lot of tinkering.

A few considerations. First, if you have to be authenticated to access the content you're wanting to scrape, that'll definitely add a layer of complexity to this endeavor.

Second, if the Zendesk application interface uses a lot of AJAX, it's likely to be much more of a bear.

If you're using wget, a basic version of what you'll need is a command something like:

wget --mirror --tries 5 -o log --continue --show-progress --wait 2 --waitretry 2 --convert-links --page-requisites https://example.com/

If you need to be authenticated to view the data you're wanting, you'll have to do some magic with cookies or some such. The way I'd deal with that:

  • Log in to the application in a browser.
  • Hit F12 to open developer tools.
  • Hit the "Storage" tab. (Sorry, I don't remember if the equivalent tab is called "Storage" or somthing else in Chromium-based browsers. I'm using Firefox.)
  • Expand out "Cookies" on the left. (Again, sorry if you're using Chrome or some such.)
  • Make a text file with one line per cookie. This bit might be tricky, but it's likely to be quite finicky. You'll have to use a text editor and not one that saves in plaintext, not rich text. (Think Notepad, not Microsoft Word. Notepad++ will work nicely. I don't think Notepad itself will work because...) You'll probably need to use Unix-style line endings.
  • Each line will need to follow the format outlined here. Note in particular the tab characters, not spaces.
  • Save that text to a file named "cookies.txt" in the directory where you're running wget and add to your wget command --load-cookies cookies.txt. You'll need to add that right before the url.

If you're on Windows, you can get wget here.

And, honestly, all this is a little like programming, really. Unfortunately, I'm not aware of any friendlier kind of apps for this sort of thing. Hopefully this ends up getting you what you're hoping for.

[–] scrubbles@poptalk.scrubbles.tech 5 points 1 year ago (1 children)

Second the first bit. I'm a software dev and it's generally known that anything I write for a company is not mine. I can talk about it for future employers, I can think about how I solved past problems, but what I wrote is the company's. I do not own it and I claim no ownership.

Op here is not only looking at disciplinary actions but possible legal action taken against them if they try to rip this stuff without written permission.

If you do it for a company, it's the company's property. Not yours.

[–] ApathyTree@lemmy.dbzer0.com 2 points 1 year ago (1 children)

Sorry I don’t really understand why I’d be potentially subject to disciplinary action for downloading something fully available to the public..?

Like I could give you the url right now and you could access it (for privacy, I won’t). It’s not sensitive information or anything.

[–] TootSweet@lemmy.world 3 points 1 year ago

Ah! Ok. From what you said, it sounded like it was something only available on an internal network. Like a corporate Wiki or some such. It's totally possible I just missed something in the original post that would have disabused me of that incorrect notion.

(Disclaimer, I am not a lawyer and like to geek out on intellectual property law studies sometimes, but ultimately don't really know what I'm talking about here. But...)

If it's publicly available, that probably doesn't make scraping a copy legal. (It's called "copyright" because it places restrictions on "copying" (though it also places restrictions on other things like "public performance" and such.))

The knowledge bases being publicly available might make it less likely that they'll pursue you over anything. Maybe. But that depends on your boss' state of mind and such. One thought might be to do the scraping through Tor, but if you're doing this scraping close to the time you're leaving the company, that may not really give you any significant amount of palusible deniability.

Now, the fact that it's available on the open internet may open up some other options. If the number of pages in question isn't too terribly large, you could request one-by-one that the Wayback Machine save pages. (The "save page now" section at the bottom right.) That would mean a) no learning curve b) it's not you copying things so if there were any legal issues, it would most likely (again, IANAL) be Archive.org fighting them not you. That archive would stay up for a long time. You wouldn't have a copy locally, but if you had it bookmarked, it'd stay out there on the internet in a frozen state for you to reference any time.

Aslo, if you did go the wget route, you wouldn't have to deal with the cookies or anything, which makes things easier.

[–] ApathyTree@lemmy.dbzer0.com 1 points 1 year ago

Thank you for taking the time to write this out, I’ve done a bit of digging into it and will do a bit more when time permits.

These KBs are fully accessible to the public, so I’m not really that worried about using something from them that I couldn’t otherwise get by saving the page or using a link document to send people to it directly, just want to ensure it continues to reflect my work, and stays available to me (one of them will only be live for another 5 or so years).

[–] Shdwdrgn@mander.xyz 4 points 1 year ago (1 children)

If you have a linux machine, wget can create a mirror copy of a website but it does require following the URLs. While you are still there, do you have the access to make a direct copy of the database and the back-end code that drives the website?

As an alternative that might help show the changes you have made, if this page is exposed to the internet then you could check the internet archive, which will have snapshots from different dates that could provide a comparison of the changes.

[–] ApathyTree@lemmy.dbzer0.com 1 points 1 year ago (1 children)

This is all “cloud” so no real database. Or at least I wouldn’t know how to download/access it. There’s a tool my company won’t approve to perform backups.. they don’t understand that cloud means “someone else’s computer”. It would solve my problem, but I’m not permitted to install it. I’m the lowliest of grunts; customer service tech support. But also fix all our documentation plzandtkx.

The pages are exposed to the internet, but I’ve looked on various internet archives and they don’t come up, my company may block archiving. Even if they did though I have hundreds of pages (thousands of you count the archived stuff).

[–] Shdwdrgn@mander.xyz 2 points 1 year ago

Yeah I figured the solutions wouldn't be quite that easy since you were here asking, but sometimes it helps to get the simple ideas out of the way first so others can come along with better methods. Good luck on your search!

[–] nicoweio@lemmy.world 1 points 1 year ago (1 children)

I'm curious, since you mentioned it: What degree is it that qualifies you to edit knowledge bases?

[–] ApathyTree@lemmy.dbzer0.com 2 points 1 year ago

My degree is in technical communications. Making tech information accessible and coherent for broad audiences of end-users, mostly.