datahoarder

6736 readers

3 users here now

Who are we?

We are digital librarians. Among us are represented the various reasons to keep data -- legal requirements, competitive requirements, uncertainty of permanence of cloud services, distaste for transmitting your data externally (e.g. government or corporate espionage), cultural and familial archivists, internet collapse preppers, and people who do it themselves so they're sure it's done right. Everyone has their reasons for curating the data they have decided to keep (either forever or For A Damn Long Time). Along the way we have sought out like-minded individuals to exchange strategies, war stories, and cautionary tales of failures.

We are one. We are legion. And we're trying really hard not to forget.

-- 5-4-3-2-1-bang from this thread

founded 4 years ago

MODERATORS

archivist@lemmy.ml

Downloading all archive.org metadata (lemmy.dbzer0.com)

submitted 3 days ago* (last edited 3 days ago) by BermudaHighball@lemmy.dbzer0.com to c/datahoarder@lemmy.ml

7 comments fedilink hide all child comments

I'd love to know if anyone's aware of a bulk metadata export feature or repository. I would like to have a copy of the metadata and .torrent files of all items.

I guess one way is to use the CLI but this relies on knowing which item you want and I don't know if there's a way to get a list of all items.

I believe downloading via BitTorrent and seeding back is a win-win: it bolsters the Archive's resilience while easing server strain. I'll be seeding the items I download.

Edit: If you want to enumerate all item names in the entire archive.org repository, take a look at https://archive.org/developers/changes.html. This will do that for you!

top 7 comments

sorted by: hot top controversial new old

[–] thingsiplay@beehaw.org 2 points 3 days ago* (last edited 3 days ago) (1 children)

I just found out you can get all metadata with ia metadata ID > metadata.json (replace ID with gamefaqs_txt in example). So from there you could extract any information too, if you know how to handle json. (Edit: Just load the metadata.json in your browser to see a better formatted list.)

[–] drspod@lemmy.ml 2 points 3 days ago (1 children)

I like to pipe my json to python -m json.tool for quick formatting in the terminal.

[–] CHKMRK@programming.dev 1 points 3 days ago

Take a look at jq, it's a really nice tool for handling json in the terminal, also gron for searching json

[–] thingsiplay@beehaw.org 2 points 3 days ago (1 children)

I use the CLI tool, even right now waiting to finish some downloads. The CLI tool can actually give you a list of all items with ia list {ID} (replace {ID} with the actual id of the stuff you want to download). But you don't even need to list the items, because you can download with a glob (in example *.torrent like your shell has. Or if you have the ID anyway, you can specify the filenames too with {ID}_archive.torrent

Here is an example how to do this with my own upload https://archive.org/details/gamefaqs_txt where the id becomes gamefaqs_txt

ia download gamefaqs_txt --glob *.torrent

or use a variable to set id and download all files that start with the id, which should be all the meta data

id=gamefaqs_txt ; ia download "${id}" --glob "${id}_*"

[–] BermudaHighball@lemmy.dbzer0.com 2 points 3 days ago (1 children)

Thank you for the tips. I am actually interested in enumerating metadata for all the "items" as defined by the API page ever uploaded. For example, one item = one ID:

Archive.org is made up of “items”. An item is a logical “thing” that we represent on one web page on archive.org. An item can be considered as a group of files that deserve their own metadata.

You did cause me to look at the API docs again, though, and I think I found something that does enumerate all item names, and as a bonus, it will keep you updated when changes are made: https://archive.org/developers/changes.html

We'll see how much progress I can make. It might take a while to get through all the millions of them.

[–] thingsiplay@beehaw.org 1 points 3 days ago (1 children)

Isn't "item" and "id" basically the same thing? Because every item has a unique id. So in my example gamefaqs_txt would be the item and id.

[–] BermudaHighball@lemmy.dbzer0.com 1 points 3 days ago

Yes, I think so. I'll definitely use the example for downloading some of the files (.torrent, metadata file) once I have some items. But first I need to find all the items ever uploaded.