Digital Archivists: Protecting Public Data from Erasure
(spectrum.ieee.org)184 points by rbanffy a day ago
184 points by rbanffy a day ago
Public data is incompatible with secrecy. Expunged records still appear in newspapers archives if the local reporter on the Crimes beat captured the proceedings. IMO, "expunged" means removed from Official court records - not from the public memory, including newspapers, archived websites, police blotters and prosecutors' files.
The fact that you get it out from your criminal record doesn't mean they get forgotten. Think about a paper writing about your crime. That will be public and archived forever.
There's a lot of panic and overlap in the space; a way to coordinate these efforts would be helpful.
Internet Archive et al. made noise and promises but told volunteers to stop because they couldn't actually handle the ingest.
https://www.reddit.com/r/Archiveteam/comments/1jbgycm/us_gov...
These folks made a notable effort.
https://webrecorder.net/blog/2025-03-25-govarchive-us-and-mi...
As someone who spent the last 2 days figuring out how best to digitise my father's old Hi8, Digital8 and MiniDV tapes, I take umbridge with this!
Keep originals if you can, but make copies ASAP, as close to lossless as possible. Don't depend on the right hardware being around in the future.
I can see the value in this, but .. originals, and the gear to read them, do not last forever. Plus for many formats the act of reading puts wear on the physical artifacts. So if you want to actually use the information, you have to format shift it to digital in the first place. And then you're back to the same question as the rest of us, how to maintain the bits.
Guess: If properly stored (physically), good-quality paper documents and photographs will last for centuries. But as soon as you digitize them - you're now chained to the treadmill of maintaining/upgrading/migrating digital archiving systems. Compared to keeping the old-fashioned Archive Storage Room dry (and fire-free), that's 100X the labor and expense. Forever.
True.
But from fire-resistant storage cabinets, to concrete-lined file rooms, to underground archives, the tech to make archives ~99.5% fire-proof is more than a century old. And if you add redundant storage sites for the high-value stuff...
Vs. anything digital is far more vulnerable to digital malice.
Hypothetically: -Government leader says they're nuking data -Mad rush to back up data through other means -Government leader declares they've 'transferred the cost of maintaining data out of government, thus making for a smaller, more efficient, government'
I hate everything about this.
I think many people are very not Ok how government handles data: https://news.ycombinator.com/item?id=43237352
How does this relate to dox?
Let’s say an individual posted identifying or incriminating information online, inadvertently or intentionally, in a public place.
Then a third party decides to store it, and possibly make it accessible to others.
If the original self doxxing user then pulled the original dox, but was unable to scrub the rest, would that information still be considered public, or would it be private? Was it ever truly public? Or private for that matter?
If you intentionally post something publicly, it's public. Full stop.
The tricky part is dealing with inadvertent or malicious (i.e. some other party), posting of private information to a public space. That's really hard to deal with on multiple levels.
For one, the archives would retain the information and scrubbing it is effectively impossible.
Secondly, legitimate things which should remain public (i.e. were posted publicly, are of public interest, etc.) can be argued to have been inadvertently or maliciously posted. So you need some way to moderate and create rulings for each individual case, which quickly becomes untenable due to the sheer volume of information being posted and the inordinate amount of time required to investigate vs. post.
That's a really good question.
In my head, I'm imagining someone early in the morning posting a flyer up on a bulletin board downtown.
Throughout the day many folks walked by and took photos of the flyer with their cell phone.
At the end of the day, the original person came back and removed the flyer.
IMO, at the time that the folks took the photo of the flyer, that flyer was public information. It remains public information even after the flyer is removed[0].
This isn't a great analogy of mine, and has plenty of holes, but was interesting to me after I read your comment. I know it was in the context of doxxing, but I think it's pretty interesting philosophically.
I think something similar applies to photos taken of other people in public spaces. Both the person who took the photo and the subject of the photo are no longer in that physical public space, but the actions took place within that space.
I think something similar applies to digital "public spaces". But what does a public space even mean in the context of walled gardens[1], etc.
[0] you then run into the question of what happens if someone posts non-public information, publicly? [1] are digital walled garden communities that different from physical communities that gate access, whether free or paid. Whether information shared within those contexts are public or private is an interesting thread as well.
I made this related submission[0] recently but it was flagged.
This stuff is very important to talk about so I hope that this submission by rbanffy isn't also flagged.
No it isn't. It's merely a cause du jour for data hoarders to justify their hobby in light of this Chicken Little hysteria.
30 years ago it was thought collecting every issue of magazines like TV Guide was important. No one even knows what that is anymore.
No one is ever going to look at 99% of this data. In the meantime, send more hard drives for my NAS!!
My wife takes thousands of photos every year, when my daughter was young she took even more.
When we were moving out of our apartment there was damage to a door hinge that we never noticed when we moved in but that had definitely been there from the onset of our two years of living in that apartment.
Guess what? I had a photo from the day after we moved in of that door hinge in a state of damage! Not because we took the photo for that intention, but because my daughter was playing in the hallway and my wife snapped a photo and it just happened to capture the damage. Saved me several hundreds of dollars in repair costs from my landlord.
You are right, 99% of the data will never be looked at. But do you know what the 1% is today? I'm guessing you don't.
Your example of personal family photos is in no way comparable to storing terabytes of essentially unindexed data for which one has no detailed knowledge about, under the notion that the government is somehow lighting a match to everything, and they're going to save it.
The government doesn't delete anything. It might be moved or inaccessible to the public but that data is somewhere in perpetuity.
It's one of the most deranged larps I've ever seen, then they pat each other on the back on BlueSky, desperately wanting to be a part of something.
These people envision themselves as folk heroes when what they really need to do is go outside and touch grass.
Among the deleted data there was the police accountability database. You probably won't have to deal with thugs now feeling omnipotent and immune from prosecution because of this.
https://www.police1.com/federal-law-enforcement/national-law...
Typo that I can't correct anymore: that would be "won't want to deal".
It might be of some interest to cultural historians in the future. But I think it makes more sense to take sample+curated data. But in any case if we can afford it, eh why not.
We don't know now what to curate for the future. We should preserve as much of everything we can - we don't know what will be important in 50, or 500 years.
Case in point: retrocomputing is my hobby. I buy, restore, preserve, and use old computers. Most of them are home computers, because business computers go directly from the office to the recycling facility or the landfill. Unless someone deliberately preserved, say, a Burroughs B-25 desktop, or the similar from Data General, they are gone.
I think the data being discussed is quite a bit different than old TV Guides...
I was, believe it if you wish, thinking about old TV guides just this morning and wondering how one would even go about archiving those. Most of the stumbling blocks for taking apart the glued binding for scanning have been figured out, of course, but for any given week there may have been as many as 60 or 70 editions (for each television market, I think). None of these have proper ISSN numbers as far as I'm aware, and other than the listings they can be visually indistinguishable. Then there is the challenge of finding those, and not knowing whether this or that edition is missing (from time to time, the company would create new additions for new regions, or fold old ones back into some other are) along with even parsing the content. Many of these tv shows aren't on themoviedb or thetvdb, and if the shows are, then there won't be episode listings (there were 6000 Donahue talk show episodes, after all). On top of all of that, you can't necessarily know what was on tv at a given time and day, with federal government preemptions, commercials, unreported last-minute rescheduling, etc.
But I can also see why people might want to keep more interesting data, like when the Federal Cheese-Sniffing Agency moved offices back in 1982 and they have meticulous records of the 483 filing cabinets that had to be moved from the original location to their new home in Furrytown, Pennsylvania.
I wonder if those would be useful in identifying the potential contents of specific Marion Stokes tapes (my understanding is that they're sorted, but are only labeled with channel and date/time and are being archived slowly): https://libwww.freelibrary.org/blog/post/5393
I’ve had the idea of recreating tv channels on my plex server by using tv guide data from the late 90s early 00s
The insurmountable part of that project would be getting the guide data.
You don’t know what other people will want in the future
That's a great idea.
There's are sites that stream old content with a old tube tv UI wrapped around the video frame but they don't have all the commercials and they don't follow the old schedules like you suggest.
I've got a friend who has hoarded digitized copies of VHS recordings of old cartoons from that era complete with the commercials, so the content is definitely out there.
https://blog.archive.org/2023/10/20/celebrating-1-petabyte-o...
Though given the space in general and some of the people involved it all should be audited very carefully.
Many criminal records, petty or otherwise, are public record. When archived, expunged or dismissed infractions never truly become that. A traffic violation or other petty misdemeanor from 20 years ago, that has been expunged from official record, can show up on a background check because companies archive public data. So, there is a flip side to this.