Comment by HocusLocus

Comment by HocusLocus 2 days ago

0 replies

I've been trying to extract historycommons.org from wayback and it is an uphill battle, even to grab the ~198 pages it says it collected. Even back in the days after 9/11 when it rose to prominence I was shuddering at the site's dynamically served implementation. These were the days of Java and they loaded down the server side with CPU time when it'd rather be serving static items... from REAL directories. With REAL If-Modified-Since: virtual support file attributes set from the combined database update times ... which seems to have gone by the wayside on the Internet completely.

Everything everywhere is now Last-Modified today, now, just for YOU! Even if it hasn't changed. Doesn't that make you happy? Do you have a PROBLEM with that??

Everything unique at the site was after the ? and there was more than one way to get 'there', there being anywhere.

I suspect that many tried to whack the site then finally gave up. I got a near-successful whack once after lots of experimenting, but said to myself then "This thing will go away, and it's sad".

That treasure is not reliably archived.

Suggestion: Even if the whole site is spawned from a database, choose a view that presents everything once and only once, and present to the world a group of pages that completely divulge the content with slash-separators only /x/y/z/xxx.(html|jpg|etc) with no duplicitous tangents IF the crawler ignores everything after the ? ... and place actual static items in a hierarchy. The most satisfying crawl is one where you can do this, knowing that the archive will be complete and relevant and there is no need to 'attack' the server side with processes-spawning.