Comment by geye1234

Comment by geye1234 2 days ago

1 reply

I've been trying to download various blogs, on blogspot.com and wordpress.com, as well as a couple now only on archive.org, using Linux CLI tools. I cannot make it work. Everything either seems to miss css, or jumps the wrong number of links, or stops arbitrarily, or has some other problem.

If I had a couple of days to devote to it entirely, I think I could make it work, but I've had to be sporadic, although it's cost me a ton of time cumulatively. I've tried wget, httrack, and a couple of other more obscure tools -- all with various options and parameters of course.

One issue is that blog info is duplicated -- you might get domainname.com/article/article.html; domainname.com/page/1; and domainname.com/2015/10/01; all of which contain the same links. Could there be some vicious circularity taking place, causing the downloader to be confused about what it's done and what it has yet to do? I wouldn't think so, but static, non-blog pages are obviously much simpler than blogs.

Anyway, is there a known, standardized way to download blogs? I haven't yet found one. But it seems such a common use case! Does anybody have any advice?