Comment by 1vuio0pswjnm7
Comment by 1vuio0pswjnm7 7 days ago
The proxy log contains the timestamps but not the titles
For the titles I could extract them from pcaps; I also have a running tcpdump capture that logs to a (daemontools) multilog directory
The URL consumption might be different, and difficult to compare, for a number of reasons, e.g.,
I do not use a browser that sends automatic HTTP requests for resources like images, CSS files, Javascripts, etc.
I do not use a browser that runs Javascript so there are no XHR or other Javascript-triggered requests
I do not use remote DNS, I use "curated" DNS data, so the URLs are only for resources at domains I specifically request
I use HTTP/1.1 pipelining so I have large numbers of URLs that are for resources from a single domain, for example DoH (I do not include these in the URL database)
Generally the proxy log is rather clean and excludes garbage requests that are being sent automatically; IME, use of a "modern" browser will fill a log with such garbage
The proxy's self-signed certificate blocks many potential requests from hardware with pre-installed software from so-called "tech" companies, e.g., Google, Apple, Microsoft, because the TLS connections fail
These attempted connections to the mothership are incessant; they would fill a proxy log with garbage URLs if they were accepted
All this makes it easier to for me keep a URLs database; storing all those garbage URLs would make the database less useful