Comment by geokon
Big picture, why does everyone scrape the web?
Why doesn't one company do it and then resell the data? Is it a legal/liability issue? If you scrape it's a legal grey area, but if you sell what you scrap it's clearly copyright infringement?
My bet is that they believe https://commoncrawl.org isn't good enough and, precisely as you are suggesting, the "rest" is where is their competitive advantage might stem from.