Comment by andai

Comment by andai 2 days ago

1 reply

Has anyone taken a look at a random sample of web data? It's mostly crap. I was thinking of making my own search engine, knowledge database etc based on a random sample of web pages, but I found that almost all of them were drivel. Flame wars, asinine blog posts, and most of all, advertising. Forget spam, most of the legit pages are trying to sell something too!

The conclusion I arrived at was that making my own crawler actually is feasible (and given my goals, necessary!) because I'm only interested in a very, very small fraction of what's out there.

andai a day ago

The unspoken question here, of course, is "you wouldn't happen to have already done this for me?" ;)