Comment by kstrauser
For any given regex, an opponent can craft a string which is valid HTML but that the regex cannot parse. There are a million edge cases like:
<!—- Don't count <hr> this! -—> but do count <hr> this -->
and <!-- <!-- Ignore <ht> this --> but do count <hr> this —->
Now your regex has to include balanced comment markers. Solve thatYou need a context-free grammar to correctly parse HTML with its quoting rules, and escaping, and embedded scripts and CDATA, etc. etc. etc. I don't think any common regex libraries are as powerful as CFGs.
Basically, you can get pretty far with regexes, but it's provably (like in a rigorous compsci kinda way) impossible to correctly parse all valid HTML with only regular expressions.
HTML comments do not nest. The obvious tokenizer you can create with regular expressions is the correct one.