Comment by tiagod

Comment by tiagod 10 hours ago

I think even for single opening tags like asked there are impossible edge cases.

For example, this is perfectly valid XHTML:

    <a href="/" title="<a /> />"></a>

chungy 7 hours ago

No, that is not valid. The "<" and ">" characters in string values must always be escaped with < and >. The correct form would be:

    <a href="/" title="&lt;a /&gt; /&gt;"></a>

Reply View 0 replies

comex 9 hours ago

If you already know where the start of the opening tag is, then I think a regex is capable of finding the end of that same opening tag, even in cases like yours. In that sense, it’s possible to use a regex to parse a single tag. What’s not possible is finding opening tags within a larger fragment of HTML.

Reply View 4 replies

kstrauser 8 hours ago
For any given regex, an opponent can craft a string which is valid HTML but that the regex cannot parse. There are a million edge cases like:
<!—- Don't count <hr> this! -—> but do count <hr> this -->
and
 but do count <hr> this —->
Now your regex has to include balanced comment markers. Solve that
You need a context-free grammar to correctly parse HTML with its quoting rules, and escaping, and embedded scripts and CDATA, etc. etc. etc. I don't think any common regex libraries are as powerful as CFGs.
Basically, you can get pretty far with regexes, but it's provably (like in a rigorous compsci kinda way) impossible to correctly parse all valid HTML with only regular expressions.
Reply View | 3 replies
- marcosdumay 6 hours ago
  
  HTML comments do not nest. The obvious tokenizer you can create with regular expressions is the correct one.
  
  Reply View | 2 replies
  
  kstrauser 5 hours ago
  
  If you're talking about tokenizers, then you're no longer parsing HTML with a regex. You're tokenizing it with a regex and processing it with an actual parser.
  
  Reply View | 1 reply
  
  marcosdumay 2 hours ago
  
  If you are talking about detecting tags, you (and the person asking that SO question) is talking about tokenization, and everybody (like the one making that famous answer) bringing parsing into the discussion is just being an asshole.
  
  Reply View | 0 replies