Comment by SoftTalker

Comment by SoftTalker 9 hours ago

12 replies

A big one that isn't listed is looking for stuff that isn't there.

Using != or NOT IN (...) is almost always going to be inefficient (but can be OK if other predicates have narrowed down the result set already).

Also, understand how your DB handles nulls. Are nulls and empty strings the same? Does null == null? Not all databases do this the same way.

magicalhippo 9 hours ago

> Also, understand how your DB handles nulls.

Also in regards to indexing. The DBs I've used have not indexed nulls, so a "WHERE col IS NULL" is inefficient even though "col" is indexed.

If that is the case and you really need it, have a computed column with a char(1) or bit indicating if "col" is NULL or not, and index that.

  • SoftTalker 9 hours ago

    NULL should generally never be used to "mean" anything.

    If your business rules say that "not applicable" or "no entry" is a value, store a value that indicates that, don't use NULL.

    • crazygringo 5 hours ago

      Not sure what you mean.

      If you have a table of customers and someone of them don't have addresses, it's standard to leave the address fields NULL. If some of them don't belong to a company, it's standard to leave the company_id field NULL.

      This is literally what NULL is for. It's a special value precisely because missing data or a N/A field is so common.

      If you're suggesting mandatory additional has_address and has_customer_id fields, I would disagree. You'd be reinventing a database tool that already exists precisely for that purpose.

      • MaxBarraclough 5 hours ago

        > This is literally what NULL is for. It's a special value precisely because missing data or a N/A field is so common.

        Kinda. You need null for outer joins, but you could have a relational DBMS that prohibits nullable columns in tables. Christopher Date thought that in properly normalised designs, tables should never use nullable columns. Codd disagreed. [0]

        > If you're suggesting mandatory additional has_address and has_customer_id fields, I would disagree. You'd be reinventing a database tool that already exists precisely for that purpose.

        The way to do it without using a nullable column is to introduce another table for the 'optional' data, and use a left outer join.

        [0] https://en.wikipedia.org/wiki/First_normal_form#Christopher_...

        • crazygringo 4 hours ago

          > The way to do it without using a nullable column

          I mean, you could, but having separate tables for every optional field would be an organizational and usability nightmare. Queries would be longer and slower for no good reason. Not to mention a gigantic waste of space with all those repeated primary keys and their indexes.

          And you could have databases that prohibited NULL values, but we mostly don't, because they're so useful.

      • SoftTalker 5 hours ago

        No null is fine if you don’t know or there’s literally no value. But don’t interpret a null phone number to mean the customer doesn’t have a phone number. You can’t infer anything from that, other than you don’t have it.

        • crazygringo 4 hours ago

          I'm not sure I agree.

          If I have a column for the ID of the customer's current active subscription, and that column is NULL, it seems perfectly fine to interpret that the customer has no active subscription.

          That is a valid inference. You don't need a separate has_active_subscription field.

          On the other hand, your phone number example is just common sense. The database doesn't represent the external world. The database just knows the customer didn't provide a phone number.

    • rplnt 8 hours ago

      Interesting, I don't think I've seen that while NULLs are very common.

      I guess you would handle it in the application and not in the query, right?

      • SoftTalker 7 hours ago

        I've seen it too, very often. But it's good if you can just keep NULL meaning NULL (i.e. "the absence of any value"), because otherwise you will eventually be surprised by behavior.

geysersam 9 hours ago

> Using != or NOT IN (...) is almost always going to be inefficient.

Why do you say that?

My understanding is that as long as the RHS of NOT IN is constant (in the sense that it doesn't depend on the row) the condition is basically a hash table lookup, which is typically efficient if the lookup table is not massive.

What's the more efficient alternative?

  • Sesse__ 9 hours ago

    I'm going to assume here that we're talking about a subquery here (SELECT * FROM t1 WHERE x NOT IN ( SELECT x FROM t2 )). If you're just talking about a static list, then the basic problem is the amount of data you get back. :-)

    The biggest problem with NOT IN is that it has very surprising NULL behavior: Due to the way it's defined, if there is any NULL in the joined-on columns, then _all_ rows must pass. If the column is non-nullable, then sure, you can convert it into an antijoin and optimize it together with the rest of the join tree. If not, it usually ends up being something more complicated.

    For this reason, NOT EXISTS should usually be preferred. The syntax sucks, but it's much easier to rewrite to antijoin.

  • SoftTalker 9 hours ago

    Because they can't use indexes.

    If I have a table of several million rows and I want to find rows "WHERE foo NOT IN ('A', 'B', 'C')" that's a full table scan, or possibly an index scan if foo is indexed, unless there are other conditions that narrow it down.