SQL Anti-Patterns

(datamethods.substack.com)

196 points by zekrom 9 hours ago

The single biggest thing that helped me speed up my queries and lower resource usage on the server was focusing on making my queries more sargable.

https://en.wikipedia.org/wiki/Sargable

https://www.brentozar.com/blitzcache/non-sargable-predicates...

Reply View 0 replies

EvanAnderson 7 hours ago

> Overusing DISTINCT to “Fix” Duplicates

Any time I see DISTINCT in a query I immediately become suspicious that the query author has an incomplete understanding of the data model, a lack of comprehension of set theory, or more likely both.

Reply View 58 replies

sigwinch28 6 hours ago

Or it’s simply an indicator of a schema that has not been excessively normalised (why create an addresses_cities table just to ensure no duplicate cities are ever written to the addresses table?)

Reply View | 2 replies
- valiant55 4 hours ago
  
  It depends when you see it, but I agree that DISTINCT shouldn't be used in production. If I'm writing a one off query and DISTINCT gets me over the finish line sparing me a few minutes then that's fine.
  
  Reply View | 0 replies
- echelon 2 hours ago
  
  DISTINCT, as well as the other aggregation functions, are fantastic for offline analytics queries. I find a lot of use for them in reporting, non-production code.
  
  Reply View | 0 replies
bts89 7 hours ago

That’s almost always my experience too.
Though fairly recently I learned that even with all the correct joins in place, sometimes adding a DISTINCT within a CTE can dramatically increase performance. I assume there’s some optimizations the query planner can make when it’s been guaranteed record uniqueness.

Reply View | 0 replies
dotancohen 6 hours ago

I've been told similar nasty things for adding LIMIT 1 to queries that I expect to return at most a single result, such as querying for an ID. But on large tables (at least in sqlite, mysql, and maybe postgress too) the database will continue to search the entire table after the given record was found.

Reply View | 7 replies
- giovannibonetti an hour ago
  
  I've noticed that LIMIT 1 makes a huge difference when working with LATERAL JOINs in Postgres, even when the WHERE condition has a unique constraint.
  
  Reply View | 0 replies
- Guillaume86 4 hours ago
  
  Only if your table is missing an unique index on that column, which it should have to enforce your assumption, so yeah LIMIT 1 is a code (or schema in the case) smell.
  
  Reply View | 5 replies
  
  dotancohen 3 hours ago
  
  IDs are typically unique primary key. But in my experience, adding LIMIT 1 would on average halve the time taken to retrieve the record.
  I'll test again, really the last time I tested that was two decades ago.
  
  Reply View | 4 replies
mcv 4 hours ago

It's the exact opposite in Cypher. I'm currently working with some complex data in neo4j, and wondered why my perfectly fine looking queries were so slow, until I remembered to use DISTINCT. It's very easy to get duplicate nodes in your results, especially when you use variable length relationships, and DISTINCT is the only fix I'm aware of that fixes that.

Reply View | 1 reply
- dleeftink 4 hours ago
  
  Yeah, similarly combining distinct with recursive CTE's in SQL can be the difference between a n×n blowout or a performant graph walk that only visits nodes once.
  
  Reply View | 0 replies
bandrami 7 hours ago

IDK, "which ZIP codes do we have customers in?" seems like a reasonable thing to want to know

Reply View | 13 replies
- DavidWoof 3 hours ago
  
  In OP's defense, "becoming suspicious" doesn't mean it's always wrong. I would definitely suggest an explaining comment if someone is using DISTINCT in a multi-column query.
  
  Reply View | 0 replies
- mbb70 6 hours ago
  
  The very next ask will be "order the zipcodes by number of customers" at which point you'll be back to aggregations, which is where you should have started
  
  Reply View | 11 replies
  
  wvbdmp 6 hours ago
  
  Anti-Patterns You Should Avoid: overengineering for potential future requirements. Are there real-life cases where you should design with the future in mind? Yes. Are there real-life cases where DISTINCT is the best choice by whatever metric you prioritize at the time? Also yes.
  
  Reply View | 3 replies
  
  majormajor 5 hours ago
  
  Here we start to get close to analytics sql vs application sql, and I think that's a whole separate beast itself with different patterns and anti-patterns.
  
  Reply View | 1 reply
  
  bandrami an hour ago
  
  Ah, yeah, you beat me to it. I do reporting, not applications.
  
  Reply View | 0 replies
  
  bandrami an hour ago
  
  I do reporting, not application development. If somebody wants to know different information I'd write a different query.
  
  Reply View | 0 replies
  
  kristjansson 6 hours ago
  
  Whole seconds will have been wasted!
  
  Reply View | 0 replies
  
  sql_nitpicker 6 hours ago
  
  distinct seems like an aggregation to me
  
  Reply View | 1 reply
  
  [removed] 3 hours ago
  
  [deleted]
  
  Reply View | 0 replies
  
  edoceo 6 hours ago
  
  count(id) group by post_code order by 1
  
  Reply View | 0 replies
jmull 6 hours ago

I'd be wary of overgeneralizing on that. I guess it depends on whose queries you're usually reading.

Reply View | 1 reply
- RHSeeger 6 hours ago
  
  I think you're reading more into what was said than is really there
  > I immediately become suspicious
  All I read from that is, when DISTINCT is used, it's worth taking a look to make sure the person in question understands the data/query; and isn't just "fixing" a broken query with it. That doesn't mean it's wrong, but it's a "smell", a "flag" saying pay attention.
  
  Reply View | 0 replies
ch2026 3 hours ago

Or maybe they’re on OLAP not OLTP.

Reply View | 0 replies
dragonwriter 6 hours ago

In my experience, its nearly as often a problem with the design of the database as the query author.

Reply View | 0 replies
9rx 5 hours ago

Or believe more in Codd’s relational model than SQL’s tabulational model.

Reply View | 0 replies
ryandv 6 hours ago

Set theory...
There are self-identifying "senior software engineers" that cannot understand what even an XOR is, even after you draw out the entire truth table, all four rows.

Reply View | 11 replies
- BuyMyBitcoins 6 hours ago
  
  I am surprised at common it is for software engineers to not treat booleans properly. I can’t tell you how many times if seen ‘if(IsFoo(X) != false)’
  It never used to bug me as a junior dev, but once a peer pointed this out it became impossible for me to ignore.
  
  Reply View | 4 replies
  
  furyofantares 3 hours ago
  
  The most egregious one I saw, I was tracking down a bug and found code like this:
  bool x; ... if (x == true) { DoThing1(); } else if (x == false) { DoThing2(); }
  And of course neither branch was hit, because this is C, and the uninitialized x was neither 0 nor 1, but some other random value.
  
  Reply View | 1 reply
  
  tomjakubowski 2 hours ago
  
  Sometimes this kind of thing happens after a few revisions of code, where in earlier versions the structure of the code made more sense: maybe several conditions which were tested and then, due to changing requirements, they coalesced into something which now reads as nonsense.
  When making a code change which touches a lot of places, it's not always obvious to "zoom out" and read the surrounding context to see if the structure of the code can be updated. The developer may be chewing through a grep list of a few dozen locations that need to be changed.
  
  Reply View | 0 replies
  
  munchlax 4 hours ago
  
  People do that? This hurts my brain. if(IsFoo(X)) is clear and readable.
  
  Reply View | 0 replies
  
  catlifeonmars 4 hours ago
  
  Clearly the correct spelling is
  `if(X&IsFooMask != 0)`
  :)
  
  Reply View | 0 replies
- hyperman1 5 hours ago
  
  I've spent a lot of time not seeing how xor is just the 'not equals' operator for booleans.
  
  Reply View | 0 replies
- layer8 5 hours ago
  
  Or, for a boolean type, that XOR is the same as the inequality operator.
  
  Reply View | 3 replies
  
  avalys 4 hours ago
  
  Maybe it’s confusing because it’s misnamed?
  
  Reply View | 2 replies
- catlifeonmars 4 hours ago
  
  XOR is for key splitting.
  
  Reply View | 0 replies
ryandv 6 hours ago

PostgreSQL's `DISTINCT ON` extension is useful for navigating bitemporal data in which I want, for example, the latest recorded version of an entry, for each day of the year.
There are few other legitimate use cases of the regular `DISTINCT` that I have seen, other than the typical one-off `SELECT DISTINCT(foo) FROM bar`.

Reply View | 2 replies
- dotancohen 6 hours ago
  
  Without DISTINCT ON (which I've never used) you can use a window function via the OVER clause with PARTITION BY. I'm pretty sure that's standard SQL.
  
  Reply View | 1 reply
  
  ryandv 6 hours ago
  
  Yes, this is the implementation I have seen in other dialects.
  
  Reply View | 0 replies
Sesse__ 7 hours ago

Or just doesn't know how to do semijoins in SQL, since they don't follow the same syntax as normal joins for whatever historical reason.

Reply View | 0 replies
wvbdmp 7 hours ago

Eh, sometimes you need a quick fix and it’s just extremely concise and readable. I’ll take an INNER JOIN over EXISTS (nice but insanely verbose) or CROSS APPLY (nice but slow) almost every time. Obviously you have to know what you’re dealing with, and I’m mostly talking about reporting, not perf critical application code.
Distinct is also easily explained to users, who are probably familiar with Excel’s “remove duplicate rows”.
It can also be great for exploring unfamiliar databases. I ask applicants to find stuff in a database they would never see by scrolling, and you’d be surprised how many don’t find it.

Reply View | 6 replies
- Sesse__ 6 hours ago
  
  The less verbose way of doing semijoins is by an IN subquery.
  
  Reply View | 5 replies
  
  wvbdmp 6 hours ago
  
  >subquery
  >less verbose
  Well…
  In any case, it depends. OP nicely guarded himself by writing “overusing”, so at that point his pro-tip is just a tautology and we are in agreement: not every use of DISTINCT is an immediate smell.
  
  Reply View | 4 replies
leptons 7 hours ago

And that's okay. Not every developer knows every single thing there is to know about every single tech. Sometimes you just need a solution, and someone with more specific knowledge can optimize later. How many non-database related mistakes would you make if you had to build every part of a system yourself?

Reply View | 1 reply
- pessimizer 6 hours ago
  
  But what if they don't know that they need your approval not to know things?
  
  Reply View | 0 replies

petalmind 6 hours ago

> Overusing DISTINCT to “Fix” Duplicates

I wrote a small tutorial (~9000 words in two parts) on how to design complicated queries so that they don't need DISTINCT and are basically correct by construction.

https://kb.databasedesignbook.com/posts/systematic-design-of...

Reply View 1 reply

joshmn 6 hours ago

Nice articles in there. Bookmarked.
Edit: it’s also actually a book!

Reply View | 0 replies

SoftTalker 6 hours ago

A big one that isn't listed is looking for stuff that isn't there.

Using != or NOT IN (...) is almost always going to be inefficient (but can be OK if other predicates have narrowed down the result set already).

Also, understand how your DB handles nulls. Are nulls and empty strings the same? Does null == null? Not all databases do this the same way.

Reply View 12 replies

geysersam 6 hours ago

> Using != or NOT IN (...) is almost always going to be inefficient.
Why do you say that?
My understanding is that as long as the RHS of NOT IN is constant (in the sense that it doesn't depend on the row) the condition is basically a hash table lookup, which is typically efficient if the lookup table is not massive.
What's the more efficient alternative?

Reply View | 2 replies
- Sesse__ 5 hours ago
  
  I'm going to assume here that we're talking about a subquery here (SELECT * FROM t1 WHERE x NOT IN ( SELECT x FROM t2 )). If you're just talking about a static list, then the basic problem is the amount of data you get back. :-)
  The biggest problem with NOT IN is that it has very surprising NULL behavior: Due to the way it's defined, if there is any NULL in the joined-on columns, then _all_ rows must pass. If the column is non-nullable, then sure, you can convert it into an antijoin and optimize it together with the rest of the join tree. If not, it usually ends up being something more complicated.
  For this reason, NOT EXISTS should usually be preferred. The syntax sucks, but it's much easier to rewrite to antijoin.
  
  Reply View | 0 replies
- SoftTalker 5 hours ago
  
  Because they can't use indexes.
  If I have a table of several million rows and I want to find rows "WHERE foo NOT IN ('A', 'B', 'C')" that's a full table scan, or possibly an index scan if foo is indexed, unless there are other conditions that narrow it down.
  
  Reply View | 0 replies
magicalhippo 6 hours ago

> Also, understand how your DB handles nulls.
Also in regards to indexing. The DBs I've used have not indexed nulls, so a "WHERE col IS NULL" is inefficient even though "col" is indexed.
If that is the case and you really need it, have a computed column with a char(1) or bit indicating if "col" is NULL or not, and index that.

Reply View | 8 replies
- SoftTalker 6 hours ago
  
  NULL should generally never be used to "mean" anything.
  If your business rules say that "not applicable" or "no entry" is a value, store a value that indicates that, don't use NULL.
  
  Reply View | 7 replies
  
  crazygringo 2 hours ago
  
  Not sure what you mean.
  If you have a table of customers and someone of them don't have addresses, it's standard to leave the address fields NULL. If some of them don't belong to a company, it's standard to leave the company_id field NULL.
  This is literally what NULL is for. It's a special value precisely because missing data or a N/A field is so common.
  If you're suggesting mandatory additional has_address and has_customer_id fields, I would disagree. You'd be reinventing a database tool that already exists precisely for that purpose.
  
  Reply View | 4 replies
  
  rplnt 5 hours ago
  
  Interesting, I don't think I've seen that while NULLs are very common.
  I guess you would handle it in the application and not in the query, right?
  
  Reply View | 1 reply
  
  SoftTalker 4 hours ago
  
  I've seen it too, very often. But it's good if you can just keep NULL meaning NULL (i.e. "the absence of any value"), because otherwise you will eventually be surprised by behavior.
  
  Reply View | 0 replies

btilly 6 hours ago

The biggest SQL antipattern is failing to recognize that SQL is actually a programming language.

Therefore you should create a consistent indentation style for SQL. See https://bentilly.blogspot.com/2011/02/sql-formatting-style.h... for mine. Second, you should try to group logical things together. This is why people should move subqueries into common table expressions. And finally, don't be afraid of commenting wisely.

Reply View 3 replies

xyzzy_plugh 5 hours ago

Style opinions are borderline irrelevant without appropriate linters.

Reply View | 2 replies
- javcasas 4 hours ago
  
  Go and use Google BigQuery auto-formatter in a complex query with CASE and EXTRACT YEAR FROM date, and you will have a totally different opinion.
  How that auto-formatter indents is borderly almost a hate crime. A thousand times better to indent manually.
  
  Reply View | 1 reply
  
  OscarCunningham 3 hours ago
  
  I've even seen the BigQuery formatter change the behaviour of a query, by mixing a keyword from a comment into the real code.
  
  Reply View | 0 replies

thehours 4 hours ago

> Mishandling Excessive Case When Statements

User Defined Functions (UDFs) are another option to consolidate the logic in one place.

> Using Functions on Indexed Columns

In other words, the query is not sargable [0]

> Overusing DISTINCT to “Fix” Duplicates

Orthogonal to author's point about dealing with fanout from joins, I'm a fan of using something like this for 'de-duping' records that aren't exact matches in order to conform the output to the table grain:

    ROW_NUMBER() OVER (PARTITION BY <grain> ORDER BY <deterministic sort>) = 1

Some database engines have QUALIFY [1], which lends itself to a fairly clean query.

[0] https://en.wikipedia.org/wiki/Sargable

[1] https://docs.aws.amazon.com/redshift/latest/dg/r_QUALIFY_cla...

Reply View 1 reply

andersmurphy 4 hours ago

Non sargability easy to solve with expression indexes. At least in sqlite.

Reply View | 0 replies

aerzen 5 hours ago

These "anti-patterns" are just workarounds for bad language design of SQL (or lack of design actually). I'm working on a language that can run on SQL databases, so I hope it will do better with every one of these points.

If anyone wants to check out a half-done lang with lacking documentation, I'd be happy to read your feedback: https://lutra-lang.org

Reply View 1 reply

mkeedlinger an hour ago

Hey, this looks really cool! Best wishes and I’ll try to watch out for when this is more ready

Reply View | 0 replies

wmonk 8 hours ago

The section of using functions on indexes could do with more explicit and deeper explanation. When you use the function on the index it becomes a full scan of the data instead as the query runner has to run the function on every row and column, effectively removing any benefit of the index.

Unfortunately I learned this the hard way!

Reply View 6 replies

tremon 7 hours ago
The given solution (create an indexed UPPER(name) column) is not the best way to solve this, at least not on MS SQL Server. Not sure if this is equally supported in other databases, but the better solution is to create a case-insensitive computed column:
ALTER TABLE example ADD name_ci AS name COLLATE SQL_Latin1_General_CI_AS;
(season to taste)
Reply View | 1 reply
- layer8 5 hours ago
  
  It depends on the database system, but for systems that support functional indexes, you can create an index using the same function expression that you use in the query, and the query optimizer will recognize that they match up and use the index.
  For example, you define an index on UPPER(name_column), and in your query you can use WHERE UPPER(name_to_search_for) = UPPER(name_column), and it will use the index.
  
  Reply View | 0 replies
LikesPwsh 8 hours ago

Some well known docs on the topic- https://use-the-index-luke.com/sql/where-clause/obfuscation

Reply View | 0 replies
crazygringo 6 hours ago

The blog has a typo. The first line needs to have the text in uppercase:
> query WHERE name = ‘ABC’
> create an indexed UPPER(name) column
The point is that the index itself is already on the data with the function applied. So it's not a full scan, the way the original query was.
Of course, in this particular example you just want to use a case-insensitive collation to begin with. But the general concept is valid.

Reply View | 0 replies
[removed] 7 hours ago

[deleted]

Reply View | 0 replies
readthenotes1 8 hours ago

"Unfortunately I learned this the hard way!" ... Seems to be the motto of SQL developers.
Otoh, it seems a fairly stable language (family of dialects?) so finding the pitfalls has long leverage

Reply View | 0 replies

ddxv 2 hours ago

I've built myself a few problems that I haven't fixed yet:

Many materialized views that rely on materialized views. When one at the bottom, or a table, needs a changed all views need to be dropped and recreated.

Using a warm standby for production. I love having a read only production database, but since it's not the primary, it always feels like it's on the losing end of the system. Recently upgraded to Postgres 18 and forgot that means I need to rm rf the standby and pg_basebackup to rebuild... That wasn't fun.

Reply View 1 reply

echelon 2 hours ago

I'd like to call views, triggers, and integrity constraints antipatterns.
Your code should handle the data model and never allow bad states to enter the database.
There's too much performance loss and too many footguns from these "features".

Reply View | 0 replies

seanhunter 2 hours ago

I can’t take any article like this seriously if it doesn’t lead with the #1 sql antipattern which kills performance all the time - doing things row-by-row instead of understanding that databases operate on relations, so you need to do operations over whole relations.

Very often I have seen this problem buried in code design and it always sucks. Sometimes an orm obscures this but the basic antipattern looks like

   Select some stuff
   For each row in stuff:
      … do some important things …
      Select a thing to do with this row
      … maybe do some other things …

Early on in my career an old-hand sql guru said to me “any time you are doing sql in a loop, you are probably doing it wrong”.

The non-sucky version of the code above is

   Select some stuff, joining on all the things you need for the rows because databases are great
   For each row in stuff:
      … do some important things …
      … maybe do some other things …

Reply View 0 replies

dgb23 7 hours ago

If „select *“ breaks your code, then there‘s something wrong with your code. I think Rich Hickey talked about this. Providing more than is needed should never be a breaking change.

Certain languages, formats and tools do this correctly by default. For the others you need a source of truth that you generate from.

Reply View 7 replies

sql_nitpicker 7 hours ago

I don't see anything wrong with what the article is saying. If you have a view over a join of A and B, and the view uses "select *", then what is gonna happen when A adds a column with the same name as a column in B?
In sqlite, the view definition will be automatically expanded and one of the columns in the output will automatically be distinguished with an alias. Which column name changes is dependent on the order of tables in the join. This can absolutely break code.
In postgres, the view columns are qualified at definition time so nothing changes immediately. But when the view definition gets updated you will get a failure in the DDL.
In any system, a large column can be added to one of the constituent tables and cause a performance problem. The best advice is to avoid these problems and never use "select *" in production code.

Reply View | 0 replies
rileymat2 7 hours ago

The reasoning is in the article, and true.
> Schema evolution can break your view, which can have downstream effects
Select * is the problem itself in the face of schema evolution and things like name collision.

Reply View | 0 replies
0xbadcafebee 6 hours ago

`select *` is bad for many reasons, but the biggest is that the "contract" your code has with the remote data store isn't immutable. The database can change, for many different reasons, independent of your code. If you want to write reliable code, you need to make as few assumptions as possible. One of those assumptions is what the remote schema is.

Reply View | 2 replies
- hvb2 5 hours ago
  
  Sure but columns can change data types too which 'select column's doesn't protect you from either
  
  Reply View | 1 reply
  
  layer8 5 hours ago
  
  A column changing its data type is generally considering a breaking change for the schema (for obvious reasons), while adding more columns isn’t. Backwards-compatible schema evolution isn’t practical without the latter — you’d have to add a new secondary table whenever you want to add more columns.
  This mirrors how adding additional fields to an object type in a programming language usually isn’t considered a breaking change, but changing the type of an existing field is.
  
  Reply View | 0 replies
tremon 7 hours ago

If you have select * in your code, there already is something wrong with your code, whether it breaks or not: the performance and possibly output of your code is now dependent on the table definition. I'm pretty sure Rich Hickey has also talked about the importance of avoiding non-local dependencies and effects in your code.

Reply View | 1 reply
- onli 7 hours ago
  
  The performance and partly the output of the code is always dependent on the table definition. * instead of column names just removes an output limiter, which can be useful or can be irrelevant, depending on the context.
  Though sure, known to negatively affect performance, I think in some database systems more than in others?
  
  Reply View | 0 replies

jasonpbecker 8 hours ago

We did the views on view thing once when triggers, at least how we implemented them failed. This became a huge regret that we lived with for years and not-so affectionately called "view mountain". We finally slayed viewed mountain over the last 2 years and it feels so good.

Reply View 0 replies

BrenBarn 3 hours ago

> SQL is one of those languages that looks simple on the surface but grows in complexity as teams and systems scale.

The funny thing is it's actually several of those languages. :-)

Reply View 0 replies

anthonyIPH 8 hours ago

"Instead you should:

query WHERE name = ‘abc’

create an indexed UPPER(name) column"

Should there be an "or" between these 2 points, or am I missing something? Why create an UPPER index column and not use it?

Reply View 2 replies

karmakaze 7 hours ago

[and a third] OR use a case-insensitive collation for the name column.

Reply View | 0 replies
MiscCompFacts 3 hours ago

I think they reversed the 2 expressions. You should use “WHERE UPPER(name) = ‘ABC’” if you want to use the index.

Reply View | 0 replies

skybrian 7 hours ago

> three or four layers of subqueries, each one filtering or aggregating the results of the previous one, totaling over 5000 lines of code

In a better language, this would be a pipeline. Pipelines are conceptually simple but annoying to debug, compared to putting intermediate results in a variable or file. Are there any debuggers that let you look at intermediate results of pipelines without modifying the code?

Reply View 2 replies

tremon 7 hours ago

This is not a pipeline in the control flow sense; the full query is compiled into a single processing statement, and the query compiler is free to remove and/or reorder any of the subqueries as it sees fit. The intermediate results during query execution (e.g. temp table spools) do not follow the structure of the original query, as CTEs and subqueries are not execution boundaries. It's more accurate to compare this to a C compiler that performs aggressive link-time optimization, including new rounds of copy elision, loop unrolling and dead code elimination.
If you want to build a pipeline and store each intermediate result, most tooling will make that easy for you. E.g. in dbt, just put each subquery in its separate file, and the processing engine will correctly schedule each subresult after the other. Just make sure you have enough storage available, it's not uncommon for intermediate results to be hundreds of times larger than the end result (e.g. when you perform a full table join in the first CTE, and do target filtering in another).

Reply View | 1 reply
- skybrian 4 hours ago
  
  Sure, a sufficiently smart compiler can do what it wants, but it's often conceptually a pipeline and could be implemented as one in debug mode, without having to rewrite the code. Not in production, though, since you don't want to store stuff in temporary files when you're not debugging them.
  In some languages, a series of assignments and a large expression will often compile to the same thing, but if written as assignments, it will make it easier to set breakpoints.
  
  Reply View | 0 replies

chongli 8 hours ago

When working with larger enterprise software, it is common to have large CASE WHEN statements translating application status codes into plain English. For example, status code 1 could mean the item is out of stock.

Why wouldn’t you store this information in a table and query it when you need it? What if you need to support other languages? With a table you can just add more columns for more languages!

Reply View 1 reply

megaman821 8 hours ago

I usually use generated columns for this. It still uses CASE WHEN but it is obvious to all consumers of the table that it exists.

Reply View | 0 replies

Arch-TK 5 hours ago

Forgot to add (all seen in production):

* Don't store UUIDs as strings.

* Don't use random UUID variants for your primary key (or don't use UUIDs for your primary key).

* Don't use a random column in your clustered index.

Reply View 1 reply

MichaelNolan 2 hours ago

I guess things are DB dependent. Spanner for instance not only recommends using uuidv4 as a PK, it also stores it as string(36). Uuidv4 as a PK works fine on Postgres as well.

Reply View | 0 replies

ftchd 3 hours ago

the points are fine and helpful, but they seem like a note from the author to themself rather than a cheatsheet that tries to be exhaustive.

was surprised to not see anything about dates/time.

Reply View 0 replies

egeozcan 6 hours ago

I don't know about anti patterns but what I like to do is putting 1=1 after each WHERE to align ANDs nicely and this is enough to create huge dramas in PR reviews.

Reply View 6 replies

ffsm8 6 hours ago
It's always perfectly aligned for me, because enter prefixes 2 whitespace in my ide in SQL files, ending with
where a=1 And k=2 And v=3
Reply View | 1 reply
- egeozcan 4 hours ago
  
  But the first condition looks special while it isn't and it sometimes leads to changes touching one too many lines.
  
  Reply View | 0 replies
DrewADesign 6 hours ago

> what I like to do is putting 1=1 after each WHERE to align ANDs nicely
Frankly, that sounds like one of those things that totally makes sense in the author’s head, but inconsiderately creates terrible code ergonomics and needless cognitive load for anyone reading it. You know to just ignore those expressions when you’re reading it because you wrote it and know they have no effect, but to a busy code reviewer, it’s annoying functionless clutter making their job more annoying. “Wait, that should do nothing… but does it actually do something hackish and ‘clever’ that they didn’t comment? Let’s think about this for a minute.” Use an editor with proper formatting capability, and don’t use executable expressions for formatting in code that other people look at.

Reply View | 3 replies
- tombert 5 hours ago
  
  Using `WHERE 1=1` is such a common pattern that I seriously doubt it's realistically increasing "cognitive load".
  I've seen it used in dozens of places, in particular places that programmatically generate the AND parts of queries. I wasn't really that confused the first time I saw it and I was never confused any time after that.
  
  Reply View | 0 replies
- MobiusHorizons 6 hours ago
  
  I use `WHERE true` for this. Very little cognitive load parsing that. And it makes AND conditions more copy pastable. Effectively the trailing comma of SQL where clauses
  
  Reply View | 1 reply
  
  DrewADesign 6 hours ago
  
  I absolutely cannot see how this would do what IDE formatting can’t, but admittedly the last time I wrote any significant amount of SQL directly was in a still-totally-relevant Perl 5 application. Could you give an example or link to a file in a public repository or whatever that would show this practice in context?
  
  Reply View | 0 replies

kijin 6 hours ago

Some of these things happen because people try to come up with a single clever query that does everything at once and returns a perfect spreadsheet.

Translating status codes into English or some other natural language? That's better done in the application, not the database. Maybe even leave it to the frontend if you have one. As a rule of thumb, any transformation that does not affect which rows are returned can be applied in another layer after those rows have been returned. Just because you know SQL doesn't mean you have to do everything in SQL.

Deeply nested subqueries? You might want to split that up into simpler queries. There's nothing shameful about throwing three stones to kill three birds, as long as you don't fall into the 1+N pattern. Whoever has to maintain your code will thank you for not trying to be too clever.

Also, a series of simple queries often run faster than a single large query, because there's a limit to how well the query planner can optimize an excessively complicated statement. With proper use of transactions, you shouldn't have to worry about the data changing under your feet as you make these queries.

Reply View 0 replies

exceldrawing 4 hours ago

Oracle DATE field stores a time component. You have to be aware and adjust your queries to be specific.

Reply View 0 replies

jwsteigerwalt 8 hours ago

That’s my rap sheet…

Reply View 0 replies

jacknews 8 hours ago

"When handling large CASE WHEN statements, it is better to create a dimension table or view, ideally sourced from the landed table where the original status column is populated."

Is this code for 'use a lookup table' or am I falling behind on the terminology? The modern term should be 'sum table' or something similar surely.

Reply View 4 replies

LikesPwsh 8 hours ago

"Dimension table" is the name for lookup tables in a star or snowflake schema.

Reply View | 2 replies
- jacknews 7 hours ago
  
  TIL, Thanks.
  'Landed table'? Is that the 'fact table', the one that contains the codes that need to be looked-up?
  
  Reply View | 1 reply
  
  tremon 7 hours ago
  
  I'm pretty sure the landed table refers to the local copy of the original source. In an ETL* pipeline, the place where source data is stored for further processing is usually called the landing zone. Fact and Dimension tables are outputs of the process, whereas the landing tables are the inputs.
  * in whatever order they're used
  
  Reply View | 0 replies
parpfish 7 hours ago

but sometimes large case statements cant be turned into a simple dimension table/lookup table because it's not a simple key-value transformation.
if your case statement is just a series of straighahead "WHEN x=this THEN that", you're very lucky.
the nasty case statements are the ones were the when expression sometimes uses different pieces of data and/or the ordering of the statements is important.

Reply View | 0 replies

JohnHaugeland 8 hours ago

these aren’t anti patterns. these are just things you shouldn’t do

Reply View 9 replies

em500 8 hours ago

Still waiting for the definitive article on how using the term anti-pattern is an anti-pattern.

Reply View | 4 replies
- readthenotes1 7 hours ago
  
  If a pattern is a common problem (e.g., becoming accustomed to a spectacular view) and generally-useful solution to that problem (blocking the view so that effort is required to obtain it), then an anti-pattern is what?
  I think most people think an anti-pattern is an aberration in the "solution" section that creates more problems.
  So here, the anti-pattern is that people use a term so casually (e.g., DevOps) that no one knows what it's referring to anymore.
  (The problem: need a way to refer to concept(s) in a pithy way. The solution: make up or reuse an existing word/phrase to incorporate the concept(s) by reference so that it can can, unambiguously, be used as a replacement for the longer description. )
  
  Reply View | 3 replies
  
  JohnHaugeland 4 hours ago
  
  > If a pattern is a common problem
  it isn't, is the thing.
  if you read the book design patterns, they spell out what a pattern is.
  if you read the book anti-patterns, he spells out what an anti-pattern is.
  people have gotten the wrong idea by learning the phrases from casual usage.
  
  Reply View | 1 reply
  
  MaxBarraclough 2 hours ago
  
  Pointing to books isn't very helpful here. Please just state the definition you are advocating.
  
  Reply View | 0 replies
  
  JadeNB 7 hours ago
  
  > If a pattern is a common problem (e.g., becoming accustomed to a spectacular view) and generally-useful solution to that problem (blocking the view so that effort is required to obtain it), then an anti-pattern is what?
  Strange choice of example! I'm not sure I agree that your example is a common problem, and I'm even less sure that the proposed solution to it is generally useful.
  
  Reply View | 0 replies
hobs 7 hours ago

https://pragprog.com/titles/bksqla/sql-antipatterns/ There's an actual book on them that had me nodding along the entire time.

Reply View | 2 replies
- evanelias 6 hours ago
  
  Agreed, it’s an excellent book by a great author. Bill is also quite prolific on Stack Overflow, and generally if you see an answer from him there, you can be confident it’s solid advice.
  
  Reply View | 0 replies
- JohnHaugeland 4 hours ago
  
  that's a fantastic book; one of the best i've read, and i'm glad to see it get brought up
  but also, the book anti-patterns is pretty clear here
  
  Reply View | 0 replies
karmakaze 7 hours ago

I'm waiting for the anti-patterns we shouldn't avoid.

Reply View | 0 replies

0xbadcafebee 6 hours ago

At this point it's malpractice not to use AI to analyze your SQL statements and tables for optimizations

Reply View 2 replies

jpnc 6 hours ago

Are we on bizarro HN?
No, you ask the DB to EXPLAIN itself to you.

Reply View | 1 reply
- Arch-TK 5 hours ago
  
  Next you'll be telling me that instead of asking AI to find my bug I should just use print statements or a debugger to observe the state of my program over time to find where it deviates from expectations and figure it out that way.
  
  Reply View | 0 replies