Comment by olau

Comment by olau a year ago

16 replies

A warning about Aurora: It's opaque tech. I've been on a project that switched to it by recommendation by the hosting provider, and had to switch away because it turns out that it does not support queries requiring temporary storage, i.e. queries exceeding the memory of the instances.

It manifested the way that the Aurora instances would use up their available (meagre) memory, then start thrashing, taking everything down. Apparently the instances did not have access to any temporary local storage. There was no way to fix that, and it took some time to understand. After having read all the little material I could find on Aurora, my personal conclusion is that Aurora is perhaps best thought of as a big hack. I think it's likely there are more gotchas like that.

We moved the database back to a simple VM on SSD, and Postgres handled everything just fine.

deergomoo a year ago

We’ve generally been happy with Aurora, but we run into gotchas every so often that don’t seem to be documented anywhere and it’s very annoying.

Example: in normal MySQL, “RENAME TABLE x TO old_x, new_x TO x;” allows for atomically swapping out a table.

But since we moved to Aurora MySQL, we very occasionally get stuff land in the bug tracker with “table x does not exist”, suggesting this is not atomic in Aurora.

Is this documented anywhere? Not that I’ve been able to find. I’m fine with there being subtle differences, especially considering the crazy stuff they’re doing with the storage layer, but if you’re gonna sell it as “MySQL compatible” then please at least tell me the exceptions.

orf a year ago

The first result on Google shows that Aurora certainly does have temporary local storage https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

  • alexey-salmin a year ago

    I believe this issue is (or was) real. There are important differences in how Aurora treats temporary data. Normal postgres and rds postgres write it into the main data volume (unless configured otherwise). Aurora however always separates shared storage from local storage and it's not entirely clear to me what is this local storage physically for non-read-optimized instance types. The only way to increase it is to increase the instance size. [1][2] This is indeed frustrating because with postgres or rds postgres you just increase the volume and that's it.

    Luckily since November 2023 it also has r6gd/r6id classes with local NVMEs for temp files. [3] This should in theory solve this problem but I haven't tried it yet.

    [1] https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide...

    [2] https://www.reddit.com/r/aws/s/sIhBQhsG80

    [3] https://aws.amazon.com/about-aws/whats-new/2023/11/amazon-au...

    • p0seidon a year ago

      I think Aurora has to go through the same development process as every database. They changed essential patterns in the database, and there are severe side effects that need to be addressed. You can see the same with Aurora Serverless and the changes in V2; there were some quite quirky issues in the first versions.

dalyons a year ago

Calling it a hack is pretty unfair. The log storage engine is a huge innovation, in my experience makes large MySQL/pg clusters much more reliable and performant at scale in a variety of different ways.

It has a couple of quirks, but on balance it feels like the future - the next evolution of what traditional rdbms are capable of.

But if you don’t have scale or resiliency needs it probably doesn’t matter to you.

pquki4 a year ago

Isn't Aurora mainly about their unique handling of logging and replica which leads to high availability and fast recovery? If you switch to VM, how do you handle availability in multiple locations and backups? If database checkpoints are good enough for you, sounds like Aurora is overkill in the first place.

vips7L a year ago

We’re currently struggling with switching from RDS to aurora. The replica times are absolutely bonkers long for the simplest of writes.

  • dalyons a year ago

    Aurora has essentially constant replica lag times, it’s one of the best features. Should be around 30-50ms always, are you seeing different?

    • vips7L a year ago

      7-20 seconds depending on location.

      • dalyons a year ago

        Location? Are you doing multi region?

  • alexey-salmin a year ago

    Can you elaborate, which exact timings are bad?

    • vips7L a year ago

      I’ve added more details to a sibling comment. I’m not sure I can add much more. I’m not an OPS person, just team lead on the development side.