You know you need to be careful when an Amazon engineer will argue for a database architecture that fully leverages (and makes you dependent of) the strengths of their employer's product. In particular:
> Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).
This is surely true for certain use cases, say financial applications which must guarantee 100% uptime, but I'd argue the vast, vast majority of applications are perfectly ok with local commit and rapid recovery from remote logs and replicas. The point is, the cloud won't give you that distributed consistency for free, you will pay for it both in money and complexity that in practice will lock you in to a specific cloud vendor.
I.e, make cloud and hosting services impossible to commoditize by the database vendors, which is exactly the point.
Skipping flushing the local disk seems rather silly to me:
- A modern high end SSD commits faster than the one way time to anywhere much farther than a few miles away. (Do the math. A few tens of microseconds specified write latency is pretty common. NVDIMMs (a sadly dying technology) can do even better. The speed of light is only so fast.
- Unfortunate local correlated failures happen. IMO it’s quite nice to be able to boot up your machine / rack / datacenters and have your data there.
- Not everyone runs something on the scale of S3 or EBS. Those systems are awesome, but they are (a) exceedingly complex and (b) really very slow compared to SSDs. If I’m going to run an active/standby or active/active system with, say, two locations, I will flush to disk in both locations.
Committing to NVMe drive properly is really costly. I'm talking using O_DIRECT | OSYNC or fsync here.
Can be in the order of whole milliseconds, easily. And it is much worse if you are using cloud systems.
I think it is fair to argue that there is a strong correlation between criticality of data and network scale. Most small buisnesses don't need anything S3 scale, but they also don't need 24 hour uptime, and losing the most recent day of data is annoying rather than catastrophic, so they can probably get away without flushing but with daily asynchronous backups to a different machine and a 1 minute UPS to allow for safe storage in the event of a power outage.
I should add that the bond between relational databases and spinning rust goes back further. My dad, who started working as a programmer in the 60s with just magtape as storage, talked about the early era of disks as a big step forward but requiring a lot of detailed work to decide where to put the data and how to find it again. For him, databases were a solution to the problems that that disks created for programmers. And I can certainly imagine that. Suddenly you have to deal with way more data stored in multiple dimensions (platter, cylinder, sector) with wildly nonlinear access times (platter rotation, head movement). I can see how commercial solutions to that problem would have been wildly popular, but also build around solving a number of problems that don't matter.
I'm not sure I totally understand the timeline you're describing, but my understanding is that relational databases themselves were only invented in the 1970s. Is your reference to the 60s just giving context for when he started but before this link happened (with the idea that the problems predated the solution)?
Non-relational databases existed in the 60s, and many programmers who worked in the 60s presumably continued working into the 70s, so either way I don't see any problems with the timeline GP mentions.
> Design decisions like write-ahead logs, large page sizes, and buffering table writes in bulk were built around disks where I/O was SLOW, and where sequential I/O was order(s)-of-magnitude faster than random.
Overall speed is irrelevant, what mattered was the relative speed difference between sequential and random access.
And since there's still a massive difference between sequential and random access with SSDs, I doubt the overall approach of using buffers needs to be reconsidered.
Can you clarify? I thought a major benefit of SSDs is that there isn't any difference between sequential and random access. There's no physical head that needs to move.
Edit: thank you for all the answers -- very educational, TIL!
I think that is a bigger impact on writes than reads, but certainly means there is some gap from optimal.
To me a 4k read seems anachronistic from a modern application perspective. But I gather 4kb pages are still common in many file systems. But that doesn’t mean the majority of reads are 4kb random in a real world scenario.
SSD controllers and VFSs are often optimized for sequential access (e.g. readahead cache) which leads to software being written to do sequential access for speed which leads to optimization for that access pattern, and so on.
- The access block size (LBA size). Either 512 bytes or 4096 bytes modulo DIF. Purely a logical abstraction.
- The programming page size. Something in the 4K-64K range. This is the granularity at which an erased block may be programmed with new data.
- The erase block size. Something in the 1-128 MiB range. This is the granularity at which data is erased from the flash chips.
SSDs always use some kind of journaled mapping to cope with the actual block size being roughly five orders of magnitude larger than the write API suggests. The FTL probably looks something like an LSM with some constant background compaction going on. If your writes are larger chunks, and your reads match those chunks, you would expect the FTL to perform better, because it can allocate writes contiguously and reads within the data structure have good locality as well. You can also expect for drives to further optimize sequential operations, just like the OS does.
(N.b. things are likely more complex, because controllers will likely stripe data with the FEC across NAND planes and chips for reliability, so the actual logical write size from the controller is probably not a single NAND page)
It depends on the side of read - most SSD’s have internal block sizes much larger than a typical (actual) random read, so they internally have to do a lot more work for a given byte of output in a random read situation than they would in a sequential one.
Most filesystems read in 4K chunks (or sometimes even worse, 512 byes), and internally the actual block is often multiple MB in size, so this internal read multiplication is a big factor in performance in those cases.
Note the only real difference between a random read and a sequential one is the size of the read in one sequence before it switches location - is it 4K? 16mb? 2G?
Only a few companies are global, so only a few of them should optimize for those kind of workload. However maybe every startup in SV must aim to becoming global, so probably that's what most of them must optimize for, even the ones that eventually fail to get traction.
24/7 is different because even the customers of local companies, even B2B ones, mighty feel like doing some work at midnight once in a while. They'll be disappointed to find the server down.
Author could have started by surveying current state of art instead of just falsely assuming that DB devs have just been resting on the laurels for past decades. If you want to see (relational) DB for SSD just check out stuff like myrocks on zenfs+; it's pretty impressive stuff.
There has also been some significant academic study of DBMS design for persistent memory - which SSD technology can serve as (e.g. as NVDIMMs or abstractly) : Think of no distinction between primary and secondary storage, RAM and disk - there's just a huge amount of not-terribly-fast memory; and whatever you write to memory never goes away. It's an interesting model.
At first glance this reads like a storage interface argument, but it’s really about media characteristics. SSDs collapse the random vs sequential gap, yet most DB engines still optimize for throughput instead of latency variance and write amplification. That mismatch is the interesting part
> Commit-to-disk on a single system is both unnecessary
If you believe this, then what you want already exists. For example: MySQL has in memory tables, but also this design pretty much sounds like NDB.
I don’t think I’d build a database the way they are describing for anything serious. Maybe a social network or other unimportant app where the consequences of losing data aren’t really a big deal.
I'm a little bit surprised enterprise isn't sticking to optane for this. It's EoL tech at this point, but it'll still smoke top of the line nvmes for small Q1 which I'd think you'd want for some databases.
Median database workloads are probably doing writes of just a few bytes per transaction. Ie 'set last_login_time = now() where userid=12345'.
Due to the interface between SSD and host OS being block based, you are forced to write a full 4k page. Which means you really still benefit from a write ahead log to batch together all those changes, at least up to page size, if not larger.
A write-ahead log isn't a performance tool to batch changes, it's a tool to get durability of random writes. You write your intended changes to the log, fsync it (which means you get a 4k write), then make the actual changes on disk just as if you didn't have a WAL.
If you want to get some sort of sub-block batching, you need a structure that isn't random in the first place, for instance an LSM (where you write all of your changes sequentially to a log and then do compaction later)—and then solve your durability in some other way.
The actual writes don’t need to be persisted on transaction commit, only the WAL. In most DBs the actual writes won’t be persisted until the written page is evicted from the page cache. In this sense, writing WAL generally does provide better perf than synchronously doing a random page write
I would guess by now none have that internally. As a rule of thumb every major flash density increase (SLC, TLC, QLC) also tended to double internal page size. There were also internal transfer performance reasons for large sizes. Low level 16k-64k flash "pages" are common, and sometimes with even larger stripes of pages due to the internal firmware sw/hw design.
Also due to error correction issues. Flash is notoriously unreliable, so you get bit errors _all the time_ (correcting errors is absolutely routine). And you can make more efficient error-correcting codes if you are using larger blocks. This is why HDDs went from 512 to 4096 byte blocks as well.
Not for SSD specifically, but I assume the compact design doesn't hurt: duckdb saved my sanity recently. Single file, columnar, with builtin compression I presume (given in columnar even simplest compression maybe very effective), and with $ duckdb -ui /path/to/data/base.duckdb opening a notebook in browser. Didn't find a single thing to dislike about duckdb - as a single user. To top it off - afaik can be zero-copy 'overlayed' on the top of a bunch of parquet binary files to provide sql over them?? (didn't try it; wd be amazing if it works well)
Postgres allows you to choose a different page size (at initdb time? At compile time?). The default is 8K. I've always wondered if 32K wouldn't be a better value, and this article points in the same direction.
On the other hand, smaller pages mean that more pages can fit in your CPU cache. Since CPU speed has improved much more than memory bus speed, and since cache is a scarce resource, it is important to use your cache lines as efficiently as possible.
Ultimately, it's a trade-off: larger pages mean faster I/O, while smaller pages mean better CPU utilisation.
> WALs, and related low-level logging details, are critical for database systems that care deeply about durability on a single system. But the modern database isn’t like that: it doesn’t depend on commit-to-disk on a single system for its durability story. Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).
And then a bug crashes your database cluster all at once and now instead of missing seconds, you miss minutes, because some smartass thought "surely if I send request to 5 nodes some of that will land on disk in reasonably near future?".
I love how this industry invents best practices that are actually good then people just invent badly researched reasons to just... not do them.
But we know this is not actually robust because storage and power failures tend to be correlated. The most recent Jepsen analysis again highlights that it's flawed thinking: https://jepsen.io/analyses/nats-2.12.1
The Aurora paper [0] goes into detail of correlated failures.
> In Aurora, we have chosen a design point of tolerating (a) losing
an entire AZ and one additional node (AZ+1) without losing data,
and (b) losing an entire AZ without impacting the ability to write
data. [..] With such a model, we can (a) lose a
single AZ and one additional node (a failure of 3 nodes) without
losing read availability, and (b) lose any two nodes, including a
single AZ failure and maintain write availability.
As for why this can be considered durable enough, section 2.2 gives an argument based on their MTTR (mean time to repair) of storage segments
> We would need to see two
such failures in the same 10 second window plus a failure of an
AZ not containing either of these two independent failures to lose
quorum. At our observed failure rates, that’s sufficiently unlikely,
even for the number of databases we manage for our customers.
The biggest lie we’ve been told is that databases require global consistency and a global clock. Traditional databases are still operating with Newtonian assumptions about absolute time, while the real world moves according to Einstein’s relativistic theory, where time is local and relative. You dont need global order, you dont need global clock.
Till the financial controller shows up at the very least.
Also even if not required makes reasoning about how systems work a hell lot easier. So for vast majority that doesn't need massive throughtputs sacrificing some speed for easier to understand consistency model is worthy tradeoff
Prety much all financial transactions are settled with a given date, not instantly.
Go sell some stocks, it takes 2 days to actually settle. (May be hidden by your provider, but that how it works).
For that matter, the ultimate in BASE for financial transactions is the humble check.
That is a great example of "money out" that will only be settled at some time in the future.
There is a reason there is this notion of a "business day" and re-processing transactions that arrived out of order.
Happens all the time (the ignores best practices because it’s convenient or ‘just because’ to do something different), literally everywhere including normal society.
SSDs are more of a black box per se. FTL adds another layer of indirection and they are mostly proprietary and vendor-specific. So the performance of SSDs are not generalizable.
Please give a try to dbzero. It eliminates the database from the developer's stack completely - by replacing a database with the DISTIC memory model (durable, infinite, shared, transactional, isolated, composable). It's build for the SSD/NVME drive era.
I’m a bit disappointed the article doesn’t mention Aerospike. It’s not a rdbms but a kvdb commonly used in adtech, and extremely performant on that use case. Anyway, it’s actually designed for ssds, which makes it possible to persist all writes even when the nic is saturated with write operations. Of course the aggregated bandwidth of the attached ssd hardware needs to be faster than the throughput of the nic, but not much, there’s very little overhead in the software.
How does that work? Is that an open source solution like the ZCRX stuff with io uring or does it require proprietary hardware setups? I'm hopeful that the open source solutions today are competitive.
I was familiar with Solarflare and Mellanox zero copy setups in a previous fintech role, but at that time it all relied on black boxes (specifically out of tree kernel modules, delivered as blobs without DKMS or equivalent support, a real headache to live with) that didn't always work perfectly, it was pretty frustrating overall because the customer paying the bill (rightfully) had less than zero tolerance for performance fluctuations. And fluctuations were annoyingly common, despite my best efforts (dedicating a core to IRQ handling, bringing up the kernel masked to another core, then pinning the user space workloads to specific cores and stuff like that) It was quite an extreme setup, GPS disciplined oscillator with millimetre perfect antenna wiring for the NTP setup etc we built two identical setups one in Hong Kong and one in new york. Ah very good fun overall but frustrating because of stack immaturity at that time.
but... but... SSD/MVMes are not really block devices. Not wrangling them into a block device interface but using the full set of features can already yield major improvements. Two examples: metadata and indexes need smaller granularities compared to data and an NVMe can do this quite naturally. Another example is that the data can be sent directly from the device to the network, without the CPU being involved.
Unpopular Opinion: Database were designed for 1980-90 mechanics, the only thing that never innovates is DB. It still use BTree/LSM tree that were optimized for spinning disc. Inefficiency is masked by hardware innovation and speed (Moores Law).
Optimising hardware to run existing software is how you sell your hardware.
The amount of performance you can extract from a modern CPU if you really start optimising cache access patterns is astounding
High performance networking is another area like this. High performance NICs still go to great lengths to provide a BSD socket experience to devs. You can still get 80-90% of the performance advantages of kernel bypass without abandoning that model.
> Commit-to-disk on a single system is both unnecessary (because we can replicate across storage on multiple systems) and inadequate (because we don’t want to lose writes even if a single system fails).
This is surely true for certain use cases, say financial applications which must guarantee 100% uptime, but I'd argue the vast, vast majority of applications are perfectly ok with local commit and rapid recovery from remote logs and replicas. The point is, the cloud won't give you that distributed consistency for free, you will pay for it both in money and complexity that in practice will lock you in to a specific cloud vendor.
I.e, make cloud and hosting services impossible to commoditize by the database vendors, which is exactly the point.
- A modern high end SSD commits faster than the one way time to anywhere much farther than a few miles away. (Do the math. A few tens of microseconds specified write latency is pretty common. NVDIMMs (a sadly dying technology) can do even better. The speed of light is only so fast.
- Unfortunate local correlated failures happen. IMO it’s quite nice to be able to boot up your machine / rack / datacenters and have your data there.
- Not everyone runs something on the scale of S3 or EBS. Those systems are awesome, but they are (a) exceedingly complex and (b) really very slow compared to SSDs. If I’m going to run an active/standby or active/active system with, say, two locations, I will flush to disk in both locations.
Overall speed is irrelevant, what mattered was the relative speed difference between sequential and random access.
And since there's still a massive difference between sequential and random access with SSDs, I doubt the overall approach of using buffers needs to be reconsidered.
Edit: thank you for all the answers -- very educational, TIL!
https://i.imgur.com/t5scCa3.png
https://ssd.userbenchmark.com/ (click on the orange double arrow to view additional columns)
That is a latency of about 50 µs for a random read, compared to 4-5 ms latency for HDDs.
To me a 4k read seems anachronistic from a modern application perspective. But I gather 4kb pages are still common in many file systems. But that doesn’t mean the majority of reads are 4kb random in a real world scenario.
- The access block size (LBA size). Either 512 bytes or 4096 bytes modulo DIF. Purely a logical abstraction.
- The programming page size. Something in the 4K-64K range. This is the granularity at which an erased block may be programmed with new data.
- The erase block size. Something in the 1-128 MiB range. This is the granularity at which data is erased from the flash chips.
SSDs always use some kind of journaled mapping to cope with the actual block size being roughly five orders of magnitude larger than the write API suggests. The FTL probably looks something like an LSM with some constant background compaction going on. If your writes are larger chunks, and your reads match those chunks, you would expect the FTL to perform better, because it can allocate writes contiguously and reads within the data structure have good locality as well. You can also expect for drives to further optimize sequential operations, just like the OS does.
(N.b. things are likely more complex, because controllers will likely stripe data with the FEC across NAND planes and chips for reliability, so the actual logical write size from the controller is probably not a single NAND page)
Most filesystems read in 4K chunks (or sometimes even worse, 512 byes), and internally the actual block is often multiple MB in size, so this internal read multiplication is a big factor in performance in those cases.
Note the only real difference between a random read and a sequential one is the size of the read in one sequence before it switches location - is it 4K? 16mb? 2G?
> Companies are global, businesses are 24/7
Only a few companies are global, so only a few of them should optimize for those kind of workload. However maybe every startup in SV must aim to becoming global, so probably that's what most of them must optimize for, even the ones that eventually fail to get traction.
24/7 is different because even the customers of local companies, even B2B ones, mighty feel like doing some work at midnight once in a while. They'll be disappointed to find the server down.
A massive number of companies have global customers, regardless of where the company itself has employees.
For example my b2b business is relatively tiny, yet my customer base spans four continents. Or six continents if you count free users!
anything like this, but for postgres?
actually, is it even possible to write a new db engine for postgres? like mysql has innodb, myisam, etc
If you believe this, then what you want already exists. For example: MySQL has in memory tables, but also this design pretty much sounds like NDB.
I don’t think I’d build a database the way they are describing for anything serious. Maybe a social network or other unimportant app where the consequences of losing data aren’t really a big deal.
Due to the interface between SSD and host OS being block based, you are forced to write a full 4k page. Which means you really still benefit from a write ahead log to batch together all those changes, at least up to page size, if not larger.
If you want to get some sort of sub-block batching, you need a structure that isn't random in the first place, for instance an LSM (where you write all of your changes sequentially to a log and then do compaction later)—and then solve your durability in some other way.
¿Por qué no los dos?
Ultimately, it's a trade-off: larger pages mean faster I/O, while smaller pages mean better CPU utilisation.
And then a bug crashes your database cluster all at once and now instead of missing seconds, you miss minutes, because some smartass thought "surely if I send request to 5 nodes some of that will land on disk in reasonably near future?".
I love how this industry invents best practices that are actually good then people just invent badly researched reasons to just... not do them.
That would be asynchronous replication. But IIUC the author is instead advocating for a distributed log with synchronous quorum writes.
> In Aurora, we have chosen a design point of tolerating (a) losing an entire AZ and one additional node (AZ+1) without losing data, and (b) losing an entire AZ without impacting the ability to write data. [..] With such a model, we can (a) lose a single AZ and one additional node (a failure of 3 nodes) without losing read availability, and (b) lose any two nodes, including a single AZ failure and maintain write availability.
As for why this can be considered durable enough, section 2.2 gives an argument based on their MTTR (mean time to repair) of storage segments
> We would need to see two such failures in the same 10 second window plus a failure of an AZ not containing either of these two independent failures to lose quorum. At our observed failure rates, that’s sufficiently unlikely, even for the number of databases we manage for our customers.
[0] https://pages.cs.wisc.edu/~yxy/cs764-f20/papers/aurora-sigmo...
Also even if not required makes reasoning about how systems work a hell lot easier. So for vast majority that doesn't need massive throughtputs sacrificing some speed for easier to understand consistency model is worthy tradeoff
Prety much all financial transactions are settled with a given date, not instantly. Go sell some stocks, it takes 2 days to actually settle. (May be hidden by your provider, but that how it works).
For that matter, the ultimate in BASE for financial transactions is the humble check.
That is a great example of "money out" that will only be settled at some time in the future.
There is a reason there is this notion of a "business day" and re-processing transactions that arrived out of order.
Frankly, it’s shocking anything works at all.
So, essentially just CQRS, which is usually handled in the application level with event sourcing and similar techniques.
[1] https://www.dr-josiah.com/2010/08/databases-on-ssds-initial-...
I was familiar with Solarflare and Mellanox zero copy setups in a previous fintech role, but at that time it all relied on black boxes (specifically out of tree kernel modules, delivered as blobs without DKMS or equivalent support, a real headache to live with) that didn't always work perfectly, it was pretty frustrating overall because the customer paying the bill (rightfully) had less than zero tolerance for performance fluctuations. And fluctuations were annoyingly common, despite my best efforts (dedicating a core to IRQ handling, bringing up the kernel masked to another core, then pinning the user space workloads to specific cores and stuff like that) It was quite an extreme setup, GPS disciplined oscillator with millimetre perfect antenna wiring for the NTP setup etc we built two identical setups one in Hong Kong and one in new york. Ah very good fun overall but frustrating because of stack immaturity at that time.
https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf
The amount of performance you can extract from a modern CPU if you really start optimising cache access patterns is astounding
High performance networking is another area like this. High performance NICs still go to great lengths to provide a BSD socket experience to devs. You can still get 80-90% of the performance advantages of kernel bypass without abandoning that model.
I think this was one, and I want to emphasise this, of the main points behind Odin programming language.