Modern storage is plenty fast. It is the APIs that are bad.

Conpin
5 min readNov 26, 2020

I have spent almost the entire last decade in a fairly specialized product company, building high performance I/O systems. I had the opportunity to see storage technology evolve rapidly and decisively. Talking about storage and its developments felt like preaching to the choir.

https://www.videoparentingcourse.com/nov/Sp-vs-Ce-F1.html
https://www.videoparentingcourse.com/nov/Sp-vs-Ce-F1Q.html
https://www.arbejdsgiverne.dk/cmc/Mo-vs-A-QW1.html
https://www.arbejdsgiverne.dk/cmc/Mo-vs-A-QW1.html
http://www.gometrotowing.com/nose/Br-vs-Leic-q1.html
http://www.gometrotowing.com/nose/Br-vs-Leic-q2.html
https://www.videoparentingcourse.com/nov/Sparta-Prague-v-Celtic-Track01.html
http://www.gometrotowing.com/nvc/v-ideo-molde-v-arsenalliv-tv01.html
https://www.videoparentingcourse.com/nov/Sparta-Prague-v-Celtic-Track02.html
http://www.gometrotowing.com/nvc/v-ideo-molde-v-arsenalliv-tv02.html
https://www.videoparentingcourse.com/nov/Sparta-Prague-v-Celtic-Track03.html
http://www.gometrotowing.com/nvc/v-ideo-molde-v-arsenalliv-tv03.html
https://www.videoparentingcourse.com/nov/Sparta-Prague-v-Celtic-Track04.html
http://www.gometrotowing.com/nvc/v-ideo-molde-v-arsenalliv-tv04.html
https://www.videoparentingcourse.com/nov/Sparta-Prague-v-Celtic-Track05.html
http://www.gometrotowing.com/nvc/v-ideo-molde-v-arsenalliv-tv05.html
https://www.videoparentingcourse.com/nov/Sparta-Prague-v-Celtic-Track06.html
http://www.gometrotowing.com/nvc/v-ideo-molde-v-arsenalliv-tv06.html
https://www.videoparentingcourse.com/nbc/S-v-L02.html
https://www.videoparentingcourse.com/nbc/N-v-R06.html
https://www.videoparentingcourse.com/nbc/S-v-L04.html
https://www.videoparentingcourse.com/nbc/N-v-ita-R07.html
https://www.videoparentingcourse.com/nbc/S-v-L05.html
https://www.videoparentingcourse.com/nbc/N-v-R-uefa04.html
https://www.videoparentingcourse.com/nbc/N-v-R-uefa05.html
https://www.videoparentingcourse.com/nbc/S-v-L-uefa-05.html
https://www.videoparentingcourse.com/nbc/S-v-L-uefa-06.html
https://www.videoparentingcourse.com/nbc/n-v-r-uefa06.html
https://www.videoparentingcourse.com/nbc/S-v-L-mac-07.html

This year, I have switched jobs. Being at a larger company with engineers from multiple backgrounds I was taken by surprise by the fact that although every one of my peers is certainly extremely bright, most of them carried misconceptions about how to best exploit the performance of modern storage technology leading to suboptimal designs, even if they were aware of the increasing improvements in storage technology.

As I reflected about the causes of this disconnect I realized that a large part of the reason for the persistence of such misconceptions is that if they were to spend the time to validate their assumptions with benchmarks, the data would show that their assumptions are, or at least appear to be, true.

Common examples of such misconceptions include:

  • “Well, it is fine to copy memory here and perform this expensive computation because it saves us one I/O operation, which is even more expensive”.
  • “I am designing a system that needs to be fast. Therefore it needs to be in memory”.
  • “If we split this into multiple files it will be slow because it will generate random I/O patterns. We need to optimize this for sequential access and read from a single file”
  • “Direct I/O is very slow. It only works for very specialized applications. If you don’t have your own cache you are doomed”.

Yet if you skim through specs of modern NVMe devices you see commodity devices with latencies in the microseconds range and several GB/s of throughput supporting several hundred thousands random IOPS. So where’s the disconnect?

In this article I will demonstrate that while hardware changed dramatically over the past decade, software APIs have not, or at least not enough. Riddled with memory copies, memory allocations, overly optimistic read ahead caching and all sorts of expensive operations, legacy APIs prevent us from making the most of our modern devices.

In the process of writing this piece I had the immense pleasure of getting early access to one of the next generation Optane devices, from Intel. While they are not common place in the market yet, they certainly represent the crowning of a trend towards faster and faster devices. The numbers you will see throughout this article were obtained using this device.

In the interest of time I will focus this article on reads. Writes have their own unique set of issues — as well as opportunities for improvements that I intend to cover in a later article.

The claims

There are three main problems with traditional file-based APIs:

  • They perform a lot of expensive operations because “I/O is expensive”.

When legacy APIs need to read data that is not cached in memory they generate a page fault. Then after the data is ready, an interrupt. Lastly, for a traditional system-call based read you have an extra copy to the user buffer, and for mmap-based operations you will have to update the virtual memory mappings.

None of these operations: page fault, interrupts, copies or virtual memory mapping update are cheap. But years ago they were still ~100 times cheaper than the cost of the I/O itself, making this approach acceptable. This is no longer the case as device latency approaches single-digit microseconds. Those operations are now in the same order of magnitude of the I/O operation itself.

A quick back-of-the-napkin calculation shows that in the worst case, less than half of the total busy cost is the cost of communication with the device per se. That’s not counting all the waste, which brings us to the second problem:

  • Read amplification.

Although there are some details I will brush over (like memory used by file descriptors, the various metadata caches in Linux), if modern NVMe support many concurrent operations, there is no reason to believe that reading from many files is more expensive than reading from one. However the aggregate amount of data read certainly matters.

The operating system reads data in page granularity, meaning it can only read at a minimum 4kB at a time. That means if you need to read read 1kB split in two files, 512 bytes each, you are effectively reading 8kB to serve 1kB, wasting 87% of the data read. In practice, the OS will also perform read ahead, with a default setting of 128kB, in anticipation of saving you cycles later when you do need the remaining data. But if you never do, as is often the case for random I/O, then you just read 256kB to serve 1kB and wasted 99% of it.

If you feel tempted to validate my claim that reading from multiple files shouldn’t be fundamentally slower than reading from a single file, you may end up proving yourself right, but only because read amplification increased by a lot the amount of data effectively read.

Since the issue is the OS page cache, what happens if you just open a file with Direct I/O, all else being equal? Unfortunately that likely won’t get faster either. But that’s because of our third and last issue:

  • Traditional APIs don’t exploit parallelism.

A file is seen as a sequential stream of bytes, and whether data is in-memory or not is transparent to the reader. Traditional APIs will wait until you touch data that is not resident to issue an I/O operation. The I/O operation may be larger than what the user requested due to read-ahead, but is still just one.

However as fast as modern devices are, they are still slower than the CPU. While the device is waiting for the I/O operation to come back, the CPU is not doing anything.

--

--