The difference matters more if you're manipulating a whole lot of different files simultaneously at once. You know, like multitasking on a modern machine with a lot of cores/threads.While true, unless you're manipulating huge files, it doesn't matter. The next jump after NVMe will be huge too.
ITS 50% FASTER, SO WINDOWS NOW BOOTS IN .2 SECONDS RATHER THAN .4!!!1111
Every time a SATA SSD reads a block, a single CPU interrupt fires and the driver servicing that interrupt has to interrogate the AHCI host adapter, figure out which entry in the queue the drive is returning, figure out which process on the machine wanted that data, then talk to the AHCI controller AGAIN so the controller can DMA the data where it needs to go... so that a second interrupt can fire when that data transfer completes. And the AHCI queue is only 32 entries deep so odds are there's also a software managed I/O queue on top of that that the driver also handling as well, so it has to read the next request from the queue and then add that to the drive queue.
This didn't matter so much when we were using spinning rust drives with slow seek times, but on a modern machine with lots of cores simultaneously fighting over data, and SATA SSDs having far less access latency, the software/AHCI funnel becomes the next bottleneck.
NVMe gets rid of that funnel. A drive has 64K independent queues with 64K queue entries itself. Each task on the machine can be given its own path into the NVMe drive, and can read/write from the drive completely independently of any other thing going on.
There's still some overhead setting up and tearing down queues, but it's far less overhead. It's analogous to a process having to call malloc() once to get some RAM, then it can do whatever it wants with the RAM. Yeah there's still malloc()/free() overhead, but at least the actual act of reading/writing memory doesn't have to pass through a central driver anymore.
And I don't measure this shit by how quick Windows boots, I measure this shit by doing things like iteratively processing centuries worth of underwater recordings on a 16-core/32-thread Xeon server and profiling what parts are running fast and slow. Staging files on NVMe for processing has greatly sped things up.