How did you test 2GB/sec? I have a Fusionio ioDrive2 1.2 TB card in a 32 core Xeon E5 server with 1600Mhz RAM and I get only 850MB/sec:
# for i in {1..4} ; do ( time sh -c "dd if=/dev/zero of=/fusionio1/ddtest.$i bs=1M count=4000 oflag=direct" ) ; done
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 4.90524 s, 855 MB/s
real 0m4.908s
user 0m0.006s
sys 0m0.785s
4000+0 records in
4000+0 records out
4194304000 bytes (4.2 GB) copied, 5.05399 s, 830 MB/s
Server has 128GB RAM and the card is in a PCIe 3.0 slot.
Also FusionIO has 20PByte write endurance, which sandisk card obviously doesn't need.
1) What happens if you use bigger block sizes, say 16M?
2) Which filesystem you're using? Can it be fragmented? Can you test with raw disk to eliminate any filesystem influence?
3) Is dd running on same NUMA node (CPU socket) as your ioDrive2 PCI-e link?
4) To expand on 3, is there chance for QPI saturation (traffic between CPU sockets)? Have you ensured all the software uses CPU local RAM whenever possible and not access other CPU socket's RAM?
5) Are you sure all PCI-e lanes are active? (Try lspci)
Also:
Did you low-level format to 4K blocks, or is is still using 512 byte blocks?
Various options for the kernel module also have a significant effect, as do BIOS settings (C-states, etc.).
I got 2.5-2.9 GB/sec (iirc) with a FusionIO ioDrive2 Duo. It has been more than a year so I do not remember all the details.
Weird, even on an old PCIe 2.0 Opteron machine, I easily gets 1.4 GB/s from an HGST NVMe card. I've got 800 MB/s from a single SAS 12Gb SSD. Something's probably wrong in your setup.
Nothing fancy, just multiple copies of large files at the filesystem level, making sure that the cache wasn't giving false results. The transfer rates and the I/O counters matched pretty well. This was on a vanilla Windows 10 workstation.
We run a bunch of these things, and regularly bump up against the PCIe bus limits. There's something going on with your setup.