A couple questions about that design. Is it worth paying the cost of an extra word in every slice, if strided slicing is rare in most programs? And when you wrap OS APIs that don't understand striding, do you have to use a different pointer type?
It's a subjective question, but given that A68's slices are already burdened with carrying their lower bound wherever they go, not much is lost with adding another piece of data to them. They are already quite fat, and more with each dimension. I think the ability of taking a column and just apply any sequence function to it is well worth it.
I don't think you could use OS APIs without boxing and unboxing data, but I think that's the case with every language except C.