*use the 24 bytes directly for 23 chars and a NULL terminator.* Ruby strings are...

iclelland · on Jan 4, 2012

Pardon?

The code is right there in the article -- in this case, for this specific kind of string, at the C level, they very clearly are null-terminated strings.

No, you don't get to see that at the Ruby level, it's all nicely abstracted away, but that is exactly what is happening inside the VM.

ComputerGuru · on Jan 4, 2012

Not really. RString operates in the same way std::string does - it has a character array and it has a member variable denoting the length.

It's not null-terminated. You can store a sequence of nulls and that will not affect the result of std::string.size()

In C, you'll be forgiven for thinking it was null terminated, because attempting to assign a std::string a value from a null-containing array of characters would terminate the copy upon reaching the null, but that's only because the original char array is null-terminated when read as a C string.

However, you can manually construct a std::string with \0 sequences in the middle and that will not terminate the string, nor affect the separate length calculation. The same applies for Ruby's RString.

So that was the reason they're not null-terminated. Now the reason why they technically are (i.e. the reason NULLs are stored at the end of the string) is for compatibility and optimization. At the cost of one byte per string (for the trailing \0), we get instant compatibility with non-RStrng/std::string functions. If a function needs a C string, we can just pass the internal pointer to the character array - no need to copy the string to a temporary buffer and append a null.

Therefore, while null-termination is absolutely NOT required when dealing with an exclusively counted-length implementation of C strings (a la RString, CString, std::string, etc.) if you can just pass the pointer and the length separately, it would be a ridiculously foolish optimization for a general string implementation to NOT have the option of directly exposing the underlying null-terminated string to any functions that need it, with the caveat that null-containing counted strings will obviously terminate sooner than expected.

mbell · on Jan 5, 2012

You seem to confusing C and C++, there is no std::string in C and MRI is written in C.

A 'C String' is by quasi definition, a segment of memory that can be properly processed by the string functions in the C standard library, which requires null termination.

>Now the reason why they technically are (i.e. the reason NULLs are stored at the end of the string) is for compatibility and optimization.

Really its just because they are C Strings, that is they use the C standard library string functions, if you want to use them, you must null terminate.

>Therefore, while null-termination is absolutely NOT required when dealing with an exclusively counted-length implementation of C strings (a la RString, CString, std::string, etc.)

None of those are implementations of "C Strings", they aren't even available for C.

The determination as to whether your using null terminated strings or not comes down to the String library your using. If your on C, your probably using C std lib and need to null terminate your 'strings'. There really isn't much more to it than that.

ComputerGuru · on Jan 5, 2012

No, I'm perfectly well-versed in the differences between C and C++, having written in one or the other for a long time. A trivial look-alike implementation of std::string can be written in C, and would look a lot like the RString class.

Your argument is actually, essentially mine. The need to use the platforms' string functions heavily swings (but does not force) the choice of null-terminating the RString members. As I mentioned, it would be really stupid but entirely possible to simply clone the non-null-terminated string into a temporary null-terminated char array every time you want to use a function that takes standard "C strings" if you really, truly, madly wanted to have an RString implementation that was one byte smaller to store. But that would be insane.

haileys · on Jan 4, 2012

Well I dunno how they could possibly be null terminated if you can stick a null in any old string:

    >> "hello\0world".length
    => 11

zokier · on Jan 4, 2012

That could be explained with ruby doing escaping of null characters automatically in the background, although I don't believe that to be the case.

dchest · on Jan 4, 2012

It seems like the length is stored inside RString's RBasic->flags:

  #define RSTRING_EMBED_LEN(str) \
     (long)((RBASIC(str)->flags >> RSTRING_EMBED_LEN_SHIFT) & \
            (RSTRING_EMBED_LEN_MASK >> RSTRING_EMBED_LEN_SHIFT))

dkarl · on Jan 5, 2012

but as far as I can tell, Ruby doesn't rely on this directly

They may be null-terminated to make writing C extensions easier. If they weren't, you'd have to make a null-terminated copy every time you wanted to pass the bytes to a function expecting a null-terminated string, which most C APIs do.

throw_away · on Jan 4, 2012

They probably are when they are inlined like this, else, how would you know the length?

haileys · on Jan 4, 2012

The length is stored in a flag.

include/ruby/ruby.h:

    #define RSTRING_EMBED_LEN(str) \
         (long)((RBASIC(str)->flags >> RSTRING_EMBED_LEN_SHIFT) & \
                (RSTRING_EMBED_LEN_MASK >> RSTRING_EMBED_LEN_SHIFT))

    #define RSTRING_LEN(str) \
        (!(RBASIC(str)->flags & RSTRING_NOEMBED) ? \
         RSTRING_EMBED_LEN(str) : \
         RSTRING(str)->as.heap.len)