On the weirdness of strings

classic Classic list List threaded Threaded
13 messages Options
Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Gareth Smith
Hi rust-dev,

On Github, PCWalton said
(https://github.com/mozilla/rust/issues/3222#issuecomment-8306828):
 > The reason why |@str| cannot be dereferenced is that |str| is
dynamically sized. If we allowed strings to be copied to the stack like
ints can, then we'd have to add dynamic allocas and that would break the
segmented stacks model. Other than that, |@int| and |@str| are intended
to be conceptually very similar.

How about having str be represented internally - but not in the type
system - as a pointer to the actual string data. Copying a str would
copy the pointed-to string data in addition to the pointer, so str would
not be implicitly copyable.

The advantage of such an arrangement is that str would be fixed size and
~str/@str/&str would be consistent with other types.

Is this actually a reasonable system?

Thanks
Gareth
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/rust-dev/attachments/20120905/fa4bedf7/attachment.html>

Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Patrick Walton
On 9/5/12 1:10 PM, Gareth Smith wrote:
> How about having str be represented internally - but not in the type
> system - as a pointer to the actual string data. Copying a str would
> copy the pointed-to string data in addition to the pointer, so str would
> not be implicitly copyable.

Where would the contents be stored?

Patrick


Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Graydon Hoare
In reply to this post by Gareth Smith
On 12-09-05 1:10 PM, Gareth Smith wrote:

> The advantage of such an arrangement is that str would be fixed size and
> ~str/@str/&str would be consistent with other types.
>
> Is this actually a reasonable system?

That's what they used to be. People passed them by-value too much and we
made too many copies, and balked at the double-indirection implied by
passing around an &str or such. That also doesn't handle the fixed-size,
constant-memory and substring-slice use-cases.

The current scheme is a very delicate balance between a large number of
pressures; I think it's about the best we're going to get.

-Graydon


Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Gareth Smith
In reply to this post by Patrick Walton
On 05/09/12 21:13, Patrick Walton wrote:
> On 9/5/12 1:10 PM, Gareth Smith wrote:
>> How about having str be represented internally - but not in the type
>> system - as a pointer to the actual string data. Copying a str would
>> copy the pointed-to string data in addition to the pointer, so str would
>> not be implicitly copyable.
>
> Where would the contents be stored?
>
On the the non-task-local (AKA shared?) heap.


Graydon wrote:
 >> Is this actually a reasonable system?
 > That's what they used to be. People passed them by-value too much and
we made too many copies, and balked at the double-indirection implied by
passing around an &str or such.

I was one of those passing them by value too much. I did it because it
seemed like the idiomatic thing to do. Even rustc did it - that made it
seem legit. It no longer seems like the idiomatic thing to do because
the compiler emits a warning about it unless it is done explicitly, so I
try to avoid it. I think that documentation and compiler warnings will
determine typical use.

 > That also doesn't handle the fixed-size, constant-memory and
substring-slice use-cases.

Fair enough.

 > The current scheme is a very delicate balance between a large number
of pressures; I think it's about the best we're going to get.

The problem with rust's strings is that any rust program I write seems
to be more complicated because of features that strings have that 99% of
the time I will not use. I have to pay for safe concurrency even though
it looks like I will barely be using it. Ditto with fixed size and
constant memory strings.

You created a nice language for programs that are mostly non-concurrent
(regardless of how nice it is for highly concurrent programs), so I and
others are going to try using it for that :) ... and sometimes wondering
why the strings are so hard to use.

I don't know what the fix is, but I think this issue is going to keep
coming up, because I think *for some people* there is a better balance
to be had.

Thanks
Gareth

Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Patrick Walton
On 9/6/12 12:08 PM, Gareth Smith wrote:
> You created a nice language for programs that are mostly non-concurrent
> (regardless of how nice it is for highly concurrent programs), so I and
> others are going to try using it for that :) ... and sometimes wondering
> why the strings are so hard to use.
>
> I don't know what the fix is, but I think this issue is going to keep
> coming up, because I think *for some people* there is a better balance
> to be had.

What is wrong with just using ~str?

Patrick


Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Graydon Hoare
In reply to this post by Gareth Smith
On 12-09-06 12:08 PM, Gareth Smith wrote:
> I was one of those passing them by value too much. I did it because it
> seemed like the idiomatic thing to do. Even rustc did it - that made it
> seem legit. It no longer seems like the idiomatic thing to do because
> the compiler emits a warning about it unless it is done explicitly, so I
> try to avoid it. I think that documentation and compiler warnings will
> determine typical use.

Right. So then, typical use API-use would be &str, access to the bytes
would be double-indirect, and we'd be unable to do any constant-string
or substring optimizations, correct?

>  > The current scheme is a very delicate balance between a large number
> of pressures; I think it's about the best we're going to get.
>
> The problem with rust's strings is that any rust program I write seems
> to be more complicated because of features that strings have that 99% of
> the time I will not use. I have to pay for safe concurrency even though
> it looks like I will barely be using it. Ditto with fixed size and
> constant memory strings.

Every time you write "foo", it is a constant-memory string; and in the
near-ish future, all slicing (hence substring-extraction) operations in
core::str (from which a great many derived strings originate) will
happen via borrowing, not allocating.

These are actually really important cases. Important enough that most
other languages dedicate built-in machinery to handle them non-uniformly
as well: substrings often pin the outer string alive in the GC heap (or
refcount it independently), constants often get their own pool and/or
separate representations, often all sorts of optimization apply too,
like in-place concat, doubling-growth, inline storage for small strings,
etc. etc.

I'm not trying to be a jerk. In C, a string is just a char* that you can
move around at as-near-as-possible zero cost, like an integer. Better
than just that: since it points to constant memory, the compiler can see
through it and boil off bounds checking or indexing operations
(extracting the element-bytes as sub constants). It's very cheap, and
sets people's expectations for "how fast it can be done", but it's not
safe. We want to be safe, and as close-to-as-fast as we can be while
being safe. So here's what we tried:

   - Implementation #1: all strings were shared, refcounted, there's a
     magic refcount that means "constant". Every time you copy one you
     have to check both the magic refcount and the non-magic one, and
     adjust it. Costly. Also meant you could never send them over
     channels, since that'd require atomic refcounting. We don't want
     that.

   - Implementation #2: all strings were unique. Now you can send them
     over channels, but must double-indirect to share them, and "foo"
     causes a memory allocation, where it _should_ just be a pointer
     to constant memory. No constants or substrings.

It's difficult to think of other practical versions that don't involve
either copying or refcounting all the time, even on constant strings,
which always puts is back into the same place you're suggesting: &str
for most APIs, and double-indirect, and losing all the constant-string
and substring optimization opportunities.

> You created a nice language for programs that are mostly non-concurrent
> (regardless of how nice it is for highly concurrent programs), so I and
> others are going to try using it for that :) ... and sometimes wondering
> why the strings are so hard to use.

Yeah, I'm .. sympathetic, I do want them to be "easy", or as easy as
they can be; can you describe _exactly_ what the difficulties you're
having are? Not just that they're "hard" or "weird", but like, a
use-case that you keep doing, that you want to be able to stop-doing?

Also note: many of our APIs (core::str for example) are still far more
~str-centric than they ought to be longer-term; we did a bulk-conversion
from str to ~str, and need to go through and fully convert over to &str
whenever possible.

-Graydon


Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Gareth Smith
Graydon,

Firstly I understand you know a lot more about this than I do, and I have
read and appreciated everything that you wrote.

On 6 September 2012 22:42, Graydon Hoare <graydon at mozilla.com> wrote:


> Right. So then, typical use API-use would be &str, access to the bytes
> would be double-indirect, and we'd be unable to do any constant-string or
> substring optimizations, correct?
>
>
OK then, that scheme would suck.

So how about a slightly different scheme where a str is (internally) a
tuple of (pointer-to-char-data-on-the-unique-heap, start-index, end-index)
- i.e. 3 words - pretty cheap to shallow copy - right?.

A str would be a non-implicitly copyable type.

A ~str/@str/&str would be the same as any other ~/@/& pointers (this is
what I am aiming for).

A constant string would be a &static/str, and start-index and end-index
would be the start and end of the character data (which is stored in
constant memory).

A substring (AKA slice) would point to the same character data as its
parent string, but have different start/end indexes. I believe that the
region system would keep a substring's character data alive for as long as
its parent lives (right?).

Idiomatic usage would be to pass most string parameters as &str.

Apart from fixed-length strings, which could perhaps be treated specially,
is this a reasonable way to accomplish the objective of making string
pointers the same as other pointers (apart from vector pointers, which in
are in another kettle of fish).

Thanks
Gareth
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mail.mozilla.org/pipermail/rust-dev/attachments/20120907/8709581c/attachment.html>

Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Patrick Walton
On 9/7/12 10:02 AM, Gareth Smith wrote:
> Apart from fixed-length strings, which could perhaps be treated
> specially, is this a reasonable way to accomplish the objective of
> making string pointers the same as other pointers (apart from vector
> pointers, which in are in another kettle of fish).

I'm opposed to this. Having the string data for @str actually be on the
exchange heap is unintuitive. Requiring two allocations (one on the
exchange heap, one on the task heap) to reference count a string is
inefficient. The extra layer of indirection for &str also imposes a tax.

Patrick


Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Graydon Hoare
In reply to this post by Gareth Smith
On 07/09/2012 10:02 AM, Gareth Smith wrote:

> Firstly I understand you know a lot more about this than I do, and I
> have read and appreciated everything that you wrote.

Thanks! I appreciate your patience in discussing it.

> A substring (AKA slice) would point to the same character data as its
> parent string, but have different start/end indexes. I believe that the
> region system would keep a substring's character data alive for as long
> as its parent lives (right?).

Yes, though you might as well just advance the pointer and subtract the
length a bit, make it a 2-word representation. So long as the data is
pinned, you don't need to point to the head of the string buffer
(indeed, in a read-only data section, often multiple strings get merged
together).

> Idiomatic usage would be to pass most string parameters as &str.

Right, so this makes it double-indirect to get at the bytes most of the
time.

> Apart from fixed-length strings, which could perhaps be treated
> specially, is this a reasonable way to accomplish the objective of
> making string pointers the same as other pointers (apart from vector
> pointers, which in are in another kettle of fish).

Well ... it lets you do substrings (as does a 2-word representation,
ptr+length), it's just double-indirect most of the time, because you're
usually passing around &str. The only way this differs from what we're
doing now is that we are currently single-indirect: &str is actually
that (ptr,len) pair itself, not a pointer-to-it, which is the closest we
can get to full-speed like C.

That's really all we've done: merged the layer of indirection implied-by
/ described-by a sigil -- necessary for control of allocation, copying
and ownership behavior of the underlying buffer -- and the
layer-of-indirection necessary for working with variable-sized-thingies
in containers that need to be fixed-size (other structures, stack
frames, etc.) So we wind up with only one indirection, not two. We
didn't think uniformity of representation was sufficient to justify
imposing the systemic double-indirection cost, that's all.

In case you're concerned that this makes str (and []) the sole warts on
a type system that is otherwise uniform in terms of treating non-sigil'd
type names as allocatable, interior-value things, keep in mind that it's
not. Traits-as-types (i.e. when used as fn(x:Trait), not as bounds on
type parameters) are currently in the same boat as strings used to be a
while back (implicitly @), but they're changing so that they can only be
represented through sigils; again so that we can gain-back control over
allocation (not always have to box, be able to send, handle constants,
perform move-on-self, etc). Likewise closures (fn() types) are not
really representable as interior values, since they close over an
environment (or no environment) in a variety of ways that the sigils
control.

Sorry to sound so stubborn on this; again, can you perhaps point to
specific awkwardnesses in the current scheme? Maybe we can mitigate the
difficulty by changing evaluation / borrowing / constant /
pattern-matching rules a touch.

-Gradon


Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Patrick Walton
In reply to this post by Gareth Smith
If you prefer to think about &str, ~str, and @str as separate types, an
analogy to C++ might be helpful:

&T is like T *.
~T is like std::auto_ptr<T> in C++03 or std::unique_ptr<T> in C++0x.
@T is like std::shared_ptr<T>.

&str is like char *.
~str is like std::string.
@str is like CString in Microsoft's ATL (reference counted).

&[T] is like a (T *, size_t length) pair.
~[T] is like std::vector<T>.
@[T] is like std::shared_ptr<std::vector<T>>, but without the extra
level of indirection.

Essentially, it's like C++ with several of the most common smart pointer
types made type-safe and built into the language.

Patrick


Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Gareth Smith
In reply to this post by Graydon Hoare
On 07/09/12 18:25, Graydon Hoare wrote:
>
> Sorry to sound so stubborn on this; again, can you perhaps point to
> specific awkwardnesses in the current scheme? Maybe we can mitigate
> the difficulty by changing evaluation / borrowing / constant /
> pattern-matching rules a touch.

Here are a couple of things that have confused me:

1. AFAIK you can't actually create values of @str. It seems like only
~str and &str are the blessed string types. Sinful be you if your data
has lifetimes that do not follow a stack discipline. OK, so @~str seems
to be used instead of @str, but it is ugly to keep reading that and
surely it is not efficient.

2. I start off passing a string to a particular function as a &str. But
then later I need to change the function so that it saves the string in
a structure that the function returns (the structure will then get
stored in the task-heap or the unique-heap).

I have to ask myself if I should copy the string in order to save it, or
should I pass in an @str to avoid copies? I imagine that copying the
string is expensive (it may be large), so I change the function to take
@str instead of &str. I then have to change all the calling functions to
convert their &str to @str and pass it in (but I don't want to do this
because it seems just like as many copies will happen), or I change the
caller to also take @str instead of &str - and the callers caller, and
so on. Then some of my functions take &str and some take @str, depending
on their implementation. Maybe this isn't weird. Maybe its good thing.
Maybe I am doing it wrong. But to me it feels weird that the
implementation leaks.

This is not really string specific though to be fair, and I don't see
how it can be fixed without ubiquitous global garbage collection.

Thanks
Gareth

Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Patrick Walton
On 9/7/12 2:54 PM, Gareth Smith wrote:
> 1. AFAIK you can't actually create values of @str. It seems like only
> ~str and &str are the blessed string types. Sinful be you if your data
> has lifetimes that do not follow a stack discipline. OK, so @~str seems
> to be used instead of @str, but it is ugly to keep reading that and
> surely it is not efficient.

You should be able to create values of @str -- @"Hello world!" works for
me. The remaining @~str types are due to legacy code that we haven't
ported over yet -- sorry about being slow on that...

Actually, I'm beginning to think that we should overload string
literals, so you can write `let x: @str = "Hello world!"` and it should
just work.

> 2. I start off passing a string to a particular function as a &str. But
> then later I need to change the function so that it saves the string in
> a structure that the function returns (the structure will then get
> stored in the task-heap or the unique-heap).
>
> I have to ask myself if I should copy the string in order to save it, or
> should I pass in an @str to avoid copies? I imagine that copying the
> string is expensive (it may be large), so I change the function to take
> @str instead of &str. I then have to change all the calling functions to
> convert their &str to @str and pass it in (but I don't want to do this
> because it seems just like as many copies will happen), or I change the
> caller to also take @str instead of &str - and the callers caller, and
> so on. Then some of my functions take &str and some take @str, depending
> on their implementation. Maybe this isn't weird. Maybe its good thing.
> Maybe I am doing it wrong. But to me it feels weird that the
> implementation leaks.
>
> This is not really string specific though to be fair, and I don't see
> how it can be fixed without ubiquitous global garbage collection.

Yeah :( You're 100% right that this is a bummer. Essentially it's, as
you say, the price we pay for not having pervasive GC.

C++ code feels the same pain with the STL types and the various strings
that libraries tend to implement; much of the reason we standardized on
these "smart pointers" is so that libraries don't write their own string
types and exacerbate the problem.

If GC-induced latency isn't critical to you, feel free to write your
functions to take @str pervasively. There's no need to feel guilty doing
so :) The type is there so that code that isn't micromanaging memory can
be written conveniently. As standard library writers, we write
everything using &str and ~str because we have to cater to applications
that don't want to pay the GC tax, but we definitely don't want everyone
to write that way.

The downside of @str is that support for it isn't great in the standard
library at the moment, but please feel free to file bugs -- it should be
a first-class, convenient, well-supported type.

Patrick


Reply | Threaded
Open this post in threaded view
|

On the weirdness of strings

Graydon Hoare
In reply to this post by Gareth Smith
On 12-09-07 2:54 PM, Gareth Smith wrote:
> Here are a couple of things that have confused me:
>
> 1. AFAIK you can't actually create values of @str. It seems like only
> ~str and &str are the blessed string types. Sinful be you if your data
> has lifetimes that do not follow a stack discipline. OK, so @~str seems
> to be used instead of @str, but it is ugly to keep reading that and
> surely it is not efficient.

Oh no, that's purely an artifact of how we did the bulk conversion from
str to ~str. Literally, it was a giant search/replace with all energy
devoted to keeping-the-compiler-working. @str is totally legit.

> I have to ask myself if I should copy the string in order to save it, or
> should I pass in an @str to avoid copies?

It depends if ownership-sharing is part of the contract the function is
exposing, or an incident of its implementation. If the former -- if the
whole point is to give it something of shared ownership -- then @str
makes sense. Otherwise &str, as the function taking a local copy is
incidental.

We do make programmers think about ownership, lifetime, sharing, and
copying, at least enough to express their intention. Specifically
because the programmer gets hit with ubiquitous penalties
(everything-unique, everything-copied, everything-GC'ed, or similar) if
we try to save them from thinking about it. And then they profile the
program, see it's slower than C++, and go back to using that. We're
aiming to stay close enough to the C++ performance envelope that this
isn't an urge (whatever else may be).

> This is not really string specific though to be fair

No, not string-specific at all. Same issue happens all through our
memory model.

> how it can be fixed without ubiquitous global garbage collection.

It's not part of our design criteria to fix "having to think about
lifetime, ownership, copies and sharing". That's intentional.

-Graydon