RFC: Syntax for "raw" string literals

classic Classic list List threaded Threaded
45 messages Options
123
Reply | Threaded
Open this post in threaded view
|

RFC: Syntax for "raw" string literals

Kevin Ballard-2
One feature common to many programming languages that Rust lacks is "raw" string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text.

I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros & cons. I'm hoping we can come up with a syntax that makes sense for Rust.

## Python syntax:

Python supports an "r" or "R" prefix on any string literal (both "short" strings, delimited with a single quote, or "long" strings, delimited with 3 quotes). The "r" or "R" prefix denotes a "raw string", and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means r"foo\"" evaluates to the string `foo\"`, and similarly r"foo\\\"" is `foo\\\"`, but r"foo\\" is merely the string `foo\\`.

Pros:
* Simple syntax
* Allows for embedding the closing quote character in the raw string

Cons:
* Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it.

## C++11 syntax:

C++11 allows for raw strings using a sequence of the form R"seq(raw text)seq". In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R"(raw text)", which allows for anything in the raw text except for the sequence `)"`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text).

Pros:
* Allows for embedding any character at all (representable in the source file encoding), including the closing quote.
* Reasonably straightforward

Cons:
* Syntax is slightly complicated

## D syntax:

D supports three different forms of raw strings. The first two are similar, being r"raw text" and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms.

The first looks like q"(raw text)" where the ( may be any non-identifier non-whitespace character. If the character is one of [(<{ then it is a "nesting delimiter", and the close delimiter must be the matching ])>} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q"(foo(bar))" evaluates to "foo(bar)", but q"(foo(bar)" and q"(foobar))" are both illegal.

The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like

q"delim
this is some raw text
delim"

It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to "this is some raw text\n".

Pros:
* Flexible
* Allows for constructing a raw string that contains any desired sequence of characters (representable in the source file's encoding)

Cons:
* Overly complicated

## Custom syntax

There's another approach that none of these three languages take, which is to merely allow for doubling up the quote character in order to embed a quote. This would look like R"raw string literal ""with embedded quotes"".", which becomes `raw string literal "with embedded quotes"`.

Pros:
* Very simple
* Allows for embedding the close quote character, and therefore, any character (representable in the source file encoding)

Cons:
* Slightly odd to read

## Conclusion

Of the three existing syntaxes examined here, I think C++11's is the best. It ties with D's syntax for being the most powerful, but is simpler than D's. The custom syntax is just as powerful though. The benefit of the C++11 syntax over the custom syntax is it's slightly easier to read the C++11 syntax, as the raw text has a 1-to-one mapping with the resulting string. The custom syntax is a bit more confusing to read, especially if you want to add multiple quotes. As a pathological case, let's try representing a Python triple-quoted docstring using both syntaxes:

C++11: R"("""this is a python docstring""")"
Custom: R"""""""this is a python docstring"""""""

Based on this examination, I'm leaning towards saying Rust should support C++11's raw string literal syntax.

I welcome any comments, criticisms, or suggestions.

-Kevin
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Oren Ben-Kiki
Just to make sure - how does the C++ syntax behave in the presence of line breaks? Specifically, what does it do with leading (and trailing) white space of each line? My guess is that they would be included in the string, is that correct?

At any rate, having some sort of here documents would be very nice. The C++ syntax is reasonable, though I really don't have a strong preference here. It might be more Rust-ish to use a macro notation instead: str!(delimiter"....."delimiter), or something like that.

BTW, I found myself creating (in several languages) an "unindent" string function that would (1) if the string starts with a line break, remove it; (2) remove the leading white space of the 1st line from all the lines. Applying this to "here documents" allows indenting them together with the code that includes them. In Rust, the downside of this approach is that the result isn't &'static any more... Not that this warrants making such complex functionality a built-in of the syntax, of course.

Oren.

_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Kevin Ballard-2
On Sep 19, 2013, at 1:56 PM, Oren Ben-Kiki <[hidden email]> wrote:

> Just to make sure - how does the C++ syntax behave in the presence of line breaks? Specifically, what does it do with leading (and trailing) white space of each line? My guess is that they would be included in the string, is that correct?

It includes every single character that occurs in the source between the delimiters. So

        cout << R"(this is
        a string");

will print "this is", newline, horizontal tab, "a string".

> At any rate, having some sort of here documents would be very nice. The C++ syntax is reasonable, though I really don't have a strong preference here. It might be more Rust-ish to use a macro notation instead: str!(delimiter"....."delimiter), or something like that.

Not possible. This syntax needs to be part of the lexer, and macros/syntax extensions operate on token trees, not on raw source characters.

-Kevin

> BTW, I found myself creating (in several languages) an "unindent" string function that would (1) if the string starts with a line break, remove it; (2) remove the leading white space of the 1st line from all the lines. Applying this to "here documents" allows indenting them together with the code that includes them. In Rust, the downside of this approach is that the result isn't &'static any more... Not that this warrants making such complex functionality a built-in of the syntax, of course.
>
> Oren.

_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Martin DeMello
In reply to this post by Kevin Ballard-2
How complicated would it be to use R"" but with arbitrary paired
delimiters (the way, for instance, ruby does it)? It's very handy to
pick a delimiter you know does not appear in the string, e.g. if you
had a string containing ')' you could use R{this is a string with a )
in it} or R|this is a string with a ) in it|.

martin

On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard <[hidden email]> wrote:

> One feature common to many programming languages that Rust lacks is "raw" string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text.
>
> I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros & cons. I'm hoping we can come up with a syntax that makes sense for Rust.
>
> ## Python syntax:
>
> Python supports an "r" or "R" prefix on any string literal (both "short" strings, delimited with a single quote, or "long" strings, delimited with 3 quotes). The "r" or "R" prefix denotes a "raw string", and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means r"foo\"" evaluates to the string `foo\"`, and similarly r"foo\\\"" is `foo\\\"`, but r"foo\\" is merely the string `foo\\`.
>
> Pros:
> * Simple syntax
> * Allows for embedding the closing quote character in the raw string
>
> Cons:
> * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it.
>
> ## C++11 syntax:
>
> C++11 allows for raw strings using a sequence of the form R"seq(raw text)seq". In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R"(raw text)", which allows for anything in the raw text except for the sequence `)"`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text).
>
> Pros:
> * Allows for embedding any character at all (representable in the source file encoding), including the closing quote.
> * Reasonably straightforward
>
> Cons:
> * Syntax is slightly complicated
>
> ## D syntax:
>
> D supports three different forms of raw strings. The first two are similar, being r"raw text" and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms.
>
> The first looks like q"(raw text)" where the ( may be any non-identifier non-whitespace character. If the character is one of [(<{ then it is a "nesting delimiter", and the close delimiter must be the matching ])>} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q"(foo(bar))" evaluates to "foo(bar)", but q"(foo(bar)" and q"(foobar))" are both illegal.
>
> The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like
>
> q"delim
> this is some raw text
> delim"
>
> It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to "this is some raw text\n".
>
> Pros:
> * Flexible
> * Allows for constructing a raw string that contains any desired sequence of characters (representable in the source file's encoding)
>
> Cons:
> * Overly complicated
>
> ## Custom syntax
>
> There's another approach that none of these three languages take, which is to merely allow for doubling up the quote character in order to embed a quote. This would look like R"raw string literal ""with embedded quotes"".", which becomes `raw string literal "with embedded quotes"`.
>
> Pros:
> * Very simple
> * Allows for embedding the close quote character, and therefore, any character (representable in the source file encoding)
>
> Cons:
> * Slightly odd to read
>
> ## Conclusion
>
> Of the three existing syntaxes examined here, I think C++11's is the best. It ties with D's syntax for being the most powerful, but is simpler than D's. The custom syntax is just as powerful though. The benefit of the C++11 syntax over the custom syntax is it's slightly easier to read the C++11 syntax, as the raw text has a 1-to-one mapping with the resulting string. The custom syntax is a bit more confusing to read, especially if you want to add multiple quotes. As a pathological case, let's try representing a Python triple-quoted docstring using both syntaxes:
>
> C++11: R"("""this is a python docstring""")"
> Custom: R"""""""this is a python docstring"""""""
>
> Based on this examination, I'm leaning towards saying Rust should support C++11's raw string literal syntax.
>
> I welcome any comments, criticisms, or suggestions.
>
> -Kevin
> _______________________________________________
> Rust-dev mailing list
> [hidden email]
> https://mail.mozilla.org/listinfo/rust-dev
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Masklinn
In reply to this post by Kevin Ballard-2
On 2013-09-19, at 22:36 , Kevin Ballard wrote:
>
> I welcome any comments, criticisms, or suggestions.

* C# also has rawstrings, which were not looked at. C#'s rawstrings
  disable escaping entirely but add a new one: doubling quotes will insert
  a single quote in the resulting string (similar to quote-escaping in
  SQL or Smalltalk).
* The docstring comment is incorrect, a docstring is a string in the
  first position of a module, a class statement or a function statement.
  A single-quoted string at these positions will yield a docstring.

  The triple-quoting is a string syntax embedding newlines (single-quoted
  strings can not contain literal newlines in Python, only escaped ones).
  Obviously, triple-quoted python string can be raw.
* The quote-escaping oddness is less of an issue in Python as you can
  also use single-quotes for delimiting, or use triple-quoted strings
  (if you need to embed both single and double quotes in rawstrings).
* Perl's quotes and quote-like operators would certainly deserve mention.

Also,

> windows file paths

windows paths can also use forward slashes so that's not a very
interesting justification.
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Kevin Ballard-2
In reply to this post by Martin DeMello
I didn't look at Ruby's syntax, but what you just described sounds a little too free-form to me. I believe Ruby at least requires a % as part of the syntax, e.g. %q{test}. But I don't think %R{test} is a good idea for rust, as it would conflict with the % operator. I don't think other punctuation would work well either.

-Kevin

On Sep 19, 2013, at 2:10 PM, Martin DeMello <[hidden email]> wrote:

> How complicated would it be to use R"" but with arbitrary paired
> delimiters (the way, for instance, ruby does it)? It's very handy to
> pick a delimiter you know does not appear in the string, e.g. if you
> had a string containing ')' you could use R{this is a string with a )
> in it} or R|this is a string with a ) in it|.
>
> martin
>
> On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard <[hidden email]> wrote:
>> One feature common to many programming languages that Rust lacks is "raw" string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text.
>>
>> I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros & cons. I'm hoping we can come up with a syntax that makes sense for Rust.
>>
>> ## Python syntax:
>>
>> Python supports an "r" or "R" prefix on any string literal (both "short" strings, delimited with a single quote, or "long" strings, delimited with 3 quotes). The "r" or "R" prefix denotes a "raw string", and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means r"foo\"" evaluates to the string `foo\"`, and similarly r"foo\\\"" is `foo\\\"`, but r"foo\\" is merely the string `foo\\`.
>>
>> Pros:
>> * Simple syntax
>> * Allows for embedding the closing quote character in the raw string
>>
>> Cons:
>> * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it.
>>
>> ## C++11 syntax:
>>
>> C++11 allows for raw strings using a sequence of the form R"seq(raw text)seq". In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R"(raw text)", which allows for anything in the raw text except for the sequence `)"`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text).
>>
>> Pros:
>> * Allows for embedding any character at all (representable in the source file encoding), including the closing quote.
>> * Reasonably straightforward
>>
>> Cons:
>> * Syntax is slightly complicated
>>
>> ## D syntax:
>>
>> D supports three different forms of raw strings. The first two are similar, being r"raw text" and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms.
>>
>> The first looks like q"(raw text)" where the ( may be any non-identifier non-whitespace character. If the character is one of [(<{ then it is a "nesting delimiter", and the close delimiter must be the matching ])>} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q"(foo(bar))" evaluates to "foo(bar)", but q"(foo(bar)" and q"(foobar))" are both illegal.
>>
>> The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like
>>
>> q"delim
>> this is some raw text
>> delim"
>>
>> It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to "this is some raw text\n".
>>
>> Pros:
>> * Flexible
>> * Allows for constructing a raw string that contains any desired sequence of characters (representable in the source file's encoding)
>>
>> Cons:
>> * Overly complicated
>>
>> ## Custom syntax
>>
>> There's another approach that none of these three languages take, which is to merely allow for doubling up the quote character in order to embed a quote. This would look like R"raw string literal ""with embedded quotes"".", which becomes `raw string literal "with embedded quotes"`.
>>
>> Pros:
>> * Very simple
>> * Allows for embedding the close quote character, and therefore, any character (representable in the source file encoding)
>>
>> Cons:
>> * Slightly odd to read
>>
>> ## Conclusion
>>
>> Of the three existing syntaxes examined here, I think C++11's is the best. It ties with D's syntax for being the most powerful, but is simpler than D's. The custom syntax is just as powerful though. The benefit of the C++11 syntax over the custom syntax is it's slightly easier to read the C++11 syntax, as the raw text has a 1-to-one mapping with the resulting string. The custom syntax is a bit more confusing to read, especially if you want to add multiple quotes. As a pathological case, let's try representing a Python triple-quoted docstring using both syntaxes:
>>
>> C++11: R"("""this is a python docstring""")"
>> Custom: R"""""""this is a python docstring"""""""
>>
>> Based on this examination, I'm leaning towards saying Rust should support C++11's raw string literal syntax.
>>
>> I welcome any comments, criticisms, or suggestions.
>>
>> -Kevin
>> _______________________________________________
>> Rust-dev mailing list
>> [hidden email]
>> https://mail.mozilla.org/listinfo/rust-dev

_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Martin DeMello
Yes, I figured R followed by a non-alphabetical character could serve
the same purpose as ruby's %<char>.

martin

On Thu, Sep 19, 2013 at 2:37 PM, Kevin Ballard <[hidden email]> wrote:

> I didn't look at Ruby's syntax, but what you just described sounds a little too free-form to me. I believe Ruby at least requires a % as part of the syntax, e.g. %q{test}. But I don't think %R{test} is a good idea for rust, as it would conflict with the % operator. I don't think other punctuation would work well either.
>
> -Kevin
>
> On Sep 19, 2013, at 2:10 PM, Martin DeMello <[hidden email]> wrote:
>
>> How complicated would it be to use R"" but with arbitrary paired
>> delimiters (the way, for instance, ruby does it)? It's very handy to
>> pick a delimiter you know does not appear in the string, e.g. if you
>> had a string containing ')' you could use R{this is a string with a )
>> in it} or R|this is a string with a ) in it|.
>>
>> martin
>>
>> On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard <[hidden email]> wrote:
>>> One feature common to many programming languages that Rust lacks is "raw" string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text.
>>>
>>> I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros & cons. I'm hoping we can come up with a syntax that makes sense for Rust.
>>>
>>> ## Python syntax:
>>>
>>> Python supports an "r" or "R" prefix on any string literal (both "short" strings, delimited with a single quote, or "long" strings, delimited with 3 quotes). The "r" or "R" prefix denotes a "raw string", and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means r"foo\"" evaluates to the string `foo\"`, and similarly r"foo\\\"" is `foo\\\"`, but r"foo\\" is merely the string `foo\\`.
>>>
>>> Pros:
>>> * Simple syntax
>>> * Allows for embedding the closing quote character in the raw string
>>>
>>> Cons:
>>> * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it.
>>>
>>> ## C++11 syntax:
>>>
>>> C++11 allows for raw strings using a sequence of the form R"seq(raw text)seq". In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R"(raw text)", which allows for anything in the raw text except for the sequence `)"`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text).
>>>
>>> Pros:
>>> * Allows for embedding any character at all (representable in the source file encoding), including the closing quote.
>>> * Reasonably straightforward
>>>
>>> Cons:
>>> * Syntax is slightly complicated
>>>
>>> ## D syntax:
>>>
>>> D supports three different forms of raw strings. The first two are similar, being r"raw text" and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms.
>>>
>>> The first looks like q"(raw text)" where the ( may be any non-identifier non-whitespace character. If the character is one of [(<{ then it is a "nesting delimiter", and the close delimiter must be the matching ])>} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q"(foo(bar))" evaluates to "foo(bar)", but q"(foo(bar)" and q"(foobar))" are both illegal.
>>>
>>> The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like
>>>
>>> q"delim
>>> this is some raw text
>>> delim"
>>>
>>> It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to "this is some raw text\n".
>>>
>>> Pros:
>>> * Flexible
>>> * Allows for constructing a raw string that contains any desired sequence of characters (representable in the source file's encoding)
>>>
>>> Cons:
>>> * Overly complicated
>>>
>>> ## Custom syntax
>>>
>>> There's another approach that none of these three languages take, which is to merely allow for doubling up the quote character in order to embed a quote. This would look like R"raw string literal ""with embedded quotes"".", which becomes `raw string literal "with embedded quotes"`.
>>>
>>> Pros:
>>> * Very simple
>>> * Allows for embedding the close quote character, and therefore, any character (representable in the source file encoding)
>>>
>>> Cons:
>>> * Slightly odd to read
>>>
>>> ## Conclusion
>>>
>>> Of the three existing syntaxes examined here, I think C++11's is the best. It ties with D's syntax for being the most powerful, but is simpler than D's. The custom syntax is just as powerful though. The benefit of the C++11 syntax over the custom syntax is it's slightly easier to read the C++11 syntax, as the raw text has a 1-to-one mapping with the resulting string. The custom syntax is a bit more confusing to read, especially if you want to add multiple quotes. As a pathological case, let's try representing a Python triple-quoted docstring using both syntaxes:
>>>
>>> C++11: R"("""this is a python docstring""")"
>>> Custom: R"""""""this is a python docstring"""""""
>>>
>>> Based on this examination, I'm leaning towards saying Rust should support C++11's raw string literal syntax.
>>>
>>> I welcome any comments, criticisms, or suggestions.
>>>
>>> -Kevin
>>> _______________________________________________
>>> Rust-dev mailing list
>>> [hidden email]
>>> https://mail.mozilla.org/listinfo/rust-dev
>
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Kevin Ballard-2
In reply to this post by Masklinn
On Sep 19, 2013, at 2:13 PM, Masklinn <[hidden email]> wrote:

> On 2013-09-19, at 22:36 , Kevin Ballard wrote:
>>
>> I welcome any comments, criticisms, or suggestions.
>
> * C# also has rawstrings, which were not looked at. C#'s rawstrings
>  disable escaping entirely but add a new one: doubling quotes will insert
>  a single quote in the resulting string (similar to quote-escaping in
>  SQL or Smalltalk).

I've never touched C#. Your description sounds like the "custom syntax" I described. I figured there were existing languages that did this, but none came to mind (I should have known SQL did it though).

> * The docstring comment is incorrect, a docstring is a string in the
>  first position of a module, a class statement or a function statement.
>  A single-quoted string at these positions will yield a docstring.
>
>  The triple-quoting is a string syntax embedding newlines (single-quoted
>  strings can not contain literal newlines in Python, only escaped ones).
>  Obviously, triple-quoted python string can be raw.

Yes I know, but in my (rather limited) experience with Python, triple-quoted strings are typically used for docstrings. It was just an example anyway.

> * The quote-escaping oddness is less of an issue in Python as you can
>  also use single-quotes for delimiting, or use triple-quoted strings
>  (if you need to embed both single and double quotes in rawstrings).

If I need to embed both ''' and """ in a string, I'm out of luck. For example, I cannot represent the following:

    Triple-quoted strings in Python use the delimiters ''' and """.

> * Perl's quotes and quote-like operators would certainly deserve mention.

I'm not a Perl programmer, but IIRC they look like `q{string}`, right? I don't think this is suitable for Rust because how would you lex `do q{foo()}`? Is this the invalid construct `do some-string` or is it calling a function named q with a closure?

> Also,
>
>> windows file paths
>
> windows paths can also use forward slashes so that's not a very
> interesting justification.

Not always. UNC paths must start with \\ (in my testing, //foo/bar/baz is not interpreted as a UNC path by the Windows File Explorer, but \\foo/bar/baz is). There's also paths that start with the verbatim prefix \\?\, which disables interpretation of forward-slashes (among other things).

As I am actively engaged in writing a replacement for the path module, and am currently expanding the test suite for Windows paths, raw strings would be extremely useful to me.

-Kevin
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Kevin Ballard-2
In reply to this post by Martin DeMello
As I just responded to Masklinn, this is ambiguous. How do you lex `do R{foo()}`?

-Kevin

On Sep 19, 2013, at 2:41 PM, Martin DeMello <[hidden email]> wrote:

> Yes, I figured R followed by a non-alphabetical character could serve
> the same purpose as ruby's %<char>.
>
> martin
>
> On Thu, Sep 19, 2013 at 2:37 PM, Kevin Ballard <[hidden email]> wrote:
>> I didn't look at Ruby's syntax, but what you just described sounds a little too free-form to me. I believe Ruby at least requires a % as part of the syntax, e.g. %q{test}. But I don't think %R{test} is a good idea for rust, as it would conflict with the % operator. I don't think other punctuation would work well either.
>>
>> -Kevin
>>
>> On Sep 19, 2013, at 2:10 PM, Martin DeMello <[hidden email]> wrote:
>>
>>> How complicated would it be to use R"" but with arbitrary paired
>>> delimiters (the way, for instance, ruby does it)? It's very handy to
>>> pick a delimiter you know does not appear in the string, e.g. if you
>>> had a string containing ')' you could use R{this is a string with a )
>>> in it} or R|this is a string with a ) in it|.
>>>
>>> martin
>>>
>>> On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard <[hidden email]> wrote:
>>>> One feature common to many programming languages that Rust lacks is "raw" string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text.
>>>>
>>>> I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros & cons. I'm hoping we can come up with a syntax that makes sense for Rust.
>>>>
>>>> ## Python syntax:
>>>>
>>>> Python supports an "r" or "R" prefix on any string literal (both "short" strings, delimited with a single quote, or "long" strings, delimited with 3 quotes). The "r" or "R" prefix denotes a "raw string", and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means r"foo\"" evaluates to the string `foo\"`, and similarly r"foo\\\"" is `foo\\\"`, but r"foo\\" is merely the string `foo\\`.
>>>>
>>>> Pros:
>>>> * Simple syntax
>>>> * Allows for embedding the closing quote character in the raw string
>>>>
>>>> Cons:
>>>> * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it.
>>>>
>>>> ## C++11 syntax:
>>>>
>>>> C++11 allows for raw strings using a sequence of the form R"seq(raw text)seq". In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R"(raw text)", which allows for anything in the raw text except for the sequence `)"`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text).
>>>>
>>>> Pros:
>>>> * Allows for embedding any character at all (representable in the source file encoding), including the closing quote.
>>>> * Reasonably straightforward
>>>>
>>>> Cons:
>>>> * Syntax is slightly complicated
>>>>
>>>> ## D syntax:
>>>>
>>>> D supports three different forms of raw strings. The first two are similar, being r"raw text" and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms.
>>>>
>>>> The first looks like q"(raw text)" where the ( may be any non-identifier non-whitespace character. If the character is one of [(<{ then it is a "nesting delimiter", and the close delimiter must be the matching ])>} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q"(foo(bar))" evaluates to "foo(bar)", but q"(foo(bar)" and q"(foobar))" are both illegal.
>>>>
>>>> The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like
>>>>
>>>> q"delim
>>>> this is some raw text
>>>> delim"
>>>>
>>>> It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to "this is some raw text\n".
>>>>
>>>> Pros:
>>>> * Flexible
>>>> * Allows for constructing a raw string that contains any desired sequence of characters (representable in the source file's encoding)
>>>>
>>>> Cons:
>>>> * Overly complicated
>>>>
>>>> ## Custom syntax
>>>>
>>>> There's another approach that none of these three languages take, which is to merely allow for doubling up the quote character in order to embed a quote. This would look like R"raw string literal ""with embedded quotes"".", which becomes `raw string literal "with embedded quotes"`.
>>>>
>>>> Pros:
>>>> * Very simple
>>>> * Allows for embedding the close quote character, and therefore, any character (representable in the source file encoding)
>>>>
>>>> Cons:
>>>> * Slightly odd to read
>>>>
>>>> ## Conclusion
>>>>
>>>> Of the three existing syntaxes examined here, I think C++11's is the best. It ties with D's syntax for being the most powerful, but is simpler than D's. The custom syntax is just as powerful though. The benefit of the C++11 syntax over the custom syntax is it's slightly easier to read the C++11 syntax, as the raw text has a 1-to-one mapping with the resulting string. The custom syntax is a bit more confusing to read, especially if you want to add multiple quotes. As a pathological case, let's try representing a Python triple-quoted docstring using both syntaxes:
>>>>
>>>> C++11: R"("""this is a python docstring""")"
>>>> Custom: R"""""""this is a python docstring"""""""
>>>>
>>>> Based on this examination, I'm leaning towards saying Rust should support C++11's raw string literal syntax.
>>>>
>>>> I welcome any comments, criticisms, or suggestions.
>>>>
>>>> -Kevin
>>>> _______________________________________________
>>>> Rust-dev mailing list
>>>> [hidden email]
>>>> https://mail.mozilla.org/listinfo/rust-dev
>>

_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Martin DeMello
Ah, good point. You could fix it by having a very small whitelist of
acceptable delimiters, but that probably takes it into overcomplex
territory.

martin

On Thu, Sep 19, 2013 at 2:46 PM, Kevin Ballard <[hidden email]> wrote:

> As I just responded to Masklinn, this is ambiguous. How do you lex `do R{foo()}`?
>
> -Kevin
>
> On Sep 19, 2013, at 2:41 PM, Martin DeMello <[hidden email]> wrote:
>
>> Yes, I figured R followed by a non-alphabetical character could serve
>> the same purpose as ruby's %<char>.
>>
>> martin
>>
>> On Thu, Sep 19, 2013 at 2:37 PM, Kevin Ballard <[hidden email]> wrote:
>>> I didn't look at Ruby's syntax, but what you just described sounds a little too free-form to me. I believe Ruby at least requires a % as part of the syntax, e.g. %q{test}. But I don't think %R{test} is a good idea for rust, as it would conflict with the % operator. I don't think other punctuation would work well either.
>>>
>>> -Kevin
>>>
>>> On Sep 19, 2013, at 2:10 PM, Martin DeMello <[hidden email]> wrote:
>>>
>>>> How complicated would it be to use R"" but with arbitrary paired
>>>> delimiters (the way, for instance, ruby does it)? It's very handy to
>>>> pick a delimiter you know does not appear in the string, e.g. if you
>>>> had a string containing ')' you could use R{this is a string with a )
>>>> in it} or R|this is a string with a ) in it|.
>>>>
>>>> martin
>>>>
>>>> On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard <[hidden email]> wrote:
>>>>> One feature common to many programming languages that Rust lacks is "raw" string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text.
>>>>>
>>>>> I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros & cons. I'm hoping we can come up with a syntax that makes sense for Rust.
>>>>>
>>>>> ## Python syntax:
>>>>>
>>>>> Python supports an "r" or "R" prefix on any string literal (both "short" strings, delimited with a single quote, or "long" strings, delimited with 3 quotes). The "r" or "R" prefix denotes a "raw string", and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means r"foo\"" evaluates to the string `foo\"`, and similarly r"foo\\\"" is `foo\\\"`, but r"foo\\" is merely the string `foo\\`.
>>>>>
>>>>> Pros:
>>>>> * Simple syntax
>>>>> * Allows for embedding the closing quote character in the raw string
>>>>>
>>>>> Cons:
>>>>> * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it.
>>>>>
>>>>> ## C++11 syntax:
>>>>>
>>>>> C++11 allows for raw strings using a sequence of the form R"seq(raw text)seq". In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R"(raw text)", which allows for anything in the raw text except for the sequence `)"`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text).
>>>>>
>>>>> Pros:
>>>>> * Allows for embedding any character at all (representable in the source file encoding), including the closing quote.
>>>>> * Reasonably straightforward
>>>>>
>>>>> Cons:
>>>>> * Syntax is slightly complicated
>>>>>
>>>>> ## D syntax:
>>>>>
>>>>> D supports three different forms of raw strings. The first two are similar, being r"raw text" and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms.
>>>>>
>>>>> The first looks like q"(raw text)" where the ( may be any non-identifier non-whitespace character. If the character is one of [(<{ then it is a "nesting delimiter", and the close delimiter must be the matching ])>} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q"(foo(bar))" evaluates to "foo(bar)", but q"(foo(bar)" and q"(foobar))" are both illegal.
>>>>>
>>>>> The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like
>>>>>
>>>>> q"delim
>>>>> this is some raw text
>>>>> delim"
>>>>>
>>>>> It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to "this is some raw text\n".
>>>>>
>>>>> Pros:
>>>>> * Flexible
>>>>> * Allows for constructing a raw string that contains any desired sequence of characters (representable in the source file's encoding)
>>>>>
>>>>> Cons:
>>>>> * Overly complicated
>>>>>
>>>>> ## Custom syntax
>>>>>
>>>>> There's another approach that none of these three languages take, which is to merely allow for doubling up the quote character in order to embed a quote. This would look like R"raw string literal ""with embedded quotes"".", which becomes `raw string literal "with embedded quotes"`.
>>>>>
>>>>> Pros:
>>>>> * Very simple
>>>>> * Allows for embedding the close quote character, and therefore, any character (representable in the source file encoding)
>>>>>
>>>>> Cons:
>>>>> * Slightly odd to read
>>>>>
>>>>> ## Conclusion
>>>>>
>>>>> Of the three existing syntaxes examined here, I think C++11's is the best. It ties with D's syntax for being the most powerful, but is simpler than D's. The custom syntax is just as powerful though. The benefit of the C++11 syntax over the custom syntax is it's slightly easier to read the C++11 syntax, as the raw text has a 1-to-one mapping with the resulting string. The custom syntax is a bit more confusing to read, especially if you want to add multiple quotes. As a pathological case, let's try representing a Python triple-quoted docstring using both syntaxes:
>>>>>
>>>>> C++11: R"("""this is a python docstring""")"
>>>>> Custom: R"""""""this is a python docstring"""""""
>>>>>
>>>>> Based on this examination, I'm leaning towards saying Rust should support C++11's raw string literal syntax.
>>>>>
>>>>> I welcome any comments, criticisms, or suggestions.
>>>>>
>>>>> -Kevin
>>>>> _______________________________________________
>>>>> Rust-dev mailing list
>>>>> [hidden email]
>>>>> https://mail.mozilla.org/listinfo/rust-dev
>>>
>
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Kevin Cantu
I think designing good traits to support all these text implementations is far more important than whatever hungarian notation is preferred for literals.


Kevin


On Thu, Sep 19, 2013 at 2:50 PM, Martin DeMello <[hidden email]> wrote:
Ah, good point. You could fix it by having a very small whitelist of
acceptable delimiters, but that probably takes it into overcomplex
territory.

martin

On Thu, Sep 19, 2013 at 2:46 PM, Kevin Ballard <[hidden email]> wrote:
> As I just responded to Masklinn, this is ambiguous. How do you lex `do R{foo()}`?
>
> -Kevin
>
> On Sep 19, 2013, at 2:41 PM, Martin DeMello <[hidden email]> wrote:
>
>> Yes, I figured R followed by a non-alphabetical character could serve
>> the same purpose as ruby's %<char>.
>>
>> martin
>>
>> On Thu, Sep 19, 2013 at 2:37 PM, Kevin Ballard <[hidden email]> wrote:
>>> I didn't look at Ruby's syntax, but what you just described sounds a little too free-form to me. I believe Ruby at least requires a % as part of the syntax, e.g. %q{test}. But I don't think %R{test} is a good idea for rust, as it would conflict with the % operator. I don't think other punctuation would work well either.
>>>
>>> -Kevin
>>>
>>> On Sep 19, 2013, at 2:10 PM, Martin DeMello <[hidden email]> wrote:
>>>
>>>> How complicated would it be to use R"" but with arbitrary paired
>>>> delimiters (the way, for instance, ruby does it)? It's very handy to
>>>> pick a delimiter you know does not appear in the string, e.g. if you
>>>> had a string containing ')' you could use R{this is a string with a )
>>>> in it} or R|this is a string with a ) in it|.
>>>>
>>>> martin
>>>>
>>>> On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard <[hidden email]> wrote:
>>>>> One feature common to many programming languages that Rust lacks is "raw" string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text.
>>>>>
>>>>> I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros & cons. I'm hoping we can come up with a syntax that makes sense for Rust.
>>>>>
>>>>> ## Python syntax:
>>>>>
>>>>> Python supports an "r" or "R" prefix on any string literal (both "short" strings, delimited with a single quote, or "long" strings, delimited with 3 quotes). The "r" or "R" prefix denotes a "raw string", and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means r"foo\"" evaluates to the string `foo\"`, and similarly r"foo\\\"" is `foo\\\"`, but r"foo\\" is merely the string `foo\\`.
>>>>>
>>>>> Pros:
>>>>> * Simple syntax
>>>>> * Allows for embedding the closing quote character in the raw string
>>>>>
>>>>> Cons:
>>>>> * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it.
>>>>>
>>>>> ## C++11 syntax:
>>>>>
>>>>> C++11 allows for raw strings using a sequence of the form R"seq(raw text)seq". In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R"(raw text)", which allows for anything in the raw text except for the sequence `)"`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text).
>>>>>
>>>>> Pros:
>>>>> * Allows for embedding any character at all (representable in the source file encoding), including the closing quote.
>>>>> * Reasonably straightforward
>>>>>
>>>>> Cons:
>>>>> * Syntax is slightly complicated
>>>>>
>>>>> ## D syntax:
>>>>>
>>>>> D supports three different forms of raw strings. The first two are similar, being r"raw text" and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms.
>>>>>
>>>>> The first looks like q"(raw text)" where the ( may be any non-identifier non-whitespace character. If the character is one of [(<{ then it is a "nesting delimiter", and the close delimiter must be the matching ])>} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q"(foo(bar))" evaluates to "foo(bar)", but q"(foo(bar)" and q"(foobar))" are both illegal.
>>>>>
>>>>> The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like
>>>>>
>>>>> q"delim
>>>>> this is some raw text
>>>>> delim"
>>>>>
>>>>> It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to "this is some raw text\n".
>>>>>
>>>>> Pros:
>>>>> * Flexible
>>>>> * Allows for constructing a raw string that contains any desired sequence of characters (representable in the source file's encoding)
>>>>>
>>>>> Cons:
>>>>> * Overly complicated
>>>>>
>>>>> ## Custom syntax
>>>>>
>>>>> There's another approach that none of these three languages take, which is to merely allow for doubling up the quote character in order to embed a quote. This would look like R"raw string literal ""with embedded quotes"".", which becomes `raw string literal "with embedded quotes"`.
>>>>>
>>>>> Pros:
>>>>> * Very simple
>>>>> * Allows for embedding the close quote character, and therefore, any character (representable in the source file encoding)
>>>>>
>>>>> Cons:
>>>>> * Slightly odd to read
>>>>>
>>>>> ## Conclusion
>>>>>
>>>>> Of the three existing syntaxes examined here, I think C++11's is the best. It ties with D's syntax for being the most powerful, but is simpler than D's. The custom syntax is just as powerful though. The benefit of the C++11 syntax over the custom syntax is it's slightly easier to read the C++11 syntax, as the raw text has a 1-to-one mapping with the resulting string. The custom syntax is a bit more confusing to read, especially if you want to add multiple quotes. As a pathological case, let's try representing a Python triple-quoted docstring using both syntaxes:
>>>>>
>>>>> C++11: R"("""this is a python docstring""")"
>>>>> Custom: R"""""""this is a python docstring"""""""
>>>>>
>>>>> Based on this examination, I'm leaning towards saying Rust should support C++11's raw string literal syntax.
>>>>>
>>>>> I welcome any comments, criticisms, or suggestions.
>>>>>
>>>>> -Kevin
>>>>> _______________________________________________
>>>>> Rust-dev mailing list
>>>>> [hidden email]
>>>>> https://mail.mozilla.org/listinfo/rust-dev
>>>
>
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev


_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Andrew Dunham
The way that Lua does raw strings is also fairly nifty.  Check out http://www.lua.org/manual/5.2/manual.html, section 3.1, or, in short:

- Strings can be delimited by "[===[", with any number of equals signs.  The corresponding closing delimiter must match the original number of equals signs.
- No escaping is done.
- Any kind of end-of-line sequence (i.e. "\r" and "\n" in any order) is converted to just a newline.
- It can run for multiple lines.

--Andrew D


On Thu, Sep 19, 2013 at 10:28 PM, Kevin Cantu <[hidden email]> wrote:
I think designing good traits to support all these text implementations is far more important than whatever hungarian notation is preferred for literals.


Kevin


On Thu, Sep 19, 2013 at 2:50 PM, Martin DeMello <[hidden email]> wrote:
Ah, good point. You could fix it by having a very small whitelist of
acceptable delimiters, but that probably takes it into overcomplex
territory.

martin

On Thu, Sep 19, 2013 at 2:46 PM, Kevin Ballard <[hidden email]> wrote:
> As I just responded to Masklinn, this is ambiguous. How do you lex `do R{foo()}`?
>
> -Kevin
>
> On Sep 19, 2013, at 2:41 PM, Martin DeMello <[hidden email]> wrote:
>
>> Yes, I figured R followed by a non-alphabetical character could serve
>> the same purpose as ruby's %<char>.
>>
>> martin
>>
>> On Thu, Sep 19, 2013 at 2:37 PM, Kevin Ballard <[hidden email]> wrote:
>>> I didn't look at Ruby's syntax, but what you just described sounds a little too free-form to me. I believe Ruby at least requires a % as part of the syntax, e.g. %q{test}. But I don't think %R{test} is a good idea for rust, as it would conflict with the % operator. I don't think other punctuation would work well either.
>>>
>>> -Kevin
>>>
>>> On Sep 19, 2013, at 2:10 PM, Martin DeMello <[hidden email]> wrote:
>>>
>>>> How complicated would it be to use R"" but with arbitrary paired
>>>> delimiters (the way, for instance, ruby does it)? It's very handy to
>>>> pick a delimiter you know does not appear in the string, e.g. if you
>>>> had a string containing ')' you could use R{this is a string with a )
>>>> in it} or R|this is a string with a ) in it|.
>>>>
>>>> martin
>>>>
>>>> On Thu, Sep 19, 2013 at 1:36 PM, Kevin Ballard <[hidden email]> wrote:
>>>>> One feature common to many programming languages that Rust lacks is "raw" string literals. Specifically, these are string literals that don't interpret backslash-escapes. There are three obvious applications at the moment: regular expressions, windows file paths, and format!() strings that want to embed { and } chars. I'm sure there are more as well, such as large string literals that contain things like HTML text.
>>>>>
>>>>> I took a look at 3 programming languages to see what solutions they had: D, C++11, and Python. I've reproduced their syntax below, plus one more custom syntax, along with pros & cons. I'm hoping we can come up with a syntax that makes sense for Rust.
>>>>>
>>>>> ## Python syntax:
>>>>>
>>>>> Python supports an "r" or "R" prefix on any string literal (both "short" strings, delimited with a single quote, or "long" strings, delimited with 3 quotes). The "r" or "R" prefix denotes a "raw string", and has the effect of disabling backslash-escapes within the string. For the most part. It actually gets a bit weird: if a sequence of backslashes of an odd length occurs prior to a quote (of the appropriate quote type for the string), then the quote is considered to be escaped, but the backslashes are left in the string. This means r"foo\"" evaluates to the string `foo\"`, and similarly r"foo\\\"" is `foo\\\"`, but r"foo\\" is merely the string `foo\\`.
>>>>>
>>>>> Pros:
>>>>> * Simple syntax
>>>>> * Allows for embedding the closing quote character in the raw string
>>>>>
>>>>> Cons:
>>>>> * Handling of backslashes is very bizarre, and the closing quote character can only be embedded if you want to have a backslash before it.
>>>>>
>>>>> ## C++11 syntax:
>>>>>
>>>>> C++11 allows for raw strings using a sequence of the form R"seq(raw text)seq". In this construct, `seq` is any sequence of (zero or more) characters except for: space, (, ), \, \t, \v, \n, \r. The simplest form looks like R"(raw text)", which allows for anything in the raw text except for the sequence `)"`. The addition of the delimiter sequence allows for constructing a raw string containing any sequence at all (as the delimiter sequence can be adjusted based on the represented text).
>>>>>
>>>>> Pros:
>>>>> * Allows for embedding any character at all (representable in the source file encoding), including the closing quote.
>>>>> * Reasonably straightforward
>>>>>
>>>>> Cons:
>>>>> * Syntax is slightly complicated
>>>>>
>>>>> ## D syntax:
>>>>>
>>>>> D supports three different forms of raw strings. The first two are similar, being r"raw text" and `raw text`. Besides the choice of delimiters, they behave identically, in that the raw text may contain anything except for the appropriate quote character. The third syntax is a slightly more complicated form of C++11's syntax, and is called a delimited string. It takes two forms.
>>>>>
>>>>> The first looks like q"(raw text)" where the ( may be any non-identifier non-whitespace character. If the character is one of [(<{ then it is a "nesting delimiter", and the close delimiter must be the matching ])>} character, otherwise the close delimiter is the same as the open. Furthermore, nesting delimiters do exactly what their name says: they nest. If the nesting delimiter is (), then any ( in the raw text must be balanced with a ) in the raw text. In other words, q"(foo(bar))" evaluates to "foo(bar)", but q"(foo(bar)" and q"(foobar))" are both illegal.
>>>>>
>>>>> The second uses any identifier as the delimiter. In this case, the identifier must immediately be followed by a newline, and in order to close the string, the close delimiter must be preceded by a newline. This looks like
>>>>>
>>>>> q"delim
>>>>> this is some raw text
>>>>> delim"
>>>>>
>>>>> It's essentially a heredoc. Note that the first newline is not part of the string, but the final newline is, so this evaluates to "this is some raw text\n".
>>>>>
>>>>> Pros:
>>>>> * Flexible
>>>>> * Allows for constructing a raw string that contains any desired sequence of characters (representable in the source file's encoding)
>>>>>
>>>>> Cons:
>>>>> * Overly complicated
>>>>>
>>>>> ## Custom syntax
>>>>>
>>>>> There's another approach that none of these three languages take, which is to merely allow for doubling up the quote character in order to embed a quote. This would look like R"raw string literal ""with embedded quotes"".", which becomes `raw string literal "with embedded quotes"`.
>>>>>
>>>>> Pros:
>>>>> * Very simple
>>>>> * Allows for embedding the close quote character, and therefore, any character (representable in the source file encoding)
>>>>>
>>>>> Cons:
>>>>> * Slightly odd to read
>>>>>
>>>>> ## Conclusion
>>>>>
>>>>> Of the three existing syntaxes examined here, I think C++11's is the best. It ties with D's syntax for being the most powerful, but is simpler than D's. The custom syntax is just as powerful though. The benefit of the C++11 syntax over the custom syntax is it's slightly easier to read the C++11 syntax, as the raw text has a 1-to-one mapping with the resulting string. The custom syntax is a bit more confusing to read, especially if you want to add multiple quotes. As a pathological case, let's try representing a Python triple-quoted docstring using both syntaxes:
>>>>>
>>>>> C++11: R"("""this is a python docstring""")"
>>>>> Custom: R"""""""this is a python docstring"""""""
>>>>>
>>>>> Based on this examination, I'm leaning towards saying Rust should support C++11's raw string literal syntax.
>>>>>
>>>>> I welcome any comments, criticisms, or suggestions.
>>>>>
>>>>> -Kevin
>>>>> _______________________________________________
>>>>> Rust-dev mailing list
>>>>> [hidden email]
>>>>> https://mail.mozilla.org/listinfo/rust-dev
>>>
>
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev


_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev



_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Masklinn
In reply to this post by Kevin Ballard-2
On 2013-09-19, at 23:45 , Kevin Ballard wrote:
> Yes I know, but in my (rather limited) experience with Python, triple-quoted strings are typically used for docstrings. It was just an example anyway.

They're also commonly used for multiline strings as single-quoted strings don't require it.

>
>> * The quote-escaping oddness is less of an issue in Python as you can
>> also use single-quotes for delimiting, or use triple-quoted strings
>> (if you need to embed both single and double quotes in rawstrings).
>
> If I need to embed both ''' and """ in a string, I'm out of luck.

The chance of that is as remote as can be. I've never seen or heard of
it happen. And mind, the issue must happen *in a rawstring* which is
even more unlikely.

>> Also,
>>
>>> windows file paths
>>
>> windows paths can also use forward slashes so that's not a very
>> interesting justification.
>
> Not always. UNC paths must start with \\ (in my testing, //foo/bar/baz is not interpreted as a UNC path by the Windows File Explorer, but \\foo/bar/baz is).

True. Do you expect writing literal UNC paths in Rust to be a common occurrence?

> There's also paths that start with the verbatim prefix \\?\, which disables interpretation of forward-slashes (among other things).

That's not really relevant to a rawstrings proposal, why would a
developer embed such a path literally?

> As I am actively engaged in writing a replacement for the path module, and am currently expanding the test suite for Windows paths, raw strings would be extremely useful to me.

I'd have thought it a better idea to use path builders (maybe macros)
and avoid embedding literal path separators in order to avoid
portability issues.
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Kevin Ballard-2
On Sep 20, 2013, at 1:13 AM, Masklinn <[hidden email]> wrote:

>>> Also,
>>>
>>>> windows file paths
>>>
>>> windows paths can also use forward slashes so that's not a very
>>> interesting justification.
>>
>> Not always. UNC paths must start with \\ (in my testing, //foo/bar/baz is not interpreted as a UNC path by the Windows File Explorer, but \\foo/bar/baz is).
>
> True. Do you expect writing literal UNC paths in Rust to be a common occurrence?

Maybe not for most people, but I've been writing them a _lot_ lately (I'm rewriting the path module).

Regular expressions is really the most common application here.

>> There's also paths that start with the verbatim prefix \\?\, which disables interpretation of forward-slashes (among other things).
>
> That's not really relevant to a rawstrings proposal, why would a
> developer embed such a path literally?

Perhaps they want to hard-code a path that refers to something that requires the \\?\ prefix (such as a path that contains / as part of a path component, or is longer than 255 characters).

But just in general, \ is the canonical Windows path separator. I don't think "use /" is particularly great advice. What if this string is intended for displaying?

>> As I am actively engaged in writing a replacement for the path module, and am currently expanding the test suite for Windows paths, raw strings would be extremely useful to me.
>
> I'd have thought it a better idea to use path builders (maybe macros)
> and avoid embedding literal path separators in order to avoid
> portability issues.

People still use literal path separators in strings all the time in languages that support path-building methods.

-Kevin
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Marijn Haverbeke
In reply to this post by Masklinn
>> If I need to embed both ''' and """ in a string, I'm out of luck.
>
> The chance of that is as remote as can be. I've never seen or heard of
> it happen. And mind, the issue must happen *in a rawstring* which is
> even more unlikely.

You should note that, as soon as you include something in the language
itself, that creates meaningful strings (programs in the language)
that include the token, which are not likely, at some point, to need
to be written as a multiline string in the language itself.

(As a related example, as someone writing JavaScript-analyzing code in
JavaScript, I've had several bugs caused by the fact that the
nonsense, no-one-is-ever-going-to-use-this word __proto__ has a very
hard to suppress special meaning, and you *are* going to use it when
analyzing the elements in another JavaScript program.)
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Masklinn
On 2013-09-20, at 10:26 , Marijn Haverbeke wrote:

>>> If I need to embed both ''' and """ in a string, I'm out of luck.
>>
>> The chance of that is as remote as can be. I've never seen or heard of
>> it happen. And mind, the issue must happen *in a rawstring* which is
>> even more unlikely.
>
> You should note that, as soon as you include something in the language
> itself, that creates meaningful strings (programs in the language)
> that include the token, which are not likely, at some point, to need
> to be written as a multiline string in the language itself.

It's already noted, my objections are very much that this is highly
unlikely to be an issue as it only comes to a head when needing
*triple-quoted rawstrings* to include *their own* delimiters
(meaning a triple-quoted rawstring which needs to include both
triple-quoted delimiters at the same time).

Even unlikelier given python will concatenate string literals during
parsing.

On 2013-09-20, at 10:25 , Kevin Ballard wrote:
> Regular expressions is really the most common application here.

Right, which was just about all I was saying in the original message.

> People still use literal path separators in strings all the time in languages that support path-building methods.

Something I don't believe should be encouraged.
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Andres Osinski
Out of all the mentioned syntaxes, Python's seems simple and easy (and the corner cases appear to be fairly unlikely for the actual use cases for raw strings), Ruby's seems very powerful and if a couple of restrictions are added could probably fit well, and Lua's seem very well designed by allowing delimiters of arbitrary length.

As a user of higher-level languages, all of these seem appealing to me. I don't really feel that rawstring should be complicated to use, and I don't really think the limitations are bad so long as they areexplicitly documented (which is how it should be).


On Fri, Sep 20, 2013 at 5:38 AM, Masklinn <[hidden email]> wrote:
On 2013-09-20, at 10:26 , Marijn Haverbeke wrote:
>>> If I need to embed both ''' and """ in a string, I'm out of luck.
>>
>> The chance of that is as remote as can be. I've never seen or heard of
>> it happen. And mind, the issue must happen *in a rawstring* which is
>> even more unlikely.
>
> You should note that, as soon as you include something in the language
> itself, that creates meaningful strings (programs in the language)
> that include the token, which are not likely, at some point, to need
> to be written as a multiline string in the language itself.

It's already noted, my objections are very much that this is highly
unlikely to be an issue as it only comes to a head when needing
*triple-quoted rawstrings* to include *their own* delimiters
(meaning a triple-quoted rawstring which needs to include both
triple-quoted delimiters at the same time).

Even unlikelier given python will concatenate string literals during
parsing.

On 2013-09-20, at 10:25 , Kevin Ballard wrote:
> Regular expressions is really the most common application here.

Right, which was just about all I was saying in the original message.

> People still use literal path separators in strings all the time in languages that support path-building methods.

Something I don't believe should be encouraged.
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev



--
Andrés Osinski
http://www.andresosinski.com.ar/

_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Kevin Ballard-2
Python's has really stupid handling of backslashes, and I really don't like how it cannot represent all valid strings. I'd really prefer not to make that same mistake.

Ruby's syntax cannot be used because % lexes as an operator.

Of the 3, Lua's is probably the best, although it's a bit esoteric (with using [[ and nary a quote in sight). It seems roughly equivalent to C++11's syntax though, both in ease of use and flexibility.

-Kevin

On Sep 20, 2013, at 1:41 AM, Andres Osinski <[hidden email]> wrote:

Out of all the mentioned syntaxes, Python's seems simple and easy (and the corner cases appear to be fairly unlikely for the actual use cases for raw strings), Ruby's seems very powerful and if a couple of restrictions are added could probably fit well, and Lua's seem very well designed by allowing delimiters of arbitrary length.

As a user of higher-level languages, all of these seem appealing to me. I don't really feel that rawstring should be complicated to use, and I don't really think the limitations are bad so long as they areexplicitly documented (which is how it should be).


On Fri, Sep 20, 2013 at 5:38 AM, Masklinn <[hidden email]> wrote:
On 2013-09-20, at 10:26 , Marijn Haverbeke wrote:
>>> If I need to embed both ''' and """ in a string, I'm out of luck.
>>
>> The chance of that is as remote as can be. I've never seen or heard of
>> it happen. And mind, the issue must happen *in a rawstring* which is
>> even more unlikely.
>
> You should note that, as soon as you include something in the language
> itself, that creates meaningful strings (programs in the language)
> that include the token, which are not likely, at some point, to need
> to be written as a multiline string in the language itself.

It's already noted, my objections are very much that this is highly
unlikely to be an issue as it only comes to a head when needing
*triple-quoted rawstrings* to include *their own* delimiters
(meaning a triple-quoted rawstring which needs to include both
triple-quoted delimiters at the same time).

Even unlikelier given python will concatenate string literals during
parsing.

On 2013-09-20, at 10:25 , Kevin Ballard wrote:
> Regular expressions is really the most common application here.

Right, which was just about all I was saying in the original message.

> People still use literal path separators in strings all the time in languages that support path-building methods.

Something I don't believe should be encouraged.
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev



--
Andrés Osinski
http://www.andresosinski.com.ar/
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev


_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Alex Crichton
> Of the 3, Lua's is probably the best, although it's a bit esoteric (with
> using [[ and nary a quote in sight).

I think an important thing to keep in mind is that the main reason
behind creating a new form of literal is for things like:

* Escapes in format! strings
* Possible regular expression syntax (this also may be a syntax extension)
* Type literal windows paths (escaping \ is hard)
* Otherwise long literals which may contain quotes (like html text)

With those in mind, although Lua's syntax is sufficient, is it nice to
use? If the first thing I saw as an introduction to Rust was:

fn main() {
  println!([[Hello, {}!]], "world");
}

I would be a little confused. Now the [[/]] aren't really necessary in
this case, but I'm personally unsure of how usable [[/]] would be
throughout the language. Raw literals in languages like C++ and Lua I
think aren't intended to be used that often. Instead they should be
used only when necessary, and you frequently don't see them in code.
For rust, the use cases which are the cause of this discussion are
actually fairly common, and I'm not sure that we'd want to see [[/]]
all over the place, although of course that's just my opinion :)

Skimming back, I haven't seen a suggestion of the backtick character
as a delimiter. Go takes this approach, and I don't believe that in Go
you can have a backtick anywhere in a backtick literal, and otherwise
what you see is what you get. It's at least something to consider,
though.
_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Syntax for "raw" string literals

Thad Guidry
Does it HAVE to be a single typed char seen on the English 101 keyboard ?

History Lesson:
The industry in the very early, early days of printing, storing, and processing characters, both English and non-English, came up with a solution around the use of Control Characters.

ASCI Char 1 is known as Start Of Header, or abbreviated SOH.
ASCII Char 2 is known as Start of Text, or abbreviated STX.
ASCII Char 3 is known as End of Text, or abbreviated ETX.

It got me thinking of how various industries to this day still use Start of Text and End of Text... what we are discussing as enclosing a String verbatim.

Many data operations that I perform with conversion of string fields are actually done by first wrapping with Control Chars [1] to enclose the String LITERALLY.

Apple's Enterprise Partner Feed is an example that uses such basic Control Chars to separate fields and interestingly uses multibyte EOL Control Chars to retain even unicode contents (Foreign Language strings, that use quotes of a different nature at times [2] and that sometimes appear in its fields and that need to be retained inside a database field as well.)

I am wondering if doing something similar to that the industry does with using Control Chars to represent a STX or ETX would not be even wiser to subplant String Literal ?  i.e.  do not reinvent the fast spinning wheel that also has built-in never go flat technology. :)


Thoughts ?


_______________________________________________
Rust-dev mailing list
[hidden email]
https://mail.mozilla.org/listinfo/rust-dev
123