At 12:37 05/20/97 PDT, Robert Herriot wrote:
>>> From hastings at cp10.es.xerox.com Tue May 20 11:48:48 1997
>>Snip...
>> 3. The two-octet integer is not padded, so RISC machines that require
>> integers to be aligned on 2, 4, or 8 byte boundaries, must pick up the
>> two-octets whereever they are. We don't want some clients padding the
>> data to meet their server's alignment requirements and then such servers
>> not being able to accept unaligned data from other clients. We should
>> probably outlaw the ASCII NUL (decimal 0) in attribute names and values
>> as well, just to make sure. The current Model document lists the
>> abstract characters that are allowed in keywords, but the encoding document
>> could specify that ASCII NUL could be introduced. I think we need to
>> specifically outlaw ASCII NUL in the encoding document.
>>Padding is a host issue and not a protocol issue. Since our protocol is
>based on precise locations of bytes in the stream, the protocol
>definition implies that there can be no padding. Because each name and
>value is specified by a length, it shouldn't matter if an ASCII NUL is
>present. In fact, I would NOT want to have two rules for termination
>of a name, such as length or NULL, whichever comes first. This gets us
>back into the HTTP question of Content-length versus boundary-string
>and which one wins. If a name or value contains a NUL, then it must be
>part of the name or value. This rule is especially important if we
>later allow other encodings such as Unicode which has numerous NUL
>bytes, e.g. in every ASCII character.
I agree we don't want two rules and I wasn't suggesting such. If NUL is
included in character coded data, it would contribute to the count like
any other character. If NUL is included in character coded data, it would
not terminate the string.
In UTF-8, as opposed to Unicode (which is two-octets for every characters),
there are no all-zero octets, except for the Unicode NUL character. All
other characters have all non-zero octets. That is one of the reasons
for UTF-8: to allow the C NUL terminated string to be processed by all
existing software. But for our application we don't want NUL terminated
strings, since we have a count instead.
To be more clear, we should outlaw ASCII NUL character in ASCII coded strings
and outlaw the UTF-8 NUL *character* in UTF-8 strings. Otherwise, some client
may be putting in such padding (within the count) in order to align the
two-octet attribute name or attribute value that follow, so that the
particular server that the client was built with could pick up the integers
as (aligned) integers, because they were aligned in memory as the data was
unmarshalled into memory, rather than picking up the two-octet integers as
individual octets.
Another reason to outlaw these NUL characters, is to prevent servers from
having to filter them out before doing string compares. If one client starts
to embed NULs and some servers filter them out, then we get interoperability
problems with the servers that don't.
In short, this is a small problem with a simple fix: Just add a sentence
that forbids the counted ASCII and UTF-8 strings to contain the NUL character
when representing attribute names and attribute values in IPP.
Tom