[jcifs] Character Set discussions

Fri Feb 7 07:10:58 EST 2003

> > The encoding of the file is unrelated to the encoding 
> specified for the URL
> > itself -- from RFC 2396:
> > 
> >    A URI is represented as a sequence of characters, not as 
> a sequence
> >    of octets. That is because URI might be "transported" by 
> means that
> >    are not through a computer network, e.g., printed on 
> paper, read over
> >    the radio, etc.
> > 
> >    A URI scheme may define a mapping from URI characters to octets;
> >    whether this is done depends on the scheme.
> 
> They left it wide open. This is a fancy way of saying a URL 
> is a stream
> of bytes; the encoding of a sequence of characters is not defined.
> 

More accurately, a URI is a stream of characters; specifically (from RFC
2396):

   A URI is a sequence of characters from a very limited set, i.e. the
   letters of the basic Latin alphabet, digits, and a few special
   characters.  A URI may be represented in a variety of ways: e.g.,
   ink on paper, pixels on a screen, or a sequence of octets in a coded
   character set.

There are about 60 characters which can appear in a valid URI; anything
outside of that set must be escaped (%HH).  What is NOT defined is how
characters outside the valid set are encoded into bytes prior to escaping.
The recommendation from RFC 2718 is that you encode invalid characters into
UTF-8.

> > The HTTP URL scheme does not formally define a mapping 
> (that I am aware of);
> > this can lead to confusion in certain situations.  Internet 
> Explorer, for
> > example, will accept Unicode URLs (i.e., typed into the 
> address field).
> > When the HTTP request is sent, however, the GET request 
> needs an ASCII
> > representation of the URL.  IE has an option (under 
> "Internet Options -->
> > Advanced", "Always send URLs as UTF-89") which defines the 
> mapping; if
> 
> Hmm. What is UTF-89.
> 
> > disabled, it will use the client's code page (which must 
> match the server if
> > the transaction is expected to succeed).  If enabled, URLs 
> will be sent in
> > UTF-8 -- if the server can understand this, then great.
> 
> This is interesting. And expected I guess. So slovak can be 
> sent without
> escaping provided the server is in the slocak locale.
> 

Yes, but this is only because Internet Explorer sends an invalid request
(and IIS happens to accept it).  The resource URI is just that -- a URI.  To
be portable across servers, the user agent must present only valid URIs,
which implies encoding and escaping invalid characters.  It would be valid
if IE escaped the above without first encoding to UTF-8, but non-portable.
This is why it is recommended that URL schemes specify an encoding.

> > Specifying a mapping in the URL scheme definition clears up 
> a lot of this
> > confusion, which is why it is recommended in RFC 2718; in 
> particular, using
> > UTF-8 as this mapping has been identified as the best practice.
> 
> Isn't this just about escapeing though? It says use UTF-8 and *then*
> escape each byte. It sounds like the GET/POST parameter still 
> has to be
> ASCII. I was sort of hoping they would just expand that to 
> UTF-8 without
> requiring escaping.
>

This is the basis concept behind IRIs; see

http://www.w3.org/International/O-URL-and-ident

and

http://www.w3.org/2001/Talks/0912-IUC-IRI/paper.html

especially the section on legacy URI handling.  This is currently a draft
specification, however.

> > > > There is *supposed* to be a header declaring the encoding 
> > > of the file (if
> > > > it's in HTML, for example).  It will, as you suggest, take 
> > > the Latin world
> > > > a while to get used to this.
> > > > 
> > 
> > This applies to the content being served; we are talking 
> about the means of
> > identifying a resource.
> 
> We were speculating about how the encoding of the URL in a web page
> could be determined so that it didn't have to be escaped. The charset
> spcified in the META tag could be used to decode URLs within the HTML.
> 

Some user agents do exactly this -- but it has been recommended against:

http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1

The recommendation to handle this scenario, i.e.:

<a href="http://svr/slovak/mÃ´Å¾em/jesÅ¥/sklo/nezranÃ/ma.zip">this is
currently illegal</a>

is to treat it in a fashion compatible with the IRI specification above --
i.e., use a conversion based on UTF-8, regardless of the document charset.

> > > 	And HTML has the META tag. I think ultimately this is 
> > > what should happen.
> > > 	Everything should just be widened to Unicode where the 
> > > encoding is left
> > > 	undefined or if it's protocol transport like HTTP they 
> > > pick UTF-8 or negotiate an
> > > 	encoding.
> > 
> >
> > 1. User types a Unicode HTTP URL into their browser window:
> >     http://svr/slovak/mÃ´Å¾em/jesÅ¥/sklo/nezranÃ/ma.zip
>   2. Browser encodes the URL in UTF-8 [*then* escapes each byte with
>      %HH escapes] and does a GET:
> >    GET 
> /slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip HTTP/1.1
>   3. Server [unescapes and] decodes the URI to get:
> >    "/slovak/mÃ´Å¾em/jesÅ¥/sklo/nezranÃ/ma.zip"
> > 
> > 4. Server provides the resource to the client (which may be 
> character
> >    content, which may have it's own specified character encoding).
> 
> That's not what I was thinking. Note my braketed amendments 
> above. Instead I
> was thinking of the following:
> 
> 1. User types a Unicode HTTP URL into their browser window:
>    http://svr/slovak/mÃ´Å¾em/jesÅ¥/sklo/nezranÃ/ma.zip
> 
> 2. Browser encodes the URL in UTF-8 (but does not %HH escape 
> each byte)
>    and does a GET:
>    GET http://svr/slovak/mÃ´Å¾em/jesÅ¥/sklo/nezranÃ/ma.zip HTTP/1.1
> 
> 3. Server decodes the URI to get:
>    ? - donno and doesn't matter. Whatever encoding the request handler
>    wants. Depending on the architecture you might have:
>   o Java - JAVA which is UCS-2 with a byte-order marker (although you
>    would never actually see that because you would just get a String)
>   o Windows - would likely be UTF-16LE or the local codepage
>   o UNIX - will be either wchar_t which does not define an "encoding"
>    although if the __STDC_ISO_106464__ macro is defined will be UCS
>    codepoints (Unicode but size in bytes is still not defined), the
>    default locale of the server is which if they want Unicode support
>    will be UTF-8, or a custom encoding from a module like ICU.
> 
> 4. Server provides the resource to the client (which may be character
>    content, which may have it's own specified character encoding).
> 

This is the functionality provided by IRIs -- an interpretation of unescaped
characters outside the ASCII set.  But at the current time, for existing
URIs, representation must be made using members of the set of valid URI
characters.  This means encoding and escaping; preferably, the URL scheme
specifies a standard encoding, and preferably UTF-8.

Eric

**************************************************************************
The information transmitted herewith is sensitive information intended only
for use by the individual or entity to which it is addressed. If the reader
of this message is not the intended recipient, you are hereby notified that
any review, retransmission, dissemination, distribution, copying or other
use of, or taking of any action in reliance upon this information is
strictly prohibited. If you have received this communication in error,
please contact the sender and delete the material from your computer.