[jcifs] Character Set discussions

Thu Feb 6 08:03:33 EST 2003

On Wed, 5 Feb 2003 06:02:17 -0500 
"Glass, Eric" <eric.glass at capitalone.com> wrote:
> > > 
> > 	But you will still run into situations where the 
> > encoding of files or protocol
> > 	transport does not permit Unicode (like right now with 
> > web browsers).
> 
> The encoding of the file is unrelated to the encoding specified for the URL
> itself -- from RFC 2396:
> 
>    A URI is represented as a sequence of characters, not as a sequence
>    of octets. That is because URI might be "transported" by means that
>    are not through a computer network, e.g., printed on paper, read over
>    the radio, etc.
> 
>    A URI scheme may define a mapping from URI characters to octets;
>    whether this is done depends on the scheme.

They left it wide open. This is a fancy way of saying a URL is a stream
of bytes; the encoding of a sequence of characters is not defined.

> The HTTP URL scheme does not formally define a mapping (that I am aware of);
> this can lead to confusion in certain situations.  Internet Explorer, for
> example, will accept Unicode URLs (i.e., typed into the address field).
> When the HTTP request is sent, however, the GET request needs an ASCII
> representation of the URL.  IE has an option (under "Internet Options -->
> Advanced", "Always send URLs as UTF-89") which defines the mapping; if

Hmm. What is UTF-89.

> disabled, it will use the client's code page (which must match the server if
> the transaction is expected to succeed).  If enabled, URLs will be sent in
> UTF-8 -- if the server can understand this, then great.

This is interesting. And expected I guess. So slovak can be sent without
escaping provided the server is in the slocak locale.

> Specifying a mapping in the URL scheme definition clears up a lot of this
> confusion, which is why it is recommended in RFC 2718; in particular, using
> UTF-8 as this mapping has been identified as the best practice.

Isn't this just about escapeing though? It says use UTF-8 and *then*
escape each byte. It sounds like the GET/POST parameter still has to be
ASCII. I was sort of hoping they would just expand that to UTF-8 without
requiring escaping.

> > > There is *supposed* to be a header declaring the encoding 
> > of the file (if
> > > it's in HTML, for example).  It will, as you suggest, take 
> > the Latin world
> > > a while to get used to this.
> > > 
> 
> This applies to the content being served; we are talking about the means of
> identifying a resource.

We were speculating about how the encoding of the URL in a web page
could be determined so that it didn't have to be escaped. The charset
spcified in the META tag could be used to decode URLs within the HTML.

> > 	And HTML has the META tag. I think ultimately this is 
> > what should happen.
> > 	Everything should just be widened to Unicode where the 
> > encoding is left
> > 	undefined or if it's protocol transport like HTTP they 
> > pick UTF-8 or negotiate an
> > 	encoding.
> 
> Note that this is only really applicable if the content is character-based
> (i.e., HTML or text).  HTTP (MIME, really) provides the "Content-Type"
> header to denote the type of the content being provided; the "charset"
> parameter is applicable to all "text" subtypes.  The HTML META tag provides
> the http-equiv mechanism for overriding/adding ANY response header; in
> particular, you can specify:
> 
> <META http-equiv="Content-Type" content="text/html; charset=utf-8">
> 
> to indicate the character set of the document.  The charset parameter is not
> applicable to non-character entities; you won't see a response header like:
> 
> Content-Type: image/jpeg; charset=utf-8
> 
> because a character set only applies to characters.

Natrually. I don't think you can embed URLs in a jpeg :~)

> In any case, when discussing SMB, you don't particularly CARE about the
> content -- it's a big binary chunk.  You DO care about identifying WHICH
> file to retrieve.  This is where the URL scheme encoding comes into play.
> With HTTP, you have a scenario like this (we'll assume that both client and
> server are using UTF-8 to encode URLs, although as noted above this isn't
> mandated):
>
> 1. User types a Unicode HTTP URL into their browser window:
>     http://svr/slovak/mÃ´Å¾em/jesÅ¥/sklo/nezranÃ/ma.zip
  2. Browser encodes the URL in UTF-8 [*then* escapes each byte with
     %HH escapes] and does a GET:
>    GET /slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip HTTP/1.1
  3. Server [unescapes and] decodes the URI to get:
>    "/slovak/mÃ´Å¾em/jesÅ¥/sklo/nezranÃ/ma.zip"
> 
> 4. Server provides the resource to the client (which may be character
>    content, which may have it's own specified character encoding).

That's not what I was thinking. Note my braketed amendments above. Instead I
was thinking of the following:

1. User types a Unicode HTTP URL into their browser window:
   http://svr/slovak/mÃ´Å¾em/jesÅ¥/sklo/nezranÃ/ma.zip

2. Browser encodes the URL in UTF-8 (but does not %HH escape each byte)
   and does a GET:
   GET http://svr/slovak/mÃ´Å¾em/jesÅ¥/sklo/nezranÃ/ma.zip HTTP/1.1

3. Server decodes the URI to get:
   ? - donno and doesn't matter. Whatever encoding the request handler
   wants. Depending on the architecture you might have:
  o Java - JAVA which is UCS-2 with a byte-order marker (although you
   would never actually see that because you would just get a String)
  o Windows - would likely be UTF-16LE or the local codepage
  o UNIX - will be either wchar_t which does not define an "encoding"
   although if the __STDC_ISO_106464__ macro is defined will be UCS
   codepoints (Unicode but size in bytes is still not defined), the
   default locale of the server is which if they want Unicode support
   will be UTF-8, or a custom encoding from a module like ICU.

4. Server provides the resource to the client (which may be character
   content, which may have it's own specified character encoding).

Mike

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived.