[jcifs] Character Set discussions

Glass, Eric eric.glass at capitalone.com
Wed Feb 5 22:02:17 EST 2003


> > 
> 	But you will still run into situations where the 
> encoding of files or protocol
> 	transport does not permit Unicode (like right now with 
> web browsers).

The encoding of the file is unrelated to the encoding specified for the URL
itself -- from RFC 2396:

   A URI is represented as a sequence of characters, not as a sequence
   of octets. That is because URI might be "transported" by means that
   are not through a computer network, e.g., printed on paper, read over
   the radio, etc.

   A URI scheme may define a mapping from URI characters to octets;
   whether this is done depends on the scheme.

The HTTP URL scheme does not formally define a mapping (that I am aware of);
this can lead to confusion in certain situations.  Internet Explorer, for
example, will accept Unicode URLs (i.e., typed into the address field).
When the HTTP request is sent, however, the GET request needs an ASCII
representation of the URL.  IE has an option (under "Internet Options -->
Advanced", "Always send URLs as UTF-89") which defines the mapping; if
disabled, it will use the client's code page (which must match the server if
the transaction is expected to succeed).  If enabled, URLs will be sent in
UTF-8 -- if the server can understand this, then great.

Specifying a mapping in the URL scheme definition clears up a lot of this
confusion, which is why it is recommended in RFC 2718; in particular, using
UTF-8 as this mapping has been identified as the best practice.

> 
> > There is *supposed* to be a header declaring the encoding 
> of the file (if
> > it's in HTML, for example).  It will, as you suggest, take 
> the Latin world
> > a while to get used to this.
> > 

This applies to the content being served; we are talking about the means of
identifying a resource.

> 	And HTML has the META tag. I think ultimately this is 
> what should happen.
> 	Everything should just be widened to Unicode where the 
> encoding is left
> 	undefined or if it's protocol transport like HTTP they 
> pick UTF-8 or negotiate an
> 	encoding.

Note that this is only really applicable if the content is character-based
(i.e., HTML or text).  HTTP (MIME, really) provides the "Content-Type"
header to denote the type of the content being provided; the "charset"
parameter is applicable to all "text" subtypes.  The HTML META tag provides
the http-equiv mechanism for overriding/adding ANY response header; in
particular, you can specify:

<META http-equiv="Content-Type" content="text/html; charset=utf-8">

to indicate the character set of the document.  The charset parameter is not
applicable to non-character entities; you won't see a response header like:

Content-Type: image/jpeg; charset=utf-8

because a character set only applies to characters.

In any case, when discussing SMB, you don't particularly CARE about the
content -- it's a big binary chunk.  You DO care about identifying WHICH
file to retrieve.  This is where the URL scheme encoding comes into play.
With HTTP, you have a scenario like this (we'll assume that both client and
server are using UTF-8 to encode URLs, although as noted above this isn't
mandated):

1. User types a Unicode HTTP URL into their browser window:
    http://svr/slovak/môžem/jesť/sklo/nezraní/ma.zip

2. Browser encodes the URL in UTF-8 and does a GET:
    GET /slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip HTTP/1.1

3. Server decodes the URI to get:
   "/slovak/môžem/jesť/sklo/nezraní/ma.zip"

3. Server provides the resource to the client (which may be character
content, which may have it's own specified character encoding).

The key piece is that both the client and the server need to use on a common
encoding when referring to entities with Unicode filenames.  This ensures
that the filename can be successfully reconstructed from a 7-bit ASCII
representation.  This is what the URL scheme specifies.  RFC 2396 indicates
that a future revision will allow the URI to specify its own encoding:

   It is expected that a systematic treatment of character encoding
   within URI will be developed as a future modification of this
   specification.

But until that point, each scheme must address the issue individually.

Eric
 
**************************************************************************
The information transmitted herewith is sensitive information intended only
for use by the individual or entity to which it is addressed. If the reader
of this message is not the intended recipient, you are hereby notified that
any review, retransmission, dissemination, distribution, copying or other
use of, or taking of any action in reliance upon this information is
strictly prohibited. If you have received this communication in error,
please contact the sender and delete the material from your computer.


More information about the jcifs mailing list