[jcifs] Character Set discussions

Michael B. Allen miallen at eskimo.com
Sat Feb 8 11:15:04 EST 2003


On Fri, 7 Feb 2003 15:38:57 -0600
"Christopher R. Hertel" <crh at ubiqx.mn.org> wrote:

> On Fri, Feb 07, 2003 at 03:48:00PM -0500, Michael B. Allen wrote:
> :
> > Didn't we(I) conclude previously however that once we start accepting
> > escapes we must also return them? 
> 
> I don't think you do need to return escapes if they are not required by 
> the environment (and that's the sticky point).  For example, if you are 
> given:  smb://f%6Fo/  there's no reason not to return that as: smb://foo/
> 
> > And if we return them then do we escape characters as we drill down
> > doing list() operations? We just can't do that.
> 
> You escape characters for presentation only.  There's no reason to keep
> the URLs in escaped format once they are parsed and stored in an internal
> (parsed) format.

Well the Java URL class (and I believe the URI class) does return the
escapes. It returns what it was given which is a good policy.

However now the problem is if you are given a URL (or URI or IRI) that
contains escapes and then *derive* URLs from it (using list()) you must
ether escape the new part. You cannot partially escape a URL.

[Side note: I'm just talking about the Unicode characters at this
point. Escaping the 7 special characters is separable (I think)]

Being that SMB is inherently Unicode aware you cannot unconditionally
escape Unicode characters. I think we've all accept that.

There is one possibility I have not fully explored. We may conclude that
any derived URL does not escape Unicode characters.

Incedentally here's another problem I just realized. If we accept
both unescaped URLs and URLs that have Unicode characters that have
been converted to UTF-8 and escaped how will we know after unescaping
them that they are really a sequence of UTF-8 bytes encoding a Uniocde
character? Ans: You don't. The only way to know is if you know the URL
will always escape such sequences. It cannot be "optional".

Ugh.

> > Does the URI class remove and discard the escapes? I don't
> > have Java 1.4 on my machines. Is there anyway to get the original escaped
> > URL from you uri object above back? Like from toString()?
> 
> There should be.  Like a computer language, the point here is to translate 
> between something human-readable and computer-usable.  In this case, we 
> also need to go back again because new URL strings can be generated based 
> on combinations (absolute + relative) of other strings.  The new URL 
> strings are then presented back to the user in user-readable form.
> 
> That's what existing browsers, including those that handle SMB URLs, 
> already do.  We're just talking about expanding it to include Unicode.

I seriously doubt these browsers you mention handle Unicode properly. I
doubt they put much thought into it at all really. It would be
irresponsible for us to check however.

Mike

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived. 


More information about the jcifs mailing list