[jcifs] Character Set discussions

Sat Feb 8 12:35:04 EST 2003

> 
> 
> Well the Java URL class (and I believe the URI class) does return the
> escapes. It returns what it was given which is a good policy.
> 

Javadoc for the URI class can be found at:

http://java.sun.com/j2se/1.4/docs/api/java/net/URI.html

They define an "other" class as:

The Unicode characters that are not in the US-ASCII character set, are 
not control characters (according to the Character.isISOControl 
method), and are not space characters (according to the 
Character.isSpaceChar  method)  (Deviation from RFC 2396, which is 
limited to US-ASCII)

This allows you to create a URI containing non-ASCII Unicode characters 
(essentially an IRI).  You can then do uri.toString() (which will output 
the same string) or uri.toASCIIString() (which will escape the 
non-strict characters and output a valid URI).

> However now the problem is if you are given a URL (or URI or IRI) that
> contains escapes and then *derive* URLs from it (using list()) you must
> ether escape the new part. You cannot partially escape a URL.
> 
> [Side note: I'm just talking about the Unicode characters at this
> point. Escaping the 7 special characters is separable (I think)]
> 
> Being that SMB is inherently Unicode aware you cannot unconditionally
> escape Unicode characters. I think we've all accept that.
> 
> There is one possibility I have not fully explored. We may conclude that
> any derived URL does not escape Unicode characters.
> 
> Incedentally here's another problem I just realized. If we accept
> both unescaped URLs and URLs that have Unicode characters that have
> been converted to UTF-8 and escaped how will we know after unescaping
> them that they are really a sequence of UTF-8 bytes encoding a Uniocde
> character? Ans: You don't. The only way to know is if you know the URL
> will always escape such sequences. It cannot be "optional".
> 

I'm not sure I follow some of the above... attached is an example of how 
the URI class interprets some of this.  The attached class takes 2 URIs 
as input, then resolves the second relative to the first and outputs the 
result.  I ran this with an absolute URI (containing a mixture of 
unescaped Unicode characters and escaped, UTF-8 encoded characters) and 
a relative URI (containing similar characters):

java testuri smb://svr/slovak/m%C3%B4žem/ 
./jest(/sklo/nezran%C3%AD/ma.zip > output.txt

Attached is the output -- this seems to be a reasonable interpretation.
Does it address any/all of the above?

Eric

-------------- next part --------------
import java.net.URI;

public class testuri {

    public static void main(String[] args) throws Exception {
        URI uri = new URI(args[0]);
        System.out.println("Input Base URI: " + uri.toString());
        System.out.println("ASCII: " + uri.toASCIIString());
        System.out.println();
        URI relative = new URI(args[1]);
        System.out.println("Input Relative URI: " + relative.toString());
        System.out.println("ASCII: " + relative.toASCIIString());
        System.out.println();
        uri = uri.resolve(relative);
        System.out.println("Derived URI: " + uri.toString());
        System.out.println("ASCII: " + uri.toASCIIString());
    }

}
-------------- next part --------------
Input Base URI: smb://svr/slovak/m%C3%B4Å¾em/
ASCII: smb://svr/slovak/m%C3%B4%C5%BEem/

Input Relative URI: ./jesÅ¥/sklo/nezran%C3%AD/ma.zip
ASCII: ./jes%C5%A5/sklo/nezran%C3%AD/ma.zip

Derived URI: smb://svr/slovak/m%C3%B4Å¾em/jesÅ¥/sklo/nezran%C3%AD/ma.zip
ASCII: smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip