[jcifs] Character Set discussions
Eric
eglass1 at attbi.com
Sat Feb 8 12:35:04 EST 2003
>
>
> Well the Java URL class (and I believe the URI class) does return the
> escapes. It returns what it was given which is a good policy.
>
Javadoc for the URI class can be found at:
http://java.sun.com/j2se/1.4/docs/api/java/net/URI.html
They define an "other" class as:
The Unicode characters that are not in the US-ASCII character set, are
not control characters (according to the Character.isISOControl
method), and are not space characters (according to the
Character.isSpaceChar method) (Deviation from RFC 2396, which is
limited to US-ASCII)
This allows you to create a URI containing non-ASCII Unicode characters
(essentially an IRI). You can then do uri.toString() (which will output
the same string) or uri.toASCIIString() (which will escape the
non-strict characters and output a valid URI).
> However now the problem is if you are given a URL (or URI or IRI) that
> contains escapes and then *derive* URLs from it (using list()) you must
> ether escape the new part. You cannot partially escape a URL.
>
> [Side note: I'm just talking about the Unicode characters at this
> point. Escaping the 7 special characters is separable (I think)]
>
> Being that SMB is inherently Unicode aware you cannot unconditionally
> escape Unicode characters. I think we've all accept that.
>
> There is one possibility I have not fully explored. We may conclude that
> any derived URL does not escape Unicode characters.
>
> Incedentally here's another problem I just realized. If we accept
> both unescaped URLs and URLs that have Unicode characters that have
> been converted to UTF-8 and escaped how will we know after unescaping
> them that they are really a sequence of UTF-8 bytes encoding a Uniocde
> character? Ans: You don't. The only way to know is if you know the URL
> will always escape such sequences. It cannot be "optional".
>
I'm not sure I follow some of the above... attached is an example of how
the URI class interprets some of this. The attached class takes 2 URIs
as input, then resolves the second relative to the first and outputs the
result. I ran this with an absolute URI (containing a mixture of
unescaped Unicode characters and escaped, UTF-8 encoded characters) and
a relative URI (containing similar characters):
java testuri smb://svr/slovak/m%C3%B4em/
./jest(/sklo/nezran%C3%AD/ma.zip > output.txt
Attached is the output -- this seems to be a reasonable interpretation.
Does it address any/all of the above?
Eric
-------------- next part --------------
import java.net.URI;
public class testuri {
public static void main(String[] args) throws Exception {
URI uri = new URI(args[0]);
System.out.println("Input Base URI: " + uri.toString());
System.out.println("ASCII: " + uri.toASCIIString());
System.out.println();
URI relative = new URI(args[1]);
System.out.println("Input Relative URI: " + relative.toString());
System.out.println("ASCII: " + relative.toASCIIString());
System.out.println();
uri = uri.resolve(relative);
System.out.println("Derived URI: " + uri.toString());
System.out.println("ASCII: " + uri.toASCIIString());
}
}
-------------- next part --------------
Input Base URI: smb://svr/slovak/m%C3%B4žem/
ASCII: smb://svr/slovak/m%C3%B4%C5%BEem/
Input Relative URI: ./jesť/sklo/nezran%C3%AD/ma.zip
ASCII: ./jes%C5%A5/sklo/nezran%C3%AD/ma.zip
Derived URI: smb://svr/slovak/m%C3%B4žem/jesť/sklo/nezran%C3%AD/ma.zip
ASCII: smb://svr/slovak/m%C3%B4%C5%BEem/jes%C5%A5/sklo/nezran%C3%AD/ma.zip
More information about the jcifs
mailing list