[jcifs] Creating file with hash ('#') in filename

Fri Jan 17 20:10:14 EST 2003

On Fri, 17 Jan 2003 00:53:16 -0600
"Christopher R. Hertel" <crh at ubiqx.mn.org> wrote:

> On Fri, Jan 17, 2003 at 12:22:08AM -0500, Allen, Michael B (RSCH) wrote:
> :
> > > This is the Unicode issue.  URLs must handle full Unicode in order to deal 
> > > with this stuff.  It is a problem, but it's not specific to the SMB URL.
> > > 
> > Yes, it does seem to be coming down to Unicode. But it is specific to the SMB URL.
> > RFC 2396 does not appear to address Unicode. You cannot have Unicode HTTP URLs
> > for example. Thus the problem.
> 
> HTTP URLs are just one example.  Still, "You cannot have Unicode HTTP
> URLs"...  That's why it's *not* specific to the SMB URL.  People may post
> all sorts of files on a web server.  If the filename is in cyrillic
> then...well, same problem even if it's accessed via HTTP.  It's a general
> URL issue.  URLs *should* support Unicode... but they don't.

No. You cannot have a cryllic filename on a web server. You can have a
cryllic *link* displayed in the page but the filenames and all parts of
the URL are ASCII. There might be extensions to this. I don't know. But
URLs are 100% good ol' ASCII.

> 
> :
> > > I don't follow your argument there.  What is it you are trying to say?
> > > Sorry, I'm being dense, I guess.
> > > 
> > What the example shows is that the HTTP URL handler shipped with Java does what
> > jCIFS' SMB URL handler does now which is as you put it is syntactically incorrect, not
> > a URL, and incomplete.
> 
> Lost again.  Sorry, that didn't scan.

Java ships with an HTTP URL handler. jCIFS provides an SMB URL
handler. You're arguing that the SMB handler should be compliant. I'm just
pointing out that the HTTP URL handler shipped with Java isn't compliant
either and Eric's example illustrated that. It took the spaces. And it
passed them back unescaped. So someone punted on the escapes just like me.
and probably for transcribability reasons. Don't really know.

> > Notice it's the URI class rather than the URL class that provokes the exception. I don't
> > have Java 1.4 so I don't know what toExternalForm would return but I suspect it would
> > be escaped.
> 
> Which would cover Eric's final point, about handing the URL back to the 
> user with correct syntax.

That's the UR-eye class not the UR-ell. The URI class was introduced
in Java 1.4. We don't use it. I'm not sure what it's for. Maybe Sun
decided everything needed to be escaped and came up with the URI
class. Don't know.

> > I am arguing that transcribability is better without the escapes. In fact
> > if we were to required what the URI (not L) class mandates it would be
> > rather difficult for the average person to construct certain URLs. They
> > would then need application support for performing the escaping.
> 
> Well... yes, it's easier to transcribe something if it doesn't have 
> escapes.  That doesn't solve the problem, though.  Invalid syntax is still 
> invalid syntax.  That's why the escapes are there.
> 
> Putting aside non-ASCII characters, the set of valid SMB filename
> characters that are invalid URL path characters is fairly small.  The
> remaining few need to be escaped.  Let's see, the SNIA doc lists the 
> excluded characters as:
> 
> '"', '.', '\', '/', '[', ']', ':', '+', '|', '<', '>', '=', ';', ',' '*', '?'
> 
> ...but those are for 8.3 notation so some of them can be removed (like the 
> dot).  RFC2396 allows alpha-numerics plus the following:
> 
>   ":" | "@" | "&" | "=" | "+" | "$" | "," | "-" | "_" | "." | "!" | "~" | 
>   "*" | "'" | "(" | ")"
> 
> The characters to worry about are all those that appear in *neither* of 
> the above.  That would be:
> 
>       space: ' '
>        hash: '#'
>     percent: '%'
>   back-tick: '`'
>       carat: '^'
>      lbrace: '{'
>      rbrace: '}'

Interesting. I was wondering what the precise list was.

> If we say that non-ASCII characters do not need to be escaped (which I
> claim is the right thing to do) then those seven are the only remaining
> which really should be.  For jCIFS, you've already got a work-around for
> the hash ('#'), so that's optional.
> 
> Escaping the percent is necessary in any string in which it is followed by
> characters which are hex digits.  "Foo%bar", for instance... is that "Foo"
> + 0xBA + "r" or "Foo" + "%" + "bar"?  You could argue that "this%that" can
> only have one interpretation...
> 
> Escaping the space may seem optional to the user, but it's a bad idea not
> to escape it.  What about a name like "Foo  bar"?  There are two spaces 
> there, but my text editor wanted to reduce them to one.  How do you know 
> it's not a tab?  What about linewrap?  Urq.  For transcribability, you 
> want to escape spaces.
> 
> The rest of the characters belong to the "unwise" set, listed in RFC2396.  
> These *should* be escaped by the user, but you could fudge those too.  
> Again, it's not that jCIFS requires that they be escaped, it's that a 
> smart user (do they exist? I think there are some) will escape them.

And being smart they will quickly find that it won't work.

> So, except for the percent sign and the space, you could argue that all of
> the above are fudge-able.  That still doesn't get you out of having to
> translate escape sequences if the user hands them to you.

I don't understand. We don't have to do anything. We just pass them
right through. Only the cross hatch '#' was syntactically important. And
we handled that. This is good!

> > > The fatal flaw is that it is not valid URL syntax.  Trying to be 'nice' to
> > > users is one thing, but we shouldn't be 'breaking' URL syntax to do it.
> > > 
> > 	It's not "fatal". And jCIFS proves it (knock on wood!).
> 
> It is if it means that jCIFS doesn't comply with standards.  [Not SMB URL
> standards, because there is no such thing yet... URL standards, in this
> case.]  Having (finally) done the analysis I don't see a problem except 
> for the percent and the space.  I would really like to see jCIFS return 
> proper escapes if either the percent or space are entered literally, 
> though.

But we just keep butting into the same problem: non-ASCII characters.

> 
> Here's what the RFC says about the "unwise" set:
> 
>    Other characters are excluded because gateways and other transport
>    agents are known to sometimes modify such characters, or they are
>    used as delimiters.
> 
>    unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"
> 
>    Data corresponding to excluded characters must be escaped in order to
>    be properly represented within a URI.
> 
> Note that the "[" and "]" were later removed from the unwise set to be
> used as delimiters for IPv6 addresses.
> 
> > > Besides, people (especially those who might use jCIFS as the toolkit to
> > > build their killer application) will expect to be able to hand in a valid
> > > URL with the smb:// prefix and get expected results.
> > > 
> > 	I think the results are well defined. We just don't use escapes.
> 
> It's not a question of whether *we* use escapes.  Users are the ones who 
> will enter the escapes.  Either through cut-and-paste, or by clicking on 
> the URL in a web page (yes, it could happen--particularly on an intranet).

Well browsers do not understand the SMB URL yet. And when they do they
will have to deal with this problem too. If an application expects to
get escaped SMB URLs then it will have to deal with that but I don't
see how it can.

> 
> > > ...and here's something else that confuses me.  It seems you're arguing in
> > > favor of avoiding the escapes for the sake of user convenience (whack me
> > > on the head if'n I'm misinterpreting that), but at the same time, the
> > > current code is pedantic, in places, about the trailing slash '/'.  
> > > People forget the trailing slash all the time.
> > > 
> > Yes, that is a pain. Unfortunately that is mandated by Java's URL parser. So
> > any other Java URL implementation will exhibit the same behavior. It's just
> > less noticable because of the way SMB URLs are used (listing directories).
> 
> There's got to be some way 'round it...  Hmmm...

The *real* problem is that we don't know if the resource refers to a
directory or file and when you combine them with the 2 argument parameters
the URL class cuts off the name if it doesn't have a trailing slash:

  smb://server/share/dir/ + file.txt => smb://server/share/dir/file.txt
  smb://server/share/dir + file.txt => smb://server/share/file.txt

The 'dir' get's dropped. I don't think we're going to fix that. I think
this is built into the URL specs somewhere. Maybe you can come up with a solution.

> > > Ah.  So, by extension, a URI string is an encoded UTF-8 string, where all 
> > > disallowed and non-ASCII characters (including all two-byte characters) 
> > > are escaped.
> > > 
> > No. This is from the *XML* specification. It explicitly states that RFC 2396 does not
> > support Unicode. There may be another RFC that does deal with that. I don't know. But
> > currently URLs and URIs do not support Unicode. I would imagine many products *do*
> > use UTF-8 conversion and excaping to get around that. They must. But that is not part
> > of the standard.
> 
> Right.  That's not what I was saying.  It's badly worded, sorry.
> 
> I should say that jCIFS should accept non-ASCII characters as Unicode so 
> that users do not have to escape non-ASCII characters.

Well all strings are Unicode in Java. And we don't escape them so this
is what jCIFS does and has always done (although until we introduced
the jcifs.encoding property I don't think it converted things property
interally).

> > There are no codepage issues. Everything in Java is Unicode unless specified
> > otherwise using the file.encoding property. jCIFS will convert all Strings to the encoding
> > specified by the jcifs.encoding property.
> 
> Sorry, there *are* codepage issues.  If you connect to a Samba 2.2.x
> server it won't support Unicode.  The file and directory names are,
> therefore, presented in whatever 8-bit DOS OEM codepage the Samba server
> is configured to use.  The client has no idea what codepage the server is
> using, and vice versa.  There's nothing in the protocol that lets you
> know.  So, the Unicode used by jCIFS may not be compatible with the DOS
> Codepage where non-ASCII characters are concerned.  Therefore, the
> names--if they contain non-ASCII characters--may not match.  The
> likelihood of problems increases the further you get from ASCII.

Ok. There are no codepage issues with *the SMB URL* (because everything
coming in gets converted to Unicode).

> This problem exists with all pre-NT LM 0.12 dialects, but jCIFS doesn't
> support those so it's not an issue.  I don't know what other NT LM 0.12
> servers may have this problem.  There may be several NT LM 0.12 servers
> that don't do Unicode.
> 
> There is nothing that can be done to fix it, however, unless the
> jcifs.encoding property can be told to use the (proprietary) DOS OEM
> codepages.  I think it would be ugly to add a ?CODEPAGE= option to the SMB 
> URL, and doing so would require that all implementations come with a 
> complete set of DOS codepages.

I like it. Throw it in. Make it optional what codepages are supported.

> 
> I think we should just stick to Unicode and punt on the codepages.

There are NO CODEPAGE ISSUES in jCIFS. The jcifs.encoding property
DOES specify a codepage. JCIFS supports ALL codepages listed here:

  http://java.sun.com/j2se/1.3/docs/guide/intl/encoding.doc.html

and fully Unicode. It undoubtedly has the most flexible character
encoding support of ANY client. It can read in Korean encoded SMB URLs
and talk EBCDIC with OS/390.

Really. It's built into the language. I didn't even have to do anything.

-- 
A  program should be written to model the concepts of the task it
performs rather than the physical world or a process because this
maximizes  the  potential  for it to be applied to tasks that are
conceptually  similar and, more important, to tasks that have not
yet been conceived.