[jcifs] Creating file with hash ('#') in filename

Fri Jan 17 17:53:16 EST 2003

On Fri, Jan 17, 2003 at 12:22:08AM -0500, Allen, Michael B (RSCH) wrote:
:
> > This is the Unicode issue.  URLs must handle full Unicode in order to deal 
> > with this stuff.  It is a problem, but it's not specific to the SMB URL.
> > 
> Yes, it does seem to be coming down to Unicode. But it is specific to the SMB URL.
> RFC 2396 does not appear to address Unicode. You cannot have Unicode HTTP URLs
> for example. Thus the problem.

HTTP URLs are just one example.  Still, "You cannot have Unicode HTTP
URLs"...  That's why it's *not* specific to the SMB URL.  People may post
all sorts of files on a web server.  If the filename is in cyrillic
then...well, same problem even if it's accessed via HTTP.  It's a general
URL issue.  URLs *should* support Unicode... but they don't.

:
> > I don't follow your argument there.  What is it you are trying to say?
> > Sorry, I'm being dense, I guess.
> > 
> What the example shows is that the HTTP URL handler shipped with Java does what
> jCIFS' SMB URL handler does now which is as you put it is syntactically incorrect, not
> a URL, and incomplete.

Lost again.  Sorry, that didn't scan.

> Notice it's the URI class rather than the URL class that provokes the exception. I don't
> have Java 1.4 so I don't know what toExternalForm would return but I suspect it would
> be escaped.

Which would cover Eric's final point, about handing the URL back to the 
user with correct syntax.

:
> I am arguing that transcribability is better without the escapes. In fact
> if we were to required what the URI (not L) class mandates it would be
> rather difficult for the average person to construct certain URLs. They
> would then need application support for performing the escaping.

Well... yes, it's easier to transcribe something if it doesn't have 
escapes.  That doesn't solve the problem, though.  Invalid syntax is still 
invalid syntax.  That's why the escapes are there.

Putting aside non-ASCII characters, the set of valid SMB filename
characters that are invalid URL path characters is fairly small.  The
remaining few need to be escaped.  Let's see, the SNIA doc lists the 
excluded characters as:

'"', '.', '\', '/', '[', ']', ':', '+', '|', '<', '>', '=', ';', ',' '*', '?'

...but those are for 8.3 notation so some of them can be removed (like the 
dot).  RFC2396 allows alpha-numerics plus the following:

  ":" | "@" | "&" | "=" | "+" | "$" | "," | "-" | "_" | "." | "!" | "~" | 
  "*" | "'" | "(" | ")"

The characters to worry about are all those that appear in *neither* of 
the above.  That would be:

      space: ' '
       hash: '#'
    percent: '%'
  back-tick: '`'
      carat: '^'
     lbrace: '{'
     rbrace: '}'

If we say that non-ASCII characters do not need to be escaped (which I
claim is the right thing to do) then those seven are the only remaining
which really should be.  For jCIFS, you've already got a work-around for
the hash ('#'), so that's optional.

Escaping the percent is necessary in any string in which it is followed by
characters which are hex digits.  "Foo%bar", for instance... is that "Foo"
+ 0xBA + "r" or "Foo" + "%" + "bar"?  You could argue that "this%that" can
only have one interpretation...

Escaping the space may seem optional to the user, but it's a bad idea not
to escape it.  What about a name like "Foo  bar"?  There are two spaces 
there, but my text editor wanted to reduce them to one.  How do you know 
it's not a tab?  What about linewrap?  Urq.  For transcribability, you 
want to escape spaces.

The rest of the characters belong to the "unwise" set, listed in RFC2396.  
These *should* be escaped by the user, but you could fudge those too.  
Again, it's not that jCIFS requires that they be escaped, it's that a 
smart user (do they exist? I think there are some) will escape them.

So, except for the percent sign and the space, you could argue that all of
the above are fudge-able.  That still doesn't get you out of having to
translate escape sequences if the user hands them to you.

> > The fatal flaw is that it is not valid URL syntax.  Trying to be 'nice' to
> > users is one thing, but we shouldn't be 'breaking' URL syntax to do it.
> > 
> 	It's not "fatal". And jCIFS proves it (knock on wood!).

It is if it means that jCIFS doesn't comply with standards.  [Not SMB URL
standards, because there is no such thing yet... URL standards, in this
case.]  Having (finally) done the analysis I don't see a problem except 
for the percent and the space.  I would really like to see jCIFS return 
proper escapes if either the percent or space are entered literally, 
though.

Here's what the RFC says about the "unwise" set:

   Other characters are excluded because gateways and other transport
   agents are known to sometimes modify such characters, or they are
   used as delimiters.

   unwise      = "{" | "}" | "|" | "\" | "^" | "[" | "]" | "`"

   Data corresponding to excluded characters must be escaped in order to
   be properly represented within a URI.

Note that the "[" and "]" were later removed from the unwise set to be
used as delimiters for IPv6 addresses.

> > Besides, people (especially those who might use jCIFS as the toolkit to
> > build their killer application) will expect to be able to hand in a valid
> > URL with the smb:// prefix and get expected results.
> > 
> 	I think the results are well defined. We just don't use escapes.

It's not a question of whether *we* use escapes.  Users are the ones who 
will enter the escapes.  Either through cut-and-paste, or by clicking on 
the URL in a web page (yes, it could happen--particularly on an intranet).

> > ...and here's something else that confuses me.  It seems you're arguing in
> > favor of avoiding the escapes for the sake of user convenience (whack me
> > on the head if'n I'm misinterpreting that), but at the same time, the
> > current code is pedantic, in places, about the trailing slash '/'.  
> > People forget the trailing slash all the time.
> > 
> Yes, that is a pain. Unfortunately that is mandated by Java's URL parser. So
> any other Java URL implementation will exhibit the same behavior. It's just
> less noticable because of the way SMB URLs are used (listing directories).

There's got to be some way 'round it...  Hmmm...

> > Ah.  So, by extension, a URI string is an encoded UTF-8 string, where all 
> > disallowed and non-ASCII characters (including all two-byte characters) 
> > are escaped.
> > 
> No. This is from the *XML* specification. It explicitly states that RFC 2396 does not
> support Unicode. There may be another RFC that does deal with that. I don't know. But
> currently URLs and URIs do not support Unicode. I would imagine many products *do*
> use UTF-8 conversion and excaping to get around that. They must. But that is not part
> of the standard.

Right.  That's not what I was saying.  It's badly worded, sorry.

I should say that jCIFS should accept non-ASCII characters as Unicode so 
that users do not have to escape non-ASCII characters.

> There are no codepage issues. Everything in Java is Unicode unless specified
> otherwise using the file.encoding property. jCIFS will convert all Strings to the encoding
> specified by the jcifs.encoding property.

Sorry, there *are* codepage issues.  If you connect to a Samba 2.2.x
server it won't support Unicode.  The file and directory names are,
therefore, presented in whatever 8-bit DOS OEM codepage the Samba server
is configured to use.  The client has no idea what codepage the server is
using, and vice versa.  There's nothing in the protocol that lets you
know.  So, the Unicode used by jCIFS may not be compatible with the DOS
Codepage where non-ASCII characters are concerned.  Therefore, the
names--if they contain non-ASCII characters--may not match.  The
likelihood of problems increases the further you get from ASCII.

This problem exists with all pre-NT LM 0.12 dialects, but jCIFS doesn't
support those so it's not an issue.  I don't know what other NT LM 0.12
servers may have this problem.  There may be several NT LM 0.12 servers
that don't do Unicode.

There is nothing that can be done to fix it, however, unless the
jcifs.encoding property can be told to use the (proprietary) DOS OEM
codepages.  I think it would be ugly to add a ?CODEPAGE= option to the SMB 
URL, and doing so would require that all implementations come with a 
complete set of DOS codepages.

I think we should just stick to Unicode and punt on the codepages.

Chris -)-----

-- 
Samba Team -- http://www.samba.org/     -)-----   Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/   -)-----   ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/     -)-----   crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/    -)-----   crh at ubiqx.org