[jcifs] Creating file with hash ('#') in filename
Christopher R. Hertel
crh at ubiqx.mn.org
Thu Jan 16 15:35:39 EST 2003
On Wed, Jan 15, 2003 at 10:07:27PM -0500, Allen, Michael B (RSCH) wrote:
:
> > No, I mean that the parsing of URLs is standard, based on the RFC. The #
> > is defined as part of the syntax of generic URLs. It's just that the
> > semantics have no meaning for the SMB URL.
> >
> Ok. I though it was specific to HTTP.
...it explains why the java.net.URL class parses on the # sign. It will
probably do that on all URLs that match the particular syntax (there are
variations within the URL syntax itself).
> True. By removing the '#' as the ref delimeter we will be breaking the
> generic
> URL syntax (albeit not by much and of no consequence to anyone).
...and that is where a developer gets to make a decision balancing
user-friendly vs. correct syntax. Personally, I would favor correct
syntax. Let the jCIFS code throw a polite exception, and any application
that uses jCIFS can do the work of "cleaning up" the initial URL.
> > > The main problem is that
> > > SMB path names need to represent just about any character including
> > > Unicode which we haven't even touched on. I personally do not want to
> > > decode paths. That is very costly.
> >
> > We only need to unescape them.
> >
> When decoded. But what happens when we pass a string back out? You have
> to encode it. Now if you have spaces and @ and # it all get's escaped.
Right. That's what I think should be done.
> > Two methods: urlEscape() and urlUnEscape().
> >
> We cannot do that. We've been here before. The SMB URLs are not used like
> normal URLs. People will need to specify them manually.
I don't think it's a good idea to predict how URL strings will be used.
I'm rather amazed at what people already do with them.
> They will be used where UNC paths are used. We cannot mandate that
> spaces, '@', '#' and unicode characters be escaped.
On the one hand, you are correct. It's a pain. On the other hand, that's
the nature of URLs. The current RFC doesn't talk about Unicode, though.
I imagine that characters outside of the US ASCII set would not need to be
escaped in this situation. Not if there's a proper Unicode
representation.
You *do* get into a problem when using extended ASCII, however. The
different DOS codepages use different octet values for extended
characters. They don't all map to Latin1 either. So here's the problem:
Someone enters a filename which includes the character 'Ö' (that's
o-umlout). In the Latin1 character set (Unicode), the octet value is
0xD6, but in DOS Code Page 437 it's 0x99.
So the question is: how do you read something like this:
smb://server/share/path/Övertone.spew
jCIFS would have to know the character set in use at the terminal (maybe
you can do that...if so, it's not a problem) in order to figure out
that the octet value 0x99 maps to Unicode character 0xD600 (that's
0x00D6 except that Microsoft uses UCS2LE encoding which is two bytes,
little-endian--so the bytes are reversed).
So, as I said, it's still a problem if the command shell is not using
Unicode as well. On the other hand, if you always read escapes as
Unicode, then
smb://server/share/path/%D6vertone.spew
is not ambiguous.
> That would happen far too much and annoy people to no end. The only
> reason for escaping things in the first place is to make them portable:
>
> "The space character is excluded because significant spaces may disappear
> and insignificant spaces may be introduced when URI are transcribed or
> typeset or subjected to the treatment of wordprocessing programs."
That's the reason for escaping spaces. There are a lot of other
characters that get escaped. The reasons are:
1) They have syntactic meaning within the URL.
2) They have syntactic meaning when used to identify a URL in another
context.
3) They are non-printing.
4) They may get munged by an intermediary.
RFC 2396:
2. URI Characters and Escape Sequences
URI consist of a restricted set of characters, primarily chosen to
aid transcribability and usability both in computer systems and in
non-computer communications. Characters used conventionally as
delimiters around URI were excluded. The restricted set of
characters consists of digits, letters, and a few graphic symbols
were chosen from those common to most of the character encodings and
input facilities available to Internet users.
uric = reserved | unreserved | escaped
Within a URI, characters are either used as delimiters, or to
represent strings of data (octets) within the delimited portions.
Octets are either represented directly by a character (using the US-
ASCII character for that octet [ASCII]) or by an escape encoding.
This representation is elaborated below.
> These URLs are not going to be embedded in word processing programs.
Why not? I can imagine using Kword to edit a document stored on some
other system. Kword might use the SMB URL string to identify the document
in its "most recently accessed" list or somesuch. Likewise, when I print
the document I might include the filename in the footer. There it is.
In an office environment, where documents are shared by several people, an
SMB URL might be handed around, or even referenced in an internal memo.
Again, it's dangerous to guess how an SMB URL might be used.
> If they are embedded in a web page it will be the path component
> appended to an HTTP URL
If it's a relative URL, then it's an HTTP URL. If you append it to an
HTTP URL then it's just a relative portion and has nothing to do with SMB.
If it is an SMB URL it will be absolute. That's the only way to change
protocols. Eg. If you reference a document available via FTP in a web
page then you would use the absolute form: ftp://ftp.server.net/path/file
Likewise with an SMB URL.
> which the web application needs to escape (i.e. NetworkExplorer).
> No one is going to send an SMB URL to someone in an e-mail. That
> kind of stuff just doesn't fit the protocol. That's what HTTP and
> FTP are for.
I disagree with you here, mostly for reasons already stated. Further,
though, I got an email this very day (geez, I write too much) in which
someone where I worked was talking about the fact that his users just
couldn't get the hang of using FTP or HTTP to share files.
> Again, we've been here before. If you remember when we un-escaped URLs
> in 0.6 we suddenly had to escape them. Then I decided to hang onto
> the URL that was passed in as is and give back what was given in. But
> that didn't quite work either. The end result will be that
> escaping will creep in.
I don't think it should creep back in. It needs to be handled head-on.
> More importantly, all that character manipulation is very costly.
Then its important to find ways to minimize it by figuring out the locus
at which it must occur.
As usual, I'm on the theory side of the house here, and the job is to
figure out how to make this all practical. Annoying, eh?
Thing one: The unescaping doesn't occur until the URL is split into its
component pieces. That's logical, since the escapes may be
protecting some character that would otherwise be a delimiter.
Thing two: Once the URL is decomposed, the pieces can be unescaped, but
the escaped version would be kept as well. If a change is
made to, say, the path then both versions are updated.
Thing three: Well, maybe not. If the escaped version is updated then
there is no need to update the unescaped version until it's
actually used. A flag would be needed...
Thing four: I know exactly how I would do this if I were not relying on
java.net.URL. I don't know what kind of monkey wrench that
throws into things.
> > > Can
> > > someone give me a reason why we *have* to require URL encoding of the
> > > path component? Otherwise I think we should punt the '#ref' and just
> > > integrate it into the path. Anything we would use it for can be
> > > done with a query_string parameter.
> >
> > The '#' character isn't the only problem. You could fudge that one.
> > There are other characters (eg., spaces) which are not legal URL
> > characters. Non-english language characters, for example.
> >
> > The key thing, though, is that a user may type in an SMB URL with a URL
> > escape sequence included.
> >
> This is a very unlikey scenario but in that application (a web browser
> maybe) the application will be responsible for un-escaping it.
When and why? The unescaping must be done after the parsing but before
the calls to jCIFS. The reverse--escaping strings returned by jCIFS
before handing them to java.net.URL-- also fits in between. Is that
(honestly, I don't know...my head's been in my book) something that is
do-able with jCIFS as it works currently?
> > > Incedentally speaking of query_string parameters we got lucky with the '?'
> > > character. That *is* reserved in SMB pathnames. It's a wildcard character.
> > > Otherwise we really *would* have to require escaping path components.
> >
> > We still do. :)
> >
> Seriously, what will *break* if we do not mandate escaping the path
> component?
Any valid Windows filename or directoryname character that is not also a
valid URL character.
You are using java.net.URL which, if I understand correctly, does the
parsing for you. I imagine that other tools exist out there that also do
generic URL parsing. These tools may simply ignore "illegal" URL
characters, or they may not. If java.net.URL simply passes such
characters along then jCIFS is okay...except that some of the URLs that
work with jCIFS won't work with, say, KDE or MacOS X other tools that
support the SMB URL.
>
> > > Anyway it looks like just tacking the '#ref' back onto the path component in
> > > Handler.java is going to do the trick.
> >
> > ...for that *one* case, and it is still a user convenience at the expense
> > of correct syntax.
> >
> This is the debate right here. The SMB URL cannot be used on the
> Internet because it's character range is too great. It inherently
> stomps on reserved characters. So are you weighing the "user
> convenience" side enough?
Nope. There is no reason that a file specified by the SMB URL could not
also be offered to the Internet via a web server. The same path, same
file, same problem.
The current RFC specifies the US ASCII set of characters, so Unicode
simply isn't supported by generic URI. We fall into that the same way FTP
and HTTP do. I will have to ask if/when that will be covered. A new
generic URI draft is in the works, by the way.
Anyway, if "the SMB URL cannot be used on the Internet" then there is no
point in pursuing the draft, since it is an "Internet Draft". The point
is that the SMB URL *can* be used on the Internet (no judgement as to
whether this is wise or not...except to say that it's no less secure
than FTP, which sends passwords in cleartext).
More likely, the SMB URL will be used on the "in*tra*net" within an office
or company or suchlike. That, however, is pure conjecture on my part so
who knows...
> > The HTTP URL is just an instance of a URL. A descendant type. The rules
> > apply to all URLs.
> >
> Well in this case I meant the path component really IS an HTTP URL
> because
> that is how the requested path is passed to the servlet like:
>
> http://miallen3.com:8080/servlets/NetworkExplorer/miallen1/C$/pub/
Ah. Okay.
> > Sorry. :(
> >
> No reason to be. I am just trying to determine with certainty the answer to the
> question: Will the SMB URL *break* if we do not escape the path component?
It's not really a question of whether *we* escape the path. On input, the
user "should" do so. A "user-friendly" application may try to clean the
URL itself and correct the user's mistakes (as well as it can). Doing so,
though, runs a bit of a risk by allowing "invalid" URL strings to
propogate.
When returning a URL to the user, I believe that it should have correct
syntax.
As far as the SMB URL breaking, my thoughts are these:
- Escapes should be handled regardless, simply because they *may* be used.
They may be cut-and-pasted from other sources, for example.
- Other general-purpose parsers may or may not handle URLs that are not
escaped properly. jCIFS can be compatible by aiming for (not
necessarily being) the least-common-denominator.
- Within the ASCII set, the list of characters that would need to be
escaped is limited to those that are valid filename characters but are
not valid URL 'pchar' characters:
pchar = unreserved | escaped |
":" | "@" | "&" | "=" | "+" | "$" | ","
unreserved = alphanum | mark
mark = "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" | ")"
That leaves us with only a small number of disallowed ASCII characters.
The '#' and the space are probably the most conspicuous.
- I have no idea how Unicode and extended ASCII should be handled
off-hand.
> After we answer that question we can debate whether or not "correct syntax"
> out weighs "user convenience".
Good 'nough. :)
Chris -)-----
--
Samba Team -- http://www.samba.org/ -)----- Christopher R. Hertel
jCIFS Team -- http://jcifs.samba.org/ -)----- ubiqx development, uninq.
ubiqx Team -- http://www.ubiqx.org/ -)----- crh at ubiqx.mn.org
OnLineBook -- http://ubiqx.org/cifs/ -)----- crh at ubiqx.org
More information about the jcifs
mailing list