FileSpec and normalization questions

We currently have DWARF that has a DW_AT_comp_dir that is set to “./” followed by any number of extra ‘/’ characters. I would like to have this path normalized as we parse the DWARF so we don’t end up with line tables with a variety of “.//+” prefixes on each source file.

While looking to solve this issue, I took a look at the functionality that is in FileSpec right now. In:

void FileSpec::SetFile(llvm::StringRef pathname, bool resolve, PathSyntax syntax);

This function always calls a cheaper normalize function:

namespace {
void Normalize(llvm::SmallVectorImpl &path, FileSpec::PathSyntax syntax);
}

This function does nothing for posix paths, but switches backward slashes to forward slashes.

We have a FileSpec function named FileSpec::GetNormalizedPath() which will do much more path normalization on a path by removing redundant “.” and “…” and “//”.

I can fix my DWARF issue in a few ways:
1 - fix FileSpec::SetFile() to still call ::Normalize() but have it do more work and have it normalize out the and redundant or relative path info
2 - call FileSpec::GetNormalizedPath() on each comp dir before using it to actually normalize it

The main question is: do we want paths floating around LLDB that aren’t normalized? It seems like having a mixture of theses path will lead to issues in LLDB so I would vote for solution #1.

Also, looking at the tests for normalizing paths I found the following pairs of pre-normalized and post-normalization paths for posix:

{“//”, “//”},
{“//net”, “//net”},

Why wouldn’t we reduce “//” to just “/” for posix? And why wouldn’t we reduce “//net” to “/net”?

Also I found:

{“./foo”, “foo”},

Do we prefer to not have “./foo” to stay as “./foo”? Seems like if users know that their debug info has line table entries that are “./foo/foo.c” that they might type this in:

(lldb) b ./foo/foo.c:12

But this will fail since it might not match the “foo/foo.c:12” that might result from path normalization. We don’t normalize right now so it doesn’t matter and things would match, but part of my fix is normalizing a path in the DWARF that is currently “.////////foo/foo.c” down to either “./foo/foo.c” or “foo/foo.c” so it will matter depending on what we decide here.

Any input would be appreciated.

Greg Clayton

IIRC We have path normalization functions in llvm, have you looked at them?

Thanks,

The last time I looked at the llvm functions they only support the path syntax of the llvm host, which won't do for lldb. But maybe they have gotten more general recently?

Jim

Also, looking at the tests for normalizing paths I found the following pairs of pre-normalized and post-normalization paths for posix:

{“//”, “//”},
{“//net”, “//net”},

Why wouldn’t we reduce “//” to just “/” for posix? And why wouldn’t we reduce “//net” to “/net”?

I don’t know what the author of this test had in mind, but from the POSIX spec:

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_11

Yes in fact I was the one who updated them to make them more general. You can now specify an enumeration parameter which will cause the algorithm to treat paths as either windows, posix, or whatever the host is.

If I were guessing, I would have guessed that!

Since you probably know how the llvm functions work better than most, were there technical reasons why, having done the work on the llvm side, you didn't adopt them in lldb? Or was it just a matter of time?

Jim

I can't see any reason why we benefit from having these differently spelled equivalent paths floating around. You want to be careful to preserve the path syntax, since it would be weird to be cross debugging to a Windows machine and have to type Posix paths. But other than that, removing all the extraneous cruft seems goodness to me.

Were you proposing just that you get the DWARF parser to do this, or were you proposing that that become a requirement for SymbolFile parsers, so that we can drop all the code that normalizes paths from debug information before comparisons? We still have to normalize paths from users, but it would be nice not to have to do it if we know the source is debug info.

Jim

We actually use it in some places, but it’s limited. Before I did that was when I added the PathSyntax to FileSpec which essentially servers the same purpose. We could in theory drop PathSyntax now that LLVM supports all of the same functionality. It’s a pretty invasive refactor though which I never had time to do.

I think I might have tried to replace some of the low level functions in FileSpec with the LLVM equivalents and gotten a few test failures, but I didn’t have time to investigate. It would be a worthwhile experiment for someone to try again if they have some cycles.

I can't see any reason why we benefit from having these differently spelled equivalent paths floating around. You want to be careful to preserve the path syntax, since it would be weird to be cross debugging to a Windows machine and have to type Posix paths. But other than that, removing all the extraneous cruft seems goodness to me.

We translate all windows slashes all the time in any FileSpec, so it doesn't really matter what the user types.

Were you proposing just that you get the DWARF parser to do this, or were you proposing that that become a requirement for SymbolFile parsers, so that we can drop all the code that normalizes paths from debug information before comparisons? We still have to normalize paths from users, but it would be nice not to have to do it if we know the source is debug info.

I was proposing to fix FileSpec::SetFile(), when false is passed for the "resolve" parameter, to always normalize all paths. Or I could just fix this inside of SymbolFileDWARF. I would rather do this at the "FileSpec" level since it will normalize our comparisons and just remove a ton of issue we can run into. So I vote for fixing FileSpec to "do the right thing".

Greg

Also, looking at the tests for normalizing paths I found the following

pairs of pre-normalized and post-normalization paths for posix:

       {"//", "//"},
       {"//net", "//net"},

Why wouldn't we reduce "//" to just "/" for posix? And why wouldn't we

reduce "//net" to "/net"?

I don't know what the author of this test had in mind, but from the POSIX

spec:

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_11

> A pathname that begins with two successive slashes may be interpreted

in an implementation-defined manner, although more than two leading slashes
shall be treated as a single slash.

Yes, that's exactly what the author of this test (me) had in mind. :slight_smile:
And it's not just a hypothetical posix thing either. Windows and cygwin
both use \\ and // to mean funny things. I remember also seeing something
like that on linux, though I can't remember now what was it being used for.

This is also the same way as llvm path functions handle these prefixes, so
I think we should keep them. I don't know whether we do this already, but
we can obviously fold 3 or more consecutive slashes into one during
normalization. Same goes for two slashes which are not at the beginning of
the path.

    {"./foo", "foo"},

Do we prefer to not have "./foo" to stay as "./foo"?

This is an interesting question. It basically comes down to our definition
of "identical" FileSpecs. Do we consider "foo" and "./foo" to identical? If
we do, then we should do the above normalization (theoretically we could
choose a different normal form, and convert "foo" to "./foo", but I think
that would be even weirder), otherwise we should skip it.

On one hand, these are obviously identical -- if you just take the string
and pass it to the filesystem, you will always get back the same file. But,
on the other hand, we have this notion that a FileSpec with an empty
directory component represents a wildcard that matches any file with that
name in any directory. For these purposes "./foo" and "foo" are two very
different things.

So, I can see the case for both, and I don't really have a clear
preference. All I would say is, whichever way we choose, we should make it
very explicit so that the users of FileSpec know what to expect.

I think I might have tried to replace some of the low level functions in

FileSpec with the LLVM equivalents and gotten a few test failures, but I
didn't have time to investigate. It would be a worthwhile experiment for
someone to try again if they have some cycles.

I can try to take a look at it. The way I remember it, I just copied these
functions from llvm and replaced all #ifdefs with runtime checks, which is
pretty much what you later did in llvm proper. Unless there has been some
significant divergence since then, it shouldn't be hard to reconcile these.

Also, looking at the tests for normalizing paths I found the following

pairs of pre-normalized and post-normalization paths for posix:

      {"//", "//"},
      {"//net", "//net"},

Why wouldn't we reduce "//" to just "/" for posix? And why wouldn't we

reduce "//net" to "/net"?

I don't know what the author of this test had in mind, but from the POSIX

spec:

General Concepts

A pathname that begins with two successive slashes may be interpreted

in an implementation-defined manner, although more than two leading slashes
shall be treated as a single slash.

Yes, that's exactly what the author of this test (me) had in mind. :slight_smile:
And it's not just a hypothetical posix thing either. Windows and cygwin
both use \\ and // to mean funny things. I remember also seeing something
like that on linux, though I can't remember now what was it being used for.

ok, we need to keep any paths starting with // or \\

This is also the same way as llvm path functions handle these prefixes, so
I think we should keep them. I don't know whether we do this already, but
we can obviously fold 3 or more consecutive slashes into one during
normalization. Same goes for two slashes which are not at the beginning of
the path.

   {"./foo", "foo"},

Do we prefer to not have "./foo" to stay as "./foo"?

This is an interesting question. It basically comes down to our definition
of "identical" FileSpecs. Do we consider "foo" and "./foo" to identical? If
we do, then we should do the above normalization (theoretically we could
choose a different normal form, and convert "foo" to "./foo", but I think
that would be even weirder), otherwise we should skip it.

On one hand, these are obviously identical -- if you just take the string
and pass it to the filesystem, you will always get back the same file. But,
on the other hand, we have this notion that a FileSpec with an empty
directory component represents a wildcard that matches any file with that
name in any directory. For these purposes "./foo" and "foo" are two very
different things.

So, I can see the case for both, and I don't really have a clear
preference. All I would say is, whichever way we choose, we should make it
very explicit so that the users of FileSpec know what to expect.

I would say that without a directory it is a wildcard match on base name alone, and with one, the partial directories must match if the path is relative, and the full directory must match if absolute. I will submit a patch that keeps leading "./" and "../" during normalization and we will see what people think.

I think I might have tried to replace some of the low level functions in

FileSpec with the LLVM equivalents and gotten a few test failures, but I
didn't have time to investigate. It would be a worthwhile experiment for
someone to try again if they have some cycles.

I took a look at the llvm file stuff and it has llvm::sys::fs::real_path which always resolves symlinks _and_ normalizes the path. Would be nice to break it out into two parts by adding llvm::sys::fs::normalize_path and have llvm::sys::fs::real_path call it.

I can try to take a look at it. The way I remember it, I just copied these
functions from llvm and replaced all #ifdefs with runtime checks, which is
pretty much what you later did in llvm proper. Unless there has been some
significant divergence since then, it shouldn't be hard to reconcile these.

Ok, I will submit a patch and we will see how things go.

Greg

>
>
> So, I can see the case for both, and I don't really have a clear
> preference. All I would say is, whichever way we choose, we should make

it

> very explicit so that the users of FileSpec know what to expect.

I would say that without a directory it is a wildcard match on base name

alone, and with one, the partial directories must match if the path is
relative, and the full directory must match if absolute. I will submit a
patch that keeps leading "./" and "../" during normalization and we will
see what people think.

Ok, what about multiple leading "./" components? Would it make sense to
collapse those to a single one ("././././foo.cpp" -> "./foo.cpp")?

>
>> I think I might have tried to replace some of the low level functions

in

> FileSpec with the LLVM equivalents and gotten a few test failures, but I
> didn't have time to investigate. It would be a worthwhile experiment

for

> someone to try again if they have some cycles.

I took a look at the llvm file stuff and it has llvm::sys::fs::real_path

which always resolves symlinks _and_ normalizes the path. Would be nice to
break it out into two parts by adding llvm::sys::fs::normalize_path and
have llvm::sys::fs::real_path call it.

> I can try to take a look at it. The way I remember it, I just copied

these

> functions from llvm and replaced all #ifdefs with runtime checks, which

is

> pretty much what you later did in llvm proper. Unless there has been

some

> significant divergence since then, it shouldn't be hard to reconcile

these.

So, I tried playing around with unifying the two implementations today. I
didn't touch the normalization code, I just wanted to try to replace path
parsing functions with the llvm ones.

In theory, it should be as simple as replacing our parsing code in
FileSpec::SetFile with calls to llvm::sys::path::filename and
...::parent_path (to set m_filename and m_directory).

It turned out this was not as simple, and the reason is that the llvm path
api's aren't completely self-consistent either. For example, for "//", the
filename+parent_path functions will decompose it into "." and "", while the
path iterator api will just report a single component ("//").

After a couple of hours of fiddling with +/- ones, I think I have came up
with a working and consistent implementation, but I haven't managed to
finish polishing it today. I'll try to upload the llvm part of the patch on
Monday. (I'll refrain from touching the lldb code for now, to avoid
interfering with your patch).

So, I can see the case for both, and I don't really have a clear
preference. All I would say is, whichever way we choose, we should make

it

very explicit so that the users of FileSpec know what to expect.

I would say that without a directory it is a wildcard match on base name

alone, and with one, the partial directories must match if the path is
relative, and the full directory must match if absolute. I will submit a
patch that keeps leading "./" and "../" during normalization and we will
see what people think.

I have that as part of my current patch, so don't worry about that.

Ok, what about multiple leading "./" components? Would it make sense to
collapse those to a single one ("././././foo.cpp" -> "./foo.cpp")?

yes!

I think I might have tried to replace some of the low level functions

in

FileSpec with the LLVM equivalents and gotten a few test failures, but I
didn't have time to investigate. It would be a worthwhile experiment

for

someone to try again if they have some cycles.

I took a look at the llvm file stuff and it has llvm::sys::fs::real_path

which always resolves symlinks _and_ normalizes the path. Would be nice to
break it out into two parts by adding llvm::sys::fs::normalize_path and
have llvm::sys::fs::real_path call it.

I can try to take a look at it. The way I remember it, I just copied

these

functions from llvm and replaced all #ifdefs with runtime checks, which

is

pretty much what you later did in llvm proper. Unless there has been

some

significant divergence since then, it shouldn't be hard to reconcile

these.

So, I tried playing around with unifying the two implementations today. I
didn't touch the normalization code, I just wanted to try to replace path
parsing functions with the llvm ones.

In theory, it should be as simple as replacing our parsing code in
FileSpec::SetFile with calls to llvm::sys::path::filename and
...::parent_path (to set m_filename and m_directory).

It turned out this was not as simple, and the reason is that the llvm path
api's aren't completely self-consistent either. For example, for "//", the
filename+parent_path functions will decompose it into "." and "", while the
path iterator api will just report a single component ("//").

After a couple of hours of fiddling with +/- ones, I think I have came up
with a working and consistent implementation, but I haven't managed to
finish polishing it today. I'll try to upload the llvm part of the patch on
Monday. (I'll refrain from touching the lldb code for now, to avoid
interfering with your patch).

Sounds good. Feel free to replace my changes with LLVM stuff as long as our test suite passes as I am adding a bunch of tests to cover all the cases.

Greg

The llvm patches have been posted as: <Login, <
Login.

>
>>>
>>>
>>> So, I can see the case for both, and I don't really have a clear
>>> preference. All I would say is, whichever way we choose, we should

make

> it
>>> very explicit so that the users of FileSpec know what to expect.
>
>> I would say that without a directory it is a wildcard match on base

name

> alone, and with one, the partial directories must match if the path is
> relative, and the full directory must match if absolute. I will submit a
> patch that keeps leading "./" and "../" during normalization and we will
> see what people think.

I have that as part of my current patch, so don't worry about that.
>
> Ok, what about multiple leading "./" components? Would it make sense to
> collapse those to a single one ("././././foo.cpp" -> "./foo.cpp")?

yes!

>
>
>>>
>>>> I think I might have tried to replace some of the low level functions
> in
>>> FileSpec with the LLVM equivalents and gotten a few test failures,

but I

>>> didn't have time to investigate. It would be a worthwhile experiment
> for
>>> someone to try again if they have some cycles.
>
>> I took a look at the llvm file stuff and it has

llvm::sys::fs::real_path

> which always resolves symlinks _and_ normalizes the path. Would be nice

to

> break it out into two parts by adding llvm::sys::fs::normalize_path and
> have llvm::sys::fs::real_path call it.
>
>>> I can try to take a look at it. The way I remember it, I just copied
> these
>>> functions from llvm and replaced all #ifdefs with runtime checks,

which

> is
>>> pretty much what you later did in llvm proper. Unless there has been
> some
>>> significant divergence since then, it shouldn't be hard to reconcile
> these.
>
> So, I tried playing around with unifying the two implementations today.

I

> didn't touch the normalization code, I just wanted to try to replace

path

> parsing functions with the llvm ones.
>
> In theory, it should be as simple as replacing our parsing code in
> FileSpec::SetFile with calls to llvm::sys::path::filename and
> ...::parent_path (to set m_filename and m_directory).
>
> It turned out this was not as simple, and the reason is that the llvm

path

> api's aren't completely self-consistent either. For example, for "//",

the

> filename+parent_path functions will decompose it into "." and "", while

the

> path iterator api will just report a single component ("//").
>
> After a couple of hours of fiddling with +/- ones, I think I have came

up

> with a working and consistent implementation, but I haven't managed to
> finish polishing it today. I'll try to upload the llvm part of the

patch on

> Monday. (I'll refrain from touching the lldb code for now, to avoid
> interfering with your patch).

Sounds good. Feel free to replace my changes with LLVM stuff as long as

our test suite passes as I am adding a bunch of tests to cover all the
cases.