FileSpec and normalization questions

clayborg · April 19, 2018, 6:14pm

We currently have DWARF that has a DW_AT_comp_dir that is set to “./” followed by any number of extra ‘/’ characters. I would like to have this path normalized as we parse the DWARF so we don’t end up with line tables with a variety of “.//+” prefixes on each source file.

While looking to solve this issue, I took a look at the functionality that is in FileSpec right now. In:

void FileSpec::SetFile(llvm::StringRef pathname, bool resolve, PathSyntax syntax);

This function always calls a cheaper normalize function:

namespace {
void Normalize(llvm::SmallVectorImpl &path, FileSpec::PathSyntax syntax);
}

This function does nothing for posix paths, but switches backward slashes to forward slashes.

We have a FileSpec function named FileSpec::GetNormalizedPath() which will do much more path normalization on a path by removing redundant “.” and “…” and “//”.

I can fix my DWARF issue in a few ways:
1 - fix FileSpec::SetFile() to still call ::Normalize() but have it do more work and have it normalize out the and redundant or relative path info
2 - call FileSpec::GetNormalizedPath() on each comp dir before using it to actually normalize it

The main question is: do we want paths floating around LLDB that aren’t normalized? It seems like having a mixture of theses path will lead to issues in LLDB so I would vote for solution #1.

Also, looking at the tests for normalizing paths I found the following pairs of pre-normalized and post-normalization paths for posix:

{“//”, “//”},
{“//net”, “//net”},

Why wouldn’t we reduce “//” to just “/” for posix? And why wouldn’t we reduce “//net” to “/net”?

Also I found:

{“./foo”, “foo”},

Do we prefer to not have “./foo” to stay as “./foo”? Seems like if users know that their debug info has line table entries that are “./foo/foo.c” that they might type this in:

(lldb) b ./foo/foo.c:12

But this will fail since it might not match the “foo/foo.c:12” that might result from path normalization. We don’t normalize right now so it doesn’t matter and things would match, but part of my fix is normalizing a path in the DWARF that is currently “.////////foo/foo.c” down to either “./foo/foo.c” or “foo/foo.c” so it will matter depending on what we decide here.

Any input would be appreciated.

Greg Clayton

Davide_Italiano2 · April 19, 2018, 6:16pm

IIRC We have path normalization functions in llvm, have you looked at them?

Thanks,

jingham · April 19, 2018, 6:19pm

The last time I looked at the llvm functions they only support the path syntax of the llvm host, which won't do for lldb. But maybe they have gotten more general recently?

Jim

Zachary_Turner1 · April 19, 2018, 6:20pm

Also, looking at the tests for normalizing paths I found the following pairs of pre-normalized and post-normalization paths for posix:

{“//”, “//”},
{“//net”, “//net”},

Why wouldn’t we reduce “//” to just “/” for posix? And why wouldn’t we reduce “//net” to “/net”?

I don’t know what the author of this test had in mind, but from the POSIX spec:

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_11

Zachary_Turner1 · April 19, 2018, 6:21pm

Yes in fact I was the one who updated them to make them more general. You can now specify an enumeration parameter which will cause the algorithm to treat paths as either windows, posix, or whatever the host is.

jingham · April 19, 2018, 6:28pm

If I were guessing, I would have guessed that!

Since you probably know how the llvm functions work better than most, were there technical reasons why, having done the work on the llvm side, you didn't adopt them in lldb? Or was it just a matter of time?

Jim

jingham · April 19, 2018, 6:33pm

I can't see any reason why we benefit from having these differently spelled equivalent paths floating around. You want to be careful to preserve the path syntax, since it would be weird to be cross debugging to a Windows machine and have to type Posix paths. But other than that, removing all the extraneous cruft seems goodness to me.

Were you proposing just that you get the DWARF parser to do this, or were you proposing that that become a requirement for SymbolFile parsers, so that we can drop all the code that normalizes paths from debug information before comparisons? We still have to normalize paths from users, but it would be nice not to have to do it if we know the source is debug info.

Jim

Zachary_Turner1 · April 19, 2018, 6:37pm

We actually use it in some places, but it’s limited. Before I did that was when I added the PathSyntax to FileSpec which essentially servers the same purpose. We could in theory drop PathSyntax now that LLVM supports all of the same functionality. It’s a pretty invasive refactor though which I never had time to do.

I think I might have tried to replace some of the low level functions in FileSpec with the LLVM equivalents and gotten a few test failures, but I didn’t have time to investigate. It would be a worthwhile experiment for someone to try again if they have some cycles.

clayborg · April 19, 2018, 7:44pm

I can't see any reason why we benefit from having these differently spelled equivalent paths floating around. You want to be careful to preserve the path syntax, since it would be weird to be cross debugging to a Windows machine and have to type Posix paths. But other than that, removing all the extraneous cruft seems goodness to me.

We translate all windows slashes all the time in any FileSpec, so it doesn't really matter what the user types.

Were you proposing just that you get the DWARF parser to do this, or were you proposing that that become a requirement for SymbolFile parsers, so that we can drop all the code that normalizes paths from debug information before comparisons? We still have to normalize paths from users, but it would be nice not to have to do it if we know the source is debug info.

I was proposing to fix FileSpec::SetFile(), when false is passed for the "resolve" parameter, to always normalize all paths. Or I could just fix this inside of SymbolFileDWARF. I would rather do this at the "FileSpec" level since it will normalize our comparisons and just remove a ton of issue we can run into. So I vote for fixing FileSpec to "do the right thing".

Greg

Pavel_Labath · April 20, 2018, 8:08am

Also, looking at the tests for normalizing paths I found the following

pairs of pre-normalized and post-normalization paths for posix:

{"//", "//"},
{"//net", "//net"},

Why wouldn't we reduce "//" to just "/" for posix? And why wouldn't we

reduce "//net" to "/net"?

I don't know what the author of this test had in mind, but from the POSIX

spec:

http://pubs.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap04.html#tag_04_11

> A pathname that begins with two successive slashes may be interpreted

in an implementation-defined manner, although more than two leading slashes
shall be treated as a single slash.

Yes, that's exactly what the author of this test (me) had in mind.
And it's not just a hypothetical posix thing either. Windows and cygwin
both use \\ and // to mean funny things. I remember also seeing something
like that on linux, though I can't remember now what was it being used for.

This is also the same way as llvm path functions handle these prefixes, so
I think we should keep them. I don't know whether we do this already, but
we can obviously fold 3 or more consecutive slashes into one during
normalization. Same goes for two slashes which are not at the beginning of
the path.

{"./foo", "foo"},

Do we prefer to not have "./foo" to stay as "./foo"?

This is an interesting question. It basically comes down to our definition
of "identical" FileSpecs. Do we consider "foo" and "./foo" to identical? If
we do, then we should do the above normalization (theoretically we could
choose a different normal form, and convert "foo" to "./foo", but I think
that would be even weirder), otherwise we should skip it.

On one hand, these are obviously identical -- if you just take the string
and pass it to the filesystem, you will always get back the same file. But,
on the other hand, we have this notion that a FileSpec with an empty
directory component represents a wildcard that matches any file with that
name in any directory. For these purposes "./foo" and "foo" are two very
different things.

So, I can see the case for both, and I don't really have a clear
preference. All I would say is, whichever way we choose, we should make it
very explicit so that the users of FileSpec know what to expect.

I think I might have tried to replace some of the low level functions in

FileSpec with the LLVM equivalents and gotten a few test failures, but I
didn't have time to investigate. It would be a worthwhile experiment for
someone to try again if they have some cycles.

I can try to take a look at it. The way I remember it, I just copied these
functions from llvm and replaced all #ifdefs with runtime checks, which is
pretty much what you later did in llvm proper. Unless there has been some
significant divergence since then, it shouldn't be hard to reconcile these.

clayborg · April 20, 2018, 4:14pm

Also, looking at the tests for normalizing paths I found the following

pairs of pre-normalized and post-normalization paths for posix:

{"//", "//"},
{"//net", "//net"},

Why wouldn't we reduce "//" to just "/" for posix? And why wouldn't we

reduce "//net" to "/net"?

I don't know what the author of this test had in mind, but from the POSIX

spec:

General Concepts

A pathname that begins with two successive slashes may be interpreted

in an implementation-defined manner, although more than two leading slashes
shall be treated as a single slash.

Yes, that's exactly what the author of this test (me) had in mind.
And it's not just a hypothetical posix thing either. Windows and cygwin
both use \\ and // to mean funny things. I remember also seeing something
like that on linux, though I can't remember now what was it being used for.

ok, we need to keep any paths starting with // or \\

This is also the same way as llvm path functions handle these prefixes, so
I think we should keep them. I don't know whether we do this already, but
we can obviously fold 3 or more consecutive slashes into one during
normalization. Same goes for two slashes which are not at the beginning of
the path.

{"./foo", "foo"},

Do we prefer to not have "./foo" to stay as "./foo"?

This is an interesting question. It basically comes down to our definition
of "identical" FileSpecs. Do we consider "foo" and "./foo" to identical? If
we do, then we should do the above normalization (theoretically we could
choose a different normal form, and convert "foo" to "./foo", but I think
that would be even weirder), otherwise we should skip it.

On one hand, these are obviously identical -- if you just take the string
and pass it to the filesystem, you will always get back the same file. But,
on the other hand, we have this notion that a FileSpec with an empty
directory component represents a wildcard that matches any file with that
name in any directory. For these purposes "./foo" and "foo" are two very
different things.

So, I can see the case for both, and I don't really have a clear
preference. All I would say is, whichever way we choose, we should make it
very explicit so that the users of FileSpec know what to expect.

I would say that without a directory it is a wildcard match on base name alone, and with one, the partial directories must match if the path is relative, and the full directory must match if absolute. I will submit a patch that keeps leading "./" and "../" during normalization and we will see what people think.

I think I might have tried to replace some of the low level functions in

FileSpec with the LLVM equivalents and gotten a few test failures, but I
didn't have time to investigate. It would be a worthwhile experiment for
someone to try again if they have some cycles.

I took a look at the llvm file stuff and it has llvm::sys::fs::real_path which always resolves symlinks _and_ normalizes the path. Would be nice to break it out into two parts by adding llvm::sys::fs::normalize_path and have llvm::sys::fs::real_path call it.

I can try to take a look at it. The way I remember it, I just copied these
functions from llvm and replaced all #ifdefs with runtime checks, which is
pretty much what you later did in llvm proper. Unless there has been some
significant divergence since then, it shouldn't be hard to reconcile these.

Ok, I will submit a patch and we will see how things go.

Greg

Pavel_Labath · April 20, 2018, 7:17pm

>
>
> So, I can see the case for both, and I don't really have a clear
> preference. All I would say is, whichever way we choose, we should make

it

> very explicit so that the users of FileSpec know what to expect.

I would say that without a directory it is a wildcard match on base name

alone, and with one, the partial directories must match if the path is
relative, and the full directory must match if absolute. I will submit a
patch that keeps leading "./" and "../" during normalization and we will
see what people think.

Ok, what about multiple leading "./" components? Would it make sense to
collapse those to a single one ("././././foo.cpp" -> "./foo.cpp")?

>
>> I think I might have tried to replace some of the low level functions

in

> FileSpec with the LLVM equivalents and gotten a few test failures, but I
> didn't have time to investigate. It would be a worthwhile experiment

for

> someone to try again if they have some cycles.

I took a look at the llvm file stuff and it has llvm::sys::fs::real_path

which always resolves symlinks _and_ normalizes the path. Would be nice to
break it out into two parts by adding llvm::sys::fs::normalize_path and
have llvm::sys::fs::real_path call it.

> I can try to take a look at it. The way I remember it, I just copied

these

> functions from llvm and replaced all #ifdefs with runtime checks, which

is

> pretty much what you later did in llvm proper. Unless there has been

some

> significant divergence since then, it shouldn't be hard to reconcile

these.

So, I tried playing around with unifying the two implementations today. I
didn't touch the normalization code, I just wanted to try to replace path
parsing functions with the llvm ones.

In theory, it should be as simple as replacing our parsing code in
FileSpec::SetFile with calls to llvm::sys::path::filename and
...::parent_path (to set m_filename and m_directory).

It turned out this was not as simple, and the reason is that the llvm path
api's aren't completely self-consistent either. For example, for "//", the
filename+parent_path functions will decompose it into "." and "", while the
path iterator api will just report a single component ("//").

After a couple of hours of fiddling with +/- ones, I think I have came up
with a working and consistent implementation, but I haven't managed to
finish polishing it today. I'll try to upload the llvm part of the patch on
Monday. (I'll refrain from touching the lldb code for now, to avoid
interfering with your patch).

clayborg · April 20, 2018, 9:12pm

So, I can see the case for both, and I don't really have a clear
preference. All I would say is, whichever way we choose, we should make

it

very explicit so that the users of FileSpec know what to expect.

I would say that without a directory it is a wildcard match on base name

alone, and with one, the partial directories must match if the path is
relative, and the full directory must match if absolute. I will submit a
patch that keeps leading "./" and "../" during normalization and we will
see what people think.

I have that as part of my current patch, so don't worry about that.

Ok, what about multiple leading "./" components? Would it make sense to
collapse those to a single one ("././././foo.cpp" -> "./foo.cpp")?

yes!

I think I might have tried to replace some of the low level functions

in

FileSpec with the LLVM equivalents and gotten a few test failures, but I
didn't have time to investigate. It would be a worthwhile experiment

for

someone to try again if they have some cycles.

I took a look at the llvm file stuff and it has llvm::sys::fs::real_path

which always resolves symlinks _and_ normalizes the path. Would be nice to
break it out into two parts by adding llvm::sys::fs::normalize_path and
have llvm::sys::fs::real_path call it.

I can try to take a look at it. The way I remember it, I just copied

these

functions from llvm and replaced all #ifdefs with runtime checks, which

is

pretty much what you later did in llvm proper. Unless there has been

some

significant divergence since then, it shouldn't be hard to reconcile

these.

So, I tried playing around with unifying the two implementations today. I
didn't touch the normalization code, I just wanted to try to replace path
parsing functions with the llvm ones.

In theory, it should be as simple as replacing our parsing code in
FileSpec::SetFile with calls to llvm::sys::path::filename and
...::parent_path (to set m_filename and m_directory).

It turned out this was not as simple, and the reason is that the llvm path
api's aren't completely self-consistent either. For example, for "//", the
filename+parent_path functions will decompose it into "." and "", while the
path iterator api will just report a single component ("//").

After a couple of hours of fiddling with +/- ones, I think I have came up
with a working and consistent implementation, but I haven't managed to
finish polishing it today. I'll try to upload the llvm part of the patch on
Monday. (I'll refrain from touching the lldb code for now, to avoid
interfering with your patch).

Sounds good. Feel free to replace my changes with LLVM stuff as long as our test suite passes as I am adding a bunch of tests to cover all the cases.

Greg

Pavel_Labath · April 23, 2018, 9:53am

The llvm patches have been posted as: <Login, <
Login.

>
>>>
>>>
>>> So, I can see the case for both, and I don't really have a clear
>>> preference. All I would say is, whichever way we choose, we should

make

> it
>>> very explicit so that the users of FileSpec know what to expect.
>
>> I would say that without a directory it is a wildcard match on base

name

> alone, and with one, the partial directories must match if the path is
> relative, and the full directory must match if absolute. I will submit a
> patch that keeps leading "./" and "../" during normalization and we will
> see what people think.

I have that as part of my current patch, so don't worry about that.
>
> Ok, what about multiple leading "./" components? Would it make sense to
> collapse those to a single one ("././././foo.cpp" -> "./foo.cpp")?

yes!

>
>
>>>
>>>> I think I might have tried to replace some of the low level functions
> in
>>> FileSpec with the LLVM equivalents and gotten a few test failures,

but I

>>> didn't have time to investigate. It would be a worthwhile experiment
> for
>>> someone to try again if they have some cycles.
>
>> I took a look at the llvm file stuff and it has

llvm::sys::fs::real_path

> which always resolves symlinks _and_ normalizes the path. Would be nice

to

> break it out into two parts by adding llvm::sys::fs::normalize_path and
> have llvm::sys::fs::real_path call it.
>
>>> I can try to take a look at it. The way I remember it, I just copied
> these
>>> functions from llvm and replaced all #ifdefs with runtime checks,

which

> is
>>> pretty much what you later did in llvm proper. Unless there has been
> some
>>> significant divergence since then, it shouldn't be hard to reconcile
> these.
>
> So, I tried playing around with unifying the two implementations today.

I

> didn't touch the normalization code, I just wanted to try to replace

path

> parsing functions with the llvm ones.
>
> In theory, it should be as simple as replacing our parsing code in
> FileSpec::SetFile with calls to llvm::sys::path::filename and
> ...::parent_path (to set m_filename and m_directory).
>
> It turned out this was not as simple, and the reason is that the llvm

path

> api's aren't completely self-consistent either. For example, for "//",

the

> filename+parent_path functions will decompose it into "." and "", while

the

> path iterator api will just report a single component ("//").
>
> After a couple of hours of fiddling with +/- ones, I think I have came

up

> with a working and consistent implementation, but I haven't managed to
> finish polishing it today. I'll try to upload the llvm part of the

patch on

> Monday. (I'll refrain from touching the lldb code for now, to avoid
> interfering with your patch).

Sounds good. Feel free to replace my changes with LLVM stuff as long as

our test suite passes as I am adding a bunch of tests to cover all the
cases.

Topic		Replies	Views
filenames in a cross-platform environment LLDB	2	78	August 29, 2015
FileSpec and normalization questions LLDB	0	61	April 20, 2018
FileSpec::Resolve and (sym)links LLDB	8	72	February 24, 2015
FileSpec changes fallout and questions LLDB	2	66	March 11, 2015
Should FileSpec::Resolve() look at PATH? LLDB	10	80	October 15, 2014

FileSpec and normalization questions

Related Topics