[GSOC 2018] Information gathering

Hello,

I'm Paul Semel, a French student in computer science. I am currently in my 4th year (1st year of graduate school) at EPITA and enrolled in the system and security laboratory of the school.

I would be very interested in working on a LLVM project during this GSoC. Implementing a PoC for an unsequenced modification checker in CSA helped me discover LLVM. However, I would like to dive deeper in this project.

I've seen some of the proposals, and I would like to ask a few questions about two of those.

As you might have guessed, I have some interest in the checker for dangling string pointers :

- Do you think it would help if I kept working on improving my unsequenced modification checker to get more familiar with Clang Static Analyzer ?

I'm also interested in the command line replacements for GNU Binutils :

- What tools would you like to replace in priority ?
- Does this subject imply to add options/features to some of the tools, or is it only about handling command line ?

Thank you very much,

Hey,

Hey, welcome!

I'm curious about the unsequenced modification checker, is it something that I should have seen but missed for whatever reason? It might be useful, and I think I'm seeing why don't compiler warnings cover all cases, i.e. why the analyzer's path sensitivity would help here. But I can't answer until I see it :slight_smile: -eg. on our Phabricator.

We are currently having two confirmed mentors for the Analyzer for now (me and George), so we'd most likely be able to mentor one student each, for two projects, and it'd likely be the two projects we proposed - unless someone proposes something really interesting. And already two fairly motivated students have shown up here in the mailing lists, but this shouldn't stop you from posting your own proposal here in cfe-dev (most of the analyzer contributors aren't actively scanning llvm-dev, as far as I know).

I don't know much about the binutils replacement project; someone else should reply on that one.

A couple of words about the use-after-free-like checker for values managed by temporary objects (mostly strings) that go out of scope. Because internals of std::string and other similar classes are too hard for the analyzer's generic use-after-free checker to understand (mostly due to how hard it is to track STL's internal invariants, and how not all of the code is necessarily present in the header), an API-specific checker seems to be necessary. The original plan we've had in mind was to keep track of dangerous values like str.c_str() in the program state (similarly to how SimpleStreamChecker tracks file descriptors) and then see if any of them are still present in memory at the end of the original value's lifetime (similarly to how StackAddrEscape checker finds stack pointers at the end of a function's stack frame).

The unknowns here include how easy would it be to track scopes (for now we only track function scopes, but if fairly old but recently reincarnated patches [1] and [2] land any time soon, we may get a much better granularity), how easy would it be to track objects when they are moved or lifetime-extended by binding to references, which was a large problem for other C++ object checkers, but we may work our way around it to some extent (or do it properly, depending on my current work outlined in [3] and in follow-up mails in February), and also how helpful inlining would be (eg. would we be able to automagically support string_view-like classes by inlining their methods?). So the checker would need an almost indefinite amount of incremental improvements once the initial prototype is done, some of which must be fairly curious and would certainly expose you to some of the analyzer's internals.

Hi,

Thanks for replying !

Hey, welcome!

I'm curious about the unsequenced modification checker, is it something that I should have seen but missed for whatever reason? It might be useful, and I think I'm seeing why don't compiler warnings cover all cases, i.e. why the analyzer's path sensitivity would help here. But I can't answer until I see it :slight_smile: -eg. on our Phabricator.

So.. I uploaded the checker on Phabricator !
Please keep in mind that it was for me a proof of concept, and I didn't have in mind to purpose this patch at the time I was developping it (and didn't have the time to improve it for the moment, as I am currently working on a structure pretty printing builtin - https://reviews.llvm.org/D44093).

For the moment, this checker is not able to detect all the unsequenced modifications, but can detect things like this :

static int a = 0;

int foo(void)
{
   return a++;
}

int main(void)
{
   int res = a++ + foo();
   return res;
}

So here is the link on Phabricator : https://reviews.llvm.org/D44154

We are currently having two confirmed mentors for the Analyzer for now (me and George), so we'd most likely be able to mentor one student each, for two projects, and it'd likely be the two projects we proposed - unless someone proposes something really interesting. And already two fairly motivated students have shown up here in the mailing lists, but this shouldn't stop you from posting your own proposal here in cfe-dev (most of the analyzer contributors aren't actively scanning llvm-dev, as far as I know).

I don't know much about the binutils replacement project; someone else should reply on that one.

Sure, I would really like to have some other info on this one ! Maybe you know someone I could had in cc of this thread ? :slightly_smiling_face:

A couple of words about the use-after-free-like checker for values managed by temporary objects (mostly strings) that go out of scope. Because internals of std::string and other similar classes are too hard for the analyzer's generic use-after-free checker to understand (mostly due to how hard it is to track STL's internal invariants, and how not all of the code is necessarily present in the header), an API-specific checker seems to be necessary. The original plan we've had in mind was to keep track of dangerous values like str.c_str() in the program state (similarly to how SimpleStreamChecker tracks file descriptors) and then see if any of them are still present in memory at the end of the original value's lifetime (similarly to how StackAddrEscape checker finds stack pointers at the end of a function's stack frame).

Ok I think that I understand the idea. So the idea is that this checker will be an API that will permit to track those invariants (and we will use this API to track str.c_str()).
Am I right ?

The unknowns here include how easy would it be to track scopes (for now we only track function scopes, but if fairly old but recently reincarnated patches [1] and [2] land any time soon, we may get a much better granularity), how easy would it be to track objects when they are moved or lifetime-extended by binding to references, which was a large problem for other C++ object checkers, but we may work our way around it to some extent (or do it properly, depending on my current work outlined in [3] and in follow-up mails in February), and also how helpful inlining would be (eg. would we be able to automagically support string_view-like classes by inlining their methods?). So the checker would need an almost indefinite amount of incremental improvements once the initial prototype is done, some of which must be fairly curious and would certainly expose you to some of the analyzer's internals.

Wow. This project sounds really cool, it's really too bad that there is already two students on this project.

Hey,

Hello,

I'm Paul Semel, a French student in computer science. I am currently in my 4th year (1st year of graduate school) at EPITA and enrolled in the system and security laboratory of the school.

I would be very interested in working on a LLVM project during this GSoC. Implementing a PoC for an unsequenced modification checker in CSA helped me discover LLVM. However, I would like to dive deeper in this project.

I've seen some of the proposals, and I would like to ask a few questions about two of those.

As you might have guessed, I have some interest in the checker for dangling string pointers :

- Do you think it would help if I kept working on improving my unsequenced modification checker to get more familiar with Clang Static Analyzer ?

I'm also interested in the command line replacements for GNU Binutils :

- What tools would you like to replace in priority ?
- Does this subject imply to add options/features to some of the tools, or is it only about handling command line ?

Thank you very much,

Adding cfe-dev..

Regards,

By the way, if you have some free time, I would really appreciate to have some advices on a better way to do the unsequenced modification checker. :slightly_smiling_face:

Thanks,

Hi,

Thanks for replying !

Hey, welcome!

I'm curious about the unsequenced modification checker, is it something that I should have seen but missed for whatever reason? It might be useful, and I think I'm seeing why don't compiler warnings cover all cases, i.e. why the analyzer's path sensitivity would help here. But I can't answer until I see it :slight_smile: -eg. on our Phabricator.

So.. I uploaded the checker on Phabricator !

Yay! I'll comment with my thoughts on this, so that you could polish it when you have time.

Note that this shouldn't necessarily have anything to do with GSoC - we're accepting code in all seasons :slight_smile:

Please keep in mind that it was for me a proof of concept, and I didn't have in mind to purpose this patch at the time I was developping it (and didn't have the time to improve it for the moment, as I am currently working on a structure pretty printing builtin - https://reviews.llvm.org/D44093).

For the moment, this checker is not able to detect all the unsequenced modifications, but can detect things like this :

static int a = 0;

int foo(void)
{
  return a++;
}

int main(void)
{
  int res = a++ + foo();
  return res;
}

This sounds like, for once, a bug that the analyzer might be really good at finding, and the check isn't going to be super loud, which makes me quite excited about this check.

So here is the link on Phabricator : https://reviews.llvm.org/D44154

We are currently having two confirmed mentors for the Analyzer for now (me and George), so we'd most likely be able to mentor one student each, for two projects, and it'd likely be the two projects we proposed - unless someone proposes something really interesting. And already two fairly motivated students have shown up here in the mailing lists, but this shouldn't stop you from posting your own proposal here in cfe-dev (most of the analyzer contributors aren't actively scanning llvm-dev, as far as I know).

I don't know much about the binutils replacement project; someone else should reply on that one.

Sure, I would really like to have some other info on this one ! Maybe you know someone I could had in cc of this thread ? :slightly_smiling_face:

Sorry, I'm completely out of topic on that one. This project has two assigned mentors, as mentioned in http://llvm.org/OpenProjects.html#replace_binary_utilities - you might try to contact them directly in case they accidentally missed your mail.

A couple of words about the use-after-free-like checker for values managed by temporary objects (mostly strings) that go out of scope. Because internals of std::string and other similar classes are too hard for the analyzer's generic use-after-free checker to understand (mostly due to how hard it is to track STL's internal invariants, and how not all of the code is necessarily present in the header), an API-specific checker seems to be necessary. The original plan we've had in mind was to keep track of dangerous values like str.c_str() in the program state (similarly to how SimpleStreamChecker tracks file descriptors) and then see if any of them are still present in memory at the end of the original value's lifetime (similarly to how StackAddrEscape checker finds stack pointers at the end of a function's stack frame).

Ok I think that I understand the idea. So the idea is that this checker will be an API that will permit to track those invariants (and we will use this API to track str.c_str()).
Am I right ?

No-no, i mean that .c_str() is a (part of) certain API :slight_smile: ...and we want see if it's used correctly. But in order to do that, we don't want to understand how it works in a particular implementation of, say, C++ standard library. Instead, we know how it is supposed to work, and encode part of this knowledge about this API into the analyzer so that it could find misused of it. Eg., we don't care what exact value is returned by .c_str() and how exactly it is allocated or deleted. The only thing we care about is that we shouldn't keep it around after the string is destroyed. In this sense, the checker is API-specific: it works by knowing about a particular API, not through generic knowledge of the language. Similarly, SimpleStreamChecker doesn't want to know what it takes to open a file: it only knows that the file that was opened must also be closed. For this checker it's more realistic to fully understand how the API works internally, but still hard. Just in case, i'm mentioning SimpleStreamChecker because it's essentially an example/hello-world checker described in a very detailed manner in https://youtu.be/kdxlsP5QVPw (totally recommended).

Hi Eric,

As you are pointed to be the confirmed mentor for the "Command line replacements for GNU Binutils" GSOC 2018 subject, I permit myself to add you to this thread !

If you have a few minutes to answer my questions, that'd really great :slightly_smiling_face:

Hi Paul,

I’m also interested in the command line replacements for GNU Binutils :

  • What tools would you like to replace in priority ?
  • Does this subject imply to add options/features to some of the
    tools, or is it only about handling command line ?

I just replied with this in another thread:

"It’s currently still available. The basic idea is that we’d be working on getting each of the llvm tools or libraries with a front end that is command line compatible with the GNU binutils counterpart to serve as a replacement. Whether or not we made them output compatible is something else, but we’ll probably want to have a couple different modes there from:

a) The compatible tool,
b) The tool we all want.

A and B could be the same, but then again, they might not. The low bar for the SoC project is going to be A."

And in priority order I’d probably want to finish off objcopy support (see the recent thread on llvm-dev) and objdump/readobj/readelf and then go from there.

Thoughts?

-eric

Hi Eric,

I saw the thread you are talking about. So basically, the idea would be to do the correct calls for either COFF subset of functions of ELF ones wether we have a COFF or ELF file as an input. Am I right ? I am really interested in doing a proposal for this subject. What do you expect to be in it ? I was actually thinking of something like exposing the things I’ve done in LLVM/CLang, the schedule for the 3 months (but for this, I need to talk with you about the high priority tools, as I’m not sure it is possible to do all the frontend tools in such amount of time)… Anyway, I am really happy that you answered my email. Hope to hear from you soon ! – Paul

Hi Eric,

Hi Paul,

I’m also interested in the command line replacements for GNU Binutils :

  • What tools would you like to replace in priority ?
  • Does this subject imply to add options/features to some of the
    tools, or is it only about handling command line ?

I just replied with this in another thread:

"It’s currently still available. The basic idea is that we’d be working on getting each of the llvm tools or libraries with a front end that is command line compatible with the GNU binutils counterpart to serve as a replacement. Whether or not we made them output compatible is something else, but we’ll probably want to have a couple different modes there from:

a) The compatible tool,
b) The tool we all want.

A and B could be the same, but then again, they might not. The low bar for the SoC project is going to be A."

And in priority order I’d probably want to finish off objcopy support (see the recent thread on llvm-dev) and objdump/readobj/readelf and then go from there.

Thoughts?

-eric

I saw the thread you are talking about. So basically, the idea would be to do the correct calls for either COFF subset of functions of ELF ones wether we have a COFF or ELF file as an input.
Am I right ?

Basically what I’m looking for first is a command line equivalent replacement first for gnu objcopy. I’d focus on ELF first, and then move to COFF/PE. I’d start from the work that Jake (cc’d) has already done and work with Zach (cc’d) on the COFF stuff if he’s still interested. Of course, I’ll be around for the first bit.

Then follow up with objcopy, etc as there’s time.

I am really interested in doing a proposal for this subject. What do you expect to be in it ? I was actually thinking of something like exposing the things I’ve done in LLVM/CLang, the schedule for the 3 months (but for this, I need to talk with you about the high priority tools, as I’m not sure it is possible to do all the frontend tools in such amount of time)…

Showing off your previous work is absolutely great in a proposal. A timeline and some proof that you’ve at least looked at what’s missing and have ideas at how to do the work would be key. And I don’t really expect you to finish all of them - at least not without help, but with some luck there might be other contributors to help :slight_smile:

Sound good? We can definitely work on the details as you’re interested - I’ll also be more responsive in the near future as well.

Thanks!

-eric

Hi,

    Hi Eric,

    Hi Paul,

        >> I'm also interested in the command line replacements for
        GNU Binutils :
        >>
        >> - What tools would you like to replace in priority ?
        >> - Does this subject imply to add options/features to some
        of the
        >> tools, or is it only about handling command line ?
        >

    I just replied with this in another thread:

    "It's currently still available. The basic idea is that we'd be
    working on getting each of the llvm tools or libraries with a
    front end that is command line compatible with the GNU binutils
    counterpart to serve as a replacement. Whether or not we made them
    output compatible is something else, but we'll probably want to
    have a couple different modes there from:

    a) The compatible tool,
    b) The tool we all want.

    A and B could be the same, but then again, they might not. The low
    bar for the SoC project is going to be A."

    And in priority order I'd probably want to finish off objcopy
    support (see the recent thread on llvm-dev) and
    objdump/readobj/readelf and then go from there.

    Thoughts?

    -eric

    I saw the thread you are talking about. So basically, the idea would
    be to do the correct calls for either COFF subset of functions of
    ELF ones wether we have a COFF or ELF file as an input.
    Am I right ?

Basically what I'm looking for first is a command line equivalent replacement first for gnu objcopy. I'd focus on ELF first, and then move to COFF/PE. I'd start from the work that Jake (cc'd) has already done and work with Zach (cc'd) on the COFF stuff if he's still interested. Of course, I'll be around for the first bit.

Then follow up with objcopy, etc as there's time.

I think you meant objdump, right ? (you talked about objcopy in your previous paragraph).

    I am really interested in doing a proposal for this subject. What do
    you expect to be in it ? I was actually thinking of something like
    exposing the things I've done in LLVM/CLang, the schedule for the 3
    months (but for this, I need to talk with you about the high
    priority tools, as I'm not sure it is possible to do all the
    frontend tools in such amount of time)..

Showing off your previous work is absolutely great in a proposal. A timeline and some proof that you've at least looked at what's missing and have ideas at how to do the work would be key. And I don't really expect you to finish all of them - at least not without help, but with some luck there might be other contributors to help :slight_smile:

Alright, that sounds very good ! For the moment, what I've done is that I listed the tools that were needed command line replacements (for some of those it is really binign).

Do I need to take LLD into account in my timeline ?

Then, I investigated a bit on the different tools command line, and what I have learnt so far is that objdump and objcopy are the ones that require the biggest amount of work (again, not took LLD into account so far).

Sound good? We can definitely work on the details as you're interested - I'll also be more responsive in the near future as well.

I have shared my draft in the GSOC 2018 Dashboard, but here is a link so that you have it right in the email[0]. I would really like to have feedback on it, espacially for the timeline I made. (but I'd really appreciate for the rest of the draft too :slightly_smiling_face:).

I am actually not sure at all about the time it would take for the replacement of llvm-objcopy, so maybe Jake and/or Zach would have an idea about it, as they already worked on this subject ! :slightly_smiling_face:

Thanks!

-eric

Thanks,

Hi,

FWIW I’m happy enough with the proposal and while the timeline isn’t necessarily the best - it’s not like we have particular amazing thoughts here either.

-eric