Regular Expression lib support

We would like to have access to some kind of regular expression
library inside LLVM. For example, we need this to extend the FileCheck
test case checking tool to support regular expressions.

There are three obvious options:
  1. Roll our own library. Multiple unnamed individuals may even
already have implementations lying around! :slight_smile:

  2. Use POSIX regcomp facilities. This implies importing some
implementation of this interface, e.g., Windows. On Linux, BSD, etc.
we would try to use the platform version if available (and non-buggy).

  3. Import a more heavy weight library such as PCRE, and use it universally.

My personal preference is #2, and I may work on this in the near-term,
but I wanted to check for conflicting opinions first.

My main reasons for preferring #2 is to use a standard interface and
language, and it should be relatively easy to implement and maintain
(I'm assuming there is some nice BSD implementation out their we can
snarf, I haven't actually validated this assumption).

The main downside of #2 vs #1 or #3 is that if we wanted fancier
regexp features we would be somewhat stuck, but I don't think this is
a significant concern.

Ok?

- Daniel

We would like to have access to some kind of regular expression
library inside LLVM. For example, we need this to extend the FileCheck
test case checking tool to support regular expressions.

There are three obvious options:
1. Roll our own library. Multiple unnamed individuals may even
already have implementations lying around! :slight_smile:
2. Use POSIX regcomp facilities. This implies importing some
implementation of this interface, e.g., Windows. On Linux, BSD, etc.
we would try to use the platform version if available (and non-buggy).

3. Import a more heavy weight library such as PCRE, and use it universally.

Personally, I'm a big fan of the Boost libraries. They've got a regex
library, and a full-blown parser library (which I am using in my
front-end). It's definitely heavier than POSIX, but it's portable,
well-tested, and loaded with features.

GAH! Why on bloody earth does the LLVM mailing list server not set
the response address to the LLVM mailing list server so I actually
respond to the server instead of the person above me, that is so
different then the other 20+ mailing lists I am subscribed to and I
could have sworn the LLVM mailing list was not broken like that
before... My message is below:

We would like to have access to some kind of regular expression
library inside LLVM. For example, we need this to extend the FileCheck
test case checking tool to support regular expressions.

There are three obvious options:
1. Roll our own library. Multiple unnamed individuals may even
already have implementations lying around! :slight_smile:
2. Use POSIX regcomp facilities. This implies importing some
implementation of this interface, e.g., Windows. On Linux, BSD, etc.
we would try to use the platform version if available (and non-buggy).

3. Import a more heavy weight library such as PCRE, and use it universally.

Personally, I'm a big fan of the Boost libraries. They've got a regex
library, and a full-blown parser library (which I am using in my
front-end). It's definitely heavier than POSIX, but it's portable,
well-tested, and loaded with features.

Boost.Xpressive supports both dynamic and static regex's, what that
means is that you can use a regex dynamically (as a string), or you
can create it statically (by building up the AST in a *very*
easy-to-use way). Honestly, I prefer Boost.Spirit though, which is a
PEG parser, it is pure static and it compiles faster then just about
anything else yet found in existence (we have been testing it on
everything, even the string->int simple parser is faster then atoi,
and it just blows yax/etc... away, no comparison). PEG's have a
syntax very much like regex, however they are recursable, and fully
greedy with unlimited lookahead. If you want to include
Boost.Spirit2.1 (and I can easily rip it out so you would just need to
include it and the parts of it that it requires, you would not need to
include boost, and the license not only allows doing this it
encourages it), then I can help you integrate it. I have been using
Spirit2.1 for a long time and have helped in its creation so I know it
quite well now and would be happy to help include it in any/all of the
needs you would need it here in, and yes, I can just about guarantee
that it will be faster then anything else you could possibly use, and
it is very easy to learn, especially if you already know regex
(although you will have to note that everything is greedy, this is a
'good thing' in PEG grammars, it contributes to their very high speed,
and remember that PEG's can be recursive). Spirit2.1 also compiles to
*very* tight code (being smaller and faster then identical optimized
hand-coded parsers that a certain company wanted us to compare it
with), it may increase compile time, but you get huge runtime
increases, as well as Spirit2.1 and all it depends on are header-only
files, there are no other libs to include, just headers.

Er, by compiles faster I meant to say "executes faster", its compiling
time is a minute or two for the usual grammar, a few seconds for
something simple like string->int.

Also, Spirit also has a generater part, it can take a data structure
and save it as a text stream, this is also blazingly fast.

This is too heavy, and we don't need the extra features, and regexec
is well tested and much more standard. Unless there is an overwhelming
agreement to add another option, I'd like to keep the discussion to
the obvious choices. That is, I need to be convinced *not* to use #2,
before I get derailed into discussing which form of #3 to take.

- Daniel

Blast, LLVM list not filling in the response headers still...

We would like to have access to some kind of regular expression
library inside LLVM. For example, we need this to extend the FileCheck
test case checking tool to support regular expressions.

There are three obvious options:
1. Roll our own library. Multiple unnamed individuals may even
already have implementations lying around! :slight_smile:
2. Use POSIX regcomp facilities. This implies importing some
implementation of this interface, e.g., Windows. On Linux, BSD, etc.
we would try to use the platform version if available (and non-buggy).

3. Import a more heavy weight library such as PCRE, and use it universally.

Personally, I'm a big fan of the Boost libraries. They've got a regex
library, and a full-blown parser library (which I am using in my
front-end). It's definitely heavier than POSIX, but it's portable,
well-tested, and loaded with features.

This is too heavy, and we don't need the extra features, and regexec
is well tested and much more standard. Unless there is an overwhelming
agreement to add another option, I'd like to keep the discussion to
the obvious choices. That is, I need to be convinced *not* to use #2,
before I get derailed into discussing which form of #3 to take.

POSIX has a tendency to be rather useless on some major platforms,
notably Windows (which has no built-in regex library), which is why I
still recommend Spirit, it can be as lightweight as you want, you only
pay for what you use, and it works fast and everywhere.
Boost.xpressive is also quite good if you need the dynamic
functionality (you really should not, how often is the grammar/regex
going to be generated at runtime anyway?).

'regexec' I had never heard of, figured it was a library, turns out it
is a function call on *nix systems, yea, that is very much not usable
in any way shape or form, and is certainly not a standard if it does
not work on one of the major LLVM platforms (and it is still not a
standard in any pure form since it is not part of the C/C++ standard
headers). If that is option #2, then option #2 is very unusable.

And yes, if you must know, I program on Windows, which is why I am
pushing to use something that actually works everywhere instead of
just someone's favorite OS (I prefer BSD honestly, but Windows is what
the desktop world is sadly stuck on, so that is what I have to program
for).

Daniel Dunbar wrote:

This is too heavy, and we don't need the extra features, and regexec
is well tested and much more standard. Unless there is an overwhelming

actually, Boost is much more standard. IIRC the Boost library corresponds to
tr1::regex (or std::regex in C++0x), which has the incredible advantage of
being available on all standard conforming C++ compilers. Admittedly the C++
compiler has to be fairly new to support this, but you can use Boost as a
drop-in replacement until the compiler supports it on its own.

Thomas

I think you're seriously confused about the proposal. To put it bluntly, there is no way we'll use boosts regex support, sorry.

The proposal is to use the unix standard regexec library interface. The LLVM tree would include an imported BSD-licenced implementation from one of many sources. We'd then have configury logic detect when the host OS already supports the regexec interfaces, and if so, don't build our imported copy.

We'd have a simple layer on top of it to make the interface to the regex library less horrible than what regexec provides.

Again, forget boost regex. :slight_smile:

-Chris

We would like to have access to some kind of regular expression
library inside LLVM. For example, we need this to extend the FileCheck
test case checking tool to support regular expressions.

There are three obvious options:
1. Roll our own library. Multiple unnamed individuals may even
already have implementations lying around! :slight_smile:

2. Use POSIX regcomp facilities. This implies importing some
implementation of this interface, e.g., Windows. On Linux, BSD, etc.
we would try to use the platform version if available (and non-buggy).

Don't do it.
They are ridiculous slow, and posix made some really dumb choices in regexps.

3. Import a more heavy weight library such as PCRE, and use it universally.

PCRE is actually quite slow too.
Google released (as part of V8, I believe) an almost-always
linear-time regular expression library (written by Russ Cox and Rob
Pike).

It's quite small code wise (unlike boost), quite fast (and supports
perl style regexps). I'll see if i can get a pointer to it.

My personal preference is #2, and I may work on this in the near-term,
but I wanted to check for conflicting opinions first.

My main reasons for preferring #2 is to use a standard interface and
language, and it should be relatively easy to implement and maintain
(I'm assuming there is some nice BSD implementation out their we can
snarf, I haven't actually validated this assumption).

Warning: All the builtin platform ones are actually fairly slow, and
some are quite buggy, even between different versions of the same
platform (IE different glibc versions, etc) Google went through this
before originally settling on something that was at least consistently
buggy on all platforms (PCRE) and then moving to writing one that was
much better (what was released in V8, though i think that one may be
different than the exact one used internally). PCRE could easily blow
the stack in some cases, sadly.

FWIW: Most people are much more familiar with perl tyle regexps, than
they are familiar with posix style.

That said, if you really have the urge to have a BSD'd implementation
of regexec, at least choose something like TRE — The free and portable approximate regex matching library.
which is a linear time in size of string.

I've never performance timed it myself, but Russ Cox says it is "efficient" :wink:

The Google impl was, at least if i remember right, made up of 3
engines: 1 that was a special case matcher that handled plain old
anchored strings super fast, 1 that dealt with regular expressions
that did not require backtracking, and 1 that dealt with regular
expressions that did.

If you are curious, see Implementing Regular Expressions for how bad the
other libraries can get on fairly simple regexps because they don't
use linear time matchers.

This is too heavy, and we don't need the extra features, and regexec
is well tested and much more standard. Unless there is an overwhelming

'regexec' I had never heard of, figured it was a library, turns out it
is a function call on *nix systems, yea, that is very much not usable
in any way shape or form, and is certainly not a standard if it does
not work on one of the major LLVM platforms (and it is still not a
standard in any pure form since it is not part of the C/C++ standard
headers). If that is option #2, then option #2 is very unusable.

And yes, if you must know, I program on Windows, which is why I am
pushing to use something that actually works everywhere instead of
just someone's favorite OS (I prefer BSD honestly, but Windows is what
the desktop world is sadly stuck on, so that is what I have to program
for).

I think you're seriously confused about the proposal. To put it bluntly,
there is no way we'll use boosts regex support, sorry.

The proposal is to use the unix standard regexec library interface. The
LLVM tree would include an imported BSD-licenced implementation from one of
many sources. We'd then have configury logic detect when the host OS
already supports the regexec interfaces, and if so, don't build our imported
copy.

We'd have a simple layer on top of it to make the interface to the regex
library less horrible than what regexec provides.

Again, forget boost regex. :slight_smile:

What about std::regex? It is even now still a lot more standard then
regexec (which does not exist on my platform). Of which Boost just
happens to have a fully standards conforming implementation (as does
dinkumware too I think...).

In comparison (since you mentioned horrible interfaces), how about
something simple like returning a float followed by a pipe followed by
an integer list separated by commas and parsing it into a
std::pair<float,vector<int> > myPair; Also, I might be wrong on some
of the regex syntax, it has been a *long* time since I ever touched
regex inside a program (ever since I started using boost::spirit, only
use it for grep and its kin anymore), but it should get the idea
across. I use this regex for each of the things below:
  std::string regexStr("(-?\\d+(?:\\.\\d*)?)\\|(-?\\d+)*(?:\\,(-?\\d+))");
// I hope this is correct...
  std::string testStr("3.14|1,2,3,42,128");

The "(-?\\d+(?:\\.\\d*)?)\\|(-?\\d+)*(?:\\,(-?\\d+))" does not check
that the checked integer fits inside an int, does not account for
overflow (I could set a max repeat as well, but that still means it
will be too short to fit, or could overflow) for note...

regexec/or_whatever_in_regex.h_that_is_*nix_only (*nix standard, not
anywhere else): // dynamic regex, so it is slow in comparison to
other alternatives
  I have no clue how use this one, to be honest, I do not know the
syntax, if anyone else could show me how for this example? I doubt it
would be any shorter or faster to write (and especially to execute)
then the spirit example for note.

std/tr1/boost::regex (the C++ standard): // dynamic regex, so it is
slow in comparison to other alternatives
bool parse_test(std::string &testStr, myPair &ret)
{
  match_results<IteratorType> m;
  regex e(regexStr);
  bool successful = regex_match(testStr.begin(),testStr.end(),m,e,match_extra);
  if(successful)
  {
    float f;
    vector<int> &i_list = myPair.second;
    f = atof(m[1].c_str());
    myPair.first=f;
    for(int i = 2; i < what.captures(2).size(); ++i)
    {
      i_list.push_back(atoi(m.captures(2)[i]));
    }
  }
  return successful;
}

boost::xpressive(dynamic): // this uses the same back-end as
std/tr1/boost::regex, except you can freely mix-n-match dynamic and
static xpressive regex's, so refer to std/tr1/boost::regex

boost::xpressive(static): // This will be faster then all of the
above methods as the regex parse tree is created at compile-time
instead of run-time
bool parse_test(std::string &testStr, myPair &ret)
{
  sregex e = (s1=(!as_xpr('-')>>+_d>>!('.'>>*_d)))[ref(ret.first)=atof_adapt(s1)]
    >> '|'>> (s2=*_d)[push_back(ref(ret.second),atoi_adapt(s2))] >>
*(',' >> (s2=*_d)[push_back(ref(ret.second),atoi_adapt(s2))])
  bool successful = regex_match(testStr.begin(),testStr.end(),e,match_extra);
  return successful;
}

spirit::qi: // This will be the fastest, quite literally no other
parser has got close to the execution speed that this one can parse at
for anything like this, and it is easy to write!
bool parse_test(std::string &testStr, myPair &ret)
{
  return parse(testStr.begin(),testStr.end(),
    float_ >> '|' >> int_%','
    ret);
}

And as stated, the spirit version will execute faster, is very easy to
write, is very readable, and has adapters so it can stuff into
everything from any stl container (or anything that supports insert or
push_back for containers) or any generic struct/class (thanks to
fusion), and it is simple to create new things to do just about
anything you want.
Even something as simple as
parse(testStr.begin(),testStr.end(),int_,myInt); it executes faster
then atoi. And of course, you cannot beat the license.

Daniel just responded, so let me add based on his post. I know regex
was slow, but if it is as slow as he is saying, that is even more
incentive not to use it, not just based on how unusable it is on some
platforms. If you want speed, I dare you to show me anything that
beats Spirit2.1, even ignoring its ease of use and the fact it works
everywhere.

No, we have to build with c++'98 compilers. I think you're missing the point here. We care about code size in llvm, and the best code size you can get is to link to the one already in your address space because it's part of libc.

-Chris

2. Use POSIX regcomp facilities. This implies importing some
implementation of this interface, e.g., Windows. On Linux, BSD, etc.
we would try to use the platform version if available (and non-buggy).

Don't do it.
They are ridiculous slow, and posix made some really dumb choices in regexps.

We want to use this from FileCheck, which we build at -O0 today. Also, each regex will be matched once. Most testcases use fixed strings (in fact 100% of them do today!). This really is not very performance sensitive.

Regex engines like this are inherently more powerful but slower than fixed-purpose matching logic. I don't see a reason not to use a (slow!) simple regexec version.

I would also prefer not to have all the crazy features. Just supporting simple matching stuff is perfectly acceptable. We don't need unicode character classes, negative assertions, etc. We don't need the full power of perl regex's.

My personal preference is #2, and I may work on this in the near-term,
but I wanted to check for conflicting opinions first.

My main reasons for preferring #2 is to use a standard interface and
language, and it should be relatively easy to implement and maintain
(I'm assuming there is some nice BSD implementation out their we can
snarf, I haven't actually validated this assumption).

Warning: All the builtin platform ones are actually fairly slow, and
some are quite buggy, even between different versions of the same
platform (IE different glibc versions, etc)

I am more concerned about bugginess, but I doubt that affects simple regexes.

That said, if you really have the urge to have a BSD'd implementation
of regexec, at least choose something like TRE — The free and portable approximate regex matching library.
which is a linear time in size of string.

Nice, we need something for windows.

The Google impl was, at least if i remember right, made up of 3
engines: 1 that was a special case matcher that handled plain old
anchored strings super fast, 1 that dealt with regular expressions
that did not require backtracking, and 1 that dealt with regular
expressions that did.

FWIW, I plan to have different syntax in filecheck for fixed string matching vs regex matching. This should eliminate most of the common reasons for needing "leaning toothpicks".

If you are curious, see Implementing Regular Expressions for how bad the
other libraries can get on fairly simple regexps because they don't
use linear time matchers.

Yep, I'm very familiar with that paper. Thanks Danny,

-Chris

Again, forget boost regex. :slight_smile:

What about std::regex?

No, we have to build with c++'98 compilers. I think you're missing the
point here. We care about code size in llvm, and the best code size you can
get is to link to the one already in your address space because it's part of
libc.

There are multiple ones that have been created that fullfill
std::regex that work just fine on C++98, such as boost::regex and
dinkumware's and I know there is at least one other.

2. Use POSIX regcomp facilities. This implies importing some
implementation of this interface, e.g., Windows. On Linux, BSD, etc.
we would try to use the platform version if available (and non-
buggy).

Don't do it.
They are ridiculous slow, and posix made some really dumb choices in
regexps.

We want to use this from FileCheck, which we build at -O0 today.
Also, each regex will be matched once. Most testcases use fixed
strings (in fact 100% of them do today!). This really is not very
performance sensitive.

Regex engines like this are inherently more powerful but slower than
fixed-purpose matching logic. I don't see a reason not to use a
(slow!) simple regexec version.

I would also prefer not to have all the crazy features. Just
supporting simple matching stuff is perfectly acceptable. We don't
need unicode character classes, negative assertions, etc. We don't
need the full power of perl regex's.

Again, why not Spirit2.1, works just fine on C++98, and it is fast,
and it is split up into the smallest bits so you only include what you
use, and the assembly it compiles into is *very* tiny, far far less
then any regex library could possibly be.

I am more concerned about bugginess, but I doubt that affects simple
regexes.

Spirit2.1 has testcases for every possible aspect of it, thoroughly
tested and well proven.

I think you're missing the whole "we value small code size more than speed of matching" thing.

-Chris

Hello LLVM Devs,

I thought I'd weigh in on some of these non-backtracking linear time RegEx algorithms. If they're anything like the PackRat parsing algorithms they take at least 4x the amount of memory in terms of storage as the string length itself by not backtracking. That should be fine for small RegExes but it wouldn't do so well for more elaborate and long expressions.

If what you need is something for a short regex, a packrat algorithm will do fine. A 256-character string may bloat up to a couple of KB and still be reasonably cheap. If you're looking for a full parser, Spirit2.x might be better.

--Sam

I gathered that, hence why I have been pushing spirit, it makes for
very tight compiled code while working everywhere and being easy to
use (while being fast, but that is a side effect in this case). For
just plain matching you just call parse(iterbegin, iterend,
toMatchToRules). You can even just feed it the file iterators
directly without needing to stringize it first.

OvermindDL1 wrote:

Again, why not Spirit2.1, works just fine on C++98, and it is fast,
and it is split up into the smallest bits so you only include what you
use, and the assembly it compiles into is *very* tiny, far far less
then any regex library could possibly be.
  
Spirit is not an option for one simple reason: FileCheck needs to parse
regexes from its instruction comments at runtime and execute them. It
has no idea, at compile time, what these regexes will be.

Sebastian

I agree with Daniel Berlin, system's regcomp/regexec shouldn't be used.

Slow can mean taking 30 minutes, or hanging indefinetely on some
platforms with buggy regcomp/regexec.

Some examples:
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=497
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=598
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=635
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=658
https://wwws.clamav.net/bugzilla/show_bug.cgi?id=679
http://bugs.opensolaris.org/bugdatabase/printableBug.do?bug_id=4346175

That is why in ClamAV we are using the OpenBSD implementation of
regcomp/regexec, regardless if the system has a regcomp/regexec available.
The code is fairly small (~100k), BSD licensed, easy to make it portable
(memmove->memcpy, pull in strlcpy impl., etc.) and doesn't explode in
execution time.

I wasn't aware of Google's regexp library at that time (perhaps it
didn't exist yet), but if it provides linear execution time that sounds
good too.

If LLVM is going to have an integrated regex library I suggest using it
regardless if the platform has one.
The LLVM integrated regex library will provide consistent behaviour and
execution time, the system one will not.

Best regards,
--Edwin

Ah, did not know it required dynamic generation, I talked about
dynamic/static in a previous email but no one stated which was needed
and no one ever clarified over the course of my messages, does no one
else know this either?
In that case, I would recommend xpressive, it is the faster between it
and regex.