[libcxx] Multiline regular expression matching

Hello,

N3242 is silent on the issue of multiline regular expression matching, i.e. if ^ and $ only match the beginning
and end of the string, respectively, or if they also match occurrances of \n or \r inside the string. It is only
possible to turn matching the former off (via match_not_bol and match_not_eol, respectively). ECMAScript, on
whose regular expressions the C++0x regex library is based (among others), provides an additional flag in the
RegExp constructor to turn on multiline matching (see also <http://www.regular-expressions.info/anchors.html>
for more information on multiline matching).

I looked into previous standard committee documents about regular expressions, but was unable to find anything
regarding this issue.

I then tried my hand at a workaround using a non-captured disjunction in the following test program, using
libc++ trunk:

// /opt/bin/clang -std=c++0x -stdlib=libc++ -lc++ clang.cpp
#include <regex>

static const std::regex INCLUDE_REGEXP("(?:^|[\\n\\r])#include\\s*<([^>]+)>");

static const std::string s =
  "attribute vec3 vertexUV0;\n"
  "#include <shaders/include/Lighting.glsl>\n"
  "#include <shaders/include/ProjectTextureOnCube.glsl>\n"
  "uniform mat4 mvp;\n";

int main(int, char**)
{
  std::sregex_iterator it(s.begin(), s.end(), INCLUDE_REGEXP);
  std::sregex_iterator const end;
  if (it == end)
  {
    std::printf("Not found\n");
  }
  else
  {
    while (it != end)
    {
      std::printf("Found '%s [%s]'\n", it->str().c_str(), it->str(1).c_str());
      ++it;
    }
  }
}

This resulted in the output

  "Not found".

Exchanging the disjunction's alternatives ("(?:^|[\\n\\r])" => "(?:[\\n\\r]|^)"), resulted in a (seemingly)
endless stream of
  Found ' []'
  Found ' []'
  Found ' []'
  Found ' []'
  ...

Removing the disjunction results in two matches (as expected):
  Found '#include <shaders/include/Lighting.glsl> [shaders/include/Lighting.glsl]'
  Found '#include <shaders/include/ProjectTextureOnCube.glsl> [shaders/include/ProjectTextureOnCube.glsl]'

From my reading of the ECMAScript standard (<http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf>),

above regular expressions are (at least syntactically) valid. So I have the following questions:

- Is libc++'s current behaviour a bug?
- Is there another, simpler way to perform multiline matching using std::regex?

Thanks in advance,
Jonathan

Hello,

N3242 is silent on the issue of multiline regular expression matching, i.e. if ^ and $ only match the beginning
and end of the string, respectively, or if they also match occurrances of \n or \r inside the string. It is only
possible to turn matching the former off (via match_not_bol and match_not_eol, respectively). ECMAScript, on
whose regular expressions the C++0x regex library is based (among others), provides an additional flag in the
RegExp constructor to turn on multiline matching (see also <http://www.regular-expressions.info/anchors.html>
for more information on multiline matching).

I looked into previous standard committee documents about regular expressions, but was unable to find anything
regarding this issue.

I then tried my hand at a workaround using a non-captured disjunction in the following test program, using
libc++ trunk:

// /opt/bin/clang -std=c++0x -stdlib=libc++ -lc++ clang.cpp
#include <regex>

static const std::regex INCLUDE_REGEXP("(?:^|[\\n\\r])#include\\s*<([^>]+)>");

static const std::string s =
  "attribute vec3 vertexUV0;\n"
  "#include <shaders/include/Lighting.glsl>\n"
  "#include <shaders/include/ProjectTextureOnCube.glsl>\n"
  "uniform mat4 mvp;\n";

int main(int, char**)
{
  std::sregex_iterator it(s.begin(), s.end(), INCLUDE_REGEXP);
  std::sregex_iterator const end;
  if (it == end)
  {
    std::printf("Not found\n");
  }
  else
  {
    while (it != end)
    {
      std::printf("Found '%s [%s]'\n", it->str().c_str(), it->str(1).c_str());
      ++it;
    }
  }
}

This resulted in the output

  "Not found".

Exchanging the disjunction's alternatives ("(?:^|[\\n\\r])" => "(?:[\\n\\r]|^)"), resulted in a (seemingly)
endless stream of
  Found ' []'
  Found ' []'
  Found ' []'
  Found ' []'
  ...

Removing the disjunction results in two matches (as expected):
  Found '#include <shaders/include/Lighting.glsl> [shaders/include/Lighting.glsl]'
  Found '#include <shaders/include/ProjectTextureOnCube.glsl> [shaders/include/ProjectTextureOnCube.glsl]'

From my reading of the ECMAScript standard (<http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf>),

above regular expressions are (at least syntactically) valid. So I have the following questions:

- Is libc++'s current behaviour a bug?

I believe it is a libc++ bug. I've committed a fix revision 128350.

- Is there another, simpler way to perform multiline matching using std::regex?

Your way looks as good as any to me.

Thanks for bringing this to our attention.

-Howard

Hello,

- Is libc++'s current behaviour a bug?

I believe it is a libc++ bug. I've committed a fix revision 128350.

It works now as expected. Thank you!

To avoid capturing the previous line ending, I changed the expression a little bit
using an assertion to (C-string, hence the double escape):
"(?=^|[\\n\\r])#include\\s*<([^>]+)>". This also worked.

Jonathan