CLang and UTF BOM characters

Hi,

According to the UTF-8 standard, the BOM character sequence may be present at the beginning of a file. Clang doesn’t seem to support this, and produces an error, specifying the characters as unknown tokens.

This should be fixed IMHO. The way I handle it (if input is through std::ifstream):

inline void processBOM( std::ifstream &stream )

{
    const unsigned char BOM[] = { 0xef, 0xbb, 0xbf };
    char first3chars[3];
    if( !stream.read( first3chars, 3 ) )
        throw std::runtime_error( "Unexpected end of file" );
    if( strcmp(reinterpret_cast<const char*>(BOM), first3chars) )
        stream.seekg( 0, std::ios::beg ); // reset to beginning of file
}

This essentially skips the BOM if present. But the solution is of course up to you and Clang’s design in this aspect.

Ruben

In message <AANLkTikSBG4iizY_RuSoBodkR+wgoN4ihDxDCv7FnBkz@mail.gmail.com>

2010/10/16 Ruben Van Boxem <vanboxem.ruben@gmail.com>

Hi,

According to the UTF-8 standard, the BOM character sequence may be present at the beginning of a file. Clang doesn’t seem to support this, and produces an error, specifying the characters as unknown tokens.

This should be fixed IMHO. The way I handle it (if input is through std::ifstream):

inline void processBOM( std::ifstream &stream )

{
    const unsigned char BOM[] = { 0xef, 0xbb, 0xbf };
    char first3chars[3];
    if( !stream.read( first3chars, 3 ) )
        throw std::runtime_error( "Unexpected end of file" );
    if( strcmp(reinterpret_cast<const char*>(BOM), first3chars) )
        stream.seekg( 0, std::ios::beg ); // reset to beginning of file
}

This essentially skips the BOM if present. But the solution is of course up to you and Clang’s design in this aspect.

Ruben

I actually meant strncmp(…,…,3) by the way :slight_smile:

Ruben

Why strQQQ when you don't expect null terminators anywhere?

~ Scott

Why not? I’m not big on C here, just use it here out of brevity. There should be no difference between the two in this use case IMHO?

Ruben

Correct, should do memcmp.

strncmp() will compare up to n characters, stopping when a null byte is encountered. memcmp() will compare exactly n characters. Both cases will stop if there is a difference between the two inputs. A null bytes in the input stream is a) not a terminator, and b) not valid. Therefore, testing for it is less efficient.

Of course, when testing three bytes performance is not likely to be an issue and (hopefully) the lower standard library calls pass will turn it into a trivial comparison, avoiding the function call. Avoiding the test for null will have almost no impact on performance. Using strncmp(), however, does impact code readability because it implies to the reader that at least one of the arguments is a null-terminated C string.

David

-- Sent from my Cray X1

Why not? I'm not big on C here, just use it here out of brevity. There should be no difference between the two in this use case IMHO?

strncmp() will compare up to n characters, stopping when a null byte is encountered. memcmp() will compare exactly n characters. Both cases will stop if there is a difference between the two inputs. A null bytes in the input stream is a) not a terminator, and b) not valid. Therefore, testing for it is less efficient.

Mostly pointless pedanticism, but memcmp() is also allowed to read
ahead more than strncmp() is, which makes it slightly more
optimizable.

- Daniel