[Introduction]

Unix Incompatibility Notes:
Regular Expression Libraries

Jan Wolter

All Unix systems seem to have some form of regular expression parsing library that can be invoked from C programs, however they are not very compatible with each other. Both the regular expression syntax and C-language API vary considerably.

The best solution for programmers wishing to use regular expression routines in portable C programs is probably not to depend on the regular expression libraries that may be found on the host computer, but to include a copy of Henry Spencer's implementation of the POSIX 1003.2 standard or Philip Hazel's Perl-Compatible Regular Expression Library in your distribution. This is what Apache, MSQL and many other packages do. The copyright terms on for these libraries are liberal enough so that almost anyone should be able to make use of them, and both packages work well.

If you don't want to do this, then you may want to require people to install a POSIX 1003.2 compliant library on their system. In addition to the Henry Spencer package, there is a free Gnu package (which includes the usual Gnu copyright which prevents it from being rolled into commerical software distributions).

If you are misguided, you can do what I did in earlier versions of Backtalk and try to support a wide range of different regular expression libraries in your program. Not only will this keep you awake late into the night coding in support for many strange and arcane regular expression libraries, but you will have to figure out how to deal with the fact that the regular expression syntax will be different depending on which library you compiled with.

In support of such misguided souls, here are some rough notes describing some different libraries and how they might be supported. Don't expect me to update this though - I've ripped all of that code out of Backtalk and hope never too have to look at this again.

Nearly all regular expression libraries work in two steps - first there is a function call that compiles the regular expression, and then there is a function call that uses that compiled regular expression to search a string.

Libraries

Posix Regex

POSIX 1003.2 section 2.8 aparantly sets a standard for regular expressions syntax and for the C API to regular expression libraries. This is presumably the wave of the future. Freeware implementations exist by Henry Spencer and by Gnu. All the open source Unixes have these.

Includes support for two different regular expression syntaxs, basic and extended. Basic regular expressions are similar to those in ed, while extended regular expressions are more like those in egrep, adding the '|', '+' and '?' operators and not requiring backslashes on parenthesized subexpressions or curly-bracketed bounds. Basic is the default, but extended is prefered.

C API uses the regex.h header file. Compilation is by the regcomp() function, search is done with the regexec() function.

Bell Regexp

This originated in Bell Version 8 Unix, exists in 4.3bsd-Tahoe and BSDI. It resembles the Posix interface, but isn't. It is always case sensitive.

C API uses the regexp.h header file. Compilation is by the regcomp() function, search is done with the regexec() function.

Gnu RE_Compile_Pattern

The Gnu regular expression package has a POSIX interface, but also has it's own native interface, which maybe I ought to look at someday, but it seems more portable to use the POSIX interface.

Compilation is by re_compile_pattern(), searching is with re_search().

Lib-Gen Basic

Solaris and IRIX have two different regular expression libraries that can be linked in with the -lgen flag at compile time. This one handles regular expression syntax whose syntax is similar to POSIX Basic expressions. It is always case sensitive.

C API uses the regexpr.h header file. Compilation is by the compile() function, searching is with step().

Lib-Gen Extended

Solaris and IRIX have two different regular expression libraries that can be linked in with the -lgen flag at compile time. This one handles regular expression syntax whose syntax is a bit closer to POSIX Extended expressions, but lacks some of the fancier features. It is always case sensitive.

C API uses the libgen.h header file. Compilation is by the regcmp() function, searching is with regex().

BSD

The old BSD library (which apparantly originally comes from System V) uses the rather weird regexp.h header file, which contains not only function definitions, but the actual source code for the regular expression matcher. You have to define a bunch of macros that get inserted into the program at compile time. Though the interface is hideous, this library has the advantage that you can (in theory) move it to another machine simply by copying the regex.h header file. Since it is under the BSD license, this is generally even legal. Because of this, the library is available on many systems, including some that don't have a manual page for it (there are usually some comments in the header file itself that will help). It exists (at least) on SunOS, IRIX and AIX and probably many more.

The expressions are normally pretty much ed-style basic expressions. But at least some version of Linux include a regexp.h header but which actually handle more extended regular expression syntaxes.

C API uses the regexp.h header file, but you have to define macros INIT, GETC(), PEEKC(), UNGETC() RETURN() and ERROR() before including it. Compilation is by the compile() function, searching is with step().

The Linux version of this gives a compilation error unless you include the declaration "static getrnge();" before including the regexp.h header file. You can safely include this declaration on all systems, it will be ignored by systems that don't need it.

RE_Match

This exists on Nextstep. It supports case insensitive matching.

Regular expression syntax is similar to POSIX Basic expressions.

C API uses the regex.h header file. Compilation is by the re_compile() function, searching is with re_match().

RE_Exec

This exists on SunOS, AIX, NeXTstep and IRIX. Gnu's package can emulate this interface. It is always case sensitive.

Regular expression syntax is similar to POSIX Basic expressions.

C API uses no header file. Compilation is by the re_comp() function, searching is with re_exec().

Recmp

This exists on NeXTstep. It it's just a single function, recmp() that does both compilation and execution. It is always case sensitive.

Regular expression syntax is similar to POSIX Basic expressions.

AutoConf

When I still tried to use the host system's regular expression library in Backtalk, this was the autoconf code I used to decide which library to use. I was prefering libraries that supported case independent matching.
bt_regexp=none
AC_MSG_CHECKING(for posix regex.h)
AC_CACHE_VAL(bt_cv_posix_regex, [dnl
AC_EGREP_HEADER([regex_t],regex.h,
  [bt_cv_posix_regex=yes],[bt_cv_posix_regex=no])])
if test "$bt_cv_posix_regex" = yes; then
  AC_DEFINE(RE_POSIX)
  bt_regexp=posix
fi
AC_MSG_RESULT($bt_cv_posix_regex)

if test $bt_regexp = none; then
AC_CHECK_FUNC(re_compile,
[ AC_DEFINE(RE_MATCH)
  bt_regexp=rematch
])
fi

if test $bt_regexp = none; then
AC_CHECK_LIB(gen,compile,
[ AC_DEFINE(RE_GENBAS)
  LIBS="-lgen $LIBS"
  bt_regexp=gen
])
fi

if test $bt_regexp = none; then
AC_CHECK_LIB(gen,regcmp,
[ AC_DEFINE(RE_GENEXT)
  LIBS="-lgen $LIBS"
  bt_regexp=gensgi
])
fi

if test $bt_regexp = none; then
AC_MSG_CHECKING(for old-style bsd regexp)
AC_CACHE_VAL(bt_cv_old_regexp, [dnl
AC_EGREP_CPP([a_pig_by_the_nose],[#define UNGETC(c) a_pig_by_the_nose
#include <regexp.h>
],[bt_cv_old_regexp=yes],[bt_cv_old_regexp=no])])
if test "$bt_cv_old_regexp" = yes; then
  AC_DEFINE(RE_BSD)
  bt_regexp=bsd
fi
AC_MSG_RESULT($bt_cv_old_regexp)
fi

if test $bt_regexp = none; then
AC_CHECK_FUNC(regcomp,
[ AC_DEFINE(RE_BELL)
  bt_regexp=bell
])
fi

if test $bt_regexp = none; then
AC_CHECK_FUNC(re_comp,
[ AC_DEFINE(RE_EXEC)
  bt_regexp=exec
])
fi

if test $bt_regexp = none; then
AC_CHECK_FUNC(recmp,
[ AC_DEFINE(RE_RECMP)
  bt_regexp=recmp
])
fi

Jan Wolter (E-Mail)
Tue May 23 14:42:30 EDT 2000 - Original Release.
Tue Nov 19 10:00:56 EST 2002 - Add reference to PCRE.