The GNU Awk User's Guide

C.4 Probable Future Extensions

AWK is a language similar to PERL, only considerably more elegant.
Arnold Robbins

Larry Wall

This section briefly lists extensions and possible improvements that indicate the directions we are currently considering for gawk. The file `FUTURES' in the gawk distribution lists these extensions as well.

Following is a list of probable future changes visible at the awk language level:

Loadable module interface
It is not clear that the awk-level interface to the modules facility is as good as it should be. The interface needs to be redesigned, particularly taking namespace issues into account, as well as possibly including issues such as library search path order and versioning.

RECLEN variable for fixed-length records
Along with FIELDWIDTHS, this would speed up the processing of fixed-length records. PROCINFO["RS"] would be "RS" or "RECLEN", depending upon which kind of record processing is in effect.

Additional printf specifiers
The 1999 ISO C standard added a number of additional printf format specifiers. These should be evaluated for possible inclusion in gawk.

It may be possible to map a GDBM/NDBM/SDBM file into an awk array.

Large character sets
It would be nice if gawk could handle UTF-8 and other character sets that are larger than eight bits.

More lint warnings
There are more things that could be checked for portability.

Following is a list of probable improvements that will make gawk's source code easier to work with:

Loadable module mechanics
The current extension mechanism works (see section Adding New Built-in Functions to gawk), but is rather primitive. It requires a fair amount of manual work to create and integrate a loadable module. Nor is the current mechanism as portable as might be desired. The GNU libtool package provides a number of features that would make using loadable modules much easier. gawk should be changed to use libtool.

Loadable module internals
The API to its internals that gawk "exports" should be revised. Too many things are needlessly exposed. A new API should be designed and implemented to make module writing easier.

Better array subscript management
gawk's management of array subscript storage could use revamping, so that using the same value to index multiple arrays only stores one copy of the index value.

Integrating the DBUG library
Integrating Fred Fish's DBUG library would be helpful during development, but it's a lot of work to do.

Following is a list of probable improvements that will make gawk perform better:

An improved version of dfa
The dfa pattern matcher from GNU grep has some problems. Either a new version or a fixed one will deal with some important regexp matching issues.

Compilation of awk programs
gawk uses a Bison (YACC-like) parser to convert the script given it into a syntax tree; the syntax tree is then executed by a simple recursive evaluator. This method incurs a lot of overhead, since the recursive evaluator performs many procedure calls to do even the simplest things.

It should be possible for gawk to convert the script's parse tree into a C program which the user would then compile, using the normal C compiler and a special gawk library to provide all the needed functions (regexps, fields, associative arrays, type coercion, and so on).

An easier possibility might be for an intermediate phase of gawk to convert the parse tree into a linear byte code form like the one used in GNU Emacs Lisp. The recursive evaluator would then be replaced by a straight line byte code interpreter that would be intermediate in speed between running a compiled program and doing what gawk does now.

Finally, the programs in the test suite could use documenting in this Web page.

See section Making Additions to gawk, if you are interested in tackling any of these projects.

