www.delorie.com/gnu/docs/glibc/libc_107.html   search  
Buy the book!

The GNU C Library

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ] iconv module interfaces

With the knowledge about the data structures we now can describe the conversion function itself. To understand the interface a bit of knowledge is necessary about the functionality in the C library that loads the objects with the conversions.

It is often the case that one conversion is used more than once (i.e., there are several iconv_open calls for the same set of character sets during one program run). The mbsrtowcs et.al. functions in the GNU C library also use the iconv functionality, which increases the number of uses of the same functions even more.

Because of this multiple use of conversions, the modules do not get loaded exclusively for one conversion. Instead a module once loaded can be used by an arbitrary number of iconv or mbsrtowcs calls at the same time. The splitting of the information between conversion- function-specific information and conversion data makes this possible. The last section showed the two data structures used to do this.

This is of course also reflected in the interface and semantics of the functions that the modules must provide. There are three functions that must have the following names:

The gconv_init function initializes the conversion function specific data structure. This very same object is shared by all conversions that use this conversion and, therefore, no state information about the conversion itself must be stored in here. If a module implements more than one conversion, the gconv_init function will be called multiple times.

The gconv_end function is responsible for freeing all resources allocated by the gconv_init function. If there is nothing to do, this function can be missing. Special care must be taken if the module implements more than one conversion and the gconv_init function does not allocate the same resources for all conversions.

This is the actual conversion function. It is called to convert one block of text. It gets passed the conversion step information initialized by gconv_init and the conversion data, specific to this use of the conversion functions.

There are three data types defined for the three module interface functions and these define the interface.

Data type: int (*__gconv_init_fct) (struct __gconv_step *)
This specifies the interface of the initialization function of the module. It is called exactly once for each conversion the module implements.

As explained in the description of the struct __gconv_step data structure above the initialization function has to initialize parts of it.

These elements must be initialized to the exact numbers of the minimum and maximum number of bytes used by one character in the source and destination character sets, respectively. If the characters all have the same size, the minimum and maximum values are the same.

This element must be initialized to an nonzero value if the source character set is stateful. Otherwise it must be zero.

If the initialization function needs to communicate some information to the conversion function, this communication can happen using the __data element of the __gconv_step structure. But since this data is shared by all the conversions, it must not be modified by the conversion function. The example below shows how this can be used.

#define MIN_NEEDED_FROM         1
#define MAX_NEEDED_FROM         4
#define MIN_NEEDED_TO           4
#define MAX_NEEDED_TO           4

gconv_init (struct __gconv_step *step)
  /* Determine which direction.  */
  struct iso2022jp_data *new_data;
  enum direction dir = illegal_dir;
  enum variant var = illegal_var;
  int result;

  if (__strcasecmp (step->__from_name, "ISO-2022-JP//") == 0)
      dir = from_iso2022jp;
      var = iso2022jp;
  else if (__strcasecmp (step->__to_name, "ISO-2022-JP//") == 0)
      dir = to_iso2022jp;
      var = iso2022jp;
  else if (__strcasecmp (step->__from_name, "ISO-2022-JP-2//") == 0)
      dir = from_iso2022jp;
      var = iso2022jp2;
  else if (__strcasecmp (step->__to_name, "ISO-2022-JP-2//") == 0)
      dir = to_iso2022jp;
      var = iso2022jp2;

  result = __GCONV_NOCONV;
  if (dir != illegal_dir)
      new_data = (struct iso2022jp_data *)
        malloc (sizeof (struct iso2022jp_data));

      result = __GCONV_NOMEM;
      if (new_data != NULL)
          new_data->dir = dir;
          new_data->var = var;
          step->__data = new_data;

          if (dir == from_iso2022jp)
              step->__min_needed_from = MIN_NEEDED_FROM;
              step->__max_needed_from = MAX_NEEDED_FROM;
              step->__min_needed_to = MIN_NEEDED_TO;
              step->__max_needed_to = MAX_NEEDED_TO;
              step->__min_needed_from = MIN_NEEDED_TO;
              step->__max_needed_from = MAX_NEEDED_TO;
              step->__min_needed_to = MIN_NEEDED_FROM;
              step->__max_needed_to = MAX_NEEDED_FROM + 2;

          /* Yes, this is a stateful encoding.  */
          step->__stateful = 1;

          result = __GCONV_OK;

  return result;

The function first checks which conversion is wanted. The module from which this function is taken implements four different conversions; which one is selected can be determined by comparing the names. The comparison should always be done without paying attention to the case.

Next, a data structure, which contains the necessary information about which conversion is selected, is allocated. The data structure struct iso2022jp_data is locally defined since, outside the module, this data is not used at all. Please note that if all four conversions this modules supports are requested there are four data blocks.

One interesting thing is the initialization of the __min_ and __max_ elements of the step data object. A single ISO-2022-JP character can consist of one to four bytes. Therefore the MIN_NEEDED_FROM and MAX_NEEDED_FROM macros are defined this way. The output is always the INTERNAL character set (aka UCS-4) and therefore each character consists of exactly four bytes. For the conversion from INTERNAL to ISO-2022-JP we have to take into account that escape sequences might be necessary to switch the character sets. Therefore the __max_needed_to element for this direction gets assigned MAX_NEEDED_FROM + 2. This takes into account the two bytes needed for the escape sequences to single the switching. The asymmetry in the maximum values for the two directions can be explained easily: when reading ISO-2022-JP text, escape sequences can be handled alone (i.e., it is not necessary to process a real character since the effect of the escape sequence can be recorded in the state information). The situation is different for the other direction. Since it is in general not known which character comes next, one cannot emit escape sequences to change the state in advance. This means the escape sequences that have to be emitted together with the next character. Therefore one needs more room than only for the character itself.

The possible return values of the initialization function are:

The initialization succeeded
The requested conversion is not supported in the module. This can happen if the `gconv-modules' file has errors.
Memory required to store additional information could not be allocated.

The function called before the module is unloaded is significantly easier. It often has nothing at all to do; in which case it can be left out completely.

Data type: void (*__gconv_end_fct) (struct gconv_step *)
The task of this function is to free all resources allocated in the initialization function. Therefore only the __data element of the object pointed to by the argument is of interest. Continuing the example from the initialization function, the finalization function looks like this:

gconv_end (struct __gconv_step *data)
  free (data->__data);

The most important function is the conversion function itself, which can get quite complicated for complex character sets. But since this is not of interest here, we will only describe a possible skeleton for the conversion function.

Data type: int (*__gconv_fct) (struct __gconv_step *, struct __gconv_step_data *, const char **, const char *, size_t *, int)
The conversion function can be called for two basic reason: to convert text or to reset the state. From the description of the iconv function it can be seen why the flushing mode is necessary. What mode is selected is determined by the sixth argument, an integer. This argument being nonzero means that flushing is selected.

Common to both modes is where the output buffer can be found. The information about this buffer is stored in the conversion step data. A pointer to this information is passed as the second argument to this function. The description of the struct __gconv_step_data structure has more information on the conversion step data.

What has to be done for flushing depends on the source character set. If the source character set is not stateful, nothing has to be done. Otherwise the function has to emit a byte sequence to bring the state object into the initial state. Once this all happened the other conversion modules in the chain of conversions have to get the same chance. Whether another step follows can be determined from the __is_last element of the step data structure to which the first parameter points.

The more interesting mode is when actual text has to be converted. The first step in this case is to convert as much text as possible from the input buffer and store the result in the output buffer. The start of the input buffer is determined by the third argument, which is a pointer to a pointer variable referencing the beginning of the buffer. The fourth argument is a pointer to the byte right after the last byte in the buffer.

The conversion has to be performed according to the current state if the character set is stateful. The state is stored in an object pointed to by the __statep element of the step data (second argument). Once either the input buffer is empty or the output buffer is full the conversion stops. At this point, the pointer variable referenced by the third parameter must point to the byte following the last processed byte (i.e., if all of the input is consumed, this pointer and the fourth parameter have the same value).

What now happens depends on whether this step is the last one. If it is the last step, the only thing that has to be done is to update the __outbuf element of the step data structure to point after the last written byte. This update gives the caller the information on how much text is available in the output buffer. In addition, the variable pointed to by the fifth parameter, which is of type size_t, must be incremented by the number of characters (not bytes) that were converted in a non-reversible way. Then, the function can return.

In case the step is not the last one, the later conversion functions have to get a chance to do their work. Therefore, the appropriate conversion function has to be called. The information about the functions is stored in the conversion data structures, passed as the first parameter. This information and the step data are stored in arrays, so the next element in both cases can be found by simple pointer arithmetic:

gconv (struct __gconv_step *step, struct __gconv_step_data *data,
       const char **inbuf, const char *inbufend, size_t *written,
       int do_flush)
  struct __gconv_step *next_step = step + 1;
  struct __gconv_step_data *next_data = data + 1;

The next_step pointer references the next step information and next_data the next data record. The call of the next function therefore will look similar to this:

  next_step->__fct (next_step, next_data, &outerr, outbuf,
                    written, 0)

But this is not yet all. Once the function call returns the conversion function might have some more to do. If the return value of the function is __GCONV_EMPTY_INPUT, more room is available in the output buffer. Unless the input buffer is empty the conversion, functions start all over again and process the rest of the input buffer. If the return value is not __GCONV_EMPTY_INPUT, something went wrong and we have to recover from this.

A requirement for the conversion function is that the input buffer pointer (the third argument) always point to the last character that was put in converted form into the output buffer. This is trivially true after the conversion performed in the current step, but if the conversion functions deeper downstream stop prematurely, not all characters from the output buffer are consumed and, therefore, the input buffer pointers must be backed off to the right position.

Correcting the input buffers is easy to do if the input and output character sets have a fixed width for all characters. In this situation we can compute how many characters are left in the output buffer and, therefore, can correct the input buffer pointer appropriately with a similar computation. Things are getting tricky if either character set has characters represented with variable length byte sequences, and it gets even more complicated if the conversion has to take care of the state. In these cases the conversion has to be performed once again, from the known state before the initial conversion (i.e., if necessary the state of the conversion has to be reset and the conversion loop has to be executed again). The difference now is that it is known how much input must be created, and the conversion can stop before converting the first unused character. Once this is done the input buffer pointers must be updated again and the function can return.

One final thing should be mentioned. If it is necessary for the conversion to know whether it is the first invocation (in case a prolog has to be emitted), the conversion function should increment the __invocation_counter element of the step data structure just before returning to the caller. See the description of the struct __gconv_step_data structure above for more information on how this can be used.

The return value must be one of the following values:

All input was consumed and there is room left in the output buffer.
No more room in the output buffer. In case this is not the last step this value is propagated down from the call of the next conversion function in the chain.
The input buffer is not entirely empty since it contains an incomplete character sequence.

The following example provides a framework for a conversion function. In case a new conversion has to be written the holes in this implementation have to be filled and that is it.

gconv (struct __gconv_step *step, struct __gconv_step_data *data,
       const char **inbuf, const char *inbufend, size_t *written,
       int do_flush)
  struct __gconv_step *next_step = step + 1;
  struct __gconv_step_data *next_data = data + 1;
  gconv_fct fct = next_step->__fct;
  int status;

  /* If the function is called with no input this means we have
     to reset to the initial state.  The possibly partly
     converted input is dropped.  */
  if (do_flush)
      status = __GCONV_OK;

      /* Possible emit a byte sequence which put the state object
         into the initial state.  */

      /* Call the steps down the chain if there are any but only
         if we successfully emitted the escape sequence.  */
      if (status == __GCONV_OK && ! data->__is_last)
        status = fct (next_step, next_data, NULL, NULL,
                      written, 1);
      /* We preserve the initial values of the pointer variables.  */
      const char *inptr = *inbuf;
      char *outbuf = data->__outbuf;
      char *outend = data->__outbufend;
      char *outptr;

          /* Remember the start value for this round.  */
          inptr = *inbuf;
          /* The outbuf buffer is empty.  */
          outptr = outbuf;

          /* For stateful encodings the state must be safe here.  */

          /* Run the conversion loop.  status is set
             appropriately afterwards.  */

          /* If this is the last step, leave the loop. There is
             nothing we can do.  */
          if (data->__is_last)
              /* Store information about how many bytes are
                 available.  */
              data->__outbuf = outbuf;

             /* If any non-reversible conversions were performed,
                add the number to *written.  */


          /* Write out all output that was produced.  */
          if (outbuf > outptr)
              const char *outerr = data->__outbuf;
              int result;

              result = fct (next_step, next_data, &outerr,
                            outbuf, written, 0);

              if (result != __GCONV_EMPTY_INPUT)
                  if (outerr != outbuf)
                      /* Reset the input buffer pointer.  We
                         document here the complex case.  */
                      size_t nstatus;

                      /* Reload the pointers.  */
                      *inbuf = inptr;
                      outbuf = outptr;

                      /* Possibly reset the state.  */

                      /* Redo the conversion, but this time
                         the end of the output buffer is at
                         outerr.  */

                  /* Change the status.  */
                  status = result;
                /* All the output is consumed, we can make
                    another run if everything was ok.  */
                if (status == __GCONV_FULL_OUTPUT)
                  status = __GCONV_OK;
      while (status == __GCONV_OK);

      /* We finished one use of this step.  */

  return status;

This information should be sufficient to write new modules. Anybody doing so should also take a look at the available source code in the GNU C library sources. It contains many examples of working and optimized modules.

[ < ] [ > ]   [ << ] [ Up ] [ >> ]         [Top] [Contents] [Index] [ ? ]

  webmaster     delorie software   privacy  
  Copyright 2003   by The Free Software Foundation     Updated Jun 2003