Variable Length Arrays

Next: Arrays of Strings Up: Conventions for Binary Tables Previous: Conventions for Binary Tables

5.2.1 Variable Length Arrays

In some tables, the number of elements in the array contained by a particular field may vary from row to row. One approach would be to set the repeat count equal to the largest number possible and use fill values for the remaining elements in the field when the number of actual values is smaller. However, if the variations from row to row are large, the resulting binary table will contain a large number of fill values, an inefficient use of storage. The binary table rules contain provisions that support a convention for efficient storage of variable length arrays, although the convention itself is not part of the formal proposal.

The basic concept is that the data for variable length arrays are not stored physically in the main table but in a heap area that follows the table. This heap area takes up part of or all of the space following the main table reserved by the PCOUNT keyword in the header. At the entry in the table where the array would be located, there is instead an array descriptor or pointer, which tells the user or software

where the data to be logically included as that entry of the table are to be found in the heap.
the size of the data.

The data identified in the heap are treated as if actually located in that entry of the table instead of in the heap. Since an array descriptor has a single, standard size, all rows in the main table will have the same length. This structure allows software that is unable to handle variable-length arrays to read the rest of the main table and then use the value of PCOUNT in the header to move on to the next extension when it has finished reading the fixed-length entries in the table. Variable length arrays may be of any of the data types used in binary tables except array descriptor and may be multidimensional.

The value of the TFORMn keyword in the header, which has the form 'rPt(maxelem)', identifies the data type (t) of the array to be logically included in field n and specifies the largest number of elements the array can have (maxelem) in any row of the table. The maxelem value, which the wording in CTP includes as an essential component of the keyword value, allows software to read the table directly into a data base that supports only fixed size data arrays without first having to read the table solely to determine the size of the arrays. Without it, the task of the reader is more complicated. The P tells the reader that the actual contents of the field in the main table will not be the data array but the descriptor. For example, the card image

TFORM8 = 'PB(1800)' / Variable length byte array

signifies that field 8 of the table should be regarded as logically containing a byte array of up to 1800 elements (here bytes). The actual physical contents of field 8 are two 32-bit integers. The first 32-bit integer is the number of heap elements to be considered a part of the current row. This value may differ from row to row, but, in the illustration here, cannot exceed 1800. A row in the table data may have a value of zero in this field, thus providing for fields where data are not present in every row. The second 32-bit integer is the offset of the start of the byte array from the beginning of the heap, counted in number of bytes. An array starting at the beginning of the heap has an offset of zero. The remaining elements of the array follow in sequence in the heap; the contents of a single field of a single row cannot be broken up in the heap. However, different fields in the same row need not appear in the heap in the order in which they appear in the table. Gaps in the heap, with no data, are possible. If two or more table fields have identical contents, the descriptors for both may point to the same area of the heap.

The heap need not begin at the end of the main table data. The user may choose to provide a gap between the main table and the beginning of the heap. The size of this gap can be found using the value of the THEAP keyword. If there is no gap, the value of THEAP is NAXIS1 × NAXIS2. If there is, the size of the gap is the difference between the value of THEAP and NAXIS1 × NAXIS2.

As an example, suppose the main table consists of five 168-byte rows, 840 bytes in all, but the user wishes to have the heap begin on a new record. (This construct is provided for purposes of this example; it is not necessarily recommended usage.) The value of NAXIS1 is 168, representing the length in bytes of a row in the table, and the value of NAXIS2 is 5, signifying 5 rows. There is then a (2880-840)- or 2040-byte gap between the main table and the heap. The value of THEAP, representing the number of bytes between the start of the main table data and the start of the heap, is 2880. Suppose further that the heap itself has a length of 5760 bytes. The value of PCOUNT is equal to the size of the gap area (2040) plus the size of the heap (5760), or 7800.

The storage location to which the array descriptor points must be located entirely within the heap. In particular, it may not be in the main table.

Reading a variable-length array on a random access storage medium, such as disk, is straightforward. When an array descriptor is encountered, the information in it can be used immediately to find the appropriate data in the heap and read them; the program can then move to the next table entry. On sequential access storage media, such as tape, the process is more complicated. As the main table is read, a table of the array descriptors must be generated and stored. Space is reserved in the output data set to hold the actual array, based upon the specifications of the array descriptors. When the main data array has been read, the array descriptors are first sorted in order of appearance in the heap, and then the data are read from the heap in order and stored as specified by the array descriptors. If more than one field has the same stored descriptor, the same heap data may be stored in more than one place.

Because variable-length arrays are more complicated to deal with than ordinary fixed-length arrays and require an extra data access per array to obtain all the data for a field, they should be used only when absolutely necessary.

As this convention is not part of the formal rules for binary tables, not all binary tables software will necessarily be able to read variable-length arrays, except for using the value of PCOUNT to skip the heap. In addition, before generally distributing a file with variable length arrays, be sure, by means of practical tests, that the file can be read on a machine other than the one used to write it.

Next: Arrays of Strings Up: Conventions for Binary Tables Previous: Conventions for Binary Tables