Data is received from government departments in a wide variety of formats. The goal of the data transformation process is to
produce data in one of a number of standard formats suitable for browsing access by users of the system and for generating copies
of data for use by researchers in current computing systems. Clearly it is essential that the transformation process preserve the
content and the intellectual ordering (as distinct from physical ordering) of the original dataset as far as possible.
The transformation process aims to preserve the structural layout of the original data. For instance if the data is already in
tabular form typical of relational databases, this will be retained but, in general, data not already in true third normal form,
will not be manipulated into this format. Changes to the format and extent of the data may be made in order to meet anonymisation
requirements. In some cases, files or fields will be closed; in others, data will be anonymised by summarising the data. Errors in
the data will not be corrected (following the principle of archives to faithfully preserve the information given to them) but are
documented in the finding aids.
Data formats
Data is held in 2 main types of format:
(i) CSV (Comma Separated Values)
This is used when the data is entirely textual (ie the characters and numbers are all held as text as opposed to there
being any binary-encoded data). Commas are used to separate fields within a record, and end-of-line to separate records.
Any field which contains a comma, a double quote character, or an end-of-line character, is itself enclosed in double
quotes and embedded double quotes are duplicated.
Such data could also be held in fixed-length character format but CSV is the preferred structure.
(ii) Binary/mixed.
This is used where the data is entirely binary-encoded or mixed character/binary. All fields contain fixed-width character
fields or numeric or date fields which can also be represented in a fixed size cell. The length of each field is given in
the Finding Aids field lists.
Integer data is stored in a twos-complement integer field of 8, 16, 32 or 64 bits. Unsigned integer data in the original
is converted to a signed positive integer using a larger storage width if necessary. Floating-point data is stored in IEEE
format using 64-bit (double precision) unless the original storage format used IEEE single precision (32-bit) or an even
less precise storage form in which case IEEE single precision is used.
Notes on the transformation process
Where character data is not already in ISO 10046 (extended ASCII, 8-bit ASCII), data is converted to that character set. Where
data is to be stored in character format, floating-point numbers are stored without leading or trailing zeroes (except for a
possible single leading 0 before a decimal point) or in scientific notation (n.nnnnEddd) using sufficient digits to correctly
represent the precision of the original number. Encoded numbers (such as those in BCD) are converted to fixed-width character
fields containing leading blanks, a correctly-positioned decimal point and trailing zeroes.
Where a date was held in character format, if the dates are imprecise (such as those which sometimes contain a full date, and
sometimes only a month or year) and/or if 2-digits were used for the year and the appropriate century could not be determined with
certainty, the field is left as a character field. Where the correct century can be determined unambiguously, the century is
inserted. Dates which are precise (i.e. always fully specified with day, month and year) are held as 32-bit Julian day numbers.
Handling of BLOBs (binary large objects) may differ in particular situations (and further information will be added here as
specific cases arise which require different treatment). In general, BLOBs are extracted into separate files which are then placed
inside a single container file per table. The BLOB files are given names which are consecutive integers of fixed width sufficient
to name all BLOBs in a table, and extensions appropriate to the file type (e.g. 0000.tif, 0001.tif for a set of image blobs
numbering no more than 10,000.) The BLOB data is converted to some standard form - TIFF for images (unless the original data is
JPEG-encoded, in which case it is usually left as is), ‘Next .au’ format for audio. The field contains the ID
of the BLOB itself.
Digital documentation formats
Digital documentation is available in at least two formats - plain text and marked-up text. For documentation which originates in
a richer format (a Microsoft Word or Wordperfect file, for instance), if the dataset is open, a proprietary form of the document,
using the Rich Text Format (RTF), can also be provided. With documentation which originated in postscript, it may not always be
possible to convert this to plain text in a reliable fashion; in such cases, the postscript documentation is stored as-is and also
converted to TIFF image form (using ghostscript).
Applications (for processing the datasets)
The types of software systems you will need to carry out further analysis of the data depend on the volume of data and its format.
Small to medium-size datasets held as CSV (Comma-separated-variables) or fixed character format can be easily read into and
manipulated using desktop database packages such as Microsoft Access or MySQL .
Many small, single table datasets can even be successfully manipulated using spreadsheet packages, such as Microsoft Excel or OpenOffice Calc . Larger and more complex datasets may require more
powerful packages, either database products such as Oracle or Ingres, or statistical packages such as SPSS and SAS.
ULCC can offer some advice as to suitable packages and can also seek out and arrange consultations with outside experts for those
users who desire specialist assistance over and above that which is provided by the ULCC.