The National Archives NDAD
Print page Close window
 

Help

Dataset formats

 
Help Glossary Frequently asked questions Contact us Site map  

Data is received from government departments in a wide variety of formats. The goal of the data transformation process is to produce data in one of a number of standard formats suitable for browsing access by users of the system and for generating copies of data for use by researchers in current computing systems. Clearly it is essential that the transformation process preserve the content and the intellectual ordering (as distinct from physical ordering) of the original dataset as far as possible.

The transformation process aims to preserve the structural layout of the original data. For instance if the data is already in tabular form typical of relational databases, this will be retained but, in general, data not already in true third normal form, will not be manipulated into this format. Changes to the format and extent of the data may be made in order to meet anonymisation requirements. In some cases, files or fields will be closed; in others, data will be anonymised by summarising the data. Errors in the data will not be corrected (following the principle of archives to faithfully preserve the information given to them) but are documented in the finding aids.

Data formats

Data is held in 2 main types of format:

(i) CSV (Comma Separated Values)

This is used when the data is entirely textual (ie the characters and numbers are all held as text as opposed to there being any binary-encoded data). Commas are used to separate fields within a record, and end-of-line to separate records. Any field which contains a comma, a double quote character, or an end-of-line character, is itself enclosed in double quotes and embedded double quotes are duplicated.

Such data could also be held in fixed-length character format but CSV is the preferred structure.

(ii) Binary/mixed.

This is used where the data is entirely binary-encoded or mixed character/binary. All fields contain fixed-width character fields or numeric or date fields which can also be represented in a fixed size cell. The length of each field is given in the Finding Aids field lists.

Integer data is stored in a twos-complement integer field of 8, 16, 32 or 64 bits. Unsigned integer data in the original is converted to a signed positive integer using a larger storage width if necessary. Floating-point data is stored in IEEE format using 64-bit (double precision) unless the original storage format used IEEE single precision (32-bit) or an even less precise storage form in which case IEEE single precision is used.

Notes on the transformation process

Where character data is not already in ISO 10046 (extended ASCII, 8-bit ASCII), data is converted to that character set. Where data is to be stored in character format, floating-point numbers are stored without leading or trailing zeroes (except for a possible single leading 0 before a decimal point) or in scientific notation (n.nnnnEddd) using sufficient digits to correctly represent the precision of the original number. Encoded numbers (such as those in BCD) are converted to fixed-width character fields containing leading blanks, a correctly-positioned decimal point and trailing zeroes.

Where a date was held in character format, if the dates are imprecise (such as those which sometimes contain a full date, and sometimes only a month or year) and/or if 2-digits were used for the year and the appropriate century could not be determined with certainty, the field is left as a character field. Where the correct century can be determined unambiguously, the century is inserted. Dates which are precise (i.e. always fully specified with day, month and year) are held as 32-bit Julian day numbers.

Handling of BLOBs (binary large objects) may differ in particular situations (and further information will be added here as specific cases arise which require different treatment). In general, BLOBs are extracted into separate files which are then placed inside a single container file per table. The BLOB files are given names which are consecutive integers of fixed width sufficient to name all BLOBs in a table, and extensions appropriate to the file type (e.g. 0000.tif, 0001.tif for a set of image blobs numbering no more than 10,000.) The BLOB data is converted to some standard form - TIFF for images (unless the original data is JPEG-encoded, in which case it is usually left as is), ‘Next .au’ format for audio. The field contains the ID of the BLOB itself.

Digital documentation formats

Digital documentation is available in at least two formats - plain text and marked-up text. For documentation which originates in a richer format (a Microsoft Word or Wordperfect file, for instance), if the dataset is open, a proprietary form of the document, using the Rich Text Format (RTF), can also be provided. With documentation which originated in postscript, it may not always be possible to convert this to plain text in a reliable fashion; in such cases, the postscript documentation is stored as-is and also converted to TIFF image form (using ghostscript).

Applications (for processing the datasets)

The types of software systems you will need to carry out further analysis of the data depend on the volume of data and its format. Small to medium-size datasets held as CSV (Comma-separated-variables) or fixed character format can be easily read into and manipulated using desktop database packages such as Microsoft Access or MySQL External link - opens in a new window. Many small, single table datasets can even be successfully manipulated using spreadsheet packages, such as Microsoft Excel or OpenOffice Calc External link - opens in a new window. Larger and more complex datasets may require more powerful packages, either database products such as Oracle or Ingres, or statistical packages such as SPSS and SAS.

ULCC can offer some advice as to suitable packages and can also seek out and arrange consultations with outside experts for those users who desire specialist assistance over and above that which is provided by the ULCC.

 
 

NDAD v3.0

 
 
Go to top of page Print page Close window