Long-term archiving (LCA) refers to the collection, long-term storage and preservation of the permanent availability of information. With the long-term archiving of digitally available information (digital preservation), certain new problems arise. For the preservation of digital resources, “long-term” does not mean the submission of a guarantee declaration for five or fifty years, but the responsible development of strategies that can cope with the constant change caused by the information market.
A generally valid definition of the term does not yet exist. Since many of the problems of digital long-term archiving only occur after about ten years, such as large version jumps of the software used, this value is used as a barrier for the considerations for long-term archiving. In addition, long-term archiving can be distinguished from data backup.
While physical objects have long been stored and preserved in archives, museums and libraries, among others, electronic publications pose completely new problems. If data is stored analogously, the data quality deteriorates with the degradation of the medium, which is why the focus is on the preservation of the medium. Digitally stored data, on the other hand, can be reconstructed in the event of small errors in the medium by appropriate formatting, which ensures constant data quality despite the deterioration of the medium. If these errors in the medium become too large, the data can no longer be completely reconstructed and are thus irretrievably lost (“digital forgetting”). Therefore, the focus of the long-term archiving of digital data is no longer on the preservation of the medium, but on timely copying before data loss. Since the media (e.g. magnetic tape and DVD), formats and readers/writers for digital storage change rapidly over time, regular testing and continuity across the changes require constant attention and long-term planning. When transferring to new systems, proprietary formats and copyright restrictions, among other things, cause problems.
While old parchment and paper, for example, have a shelf life of many hundreds of years with good storage, this does not apply to new storage media. Most publications from the first half of the 20th century are printed on paper that decomposes due to acid corrosion. Other problems arise with older printed works and manuscripts: If an iron compound was used in the production, and ink corrosion can begin due to unbalanced mixtures of the ink components. This occurs when there is an excess of acid in the ink or an excess of iron. The cellulose is attacked in a similar way to acid corrosion, and the paper can break due to different and changing moisture levels along the letter lines.
Even analogue films, photos and magnetic tapes have only limited durability. The service life of digital storage media such as floppy disks, hard disks and burned CDs/DVDs is even shorter. Digital data carriers lose their media-specific structured data either due to environmental influences (for example, due to sufficiently strong magnetic fields in the vicinity of floppy disks and magnetic tapes), or a data structure is changed so much by chemical or physical influences that no data can be stored in it, or data that has already been written can no longer be read (for example, with sufficiently long exposure to UV radiation on CD-ROMs). Often the data readability fails only because at a later time the appropriate readers and programs for making it readable are no longer available, or that older data formatting standards can no longer be interpreted, or that their technical interfaces are no longer supported for very old data readers. To avoid the aforementioned problems, it may be useful to convert certain selected, electronically stored data (again) into the non-electronic form (back) and engrave it – as a modern equivalent of the cultural habit of our ancestors to permanently carve important data in stone – by ion beam into an almost indestructible nickel plate.
Another method of permanently storing images and texts in analogue readable form is to burn them onto stoneware slabs using ceramic colour bodies. The Memory of Mankind (MOM) project stores images of museum cultural assets and everyday cultural products on stoneware slabs and stores them in chambers in the Salt Mountain of Hallstatt. The theoretical shelf life is given as hundreds of thousands of years. The durability of a ceramic data carrier is proven at least for 5000 years (cuneiform tablets).
Rapid Media and System Change
Especially with digitally stored information, there is the additional problem that data is no longer accessible even though the medium itself has been preserved. To be able to access stored information, the respective carrier medium must be able to be readout. With some media such as stone tablets or books, this can also be possible for a person without aids. With digitally stored media, a corresponding reader, often a drive, is usually necessary. If readers are no longer available, for example, due to technological change, the data can no longer be read out, or only with difficulty. An example is outdated tape formats.
Even if the storage medium is preserved and it is still readable, access to the stored data may be impossible. Since digitally stored data is not directly accessible but is digitally encoded and structured in a media-specific manner, it is only possible to read this data if there is a program and an operating system that “understand” the contents of a file. Since many operating systems and programs use their method to encode the data, data readability can no longer be ensured as soon as an operating system or program is not constantly maintained. This problem is exacerbated by the policy of many software manufacturers to release new program versions with changed data storage formats that can no longer fully use older data storage formats of the same program.
Proprietary Systems and copyright restrictions make it difficult to copy and migrate data necessary for long-term archiving because the necessary steps are not known or permitted. In particular, the introduction of Digital Rights Management (DRM) will exacerbate the problem in the future. Such a set of rules for digital data or documents is necessary because, just as with conventional data, copyright issues must be clarified before possible archiving. The difference between conventional data and electronic documents results from the fact that the latter copy and original are practically indistinguishable. Especially when migrating documents, it is necessary to make copies and possibly change original documents. Therefore, the consent of the author for such measures must be obtained in advance. Further copies handed over to readers of documents shall be adequately rewarded and, if necessary, must be accompanied by blocking notices if free disclosure is not permitted.
It is not enough to just copy original data: it must also be able to be found on the new medium. Therefore, certain additional data on the structure and content of the original data, so-called metadata, must be entered in catalogues, databases or other finding aids to be available for later data reading or search.
An often overlooked problem in long-term archiving as well as in short-term archiving is the verification of the correctness of the data. Data can be intentionally modified, but can also be changed unnoticed by system errors.
One way out could be distributed storage in different locations at different organizations and protection with distributed cryptographic checksums. This is practised, among other things, with the open-source solution LOCKSS.
Procedure of Long-Term Archiving
Methods of migration/conversion and emulation can be distinguished in electronic archiving. By using open standards such as graphic formats (TIFF, PNG, JFIF) or free document formats (XML, PDF, OpenDocument), which are considered relatively durable and whose structure is publicly known, the cycles after which stored data must be reformatted are longer. The probability that in a few years there will still be systems and programs that can read such data is therefore significantly higher.
To prevent the loss of data due to the ageing of data carriers, the data must be copied regularly to new data carriers within the guaranteed data security period of a medium. As a result, it is also possible to switch to a new carrier format as soon as what has been used so far has become obsolete due to technical development.
However, the high costs incurred by this maintenance of the data stocks mean that only the most important data can be preserved in this way. Today’s flood of data and metadata, which is caused not least by the constantly increasing use of digital data processing systems, further exacerbates the problem of the best possible classification of storage-relevant data volumes. The proportion of data stored for the long term will necessarily be relatively small, which places high technical and other subject-specific requirements on the selection of the information to be secured in terms of data technology. An additional problem arises from the divergence of the relationship between data volume and data bandwidth. The volume grows significantly faster than the available bandwidth to transfer data from one medium to another. This does not only apply to data in the state and commercial sectors. Even in the private sector, conventional media, which can often be stored in the long term, are replaced by more easily manageable digital media (photographs and negatives by digital images on a CD-ROM).