Semi-structured data is the name given in database research (informatics) to information which is not subject to a general structure, but which carry with it part of the structural information. While structured data retention must be based on a database model that contains the appearance of the data elements (objects), semi-structured data lacks one. Semi-structured data does not need to be subjected to a type model; thus, a data collection from semi-structured data can expand as desired. Semi-structured data can be brought into a form with the help of rules, which has the characteristics (1) The data collection consists of one or more sequences of objects.(2) Objects can either be decomposed into attributes (complex objects) or they are atomic objects.
(3) Atomic objects contain values of a known, elementary data type. Semi-structured data with properties (1), (2), and (3) are called well-formed semi-structured data. The Object Exchange Model (OE model) has become a de facto model for semi-structured data. Data that has these properties can also be described as well-formed XML documents.
Is semi-structured not also structured?
Semi-structured data cannot be placed in a structured database model except for one exception described below. However, there are procedures in place to detect data types of semi-structured data. If the data types (classes) and thus the relations are known, you have an entity-relationship model. However, for this model, it can only be filled with data in this structure, not with other semi-structured data. For semi-structured files shaped in an OE model, it can also be claimed that the formal description of an OE model allows you to create a matching, structured data model that can look like this:
- This data model contains only three basic types: the nodes that represent the objects, the edges, attributes, or references, and sheets that represent the properties of the reference.
- Thus, all semi-structured objects of an OEM model can also be written into this data model. The following is an OEM DB model.
- Semi-structured data cannot be written into a DB model except models that have only one abstract data type for all objects.
Notation of semi-structured data
The notation of semi-structured data with XML, which has been standardized by the W3 consortium, is very widespread. This serves as a data exchange format on the Internet and is additionally used as a data storage format in many applications.In XML, attributes can be noted with the following notation for so-called elements whose name can be freely set:
<element [attribut_1="wert_1"] [attribut_2="wert_2"] [attribut_n="wert_n"]> content1 <unterelement_1/> <unterelement_2/> .... </element>
There are two ways to specify properties of objects within XML – (1) by XML attributes (2) by sub-elements
The so-called ssd (semi-structured-data) notation is less well known than the XML. However, this notation for semi-structured data provides a very short and clear presentation. There is another notation for the XML documents, which is called DTD (Document Type Definition). This notation describes the structure of an XML document.
XML files with DTD are more structured than XML files without DTD. XML files without DTD have no typing. Within an XML document, elements or tags and their attributes can be defined as desired, without any restrictions. In principle, it is possible for the DTD to define only a portion of the elements within the XML document. With the help of a DTD, it is possible to define which elements may exist and which attributes these elements may or must have; the amount of possible values can also be limited. In addition, the set of possible children can be defined with DTDs. The types described in the DTD can be implied. Although the XML document is subject to an object description, structured data cannot be spoken of.
Despite the possibility of further structuring with DTDs, we are still at the semi-structured level of data storage. This is due to the fact that structured data is technically subject to a so-called data dictionary, which describes the structure of the data. The structure of the entities includes the relationships, attributes, and values with their data types. It is not possible to access the stored data without the data dictionary. It is different for semi-structured data, which is basically structured like a text file. Also, the values of the attributes are not defined with data structure specifications such as string, integer, float, date, number, etc., but are generally represented as strings. Thus, an XML file validated with a DTD can be edited and modified independently of the DTD. Different XML files, which in turn can be validated with the same DTD, thus belong to the same equivalence class.
Since the structure of the DTD is derived from the processing algorithms, semi-structured data in XML with DTD can only be generated by a program in one version and further processed with a program and a version – unless semantically oriented queries or processing methods are used in further processing. DTDs may also be created by type recognition methods, such as simulation, because this method detects types of objects “classes”. Program changes, as seen here in the analysis system, also lead to the adaptation of the DTD. In addition, the semi-structured concept offers the possibility that elements that describe words and sentence phrases in this case can sequence each other at will. DTD notation provides parameters entities that allow any order and number of sub-elements of a parent. This is not possible directly with structured ER modeling.
JSON is an open standard format that uses human-readable text to convey data objects in attribute–value pairs, primarily to transmit data between a server and web application, or an alternative to XML. JSON has been popularized by web services utilizing REST. Databases such as MongoDB and Couchbase can store data natively in JSON format for semi-structured data.
Advantages and disadvantages of semi-structured data
Programmers can avoid worrying about object-relational impedance mismatch. Nested or hierarchical data simplifies data models in many situations. Support for lists of objects simplifies data models. But, the method of data storage is prone to garbage in, garbage out by removing restraints.