Hierarchical Data Transformation

Mar 16, 2022

RDF, XML, JSON and many other formats represent hierarchical data. The hierarchical data are in a tree or forest-like structure. However, conversation/transformation between those formats is not always straightforward.

To be precise, transformations, like CVS, XML, and JSON to RDF, are simple and well-supported. The same is not true for the opposite direction. A solution? A language and processor designed for highly customizable and extensible transformation of hierarchical data. The objective is to design and implement this tool or its part.

A transformation (lifting) from CSV, JSON to RDF is handled using CSVW or JSON-LD. When it comes to XML, there is XSLT. JQ, or alternative solutions, can be employed to facilitate JSON to JSON transformation. In general, the transformations between formats require specific knowledge of languages and libraries.

In addition, many transformation tools and languages provide functionality far beyond an ordinary use case, i.e. the tools are too powerful and complex. This creates a space for a new tool focusing on a unified data model, simplicity and extensibility. The main idea is that all the formats can be described as hierarchical. With this abstraction, we can define unifying transformation language for a limited set of operations. The differences between the formats may then be addressed by utilizing different data models. For example, the difference between JSON and XML may be the existence of an artificial node @attribute that would provide access to attributes.

The project should build upon the existing implementation of hierarchical data transformation.

2025-09-13 Update:

An attempt to solve this has been made in a bachelor thesis Hierarchical data transformation. The author propose an unified representation for hierarchical data called Ur. For example JSON document:

{
  "library": {
    "name": "Open Library",
    "books": [{
      "attributes": { "condition": "good" },
      "book-title": "Příliš hlučná samota",
      "page-count": 98
    }]
  }
}

Is represented as:

{
  "@type": ["object"],
  "library": [{
    "@type": ["object"],
    "name": [{ "@type": ["string"], "@value": ["Open Library"] }],
    "books": [{
      "@type": ["array"],
      "0": [{
        "@type": ["object"],
        "attributes": [{
          "@type": ["object"],
          "condition": [{ "@type": ["string"], "@value": ["good"] }]
        }],
        "book-title": [{ "@type": ["string"], "@value": ["Příliš hlučná samota"] }],
        "page-count": [{ "@type": ["number"], "@value": ["98"] }]
      }]
    }]
  }]
}

We can see the representation is quite verbose. For example every value is an array, which may not be necessary for build in types like prefixed with ”@”. The author defined mapping for JSON, CSV, and XML to Ur.

Next the author defined UrPath a way do navigate Ur, e.g. ["library", "books", "[0]"].

The last step is definition of a transformation language for Ur. Here author explores interesting idea where the query language is defined as a set of operations. The operations include: shift, remove, default, filter, array-map, and replace. Example of a shift operation:

{
  "operation": "shift",
  "comment": "Posuň hodnotu name z objektu ’library’ do objektu ’org/lib’",
  "specs": [{
    "input-path": ["library", "name"],
    "output-path": ["org", "lib", "name"]
  }]
}

2025-09-13 Next iteration:

Simplify Ur and remove unnecessary arrays. We need to also include RDF conversion.
Instead of transformation focus on navigation. Output can be produced as side-effect of navigation. For example, we can open an array on visiting an object and then emit value for each property.