2022-06-30

The structure of personal data

https://dazzle.town/blog/2022-06-30-structure-of-personal-data/

Data has structure.

For example, the data in a spreadsheet is mainly organized in rows and columns. Or the data in a text document is organized in a series blocks that follow each other, where a block can be a headline or a paragraph or an inlined image. As a page fills, we break the block and continue it on the next page, unlike with a spreadsheet where we scroll on the same screen to get to a part that we can’t currently see.

We use spreadsheets and text editors for different data, and the structure of the data we want to enter is one of the main reasons for that. For example, how would you write a spreadsheet formula if you couldn’t refer to a call like A5 (as in a text editor), or add a table of content to your thesis if your text was all on the same sheet (like in a spreadsheet)? You couldn’t.

What is the structure of personal data? If we have the goal of taking our personal data back, we better understand the structure of what we’ll take back, otherwise we’ll end up putting the equivalent of a spreadsheet into a text document, and then wonder why none of its formulas update, or the personal data made no sense to us.

As a reminder, at Dazzle we use a broad definition of the term “personal data”.

Having worked on this for many years, I have found the following core characteristics of the structure of personal data across the many domains in which personal data exists:

Personal data contains discrete “data objects” that refer to discrete things in the real world (and sometimes the virtual world). For example, an e-commerce site may have discrete data objects for each “order” and “shipment” or the beach ball that I might order. Or, a social media site may have a discrete data object representing my friend Joe.
These discrete data objects have types. Clearly, an order is an order and not a beach ball, and my friend Joe is clearly something else than a shipment. It’s important not to confuse those types. Note here are many, many potential types for personal data objects. If you were to make a list, this list would be very long as it can potentially include every single thing and idea in the human experience.
These discrete data objects more often than not have properties. For example, an order at an e-commerce site may have a time when it was placed, or the object representing my friend Joe may have a property for his nickname ‘Joe’ or perhaps his birth date.
These discrete data objects are related to each other, reflecting the real-world relationships they have, or had, with each other. For example, a shipment object may represent the shipment that was shipped fulfilling a particular order at the store (and not some other order), and so personal data needs to capture that relationship between that particular order and that particular shipment. Or that my friend Joe may be the recipient of that shipment, and not me.

This is also a myriad of different types of relationships those data objects can have with each other, reflecting the myriad of different types of relationships that things in the real world can have with each other. For example, a shipment may indeed be the shipment fulfilling a particular order, but it also maybe the shipment that returns the product after the order was canceled. It would be dangerous to confuse those two, so paying attention to relationship types is important.
Often, objects are related to many other objects. Think of my friend Joe: he might be the recipient of that order, and a few dozen others, but he also sent me hundreds of messages over the years and interacted with lots of other people and things. Each of those interactions or relationships is potentially reflected in his personal data, which can easily lead to thousands of relationships, or more, of the data object representing him.
Properties have types, too, but it’s a less complex kind of type than the discrete data objects representing real-world objects. In our example, Joe is of type person, but his birth date is of type date, and his nickname is a (fairly short) text string.
Some properties carry very large values best thought of as blobs. For example, a data object representing an X-Ray image that was taken of my broken wrist might have properties for when it was taken, but also a property that carries the actual image, which may be many megabytes in size.

(Note that we are not discussing here how to efficiently represent this in software. Only that there are certain things in personal data – like the notion of an X-Ray image – that have certain properties – like the large number of pixels in the image.)
Personal data changes over time, just like the the world referenced by personal data objects changes over time. It is important to know when a given data object was accurate, and if it changed over time, how it changed. This is also true for relationships: for example, Joe and I were people long before we became friends, and perhaps one day we won’t be friends any more. Personal data needs to be able to express that data objects, their properties as well as relationships can change over time, and can come into being and go away over time.

Many applications for personal data do not make use of all these concepts, so it is very possible that a given set of personal data does not contain, say, large blob-like properties or data changed over time.

But if we are embarking on taking our personal data back, we need to expect all of the above, otherwise we will encounter data we cannot take back, and we don’t want technology to limit what data we can take back.