GEDCOM Redesign
by W. Wesley Johnston

Bringing Genealogical Computing Fully into the 21st Century


I am particularly distressed by the failure of the genealogical computing community to deal with the complete re-design of the GEDCOM database format. GEDCOM is really two things in one:

  1. It is a database design.
  2. And it is a means of transferring databases, preferably without loss of information, either to another database of one's own or of someone else or to create a web page.

The second aspect really depends on the first: the design of the database is the most fundamental aspect of GEDCOM.

The reality is that the database design of GEDCOM is based on the limitations of the technology of the early 1980's -- almost 30 years ago. The resulting design sacrificed the principles of good database design (especially third normal form) in order to make things work within the limits of the technology of that time. We all did our database designs that way in those days: we had to live within the limits of the technology, or our systems would not work, no matter how perfectly their design followed good database design principles. So we de-normalized and made things work. Various commercial genealogical computing products have added features on top of the GEDCOM design, but the fundamental flaws of the underlying GEDCOM design doom any on-going add-on efforts to frustration and also violate the transferability of the information that is the second goal of GEDCOM.

But that was then and this is now. Those days and their severe technological limitations are gone. And yet we are still using essentially the GEDCOM database design that was developed for the technology of the 1980's. And we are paying for it in ways that we should no longer have to. Until there is a fundamental redesign of the GEDCOM database structure, we will continue to be unable to exploit modern technology fully for what it can do for genealogical computing, and we will continue to be unable to share our databases fully, without significant loss of information, especially of added media. Now that many of those implementation limitations are no longer a problem, we really need a design for lineage-linked databases that supports all of the relationships in which people existed in their lives, since all of those relationships were potential sources of records and potential sources of learning more about the people through information far beyond bare-bones blood pedigrees.

So I have put together this web page that addresses the topic in detail.


Contents


What is wrong with the current design?

Good database design relies on normalization of the data. There are various levels, but the most important is the third level. Database design that conforms to this design principle is said to be in third normal form or 3NF. This form eliminates redundancy and assures integrity. The downside of implementing this design is that views of the data that combine attributes from different relationships of data have to be run every time you want to see the data that way. In the past, this was a prohibitively high cost in the amount of time it would take to display a screen with the information that you wanted to see. So in the 1980's, databases were not designed to conform to 3NF but instead were de-normalized so that the data was stored redundantly but could quickly be displayed.

The 4-Level Location Name Nightmare

The reality is that someone lived in a house or worked in a place that had an address and was in a city. These were the two fundamental ways that they thought about where they lived. Certainly, the state and country were significant. And if they were in a rural area in the United States, the county was important. But essentially you had a place that stayed where it was while the identification at various levels altered.

The problems of denormalized location data in the existing GEDCOM design lie in the forcing of every location field to have four levels: "Chicago, Cook, Illinois, USA" -- based on cities within the United States. This led to all sorts of torturing of location data on a rack of de-normalized design. Here are some examples:


Restriction to only Parent-Child and Parent-Spouse Human Relationships

The reality of the relationships that immerse people -- and which we find in the records -- are far more numerous than whether they were parent-child or parent-spouse relationships. But those are the only relationships that GEDCOM supports. You can compute other familial relationships from these. But you have no way to represent, in a retrievable fashion as a full relationship in the sense of database relations, a vast array of other relationships, of which several were particularly important in a person's life:

In addition, there are human relationships that appear in records that you want to capture but which you cannot fit into your existing database, since they depend on distant linkages that you cannot determine:

While there probably were technological limitation reasons for not designing the GEDCOM database to include these relationships, there was originally a pure lineage-linked focus. After my 2003 article "Non-blood Relationship Searches" appeared in "Genealogical Computing" magazine, a reader wrote a letter to the editor to express dismay that such a subject would even be discussed, since she had this narrow ancestors-only focus -- which really blinded her to understanding the lives of her ancestors. But there may have been a good deal of this same attitude in the original GEDCOM design, which the technological limitations of 1984 made it possible to cast into concrete that still encases and constricts us today.

We really need a database design that supports all of the key relationships in which people existed in their lives, since all of those relationships were potential sources of records and potential sources of learning more about the people through information far beyond bare-bones blood pedigrees.

....................STILL UNDER CONSTRUCTION.............................

Absence of Geographical Relationships

Where people lived in relationship to each other was important in their lives. Perhaps Cornelius met Maggie because they lived across the street from each other or maybe they were living in different apartments in the same building or worked together at the same place or attended the same church. Or maybe the baptismal record of a child shows that the child was born at "None of these geographical relationships are supported in GEDCOM. You can enter the information in a generic data field, but there is no way that you can easily search the GEDCOM database to discover which people were together in these ways.

There are also census-based relationships that may be more complicated to implement: is in the same household as (a servant, a border, ...), is in the same building as, is on the same or next census page as, is in the same enumeration district as.

Clearly if there were an address field, many of these geographical relationships could be discovered. But there are censuses and other records that do not include addresses but do include the information that two people were in some geographical relationship with each other, and it would be useful to include that connection between those people rather than to ignore it, as the current GEDCOM design does.

....................STILL UNDER CONSTRUCTION.............................

Relationships of media items to people and events and places

....................STILL UNDER CONSTRUCTION.............................


Send E-mail to wwjohnston01@yahoo.com
Copyright © 2011 by Wesley Johnston
All rights reserved


Last updated November 3, 2011 - further development of the specific problems
Stuck in someone's frames? Click here to break out.