Using GEPHI to Create a Network Graph from a Generations Matrix

Using GEPHI to Create a Network Graph from a Generations Matrix
Copyright © 2024 by Wesley Johnston - All rights reserved
Created 15 Feb 2024 - Last updated 16 Feb 2024

GEDmatch supports tag groups of multiple kits for analysis of a group of projects in an autosomal DNA project. The GEDmatch Generations Matrix tool can then be applied to that tag group. It generates an n by n matrix (where n is the number of kits) in which each cell contains the DNA-based estimated number of generations that the two people who intersect at that cell are from their MRCA (most recent common ancestor). The Generations Matrix is an extremely effective tool for quickly identifying which kits in an existing tag group match a newly identified kit.

In our Johnston Research Group, we now have more than 230 kits of descendants of Johns(t)on(e) ancestors. While the Generations Matrix presents a 2-dimensional grid view of all the connections, a February 2024 Legacy webinar by Diana Elder of Family Locket led to me to wonder if using GEPHI to create a network graph from the Generations Matrix might provide some insight that we could not easily see in the grid format. Her daughter Nicole Dyer had written several excellent web pages to help people generate network graphs from their DNA matches. These provided a jumping off point, but the input data is different with a Generations Matrix than it is from a match list. So, I had to navigate GEPHI parameters differently for the Generations Matrix. This web page explains what I figured out.

All names and kit numbers on this web page are fabricated and not the real names or kit numbers of any tester.
-- Wesley Johnston

GEPHI Download and Install

You can freely download GEPHI from the gephi.org website. Installation is simple.

The Input File

The GEDmatch Generations Matrix is the input file. You can use CTRL+A to select the entire page and then CTRL+C to copy the selection. You can then paste the result into a spreadsheet and eliminate all but the labels and the cells.

You do have to modify the labels, both down the side and across the top. I concatenate the kit number and name, separated by a hyphen so that I have a single column of labels on the left instead of a kit number column and a kit name column. I then copy that entire column and paste it transposed into the top row, which was originally just the kit numbers.

That is it. The input date setup is pretty simple.

Generations Matrix in GEDmatch after removing two columns to right of name column

Modified Generations Matrix Input File to GEPHI

GEPHI Input

Start GEPHI and select "New Project". Then click on "File" and "Open" and select the spreadsheet file you created with the GEPHI input of the Generations Matrix.

Note that GEPHI detects that the spreadsheet is a matrix and sets the "Import as:" option to Matrix. This is one of the differences from the DNA match list input described in Nicole Dyer's instructions: we have only one file and not separate files for nodes and labels.

Click "Next". Then on the next popup window, click "Finish".

This will open the "Import report" popup window. The number of nodes should be the number of kits in your Generations Matrix. The number of edges will vary depending on how many of the kits have cells with MRCA generation estimates. In this window, change the "Graph Type" to "Undirected". Then click "OK".

This will pop up the warning "Issues after import process" window stating "- mutual edges removed to fulfill undirected type". Simply click "Close". I do not fully understand this setting. But the resulting graph appears to be okay.

You will then see in the "Data Laboratory" tab the "Data Table" and the "Nodes". These are your kits with identical pairs of "id" and "Label".

If you click on "Edges", you will see all the pairs pf "Source" kit and "Target" kit. They will have a sequential number for their "id". The "weight" will be the cell value (DNA-based estimated generations to MRCA) for that pair of kits.

Working with the Graph

The graph will initially appear as a black "hairball" in a solid black rough square. You need to spread the graph apart so that you can see the nodes and how they connect. You also need to identify clusters and color them so that the graph allows you to visualize the clusters. You also need to reduce the dimensionality to include only the most-connected kits so that the graph is not overly complex. And you need to label the nodes so that you know what you are seeing.

This image highlights the key places to click in the following steps.

Spreading the Graph Apart: The graph's overall visual shape varies depending on which "Layout" you choose in the "Overview" tab on the left side tool bar. After experimenting with different layouts, I opted for the "Fruchterman Reingold" layout with its default parameters. Choose that layout from the pulldown menu, click "Run" and then click "Stop". You can always recenter the graph image with the magnifying glass icon at bottom left and zoom in or our with the scroll wheel on your mouse. Scrolling does focus on where on the graph you hover your cursor.

Run then Stop

Reducing the Dimensionality: With 229 kits in my input data, the layout spreads them out. But the result is still a "hairball" mass of so many tangled lines that it is still just solid black even at high zoom. To have a graph from which to achieve some insight, I have to reduce the dimensionality. I do this by reducing the number of kits by eliminating the kits that have fewer than some number of other kits to which they connect. On the right side tool bar, in the "Statistics" tab, click "Run" next to the "Average Degree" text in the "Network Overview" section. In my case, the resulting number is 20.943. So, the average number of connections that my testers have to each other is just under 21 connections.

Then, click the "Filters" tab (just to the left of the "Statistics" tab). Click the ">" to the left of "Topology" (or double-click on "Topology") which opens a pulldown menu. Double-click on "Degree Range". This puts "Degree Range" into the "Queries" section below. It also opens up a chart at the bottom, showing the "Degree Range Settings". In the chart for my kits, the degree ranges from 4 to 67. Recall that the average was just under 21 connections. To eliminate the kits with fewer than some degree, I have to first select the part of the range that I will eliminate. I can either move the slider on the top end of the degree range to the left, or I can click on the number under that slider and enter a new number. For my initial exploration, I want to focus on the most-connected testers. So, I set the high-end number to "30". Then I click "Filter" below the chart. My graph will now show only the nodes (people/kits) with 4 to 30 connections. (Remember that clicking the magnifying glass recenters and zooms out the graph so that you can see the entire graph.)

On the left side toolbar in the "Graph" window, click the rectangular box that is the second icon from the top. Then sweep to highlight the entire graph. Then zoom in so that you can see a single node. Nodes along the outside of the circular hairball are the easiest to zoom in on. Right click on a node, and then click on "X Delete". This pops up a "Delete nodes" window, where you click "Yes" to delete all the nodes for kits with 30 or fewer connections. Then go back to the chart at the bottom right and click "Stop" where you had first clicked "Filter". Then right-click on "Degree Range" in the "Queries" section and click "Remove" to remove the filter and see what is left in your graph of your most-connected kits.

Since the resulting graph can be skewed to one side or otherwise visually distorted, I again run the "Fruchterman Reingold" layout in the "Layouts" section. This time, I reduce the "Gravity" to 5 in order to spread the graph more.

Create and Color Clusters: Now, we need to identify the similar clusters people and give the clusters different colors so that we can start to make some sense of what we see in order to try to gain insight from the graph. In the "Statistics" tab, click on "Run" on the "Modularity" line of the "Community Detection" section. Use the defaults, and click "OK" in the popup window. This is a non-deterministic operation so that if you click "Run" again, it will give a slightly different number. It uses the Louvain community detection algorithm which Dr. David Stumpf reports in his "Graphs for Genealogists" software does a very good job of separating out the different branches of his own family tree.

Back on the left side tool bar, in the "Appearance" section's "Nodes" tab's "Partition" tab, select the attribute of "Modularity Class". Then click "Apply". Since the default is that the icon of an artist's palette was selected, you will see that each cluster/class has a unique color and number (starting from class 0). Once you click "Apply", your graph will show these colors applied to it.

Enhancing the Nodes and Edges: So far, our nodes show and edges show no information other than the connections. We need to know which testers are in which nodes, and it will be useful to set the node's circle size based on how connected that person is to the others in the Generations Matrix. In the "Appearance" section, click on the nested circles icon at the top (to the right of the artist's palette icon). Then in the "Ranking" tab, click on the "Degree" attribute in the pulldown list. I change the minimum size to 10 and the maximum to 200. Then click "Apply". Keep in mind that, because these are the most-connected people, even the smallest circles are highly connected.

Set the node labels on the tool bar at the bottom of the "Graph" section. At the right end of the bottom toolbar is a stylized up-arrow (looks like a tiny house). Click on that to open the full toolbar. Then click on the "Labels" tab. Click the empty box to check the "Node" section. You can change the font or use the slider to make the labels larger or smaller.

What Next?

Here is how our final graph looks.

If we hover the cursor over a node, we can see all the other highly-connected testers who the kit matches.

There are two main next steps.

You can export the graph to a PDF file where it can be zoomed and shared. This is done in the "Preview" section but requires a good deal of tweaking since it is not WYSIWYG.
I would really like to find a way to export the graph to a web page where it can be shared in a way that it can be interactively explored. But I have not really looked for that yet.
You can see what insights the graph reveals. I have not gone deeply into this yet. But I do see that the color of the nodes/clusters (using the Louvain community identification in the modularity statistic) does indeed group the most closely related people together. In this case, I have the option of running the modularity statistic again and then re-coloring the nodes. This breaks the exact same kits into five clusters instead of three.
I have not yet tried including the Generations Matrix cell values (the DNA-based estimated number of generations to MRCA) as labels or size-determinants of the edges.