### Generate a universe from seeds

For the purposes of this work, the idea is to reveal how two or more elements (concepts, people, works) are linked and connected to each other. Therefore, from these elements (Wikipedia entries hereinafter referred to as ‘seeds’), we have defined a subgraph from the wikiLinksClean digraph by taking the nearest neighbors of each seed (s), that is to say all the nodes of the digraph which are at a distance D of each node s. Since Wikipedia’s internal link network is a dense network (with an average shortest path length of around 4.1), a value of D≤ 2 avoids having irrelevant links between seeds in the subgraph. For the case studied here, the seeds are “Pablo Picasso”, “Albert Einstein” and “James Joyce”. Although we focus on the Einstein-Picasso relationship, including Joyce allows us to compare the relationships between art, science, and literature and therefore perform a more in-depth comparative analysis. So, based on these seeds, we got a subgraph (called the universe ) containing 79,454 nodes and 3,166,325 edges. Once this subgraph was defined, the algorithm solved the aforementioned problem with unresolved redirects by redirecting inputs from nodes with outer degrees equal to one to the corresponding successor node and then removing the (redirecting) nodes. While this procedure can also remove a small fraction of loosely connected nodes, their influence on the final results is negligible. Additionally, after extracting the subgraph, nodes with zero entry / exit degree have also been removed. Obtained universehad 78,444 knots and 3,159,866 edges.

### Measure the relationship between nodes

From the obtained universe , we wanted to work only with NOT nodes most related to each of the given seeds. Therefore, it was necessary to define a way to measure the relationship between each pair of nodes (Wikipedia articles). It is important to note that two articles can be strongly related even though there is no direct connection between them. For example, two articles can be co-linked by other articles, or they can co-link other articles. In these cases, we say that the two articles are structurally related. Among the many metrics available to measure the distance (or relationship) between elements of a complex network, the use of normalized Google distance (Cilibrasi and Vitanyi, 2007) (NGD) provides excellent results for our purpose. It is defined as

$$d _ {{ mathrm {in}} / { mathrm {out}}} left ({a, b} right) = frac {{{ log left ({ max left ({ left | A right |, left | B right |} right)} right) – { mathrm {log}} left ({ left | {A cap B} right |} right)} } {{ log left ({ left | W right |} right) – log left ({ min left ({ left | A right |, left | B right |} right)} right)}}$$

or a and b refer to the two articles of interest, A and B represent the sets of nodes (articles) that connect to / from (Denter exit) aand b , respectively, and Wis the total number of nodes in the universe . Log refers to the base two logarithm while | A | represents the number of nodes in A. If ( left | {A cap B} right | = 0 ), then the corresponding distance is infinite. There are two different distances between the nodes aand b : one for the nodes which link to a and b(Din(a B) ) and another for the nodes which are linked from a and b(Doutside(a B)). The total distance (hit) ) was taken as the harmonic mean between the input / output distances. Finally, the relation between the nodes a and bwas defined as r(a,b) = exp (-D(a, b )), which is always in range [0,1].

Based on this definition of the relationship between two nodes, we determined the NOTj nodes most related to each of the given seeds, with NOTJ= kJ, or kJ is the degree of seed J in thewikiLinksCleandigram. Once the NOTj the “closest” nodes of each seed were known, the kinship matrix ( R ) was calculated for the nodes of this subset of theuniverse(hereinafter called close to the universe). This matrix was then used to create an unoriented weighted graph (g) which represents the relationships between the different elements of theclose to the universe. The weight of the link connecting the nodes ( i, j ) was given by the corresponding element in the relationship matrix ( Ri, j). The resulting graphg contained (in our case) 856 nodes and 143,307 edges.

### Data clustering and visualization

The way theNOTthe elements closest to each seed were chosen force the formation of clusters and decrease the inter-cluster connectivity. Therefore, the remaining links between the elements of different clusters can be considered sufficiently relevant for the purposes of this work. Nodes were assigned to different clusters based on the seed they were linked to in the original graph (wikiLinksClean). When a node was connected to more than a single seed, it was assigned to the seed to which it was most related. The resulting graph (with the identified clusters) was drawn using a force-directed layout that uses attractive forces between adjacent nodes and repulsive forces between distant nodes (Fruchterman and Reingold, 1991). The result is shown in figure 1.