Database

For this work, I used the publicly available WikiLinksGraphs datasets (Consonni et al., 2019) which contain the internal link network (only those intentionally added by editors in the main text of articles) for different dumps of Wikipedia. Specifically, this work was done with a snapshot from Wikipedia from March 1, 2018. The original dataset consists of a spreadsheet file with a size of 9.56 GB (wikilink_graph.2018- 03-01.cvs) containing a table with four columns (page_id_from, page_title_from, page_id_to, page_title_to) and 163,380,007 lines corresponding to each of the internal links (from-to) of the full Wikipedia content (see the original article (Consonni et al., 2019) for more details). From that file, I took the second and fourth columns (page titles) and imported them into MATLAB into an unweighted directed graph (digraph) with 13,680,532 nodes and 163,380,007 edges. Then it was necessary to iteratively delete the badly linked Wikipedia pages, those with zero in or outdegree that correspond to some redirect, disambiguation pages and others without relevant information for the purposes of this work. The resulting graph, representing Wikipedia’s network of internal links, was called wikiLinksClean and contains 7,879,531 nodes and 150,995,780 edges (4.17 GB). The number of nodes was always higher than the number of articles in Wikipedia (5.6M for 2018) due to the presence of many unresolved redirects (pages without “real content” which automatically redirect visitors to another page). This issue could affect the scan and has therefore been addressed in a later step as described below.

Generate a universe from seeds

For the purposes of this work, the idea is to reveal how two or more elements (concepts, people, works) are linked and connected to each other. Therefore, from these elements (Wikipedia entries hereinafter referred to as ‘seeds’), we have defined a subgraph from the wikiLinksClean digraph by taking the nearest neighbors of each seed (s), that is to say all the nodes of the digraph which are at a distance D of each node s. Since Wikipedia’s internal link network is a dense network (with an average shortest path length of around 4.1), a value of D≤ 2 avoids having irrelevant links between seeds in the subgraph. For the case studied here, the seeds are “Pablo Picasso”, “Albert Einstein” and “James Joyce”. Although we focus on the Einstein-Picasso relationship, including Joyce allows us to compare the relationships between art, science, and literature and therefore perform a more in-depth comparative analysis. So, based on these seeds, we got a subgraph (called the universe ) containing 79,454 nodes and 3,166,325 edges. Once this subgraph was defined, the algorithm solved the aforementioned problem with unresolved redirects by redirecting inputs from nodes with outer degrees equal to one to the corresponding successor node and then removing the (redirecting) nodes. While this procedure can also remove a small fraction of loosely connected nodes, their influence on the final results is negligible. Additionally, after extracting the subgraph, nodes with zero entry / exit degree have also been removed. Obtained universehad 78,444 knots and 3,159,866 edges.

Measure the relationship between nodes

From the obtained universe , we wanted to work only with NOT nodes most related to each of the given seeds. Therefore, it was necessary to define a way to measure the relationship between each pair of nodes (Wikipedia articles). It is important to note that two articles can be strongly related even though there is no direct connection between them. For example, two articles can be co-linked by other articles, or they can co-link other articles. In these cases, we say that the two articles are structurally related. Among the many metrics available to measure the distance (or relationship) between elements of a complex network, the use of normalized Google distance (Cilibrasi and Vitanyi, 2007) (NGD) provides excellent results for our purpose. It is defined as

$$ d _ {{ mathrm {in}} / { mathrm {out}}} left ({a, b} right) = frac {{{ log left ({ max left ({ left | A right |, left | B right |} right)} right) – { mathrm {log}} left ({ left | {A cap B} right |} right)} } {{ log left ({ left | W right |} right) – log left ({ min left ({ left | A right |, left | B right |} right)} right)}} $$

or a and b refer to the two articles of interest, A and B represent the sets of nodes (articles) that connect to / from (Denter exit) aand b , respectively, and Wis the total number of nodes in the universe . Log refers to the base two logarithm while | A | represents the number of nodes in A. If ( left | {A cap B} right | = 0 ), then the corresponding distance is infinite. There are two different distances between the nodes aand b : one for the nodes which link to a and b(Din(a B) ) and another for the nodes which are linked from a and b(Doutside(a B)). The total distance (hit) ) was taken as the harmonic mean between the input / output distances. Finally, the relation between the nodes a and bwas defined as r(a,b) = exp (-D(a, b )), which is always in range [0,1].

Based on this definition of the relationship between two nodes, we determined the NOTj nodes most related to each of the given seeds, with NOTJ= kJ, or kJ is the degree of seed J in thewikiLinksCleandigram. Once the NOTj the “closest” nodes of each seed were known, the kinship matrix ( R ) was calculated for the nodes of this subset of theuniverse(hereinafter called close to the universe). This matrix was then used to create an unoriented weighted graph (g) which represents the relationships between the different elements of theclose to the universe. The weight of the link connecting the nodes ( i, j ) was given by the corresponding element in the relationship matrix ( Ri, j). The resulting graphg contained (in our case) 856 nodes and 143,307 edges.

Data clustering and visualization

The way theNOTthe elements closest to each seed were chosen force the formation of clusters and decrease the inter-cluster connectivity. Therefore, the remaining links between the elements of different clusters can be considered sufficiently relevant for the purposes of this work. Nodes were assigned to different clusters based on the seed they were linked to in the original graph (wikiLinksClean). When a node was connected to more than a single seed, it was assigned to the seed to which it was most related. The resulting graph (with the identified clusters) was drawn using a force-directed layout that uses attractive forces between adjacent nodes and repulsive forces between distant nodes (Fruchterman and Reingold, 1991). The result is shown in figure 1.

Fig. 1: Map of interdisciplinary knowledge.

Each point represents a Wikipedia entry (related to people, concepts, works). The thin lines represent the links between the different elements according to the kinship defined from the standard Google distance (see Methods). The colors represent different groups (disciplines) made up of the nodes most related to the given seeds: Picasso (red), Einstein (green) and Joyce (blue). The proximity between nodes / clusters is proportional to their relationship.


Source link

Leave a Reply

Your email address will not be published.