×

Stack Overflow Developer Tag Network Analysis

An Interactive Visualization of Developer Skillsets

Stack Overflow is a question and answer website for professional and enthusiast programmers. A now scrapped feature called Developer Stories allowed developers to create a profile and attach tags to it that reflected their skillset. This visualization is an attempt to show the relationships between these tags.


The first visualiation is a force directed graph (note: the graph links are undirected). Each node represents a tag that denotes a specific software tool or concept. The default node size is proportional to the number of developers that have that tag on their "story". Link width is proportional to the link strength, which is proportional to how correlated those two tags are (correlation coefficient * 100).

Hover over a node or link to see what nodes it links to, as well as information about it. Pan and zoom are also available.

There are also several options for node size that can be applied. They are: Size, Degree, Weighted Degree, Eigenvector Centrality, Closeness Centrality, and Pagerank


The second visualiation is an adjacency matrix showing the links between tag nodes. Each square represents a link between 2 nodes. The color of the square indicates the group of the node. An opacity filter is applied based on the strength of the link. For links between nodes of different groups, a mean of the group colors is used.

Hover over a link in the matrix to see information about it.

Insights

There are several interesting insights that can be drawn from observing the graph.

  • When scaling nodes by size, it is clear to see that JavaScript and Java are the most popular tags across all profiles.
  • When observing the grouping of nodes, the relatedness of tech skills can be seen (i.e. Web Development skills like JS, html, css are all grouped together because they are used together often, and thus mentioned on the profiles of devs who do frontend web dev.)
  • It is worth noting that nodes of larger size do not necessarily have a heigher weighted degree. One could assume that the tags with the highest overall occurence would have the strongest connections as well, but this is not always the case. Observing the change in size of the neighbors relative to the Java node (group 8) when switching from Size to Weighted Degree scaling illustrates this concept.
  • When scaling nodes by weighted degree, the sizes reflect the sum of link strengths on that node. This shows the frequency of tags being mentioned together.
  • Scaling nodes by Eigenvector Centrality reveals the most influential tags. This turns out to be web development based skills, both frontend and backend.
  • Scaling nodes by Closeness Centrality shows that the skills on the perimeter are mostly only ever mentioned in reference to each other, separate from the main network. These include mostly software testing and project management tags.
  • Scaling nodes by PageRank ranks the tags by how often a user following links will non-randomly reach the node "page". It is interesting to see that Linux has the highest PageRank value. This shows to me that Linux often arises in most development environments.

Data

Primary data sourced from Kaggle Stack Overflow Tag Network

Notes on dataset:

This data was gathered in 2016/2017, so the sizes of nodes may not be accurate to modern development practices.
This dataset includes only a subset of tags used on Developer Stories. Tags that were used by at least 0.5% of users and were correlated with another tag with a correlation coefficient above 0.1 are included. This means that very sparsely used tags and tags that are not used with other tags were filtered out.

Methods

Visualization created with D3.js, Jquery, and Bootstrap

Data prep done in python (Jupyter Notebook) and Gephi


This Block by Mike Bostock was used as a template when starting the graph visualiation.

Node Scaling:

Ordering: