We have wrapped up another round of excellent work with the University of Virginia (UVA) Data Science capstone project. Capstone work entails having students collaborate with community partners using data science methodology and some powerful computing to provide new insights about a dataset. This is Wiki Education’s second round working with a UVA capstone group and I’m excited to share their hard work with you. I want to acknowledge the hours of processing, analyzing, and making sense of Wikidata’s data that the UVA team – Quinton Mays, Antoine Edelman, and Olu Omosebi – did. They were an excellent team and I’m proud of their work.
This group started with a classic challenge on Wikidata: how do we know what we are describing (given a little data, can we guess what a thing/entity is?) and what properties do we use to describe any given thing? Phrased differently – how do we know how complete or incomplete something is? This is hard to answer for many reasons.
- There are millions of different kinds of things in Wikidata (people, countries, organizations)
- There are multiple ways to describe these things (how do you describe an organization?)
- Even if you know what something is, how do you know what’s missing or what to add to it? (is this a complete description of an organization?)
Sounds tough, but I’ve got good news. Even if we know very little about an item, a little data science magic can predict a lot about what your mystery item may be.
In their paper, “Review of Knowledge Graph Embedding Models for Link Prediction on Wikidata Subsets,” this group analyzed different subsets of items on Wikidata (countries, people, bridges, and films to name a few). They ran several algorithms through these sets to sort them and make guesses as to what the items may be. They found that some worked better than others and recommend them for future use in prediction tools. This could have an impact on evaluating data quality, consistency, and item completeness, which are some essential metrics on Wikidata. So how did they do this?
Let’s take a look at those subsets they selected. From this list you can start to guess how Wikidata describes these things. Countries and bridges have locations. Humans must have a place of birth. Films almost always have a director and actors. Bridges must start somewhere and end somewhere. This set of descriptions used to describe something is known as a schema or shape (don’t think geometry – think a specific set of things used to explicitly define or describe something). Their research also takes into account these shapes and considers how these items relate to other items. Sticking with humans as the example, a specific person has a two way relationship with their parents. A date of birth would be a one way relationship. And a teacher of a class of students would be a one-to-many relationship. For the information architecture superfans, these specific relationships are called cardinality. The group analyzed item cardinality and data models among these subsets within Wikidata.
So can something as small as analyzing these basic relationships reveal that much? It turns out that this is foundational for identification and recommendation features. Adding complexity reveals more and more about data models and makes identification easier and more accurate. In their analysis, they ran fifty-four different algorithms to analyze and identify items. A major takeaway is that these different algorithms can successfully process Wikidata at this level, but selecting subsets (a set of humans, a set of countries) will likely yield better results since there is more consistency in those subsets. Subsets process faster, requiring fewer resources. The paper details their rationale for rating these different programs and they recommend a few for link prediction on Wikidata. Best of all? They share all of their findings as a set of analysis tools on a Github page for anyone to use.
Let’s return to our initial question: how do we know what we’re describing? It turns out analyzing basic relationships between sets of things can reveal a great deal about those things. Since Wikidata is machine readable, knowing about these relationships can allow for the creation of recommendation tools (like Recoin) so Wikidata community members can make better edits on Wikidata. These kinds of tools could also be used to identify erroneous information and take guesses at what an unknown item or entity may be. All of this encourages better data consistency, quality, and completeness.
As great as Wikidata is, it’s not perfect. The community regularly deals with inconsistency, missing descriptions, and data that’s misplaced, out of date, or just wrong. This kind of work from UVA is exactly what is needed to make Wikidata even better. There’s a lot of work left to do and tools like what this group produced are an important step in engaging with some towering Wikidata challenges. We hope that the Wikidata community (and others outside of it) find these tools and approaches helpful in analyzing other knowledge bases and using the results to improve the data even more.
A special thanks again to Olu, Quinton, and Antoine, and the UVA data science department for supporting this work.
Want to learn Wikidata or brush up on your skills? We have online training courses starting in January, March, and May 2023. Visit wikiedu.org/learn to learn more.