Wikipedia is a tremendous resource. It is also a biased resource, lacking in diversity in any number of ways. Wikipedia isn’t the only source that suffers from systemic bias – most collections do, whether it’s from an art museum, library, archive or elsewhere. Initiatives to improve representation within collections are becoming more commonplace, which is an important first step in improving diversity in any given collection. The most well-known Wikipedia project working in this area is called Women in Red, which strives to improve the representation of women on Wikipedia. This project achieves these outcomes by generating detailed lists. But where do these lists come from? And how can projects analyze other demographic variables to improve representation?
The answer to both of these questions is another Wikimedia project called Wikidata. Wikidata provides linked data representations of nearly 100 million people, things, and events (as of December 2021)…including every single Wikipedia article. The Wikipedia article for the astronaut Sunita Williams also has a corresponding Wikidata item. Wikidata is both human and machine readable, which means we can run queries of all of Wikidata to answer any number of questions that pop into our heads based on statements in Wikidata. You can think of a statement as a simple sentence that describes things. An example of a statement is Sunita Williams’ occupation is astronaut (among many other accomplished occupations). Taking a closer look at Sunita Williams’ Wikidata item, we can see that, she identifies as female, went to Needham High School, and is of Indian and Slovene descent. These are all other statements on Wikidata. Women in Red generates their lists through Wikidata queries about gender statements on Wikidata. As we can see from that linked list, we can get more specific with our queries and ask Wikidata about ethnicity as well. So we can pick any of these other statements and run queries about them (like what are some occupations of Slovene Americans? How many people from Needham High became astronauts…spoiler: just one for now).
We can really start to learn about representation on Wikipedia when we start asking about every single person on Wikipedia (thanks to Wikidata’s machine readability, this takes seconds to do). For example, we could ask what the most common languages of astronauts speak. We could see which ethnicities are best represented (or underrepresented) in certain industries. Or we could switch out ethnicity for sexual orientation and see a similar list. We could pick any variable and run a query about it. Querying also allows us to do a more detailed analysis of all the Wikipedias that exist. We could see which language versions represent certain topics more than others. We can look at article size across different languages and whether or not articles even exist in a particular language.
Generating lists like this is so helpful in identifying areas for improvement on Wikipedia. Before we continue we need to address two concerns. One is that for this approach, the lists do not represent reality; they only represent what is or isn’t on Wikipedia/Wikidata. There is a lot that will continue to be missing. The more we add to Wikidata, the better this will get. The other concern we need to acknowledge is even though Wikidata represents data about sexual orientation, gender, and ethnicity doesn’t mean it does it perfectly — or even well. For example, the “sex or gender” property (P21) has evolved over time, but still has a long way to go before it can accurately represent the gender spectrum for all people. Demographic data is sensitive and there are still many questions the community is looking to answer in a respectful, accurate way. It’s also true that individuals may not want this kind of data shared publicly, which may skew the numbers. This isn’t a Wikidata-specific problem. Many systems and ontologies contend with this and, in fact, Wikidata’s openness has allowed properties like “sex or gender” to evolve and adapt very quickly.
In spite of these concerns, what we do have is a set of tools that can help us start to understand how diverse a given corner of Wikipedia is and how we can start to improve it. The insights that Wikipedia and Wikidata can provide us are far beyond what we could do previously and we must take advantage of it if we are ever to improve representation on these platforms.
Projects like Women in Red demonstrate that we can organize the community to begin to tackle systemic bias. If you have ideas of other ways to engage with systemic bias on Wikipedia and Wikidata, feel free to reach out to us to help make Wikipedia and Wikidata more accurate and representative of this world we live in.
Want to learn more? Enroll in our Wikidata Institute: wikiedu.org/wikidata.
Image credit: NASA, Public domain, via Wikimedia Commons