Interestingly enough and not so surprisingly, I have come across quite a few instances of bad metadata practice in my recent studies.
The reasons why I feel it’s not very surprising is that, first of all, as a regular researcher in the so called “data science” field, I am heavily relying upon datasets, especially open datasets, to conduct my studies; second, people, even researchers, are still having limited knowledge and skills in metadata.
The one recent example of the poor metadata practice I came across (which is also a more frustrating one) is the dataset about Ted Talks speakers that is published on Wikidata.
The big problem of this dataset resides in the “date of birth” field. First of all, it failed to have a consistent format, which is always a huge pain when the next person (especially for someone like me who is not particularly good at using R) tries to manipulate the data. Three formats of the date are used in this dataset, including “YYYY-MM-DD”, “YYYY”, and “MM/DD/YY”, not to mention a few values of “2000s”, which really doesn’t help anyone too much. It reminds me of another project that is going on about the analysis of the open metadata released by Museum of Modern Art (MoMA). Very similarly, there are quite a few date formats in this dataset as well.
About the date in this TED dataset, what is even more annoying is the fact that some dates are clearly not the date of birth. For example, Ray Zahab is claimed to be born in 11/01/2006, which is really wrong based on his Wikipedia page. According to the same source, 11/01/2016 is the date when Zahab and two other runners started their expedition to cross the Sahara Desert by foot. The same thing happened to Kavita Ramdas as well, who is claimed to be born in 2002, which is also terribly wrong based on other sources.
I believe it’s more than necessary that metadata can be organized and published in such an open way. However, these two examples mentioned above, again, prove the values of good practice of metadata. At the center of which seems to be the necessity of a specification about the meaning and format of each field. Preferrably, the data could be in a controlled form. For example, the TED dataset would be so much more useful if the occupation field is based on some sort of a controlled list.