Interestingly enough and not so surprisingly, I have come across quite a few instances of bad metadata practice in my recent studies.The reasons why I feel it’s not very surprising is that, first of all, as a regular researcher in the so called “data science” field, I am heavily relying upon datasets, especially open datasets, to conduct my studies; second, people, even researchers, are still having limited knowledge and skills in metadata.The one recent example of the poor metadata practice I came across (which is also a more frustrating one) is the dataset about Ted Talks speakers that is published on Wikidata.The big problem of this dataset resides in the “date of birth” field. First of all, it failed to have a consistent format, which is always a huge pain when the next person (especially for someone like me who is not particularly good at using R) tries to manipulate the data. Three formats of the date are used in this dataset, including “YYYY-MM-DD”, “YYYY”, and “MM/DD/YY”, not to mention a few values of “2000s”, which really doesn’t help anyone too much. It reminds me of another project that is going on about the analysis of the open metadata released by Museum of Modern Art (MoMA). Very similarly, there are quite a few date formats in this dataset as well.About the date in this TED dataset, what is even more annoying is the fact that some dates are clearly not the date of birth. For example, Ray Zahab is claimed to be born in 11/01/2006, which is really wrong based on his Wikipedia page. According to the same source, 11/01/2016 is the date when Zahab and two other runners started their expedition to cross the Sahara Desert by foot. The same thing happened to Kavita Ramdas as well, who is claimed to be born in 2002, which is also terribly wrong based on other sources.I believe it’s more than necessary that metadata can be organized and published in such an open way. However, these two examples mentioned above, again, prove the values of good practice of metadata. At the center of which seems to be the necessity of a specification about the meaning and format of each field. Preferrably, the data could be in a controlled form. For example, the TED dataset would be so much more useful if the occupation field is based on some sort of a controlled list.
This idea that we, as new PhD students, should write about our readings and research interests on a weekly basis has been talked about in this term’s Info811 class for a few times. And since I am fully recovered from the ailment this week. It is the perfect timing to start this project.I guess I have two major goals from this project:
* to form a more stable style in academic writing and to improve my writing skills, and
* to improve my short-term memoryThat is being said, I do feel that I have pretty bad academic writing skills and sometimes weird style. Not to mention my bad memory which always makes me embarrassed.In order to reach these goals, my weekly journal will be organized topically, based on what I read and thought in the past week. So here we start!
I talked about this topic briefly earlier in my Chinese blog. Just to brief what’s summarized there, I read a few articles, which try to establish higher-level metadata quality evaluation framework in somewhat different contexts. (Moen, 1997; Guy, 2004; Hillman, 2004; NISO, 2001; Stvilia, 2007)Talking just about these five papers, I feel like what Park’s study suggests, there are some common themes in the existing studies, as represented by those five studies above, namely, completeness, consistency, and accuracy. (Park, 2010) Moreover, there are clearly some connections between these frameworks in terms of their similarities and differences, different contexts, and underlying theories, which probably worth further exploring. But what seems more interesting is the fact that I don’t believe, based on my reading, a lot of these current studies have put the user task in a given context in a significant position in their overall considerations.This thought of mine derives from another paper by Stvilia and others about the perception of data quality and tasks of users in condensed matter physics community, where the authors cited Juran’s definition of quality as “fitness for use”, and stated that “quality is dynamic and multidimensional.” The changes in data items themselves or the contexts could change people’s perceptions of data/metadata quality greatly. (Stvilia, 2015)There are a lot of contexts where metadata is functioning. But related with what I am doing, I am mostly interested in two types of contexts: how users are perceiving and using metadata during different steps of data curation in the context of research data, and how users are perceiving and using bibliographic metadata in a library catalog or digital repository. What’s different between these two topics is that there seems to be a better understanding of what people are doing using research data. A few models about data curation have been suggested by researchers. The most famous one might be DCC Curation Lifecycle Model developed by Higgins (2007). It is undeniable that such efforts have been done in the library metadata world, especially the user task framework identified by FRBR Research Group, namely, find, identify, select, obtain, and the most recent explore. (Riva, 2015) But like what Karen Coyle said in her recent book that, there is clearly a gap between FRBR’s claimed goal to develop a detailed user task model and its final outcomes. (Coyle, 2015) Based on the above considerations, I feel how users are seeing, expecting, and actually using bibliographic metadata is an interesting topic to work on in the future.
Values of scientific research
I read Ioannidis’s paper about why most research findings are false, which is the paper I will present in next week’s Info863 class. (Ioannidis, 2005)I don’t want to talk too much about this paper per se, except that the author tries to prove that most research finding are false based on a statistical formula discovered by himself or someone else, and that this paper is one of the most highly-cited papers I have read (more than 3,000 times according to Google Scholar). But what I want to discuss here is my “personal” academic emotions aroused by this paper and my reflections on these emotions.My most direct and probably the strongest feeling after reading this paper is that, even though this paper tries to examine the shortcomings to the current “research practice”, but it uses a arrogant attitudes by using the expression “research findings” in its title and across the paper: it doesn’t even try to limit itself to scientific research findings. I think this behavior means that the author feels that the statistical formula and findings in this paper can cover all the researches: science, social science, and humanities (even though in the beginning of this paper, all the examples discussed are about medical science).I feel I am probably not standing on a stable enough stance to attack this author’s points, given my very limited knowledge on statistics and scientific studies. But I do feel that reflexivity is supposed to be one of the best characteristics of scientific communities, which means one should always take one or half step back to think about oneself. However, that’s not what I see in a lot of things I read recently, including this paper, which tries to use a statistical model to explain everything (including those things that are not themselves).Again, I am not saying that what the author claims is wrong, which honestly, I don’t know. But I can definitely see it is a dangerous and arrogant effort. Just like what Latour and Woolgar put, who conducted one of the first anthropological studies on scientific practice in 1970s, “scientific criticism by nonscientists is not practiced in the same way as literary criticism by those who are not novelists or poets.” (Latour, 1979) And based on my knowledge on humanity studies, I don’t think claims or findings can be examined in the same way as those in scientific studies.
Coyle, K. (2015, September 22). Coyle’s InFormation: FRBR Before and After – Afterword. Retrieved from http://kcoyle.blogspot.com/2015/09/frbr-before-and-after-afterword.htmlGuy, M., Powell, A., & Day, M. (2004). Improving the Quality of Metadata in Eprint Archives. Ariadne, (38). Retrieved from http://www.ariadne.ac.uk/issue38/guyHillmann, D. I., & Bruce, T. R. (2004). The Continuum of Metadata Quality: Defining, Expressing, Exploiting. ALA Editions. Retrieved from http://ecommons.cornell.edu/handle/1813/7895John P. A. Ioannidis. (2005). Why Most Published Research Findings Are False. PLoS Med, 2(8), e124. http://doi.org/10.1371/journal.pmed.0020124Latour, B., & Woolgar, S. (1979). Laboratory life: the social construction of scientific facts. Beverly Hills: Sage Publications.Moen, W. E., Stewart, E. L., & McClure, C. R. (1997). The Role of Content Analysis in Evaluating Metadata for the U.S. Government Information Locator Service (GILS): Results from an Exploratory Study [Paper]. Retrieved November 1, 2015, from http://digital.library.unt.edu/ark:/67531/metadc36312/NISO Framework Working Group. (2007). A framework of guidance for building good digital collections. http://www. niso. org/publications/rp/frame-work3. pdfRiva, P., & Žumer, M. (2015). Introducing the FRBR Library Reference Model. Retrieved from http://library.ifla.org/1084/Stvilia, B., Gasser, L., Twidale, M. B., & Smith, L. C. (2007). A framework for information quality assessment. Journal of the American Society for Information Science and Technology, 58(12), 1720–1733. http://doi.org/10.1002/asi.20652Stvilia, B., Hinnant, C. C., Wu, S., Worrall, A., Lee, D. J., Burnett, K., … Marty, P. F. (2015). Research project tasks, data, and perceptions of data quality in a condensed matter physics community. Journal of the Association for Information Science and Technology, 66(2), 246–263. http://doi.org/10.1002/asi.23177
Welcome to my website!
This website is created by Kai Li MLIS, who is working at Ingram Content Group as a cataloger. The website is and will be about topic in the information science field, especially metadata and information/data visualization.