Allie's Kitchen: October 2015

Friday, October 23, 2015

Article Response for Lecture 10 - Underwood

Underwood, T. (2014). Theorizing research practices we forgot to theorize twenty years ago. Representations 127, 64-72.

When we search large sets of data, we run into the problems of confirmation bias. That is, if we search for a question that we already believe we know the answer to, we are likely to find, whether or not it is correct. This is because given a large enough dataset, it is easy to find at least one example of any concept no matter how fallacious. The number of example that we need in order to prove a point depends on the size of the dataset, and few researchers know the size or scope of the datasets they search. Researchers come to the search process with preconceived notions of what is true, and what they need to find. They come with very specific questions, and want to find a particular answer that they intuitively believe is correct. Data mining of large sets of data is often a fishing expeditions to try to find the select few examples that confirm are already preset notions of what is true, particularly in the humanities and with the studies of linguistics.

Additionally, ranking by relevancy can filter out any information which disproves our preconceived notion. When we use search engines, especially full text ones, algorithms show us immediately what we search for, regardless of whether the subject is correct or not. More difficult than the issue of synonym exclusion, is the idea that every facet of data that does not conform to the language of our search, and thus our bias, is filtered down, so we are less likely to see any contradictory information.

Scholars of the humanities often search for certain keywords, using the distributional hypothesis. The distributional hypothesis implies that the “meaning of a word is related to its distribution across contexts.” This is like Wittgensteinian game theory, wherein meaning is determined by usage. While this approach has merit, seeing that a given word is associated X number of times with another given word, does nothing to inform the searcher about its usefulness without also knowing what other words may be associated with it more frequently, what context the associations are in, and how large the dataset is. These considerations are often omitted from scholarly research because of an over-reliance on search algorithms to ascertain value and truth.

An algorithm should not be trusted to automatically confer authority and relevancy, because algorithms are not simply blunt instruments, tools hammer datasets into shape, but come with their own inherent biases and limitations. Most algorithms, however, are proprietary, and thus not subject to public scrutiny of its mechanism, so we have no way to contextualize the search process, and provide meaning to datasets of associated search terms. Computer scientists are working to fix this by using topic modeling to more clearly define associations of terms into clusters, so that words can be associated with other words in given contexts. This process can reveal subjects and ideas we didn’t know to look for in our initial search, and contribute more effectively to scholarship, rather than simply confirming what we already thought was true.

Monday, October 19, 2015

Article Response for Lecture 9 - Rotenberg & Kushmerick

Rotenberg, E. & Kushmerick, A. (2011). The Author Challenge: Identification of the Self in the Scholarly Literature. Cataloging & Classification Quarterly 49(6), 503-520

This article began as an effective examination of the problems with attribution of scholarly scientific publications, and then submitted a given solution. The first half distinguished attribution as a necessity for allocation of government and grant funding, as well as for tenure decisions for individuals. However, names can be common, and individuals can often have similar names, making attribution tricky. Additionally, scientific scholarly output is increasing at a rapid pace, adding more common names to the jumble. Non-traditional forms of publication, such as web published pieces, and those in three-dimensional models instead of writing, proliferate the scientific landscape.

Several different international organizations are currently working on name disambiguation, in which authors themselves claim their work. The authors suppose that no one single company can cover all disambiguation in the world, so disambiguation must necessarily be a collaborative effort. The international entities linked to one another, with the authors supporting each entity in a pseudo-folksonomic fashion can create a web of disambiguation. The web particularly discussed was Web of Science and its particular disambiguation community ResearchID.

Web of Science uses an algorithm to collocate works by a single author. The difficulty mentioned with algorithmic disambiguation within the Web of Science search engine is that it collocates incorrectly whenever authors don’t stick to a strict subject matter, or when authors change names. ResearchID is an attempt to fix this difficulty. The feedback system was mentioned as a critical component to disambiguating correctly, because human users can disambiguate in such cases better than the algorithms, without the added cost in employee searching.

ResearchID offers identification numbers to each individual author, as well as citation metrics to allow authors to disambiguate themselves. It allows interactive maps of collaborators and citations to analyze an author’s geographic spread of knowledge. In many programs and communities, it has been implemented to help disambiguate authors from inventors and principle investigators. Instead of relying solely on the metadata attached to the article itself, it pulls author data from grant databases and other sources, and allows self-disambiguation

I find it significant that the authors do not see fit to mention NACO, or indeed any LC disambiguation, but only lend credence to disambiguation systems done by authors themselves rather than catalogers. While I agree with their assessment of the value of folksonomy-type disambiguation, I find it disingenuous not to at least mention a divergent way of doing things, and possible criticisms. The second half of the article seemed more and more like an advertisement for Thomson Reuters projects and products as I continued reading. While the authors seem to believe that further interoperability is the sole goal of future projects, I find it significant that no mention is made of author fraud. I would think that with a folksonomy-type system this would become an issue, or if it is not, is at least worth a mention.

Sunday, October 11, 2015

Article response for lecture 8 - Knowlton

Knowlton, S.A. (2005). Three decades since prejudices and antipathies: A study of changes in the Library of Congress Subject Headings. Cataloging & Classification Quarterly 40(2):123-45.

This article addresses biases inherent in subject cataloging, and assesses modern improvements, and how well the previous objections had been satisfied. It points out a philosophical balancing act between the stated goals of search optimization and universal bibliographic control. Subject categories are designed to enable ease of searching, and allow users to find resources by the most common term, or the term they are most likely to search by. However, whenever one presumes to imagine what a user will search under, or what the most common term might be, which allows personal biases and prejudices to play a role in cataloging. There is a danger of normalizing a single experience, and overwhelmingly the average viewpoint is white, male, heterosexual, and Christian. This bias runs the risk of stigmatizing any group that does not fall within those specific norms, and create subject headings which make resources harder to find for certain users.

Specifically, Sanford Berman published one of the first widely regarded critiques of bias in Library of Congress subject headings. Since its publication, many of the modifications suggested by Berman have been at least partially implemented. In the past several decades, terminology has changed, which to some degree necessitated different changes from those Berman suggested, accounting for some of the disparity between his recommendations and actual changes. Additionally, the vast majority of his recommendations for subject changes related to African-Americans and women have been implemented, perhaps indicative of the social climate and movements of the times between then and now.

One subject area which has remained stubborn is religion. Religious subject categories without qualification are assumed to be Christian. Thus, religious subheadings which relate to Christianity are not qualified as such. While some cases of this could be considered exclusionary toward other religions, I would argue that most of those listed are subjects particular to Christianity, and not subject to confusion. Obviously the term ‘God’ could be construed in many different religions, but other subjects such as ‘Virgin Birth’ is mythologically associated with Christianity, and not necessary to disambiguate. Such unnecessary disambiguations may account for some of those not addressed.

Other than religious subjects, I did notice that two other types of subjects were not addressed. Subjects under ‘poor’ and many of those regarding poverty and economic disparity was not disambiguated or made into less offensive categories. Perhaps this is because of the lack of emphasis on socio-economic disparity until very recently in the history of social justice. Likewise, several aberrant subject headings involving indigenous populations were not altered. Similarly, social justice issues in US culture have not emphasized international themes historically until very recently, and so many of these headings are likely still catching up to culture.

Ultimately, I think the alterations to LC subject headings since Berman’s original study have been fairly adequate, and have stayed abreast of modern social attitudes as well as can be expected for a complex cataloging structure. However, the age of the original study makes me wonder if there have been more modern ones reassessing LC subject headings to see what a more modern take on biases would reveal. If we are still considering a decades old study as a litmus strip for innovation, it’s unsurprising that LC subject headings pass the test. I think we could use a more modern litmus test.

Sunday, October 4, 2015

Article Response for Lecture 7 - Naun

Naun, C. C. (2008). Objectivity and Subject Access in the Print Library. Cataloging & Classification Quarterly, 43(2), 83-95.

This article is extremely dense with practical and philosophical points about the nature of objectivity, the advantages and disadvantages of subject access, and print as compared to electronic resources. It has been my favorite article so far for this class. The author starts with the ideology behind libraries. The reason libraries are a public good is because books and journals are expensive, and only become more so. This idea of giving every person access to information is a noble ideology, but “it is underwritten by logistics,” specifically those of the economic market. Objectivity is one of these core ideologies underwritten by logistics.

Objective subject representation depends on our social values, and hopefully our social values place the library as an open realm of discourse where all subjects are equal. Often, however, these ideals are not enough to obtain objectivity. “An attempt to capture what a document is about requires a frame of reference that may encompass a host of interests, assumptions, and values.” Ideally, subject categories should always reflect the most commonly used term. Oftentimes, however, subject categories are changed to less offensive terms, even if these are not the most used term.

This exception creates an environment of backhand censorship, where controversial subjects, which have a likelihood of causing offense, are often placed under hidden vocabulary, which most user don’t know the words to find, because they are not arriving at the subject with all the biases of the cataloger. Highly regulated subjects, can also be highly normalized. Things are shoved into preconceived boxes, until the boxes overflow to create new subject categories. After all, controlled vocabularies are exclusionary in nature, by choosing to use certain words over others. How can such a choice not contain bias?

Full text searching of electronic resources can remove subject and description interpretation and thus remove bias. However, natural language contains its own biases. In defense of print resources, subject classification can also remove bias. Competing views are normally shelved together, so the user has many options of viewpoints to look at. Correct indexing is by usage, not by preconception. However, librarians also are free to consider literary warrant in indexing, which is objectivity “in relationship to human discourse.” This seems rather flexible, and could be prone to misuse as well.

If indexing is done in the most objective way by how users search, which users are considered? This is another potential level of bias. Users must be visualized as a “potentially diverse community of users” in order to avoid bias, and it is logistically impossible to poll or visualize every type of possible user. Finally, the author gives us a single common-sense solution to these difficult questions of impartiality and objectivity. “Impartiality does not demand infallibility so much as vigilance.” In other words, it’s impossible to be completely objective every time, but watching our own biases and checking them as much as possible, while being prepared to correct mistakes, can get us much further than just strict implementation of established rules.

Allie's Kitchen