Friday, October 23, 2015

Article Response for Lecture 10 - Underwood

Underwood, T. (2014). Theorizing research practices we forgot to theorize twenty years ago. Representations 127, 64-72.
                When we search large sets of data, we run into the problems of confirmation bias. That is, if we search for a question that we already believe we know the answer to, we are likely to find, whether or not it is correct.  This is because given a large enough dataset, it is easy to find at least one example of any concept no matter how fallacious. The number of example that we need in order to prove a point depends on the size of the dataset, and few researchers know the size or scope of the datasets they search. Researchers come to the search process with preconceived notions of what is true, and what they need to find. They come with very specific questions, and want to find a particular answer that they intuitively believe is correct. Data mining of large sets of data is often a fishing expeditions to try to find the select few examples that confirm are already preset notions of what is true, particularly in the humanities and with the studies of linguistics.
                Additionally, ranking by relevancy can filter out any information which disproves our preconceived notion. When we use search engines, especially full text ones, algorithms show us immediately what we search for, regardless of whether the subject is correct or not. More difficult than the issue of synonym exclusion, is the idea that every facet of data that does not conform to the language of our search, and thus our bias, is filtered down, so we are less likely to see any contradictory information.
                Scholars of the humanities often search for certain keywords, using the distributional hypothesis. The distributional hypothesis implies that the “meaning of a word is related to its distribution across contexts.” This is like Wittgensteinian game theory, wherein meaning is determined by usage. While this approach has merit, seeing that a given word is associated X number of times with another given word, does nothing to inform the searcher about its usefulness without also knowing what other words may be associated with it more frequently, what context the associations are in, and how large the dataset is.  These considerations are often omitted from scholarly research because of an over-reliance on search algorithms to ascertain value and truth.

                An algorithm should not be trusted to automatically confer authority and relevancy, because algorithms are not simply blunt instruments, tools hammer datasets into shape, but come with their own inherent biases and limitations. Most algorithms, however, are proprietary, and thus not subject to public scrutiny of its mechanism, so we have no way to contextualize the search process, and provide meaning to datasets of associated search terms. Computer scientists are working to fix this by using topic modeling to more clearly define associations of terms into clusters, so that words can be associated with other words in given contexts. This process can reveal subjects and ideas we didn’t know to look for in our initial search, and contribute more effectively to scholarship, rather than simply confirming what we already thought was true.

No comments:

Post a Comment