Underwood,
T. (2014). Theorizing research practices we forgot to theorize twenty years ago. Representations
127, 64-72.
When we search large sets of
data, we run into the problems of confirmation bias. That is, if we search for
a question that we already believe we know the answer to, we are likely to
find, whether or not it is correct. This
is because given a large enough dataset, it is easy to find at least one example
of any concept no matter how fallacious. The number of example that we need in
order to prove a point depends on the size of the dataset, and few researchers
know the size or scope of the datasets they search. Researchers come to the
search process with preconceived notions of what is true, and what they need to
find. They come with very specific questions, and want to find a particular
answer that they intuitively believe is correct. Data mining of large sets of
data is often a fishing expeditions to try to find the select few examples that
confirm are already preset notions of what is true, particularly in the
humanities and with the studies of linguistics.
Additionally, ranking by
relevancy can filter out any information which disproves our preconceived
notion. When we use search engines, especially full text ones, algorithms show
us immediately what we search for, regardless of whether the subject is correct
or not. More difficult than the issue of synonym exclusion, is the idea that
every facet of data that does not conform to the language of our search, and
thus our bias, is filtered down, so we are less likely to see any contradictory
information.
Scholars of the humanities often
search for certain keywords, using the distributional hypothesis. The distributional
hypothesis implies that the “meaning of a word is related to its distribution
across contexts.” This is like Wittgensteinian game theory, wherein meaning is
determined by usage. While this approach has merit, seeing that a given word is
associated X number of times with another given word, does nothing to inform
the searcher about its usefulness without also knowing what other words may be
associated with it more frequently, what context the associations are in, and
how large the dataset is. These
considerations are often omitted from scholarly research because of an
over-reliance on search algorithms to ascertain value and truth.
An algorithm should not be
trusted to automatically confer authority and relevancy, because algorithms are
not simply blunt instruments, tools hammer datasets into shape, but come with
their own inherent biases and limitations. Most algorithms, however, are
proprietary, and thus not subject to public scrutiny of its mechanism, so we
have no way to contextualize the search process, and provide meaning to
datasets of associated search terms. Computer scientists are working to fix
this by using topic modeling to more clearly define associations of terms into
clusters, so that words can be associated with other words in given contexts.
This process can reveal subjects and ideas we didn’t know to look for in our
initial search, and contribute more effectively to scholarship, rather than
simply confirming what we already thought was true.
No comments:
Post a Comment