Reflections on library-based topic modeling from the Principal Investigator

As the PI for the IMLS Planning Grant titled Investigating the National Need for Library Based Topic Modeling Discovery Systems, I want to thank the onsite and online participants of the four grant workshops and the 324 colleagues, faculty, librarians, and software engineers who responded to our survey on Machine Learning (ML).

On behalf of everyone, we would like to thank IMLS for this wonderful opportunity to investigate the applications of ML in scholarship and research.

The workshops were intended to facilitate conversations among research faculty, librarians, and computer engineers in order to:

  • understand and document the current ML practices or projects of each group and
  • identify the possibilities to use ML to enhance or augment library classification as a means to meet cross-disciplinary research needs.

Machine Learning, Artificial Intelligence, and knowledge creation

ML is a branch of Artificial Intelligence (AI). It is perhaps one of the more essential pieces of AI.

Its applications are all around us from optical character recognition to self-driving cars. We currently see numerous applications of ML in knowledge creation, management, curation, and discovery. ML holds great potential to further support these areas in the future.

For example, while some ML methods, such as Natural Language Processing in data-mining and text-mining, are currently leveraged for scholarship and research, there are more advanced ML techniques, such as Deep Neural Networks, Recurrent Neural Networks, Reinforcement Learning, and Natural Language Understanding, that we may eventually leverage for scholarship and research.

Ethical and privacy implications in Machine Learning / Artificial Intelligence design

Beyond its applications in the library and information science field, I believe that all academic stakeholders should be part of ML/AI conversations in general. Because ML/AI algorithms are often complex and opaque, we need to also consider the ethics and unintended consequences related to ML.

Since I am both a librarian and an IT practitioner, the impacts of ML are imperative to me. Librarians operate on a set of values, such as user privacy, intellectual freedom, and social responsibility, as such it is critical that we first understand how ML works and then become part of the design of applications of ML, ensuring that the technology is implemented in the best possible way to promote equal access, diversity, and freedom of speech.

Inspiration for the IMLS planning grant to explore Machine Learning

This grant idea was inspired by Convocate, a collaborative endeavor between the Klau Center for Civil and Human Rights and the Hesburgh Libraries, both at the University of Notre Dame. In that project, we brought two discrete corpora from human rights law and Catholic social teaching with the intent of finding relevant connections between the two.

Experts from law, theology, and the social sciences, librarians, and software engineers worked diligently for almost four years to launch the project. Faculty and their students spent time and effort in and outside the classroom to explore the connections between the two fields.

With help from librarians, subject domain experts created subject headings/controlled keywords for a taxonomy system, not at the document level but at the full-text/paragraph level. It was a very successful academic exercise, with a promising result for furthering the study of this cross-disciplinary area.

However, we recognized the labor-intensiveness of the project, which would only grow as new documents were approved for inclusion.

The team was open to text-mining and other computing techniques to classify more documents, thereby making them quicker to find and interpret. Prior work by experts in identifying topics and paragraph tagging done by students, laid the groundwork to transform the project into a ML initiative.

  • Within the context of building the Convocate site, we had commonalities between the two disciplinary areas and mapped a set of controlled vocabularies that could become ML features.
  • We also had hundreds of documents and, more specifically, tagged paragraphs, which could serve as a training set.
  • We were equipped to create a ML hypothesis algorithm, which could interpret new materials to predict connections and discover new relationships.
  • We were also able to construct an algorithm to evaluate the accuracy of the prediction and improve the learning over time.

The resulting process is a much better approach to the discovery of new materials; therefore, our students, faculty, and our librarians can focus on more valuable high-level work.

Visit the Convocate website.
View the CNI presentation about the Convocate project.
Read the IFLA article titled, “Topic or Metadata Modeling for Cross-Disciplinary Scholarship: Challenges and Opportunities for Academic Libraries.”

Many voices broaden the potential for the impact of Machine Learning

Our original intent of the grant was to focus on ML for the discovery of multidisciplinary research, but as we reviewed the survey results, we learned that professionals were interested in many varied applications of ML.

As such, we chose to broaden the scope of the grant and identify the greater potential for the impact of ML on scholarship. No doubt, more ML methods will emerge in relation to scholarship, and they will provide greater capabilities for improving learning, research, and knowledge discovery.

For our final activity related to this IMLS grant, we bring this group together in an effort to record and publish some of the knowledge we have discovered through this process. Our hope is that collecting a series of essays will help to advance the discussion of ML and its applications within academia.

John Wang
Associate University Librarian