Digital Humanities and the Ladino Press
This talk focuses on a project called Newspaper Navigator, which uses AI to identify and extract visual content from digitised historic newspapers. The speaker, Benjamin Lee, begins by discussing the massive scale of digitised newspaper collections like Chronicling America, which contains over 20 million pages of historic American newspapers. He highlights the challenge of navigating these vast collections and argues that AI can offer new ways to browse and search these materials.
Newspaper Navigator Project Overview
Lee explains how his project, Newspaper Navigator, leverages machine learning to automatically identify and extract visual content from newspaper pages. He outlines the project's development, which began with a crowdsourcing initiative by the Library of Congress called "Beyond Words". Volunteers in this initiative drew bounding boxes around five types of visual content: photos, illustrations, maps, comics, and editorial cartoons. Lee saw the potential of using these annotations to train a machine-learning model to automatically perform this task.
The Newspaper Navigator project involves several key steps:
Visual Content Recognition: A machine learning model is trained to identify and draw bounding boxes around seven categories of visual content: photos, illustrations, comics, cartoons, maps, headlines, and advertisements.
Extraction and Captioning: Once the visual content is identified, the model extracts the corresponding text from the OCR (optical character recognition) data to create captions for each image.
Search System: A search application allows users to browse the extracted visual content using both keyword search and visual similarity search powered by AI.
Public Access: The extracted visual content, data set, and code are made publicly available, enabling researchers and the public to explore and use these resources.
Examples and Applications
Lee provides examples of how the Newspaper Navigator model successfully identifies visual content on various newspaper pages. He also shares an anecdote highlighting the project's value for digital humanities research. By simply querying the data set for maps from 1861 to 1865, researchers could quickly access a large collection of Civil War maps, a task that would have been extremely time-consuming using traditional search methods.
The presentation includes a live demo of the Newspaper Navigator search application. Lee demonstrates how users can search by keyword, browse results, and create collections of images. He also shows how the visual similarity search feature allows users to train an "AI Navigator" to retrieve visually similar content, going beyond the limitations of keyword-based searches.
Expanding to Ladino Newspapers
Lee concludes by discussing how he applied the Newspaper Navigator approach to a collection of Ladino newspapers at the University of Washington's Stroum Center for Jewish Studies. He notes that OCR engines often struggle with Ladino text, making it difficult to search these materials using keywords. However, the visual content extraction method allows researchers to explore these newspapers in new ways.
By clustering images based on visual content, Lee found patterns of reproduced advertisements across various Ladino newspaper titles. This type of analysis can offer insights into the commercial networks and cultural trends within Sephardic communities.
Lee emphasizes that the Newspaper Navigator project represents just the beginning of his research into AI and cultural heritage. He expresses his excitement about the potential of these technologies to unlock new avenues for exploration and research in various cultural heritage collections.