På svenska

Presentation Information     2014-10-06 (14:15)   •  The seminar room at Vi2

Speaker Tomas Wilkinson
Title Visualizing Document Images using Image-based Word Clouds
Abstract In recent years, word clouds (or tag clouds) have been used on the internet to quickly and easily visualize the contents of documents by organizing the most frequently occurring words in the shape of cloud. To create a word cloud, words are placed on a canvas in a deterministic or random way, where the size of the word to be placed is proportional to the frequency of its occurrence in the data. We introduce Image-based word clouds, an attempt to achieve the same visualization using image data of words only. This allows us to create word clouds from scanned historical manuscripts or even large collections of manuscripts. While most of the layout techniques from traditional word cloud algorithms for text data can be directly used, the statistics and semantic knowledge needed is not readily available. In pure image data, one can not simply count the words to get their frequency, or easily find and disregard stop words (i.e., "and", "if", etc.) before visualizing. In fact, the first step is to separate an image of text into individual words. I will describe an approach to how the necessary quantities can be approximated in the visual domain and the present the results we have achieved so far. This project is a part of q2b, From quill to bytes, a framework program sponsored by the Swedish Research Council (Vetenskapsrådet, Dnr 2012-5743) and Uppsala university. The work is done in part as a collaboration with the Swedish Museum of Natural History (Naturhistoriska riksmuseet).