||Digital image and video libraries are becoming more common and widely used, as visual information is produced at a rapidly growing rate. Creating and storing these collections is easy, but there is an increasing need for effective ways to manage and process such information. Content-based image and video retrieval systems are gaining popularity for online access to unannotated image and video databases. For computational reasons, fairly low-level visual and auditory features have to be used for measuring object similarity, which is the basis for clustering and search. A challenging problem is the inherently weak connection between such low-level features and the actual high-level semantic concepts that humans associate with pictures and videos, including sounds and speech. Methods for bridging this semantic gap are urgently needed.
The author's group has previously introduced the PicSOM system for content-based retrieval of images and multimedia documents. It is based on Self-Organizing Maps (SOM) for features. Each image or multimedia document is mapped onto several parallel feature maps, and the weighting between different maps in content-based search is guided by relevance feedback from the user. When the PicSOM system is equipped with automatic image and video segmentation, and some metadata such as keywords are available, the keyword annotations given on image level can be focused on image frames and segments. In the lack of keywords, such metadata can be automatically produced from recorded online use of the system. This high-level abstracted information can be used to improve the accuracy of retrieval, as well as to categorize the objects in the database with semantic concepts.