Evaluation Datasets

  • for Personalized Recommendation based on Social Tagging

For evaluation purposes, we prepared two real datasets the I-SEARCH and FLICKR datasets. Each dataset consists of two files: triplets.txt and descriptors.txt FLICKR dataset: created by using the Flickr’s web services, which consists of 63,172 triplets in the form userID-imageID-tagID-1 with 8,262 users, 10,049 images and 21,866 tags. For each image in the FLICKR dataset the image descriptor is extracted based on the SIFT variant described in [Uijlings et al. 2010]. Each row i in SIFT_descriptor.txt corresponds to the descriptor of image i. The dataset is split into 51 batches, equal to the number of text queries that were used to retrieve the images through Flickr’s web services. For each batch there are the respective triplets.txt and SIFT_descriptor.txt files. I-SEARCH dataset: created by researchers in 6 different European research institutes and universities within the multimodal engine of I-SEARCH, which consists of 3,532 triplets in the form userID-CO_ID-tagID-1 with 358 users, 734 Content Objects (COs) and 1,336 tags. For each CO in the I-SEARCH dataset the multimodal descriptor is extracted following the extraction strategy of [Daras et al. 2012]. In both datasets tags are treated as regular text and thus, three preprocessing steps are followed: (a) tokenization based on a standard stop list (e.g. in, the, of, at, etc.); (b) tags are turned into lower case; and (c) all non-letter or non-digit characters in the tags are removed (e.g. dots, commas, question marks, etc.).

[Uijlings et al. 2010]: Uijlings, J. R. R., Smeulders, A. W. M., and Scha, R. J. H. 2010. “Real-time visual concept classification”, IEEE Transactions on Multimedia 12(7), pp. 665–681.

[Daras et al. 2012]: Daras, P., Manolopoulou, S., and Axenopoulos, A. 2012. “Search and retrieval of rich media objects supporting multiple multimodal queries”, IEEE Transactions on Multimedia 4(3), pp. 734–746.

Link: data

Relevant papers

D. Rafailidis, A. Axenopoulos, S. Manolopoulou, J. Etzold, P. Daras, “Content-based Tag Propagation and Tensor Factorization for Personalized Recommendation based on Social Tagging”, ACM Trans. on Interact. Intell. Syst., accepted for publication.


  • for Large-scale Content-based Image Search and Retrieval and MSIDX source code

DATASETS: The ImageCLEF 2010 Wikipedia Image Collection, 2010, http://www.imageclef.org/wikidata

L. Amsaleg and H. J’egou, TEXMEX: Datasets for Approximate Nearest Neighbor Search, 2010. http://corpus-texmex.irisa.fr/

MSIDX SOURCE CODE: http://vcl.iti.gr/msidx

Relevant papers

E. Tiakas, D. Rafailidis, A. Dimou, P. Daras, “MSIDX: Multi-Sort Indexing for Efficient Content-based Image Search and Retrieval”, IEEE Transactions on Multimedia, http://dx.doi.org/10.1109/TMM.2013.2247989