People are remarkably good at remembering photographs. To further investigate the nature of the stored representations and the fidelity of human memories, it would be useful to evaluate the visual similarity of stimuli presented in experiments. Here, we explored the possible use of convolutional neural networks (CNN) as a measure of perceptual or representational similarity of visual scenes with respect to visual memory research. In Experiment 1, we presented participants with sets of nine images from the same scene category and tested whether they were able to detect the most distant scene in the image space defined by CNN. Experiment 2 was a visual variant of the Deese-Roediger-McDermott paradigm. We asked participants to remember a set of photographs from the same scene category. The photographs were preselected based on their distance to a particular visual prototype (defined as centroid of the image space). In the recognition test, we observed higher false alarm rates for scenes closer to this visual prototype. Our findings show that the similarity measured by CNN is reflected in human behavior - people can detect odd-one-out scenes or be lured to false alarms with similar stimuli. This method can be used for further studies regarding visual memory for complex scenes.
Supplementary notes can be added here, including code, math, and images.