Online data, including the web and social media, presents an unprecedented visual and textual summary of human lives, events, and activities. We --- the UW NLP and Computer Vision groups --- are working to design scalable new machine learning algorithms that can make sense of this ever growing body of information, by automatically inferring the latent semantic correspondences between language, images, videos, and 3D models of the world.
Our ultimate goal is to achieve deep semantic integration at a very large scale. Given nearly unlimited amounts of multimodal data, we aim to build statistical models that can transform information in one modality to the other, for example to automatically summarize a visual scene or better find images that match complex descriptions. Crucial to this goal will be new methods for extracting and organizing commonsense knowledge about the visual world --- automatically, from large-scale data, and with minimal supervision. This work falls in what we envision to be the future artificial intelligence; highly integrative research that spans disciplines to build a new generation of understanding systems.