AI2 at CVPR 2023

Highlighted work from our institute appearing at this year’s CVPR conference

AI2
AI2 Blog

--

The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2023 logo

The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) is the premier annual computer vision event comprising the main conference and several co-located workshops and short courses. This year, CVPR will be single track such that everyone (with full passport registration) can attend everything. There will also be a virtual accompaniment. AI2’s PRIOR team is excited to be represented with several papers at this year’s conference.

Highlighted Papers from AI2

Selected works involving AI2 researchers appearing at this year’s conference (* denotes AI2 affiliation):

A screenshot of the VisProg model, taking an image of a brown bear walking on dirt and changing it to a polar bear walking on snow.
VisProg is capable of performing visual reasoning while executing natural language instructions.

Visual Programming: Compositional visual reasoning without training

Tanmay Gupta*, Aniruddha Kembhavi*

🏆 Award Candidate— We are especially honored to have this paper among the candidates for a CVPR 2023 Paper Award!

Ever since the term AI itself was coined in 1955, researchers all over the world have strived to build intelligent systems with human or super-human capabilities. For a long time, due to limitations of the available technology, computational power, and our own understanding of human intelligence, while “general” intelligence was what we sought, “task-specific” or “narrow” intelligence was what we got. However, a series of recent breakthroughs in AI are beginning to change this landscape in unprecedented ways. VisProg is a system capable of performing visual reasoning while executing natural language instructions.

Read more about VisProg on our blog.

A screen capture of Phone2Proc’s scene variation output; a few different room options based on the initial scan.
Successfully deploying agents trained in simulation to the real world has generally proved fraught — PHONE2PROC is a simple approach that uses a cellphone to scan an environment and procedurally generate targeted training scene variations of that location, whose usage results in successful and robust agents in the real environment.

Phone2Proc: Bringing Robust Robots Into Our Chaotic World

Matt Deitke*, Rose Hendrix*, Luca Weihs*, Ali Farhadi, Kiana Ehsani*, Aniruddha Kembhavi*

As robots become increasingly integrated into our daily lives, it is important to ensure that they are trained to operate in real-world environments. However, creating and testing robots in physical spaces can be time-consuming and costly. That’s where Phone2Proc comes in — a new approach for generating a distribution of training environments that closely match the real-world physical space we are interested in.

Phone2Proc is a three-step process that begins with a phone scan of a target scene, followed by procedurally generating variations of the scene for training agents, and finally transferring onto a robot that navigates in the physical world.

Read more about Phone2Proc on our blog.

Examples of objects created in Objaverse.
Example instances from our large-scale 3D asset dataset OBJAVERSE. OBJAVERSE 3D assets are semantically diverse, high quality, and paired with natural-language descriptions.

Objaverse: A Universe of Annotated 3D Objects

Matt Deitke*, Dustin Schwenk*, Jordi Salvador*, Luca Weihs*, Oscar Michel*, Eli VanderBilt*, Ludwig Schmidt, Kiana Ehsani*, Aniruddha Kembhavi*, Ali Farhadi

Massive data corpora like WebText, Wikipedia, Conceptual Captions, WebImageText, and LAION have propelled recent dramatic progress in AI. Large neural models trained on such datasets produce impressive results and top many of today’s benchmarks. A notable omission within this family of large-scale datasets is 3D data. Despite considerable interest and potential applications in 3D vision, datasets of high-fidelity 3D models continue to be mid-sized with limited diversity of object categories. Addressing this gap, we present Objaverse 1.0, a large dataset of objects with 800K+ (and growing) 3D models with descriptive captions, tags, and animations. Objaverse improves upon present day 3D repositories in terms of scale, number of categories, and in the visual diversity of instances within a category. We demonstrate the large potential of Objaverse via four diverse applications: training generative 3D models, improving tail category segmentation on the LVIS benchmark, training open-vocabulary object-navigation models for Embodied AI, and creating a new benchmark for robustness analysis of vision models.

Read more about Objaverse on our blog.

Episode in EXCALIBUR played by a human annotator. An episode is divided into four sequential phases: in Phase I, the agent explores the house for 2,500 steps (each action takes a step); in Phase II the agent needs to answer 20 questions (5 shown) about the explored environment; in Phase III the agent is given a second chance to reenter the house, now with knowledge of the questions; in Phase IV the agent answers the questions again.
Episode in EXCALIBUR played by a human annotator.

EXCALIBUR: Encouraging and Evaluating Embodied Exploration

Hao Zhu, Raghav Kapoor, So Yeon Min, Winson Han*, Jiatai Li, Kaiwen Geng, Graham Neubig, Yonatan Bisk, Aniruddha Kembhavi*, Luca Weihs*

Experience precedes understanding. Humans constantly explore and learn about their environment out of curiosity, gather information, and update their models of the world. On the other hand, machines are either trained to learn passively from static and fixed datasets, or taught to complete specific goal-conditioned tasks. To encourage the development of exploratory interactive agents, we present the EXCALIBUR benchmark. EXCALIBUR allows agents to explore their environment for long durations and then query their understanding of the physical world via inquiries like: “is the small heavy red bowl made from glass?” or “is there a silver spoon heavier than the egg?”. This design encourages agents to perform free-form home exploration without myopia induced by goal conditioning. Once the agents have answered a series of questions, they can renter the scene to refine their knowledge, update their beliefs, and improve their performance on the questions. Our experiments demonstrate the challenges posed by this dataset for the present-day state-of-the-art embodied systems and the headroom afforded to develop new innovative methods. Finally, we present a virtual reality interface that enables humans to seamlessly interact within the simulated world and use it to gather human performance measures. EXCALIBUR affords unique challenges in comparison to present-day benchmarks and represents the next frontier for embodied AI research.

Check out our current openings, follow @allen_ai on Twitter, and subscribe to the AI2 Newsletter to stay current on news and research coming out of AI2.

--

--

Our mission is to contribute to humanity through high-impact AI research and engineering. We are a Seattle-based non-profit founded in 2014 by Paul G. Allen.