C-Reference: Improving 2D to 3D Object Pose Estimation Accuracy via Crowdsourced Joint Object Estimation

Jean Y. Song, John Joon Young Chung, David F. Fouhey, Walter S. Lasecki

Research output: Contribution to journalArticlepeer-review


Converting widely-available 2D images and videos, captured using an RGB camera, to 3D can help accelerate the training of machine learning systems in spatial reasoning domains ranging from in-home assistive robots to augmented reality to autonomous vehicles. However, automating this task is challenging because it requires not only accurately estimating object location and orientation, but also requires knowing currently unknown camera properties (e.g., focal length). A scalable way to combat this problem is to leverage people's spatial understanding of scenes by crowdsourcing visual annotations of 3D object properties. Unfortunately, getting people to directly estimate 3D properties reliably is difficult due to the limitations of image resolution, human motor accuracy, and people's 3D perception (i.e., humans do not "see" depth like a laser range finder). In this paper, we propose a crowd-machine hybrid approach that jointly uses crowds' approximate measurements of multiple in-scene objects to estimate the 3D state of a single target object. Our approach can generate accurate estimates of the target object by combining heterogeneous knowledge from multiple contributors regarding various different objects that share a spatial relationship with the target object. We evaluate our joint object estimation approach with 363 crowd workers and show that our method can reduce errors in the target object's 3D location estimation by over 40%, while requiring only $35$% as much human time. Our work introduces a novel way to enable groups of people with different perspectives and knowledge to achieve more accurate collective performance on challenging visual annotation tasks.

Original languageEnglish (US)
Article number51
JournalProceedings of the ACM on Human-Computer Interaction
Issue numberCSCW1
StatePublished - May 28 2020


  • 3D pose estimation
  • answer aggregation
  • computer vision
  • crowdsourcing
  • human computation
  • optimization
  • soft constraints

ASJC Scopus subject areas

  • Social Sciences (miscellaneous)
  • Human-Computer Interaction
  • Computer Networks and Communications


Dive into the research topics of 'C-Reference: Improving 2D to 3D Object Pose Estimation Accuracy via Crowdsourced Joint Object Estimation'. Together they form a unique fingerprint.

Cite this