Sean Ryan Fanello1,2
1
Cem Keskin1 Shahram Izadi1 Pushmeet Kohli1 David Kim1 David Sweeney1
Antonio Criminisi1 Jamie Shotton1 Sing Bing Kang1 Tim Paek1
Microsoft Research
2
iCub Facility - Istituto Italiano di Tecnologia
a
b
c
e
g
d
f
h
Figure 1: (a, b) Our approach turns any 2D camera into a cheap depth sensor for close-range human capture and 3D interaction scenarios.
(c, d) Simple hardware modifications allow active illuminated near infrared images to be captured from the camera. (e, f) This is used as input into our machine learning algorithm for depth estimation. (g, h) Our algorithm outputs dense metric depth maps of hands or faces in real-time.
Abstract
1
We present a machine learning technique for estimating absolute, per-pixel depth using any conventional monocular 2D camera, with minor hardware modifications. Our approach targets close-range human capture and interaction where dense 3D estimation of hands and faces is desired. We use hybrid classification-regression forests to learn how to map from near infrared intensity images to absolute, metric depth in real-time. We demonstrate a variety of humancomputer interaction and capture scenarios. Experiments show an accuracy that outperforms a conventional light fall-off baseline, and is comparable to high-quality consumer depth cameras, but with a dramatically reduced cost, power consumption, and form-factor.
While range sensing technologies have existed for a long time, consumer depth cameras such as the Microsoft Kinect have begun to make real-time depth acquisition a commodity. This in turn has opened-up many exciting new applications for gaming, 3D scanning and fabrication, natural user interfaces, augmented reality, and robotics. One important domain where depth cameras have had clear impact is in human-computer interaction. In particular, the ability to
References: A HMED , A. H., AND FARAG , A. A. 2007. Shape from shading under various imaging conditions A MIT, Y., AND G EMAN , D. 1997. Shape quantization and recognition with randomized trees. Neural Computation 9, 7. BARRON , J. T., AND M ALIK , J. 2013. Shape, illumination, and reflectance from shading. Tech. Rep. UCB/EECS-2013-117, EECS, UC Berkeley, May. BATLLE , J., M OUADDIB , E., AND S ALVI , J. 1998. Recent progress in coded structured light as a technique to solve the correspondence problem: a survey B EN -A RIE , J., AND NANDY, D. 1998. A neural network approach for reconstructing surface shape from shading B ESL , P. J. 1988. Active, optical range imaging sensors. Machine vision and applications 1, 2, 127–152. B LAIS , F. 2004. Review of 20 years of range sensor development. B LANZ , V., AND V ETTER , T. 1999. A morphable model for the synthesis of 3D faces B REIMAN , L. 2001. Random forests. Machine Learning 45, 1. B ROWN , M. Z., B URSCHKA , D., AND H AGER , G. D. 2003. C OMANICIU , D., AND M EER , P. 2002. Mean shift: A robust approach toward feature space analysis C RIMINISI , A., AND S HOTTON , J. 2013. Decision Forests for Computer Vision and Medical Image Analysis F REDEMBACH , C., AND S USSTRUNK , S. 2008. Colouring the nearinfrared. In Color and Imaging Conference, vol. 2008, Society for Imaging Science and Technology, 176–182. G URBUZ , S. 2009. Application of inverse square law for 3d sensing. In SPIE Optical Engineering+ Applications, International Society for Optics and Photonics, 744706–744706. H ERTZMANN , A., AND S EITZ , S. 2005. Example-based photometric stereo: Shape reconstruction with general, varying BRDFs. H OIEM , D., E FROS , A., AND H EBERT, M. 2005. Automatic photo pop-up H ORN , B. K. 1975. Obtaining shape from shading information. I DESES , I., YAROSLAVSKY, L., AND F ISHBAIN , B. 2007. Realtime 2D to 3D video conversion. J. of Real-Time Image Processing 2, 3–9. J IANG , T., L IU , B., L U , Y., AND E VANS , D. 2003. A neural network approach to shape from shading K ARSCH , K., L IU , C., AND K ANG , S. 2012. Depth extraction from video using non-parametric sampling K ESKIN , C., K IRAC , F., K ARA , Y., AND A KARUN , L. 2012. Hand ¸ P RADOS , E., AND FAUGERAS , O. 2005. Shape from shading: a well-posed problem? In Proc R EMONDINO , F., AND S TOPPA , D. 2013. ToF range-imaging cameras G EHLER , P. V. 2011. Recovering intrinsic images with a global sparsity prior on reflectance S AXENA , A., S UN , M., AND N G , A. 2009. Make3D: Learning 3D scene structure from a single still image S CHARSTEIN , D., AND S ZELISKI , R. 2002. A taxonomy and evaluation of dense two-frame stereo correspondence algorithms. S HOTTON , J., W INN , J., ROTHER , C., AND C RIMINISI , A. 2006. S HOTTON , J., F ITZGIBBON , A., C OOK , M., S HARP, T., F INOC CHIO , M., M OORE , R., K IPMAN , A., AND B LAKE , A. 2011. technique. Physics in Medicine and Biology 43, 2465–2478. S MITH , W. A., AND H ANCOCK , E. R. 2008. Facial shape-fromshading and recognition using principal geodesic analysis and robust statistics K HAN , N., T RAN , L., AND TAPPEN , M. 2009. Training manyparameter shape-from-shading models using a surface database.