A photobubble is a set of overlapping photos taken looking outward from viewpoints located near but not at a common center, for example by a circular or spherical array of fish-eye cameras. An important special case is the classic stereo pair. Because it has multiple viewpoints, a photobubble contains information about the 3D structure of the scene. An ordinary spherical panorama does not, because all viewpoints are the same.
Small photobubbles are currently used to create stereoscopic panoramas and videos. Viewed in VR, these give a strong sense of presence but are ultimately unsatisfactory due to the fact that the virtual eye positions are fixed. What we see in the real world changes as we move, and VR images should do the same. All current VR headsets track head motion in six degrees of freedom: rotation around and motion along 3 axes of space. Some already have eye tracking hardware, that adds 3 more degrees of freedom: two for gaze direction and one for the convergence angle, which indicates the distance of the thing we are looking at. The VR experience of the future will have stereoscopic images that change realistically with head and eye motion, providing much better realism and viewing comfort. The information needed to generate them can be obtained from photobubbles.
A photobubble 18 inches across captures almost everything a person seated in a swivel char might see by turning the chair and moving their head and eyes. Realistic moving views can be generated from the photos by various methods of interpolation, the most practical of which depend on having a depth map for each photo. That is an image showing the distance from the camera to each visible point. It is possible, in principle, to construct depth maps by analyzing the overlaps of the photos. However we still lack software that can do that reliably at scale. Given photos shot from widely separated viewpoints, current photogrammetry codes can build remarkably good 3D models, but they fail badly on bubble photos. The problem of deducing depth from stereo photos must be solved by methods other than triangulation. A successful approach will most likely emulate the way our brains extract 3D information from the data available at our retinas.
LESSONS FROM THE BRAIN
What we see is actually a 3D model constructed by our brain, by a process that is still largely mysterious despite centuries of scientific effort. But it has some notable features, that might help guide the design of depth mapping algorithms.
⦁ a stable single model is built from constantly moving pairs of images
⦁ the model covers only part of the visual field, in angle and in depth
⦁ the process involves very detailed comparison of left and right images
⦁ the process depends heavily on edge detection and matching
⦁ edges are matched at sub-pixel accuracy
Sub-pixel accuracy is a critical requirement. The central angular resolution of our eyes is around 50 pixels per degree, comparable to a good camera with a 24mm lens. Yet we can see depth differences that correspond to an image shift of 1/500 degree, or 1/10 of a pixel. Any successful depth-from-stereo code must do as well.
The dependence on edges distinguishes stereo vision from photogrammetry, which works well with point matching only. Edge detection begins in the retina, which is known to generate (among other things) data that resemble the Laplacian of brightness. Visual areas of the brain are full of cells that respond to edges at various orientations, contrasts, and states of motion, and to correlations between detected edges.
The limited range of our internal depth model seems to match the limitations of our eyes, which only cover the central part of the visual field at high resolution, and have a limited focal depth. The practical result is that we see depth only in a certain volume around the point at which our eyes are converged and focused. Outside that volume we actually see split images, to which we rarely pay any attention.
It actually is possible to see 3D objects by fusing pairs of artificial images that have no visible content other than tiny random looking ‘texture elements’, like sandpaper. This demonstrates that a big part of stereo vision is the correlation of fine details, and that this alone can be enough for seeing depth. However, unlike normal stereo vision, which is effectively instantaneous, it can take many seconds to fuse a random-dot stereogram. So it is clear that normal stereo vision involves far more than just point correlations.
The stability of the visual model is perhaps its most remarkable feature. Unlike the optical image in a camera, it never changes shape with gaze direction, and it remains upright no matter how we roll our head. It is constantly being updated, but those changes are almost never perceptible. I would give a great deal to know how this model is represented in the brain; but neither neuroscience nor computer science has yet come up with a good answer. The best guess is something like a layered depth map.
LAYERED DEPTH MAPS
At present, the most practical way to get moving stereo images from a photobubble is to convert the bubble to a layered depth map. Several groups have demonstrated that the standard gpu pipeline can render images from LDMs at full frame rates, even on mobile hardware. These researchers used different LDM formats; many are possible and there is not yet a consensus on whether there is a best one. But it seems clear that the basic idea is sound.
Basically an LDM is a collection of overlapping 3D mesh fragments, indexed by position in space. Each node has the usual space coordinates and texture coordinates, a transparency attribute and possibly also a depth attribute. Each mesh layer references a different texture image, so an LDM can present different pixel values at the same visual position , according to a movable virtual viewpoint. Source pixels are assigned to layer textures in such a way that everything visible in any of the original photos is visible in some layer, and therefore potentially in some interpolated view.
The basic rendering operation is alpha-blended z-compositing, using fragment shaders that compute node shifts according to the current viewpoint. The transparencies blend the contributions of different layers to simulate a continuous surface where that is appropriate, in other places they simulate a discontinuous surface.
The node coordinates and depth attribute can be used in various ways to represent scene structure. There seem to be two basic schemes.
type 1: the mesh layers have a regular geometry such as parallel planes or concentric spheres, and a depth attribute specifies distance from a reference point, or from the local mesh surface. This type of LDM is easy to construct from depth maps but tends to need many layers. It is similar in spirit to a volumetric lenticular screen.
type 2: the layers cover overlapping depth ranges, and node coordinates follow the depth structure of the scene, relative to reference points that may be different for each mesh fragment. No depth attribute is needed. This type of LDM is harder to construct but more efficient; 5 or 6 layers is enough. It is similar in some ways to the cached proxy views sometimes used to speed CG renders.
The most important thing about an LDM is that layer visibility follow occluding edges, so that foreground objects hide the background correctly. This means that many foreground nodes will be at scene edge points, with visible pixels only on one side of the edge.
FROM BUBBLES TO DEPTH MAPS
The prerequisites for stereo depth mapping are the same as for photogrammetry:
⦁ overlapping photos of a largely static scene, well focused and exposed
⦁ accurately known camera positions and orientations
⦁ accurately known lens projections
The difference is in the camera geometry: closely spaced, looking out versus widely spaced, looking in.
Unlike photogrammetry, depth from stereo does not generally admit of estimating camera poises from image data. Calibration in the workshop is mandatory, except in the simple case of a stereo pair, where the pitch and roll error between the cameras can be estimated well enough by a panorama stitcher, given accurate prior calibrations for the lenses.
Ideally the geometry of a bubble camera array will be fixed mechanically. When simulating an array by moving a group of cameras, it should be done by a numerically controlled mechanism that reproduces position and orientation with an error comparable to one pixel at the focal plane. Such a mechanism is also ideal for calibrating lenses by taking a series of photos at known angles and stitching them with those angles pre-specified.
Given the necessary calibrations, the general depth mapping flow becomes
⦁ determine overlaps of the photos
⦁ create an aligned image pair for each overlap
⦁ estimate x and y disparity maps for each overlap
⦁ combine disparity maps giving a depth map per photo
Each step makes use of the geometry and lens calibrations. The last step may compute a single depth value for most pixels, but generally should allow for the possibility of object motion.
The third step is the one that we can’t yet do well enough. Existing methods give maps that are to some degree too incomplete, inaccurate, and/or distorted to support generating acceptable interpolated images on a production basis. A few research groups have come close, and we must assume that eventually it will be possible; but progress has been slow.
I myself have been working for several years now on an approach that still seems promising, but has yet to create a usable map. It is based on detecting and matching edges at sub-pixel precision, so if and when it works, the disparity maps should be easy to convert to type 2 LDMs.