Seeing Into the Future

The scene is a darkened hallway. Slowly, almost casually, the humanoid robot walks down some stairs, away from the faint light toward pitch black. Its LiDAR vision sensor captures the outline of a doorknob at the far end. As it gets closer it reaches out to grasp and turn the knob. Gaining entrance to a room that reveals, with a quick scan, to have only a table with a square block placed in the middle. It grasps the block and switches to its camera vision, lighting it up with a small array of white LED lights. A Rubik’s cube. It carefully scans each face capturing the arrangement of colored squares that defines the current state of the cube. It takes about a second to solve the cube and replace it in the middle of the table. Total elapsed time is less than 30 seconds.

What does the (not too) distant robot future hold in the way of a vision system that could do the above task? It turns out that there are many future states that are possible, and they are all being played out right now in real time. The solution to developing a vision system depends a lot on how you define the problem. One of the earliest vision systems came in the form of a disc-shaped vacuum cleaner from iRobot. The main issues they had to solve were to develop a simple map of the area just above floor level and avoid bumping into things, while covering the ground thoroughly and minimizing the total distance traveled. They chose LiDAR (Light detecting and ranging). A 360 rotating laser light source sat on top of their signature Roomba ® and measured the ToF (Time of Flight) that it took a blip of laser light to bounce off a distant object and return. Each blip was a sort of laser pixel (or voxel) and as the Roomba rambled it built up a point cloud a few voxels high as it wandered its universe just slightly above floor level. It could even go to work at night since it provided its own light source. But it wasn’t perfect. It couldn’t react quickly to moving objects; pets, toddlers, toys flying through the air. Curtains dangling to the floor level would be treated the same as a hard stop, like molding.

If we define the problem as an autonomous vehicle designed to work in traffic, then things get trickier very quickly. At a basic level, a vehicle is just a big Roomba that works outside, is confined to specific lanes of travel and has to deal with a lot of simultaneous movement in the immediate vicinity. Clearly LiDAR cannot solve for all of those problems, but it is a partial solution for some of them. Heavy rain, fog, or snow can confound a LiDAR vision system. Also, some of the distances are large and velocities can be very high, so decision making must be quick and accurate. Even with the same working definition of the problem, various design teams are using cameras, RADAR, and LiDAR in conjunction with some version of AI and ML (Machine Learning) to stitch together a localized map of what is happening and how the vehicle should respond. To get a better sense of the pros and cons of these technologies check out the table below.

Technology/ Attribute	LiDAR	Camera	RADAR
Range	2500 m	2000 m	200m
Resolution	5mm	Varies	<0.2m
Wavelength	905nm/1550nm	Visual 400 – 700nm	3.9 mm
Color detection	None	Visual, IR	None
“Hard” environments	Snow, rain, fog, dust	Snow, rain, fog, dust, dark	Dust, lots of reflective targets
Technical Issues	No color detection. Picking up “foreign” signals from other LiDAR devices	Difficulty measuring depth and distance.	Low resolution and objects are weighted by reflectivity, not necessarily size. Slow response
Cost	$1000 – $100	$65 for triple camera	$50 – $100

The one prominent exception to the combined sensors approach to vehicle navigation is Tesla. They are singularly focused on a camera-only based system connected to a neural net. The argument goes like this: we know that humans can navigate solely by what they see, therefore other sensors for navigation purposes are unnecessary. They just need to understand and train the neural net for the “edge cases”, where people are really adept. Tesla have made a lot of progress; enough to remove the forward-facing radar from their cars. And they are accumulating lots of test cases by operating these systems in “shadow mode” where they collect data for analysis and refinement later and don’t issue any controlling instructions to the car. They still don’t fully explain the dreaded winter whiteout conditions, but having to pull-over and stop is no worse than the real-world situation. Given the track record for Tesla, I’d have to say that I wouldn’t bet against them. Even iRobot has added a camera to their signature Roomba.

Let’s get back to our autonomous humanoid robot in the dark basement. Being humanoid, we expect it to be capable of human-like physical tasks. We have given it a challenging physical task that requires an advanced vision system. The one characteristic of vision that we haven’t talked about yet is depth perception. Our robot friend must change focus between walking down the hallway, the stairs, finding the doorknob, and finally, locating and examining the Rubik’s cube.

The best combination of technologies to solve the depth perception problem is likely a hybrid solution based on LiDAR plus a camera. Thanks to phone-based cameras, the price/performance ratio is very good for camera solutions. Putting two cameras side-by-side would allow for simple triangulation to calculate the distance to an object. It just has to have good enough resolution that the robotic equivalent of proprioception (awareness of physical position in space) matches up with what the robot sees. A quick search online for “3D and time of flight camera” shows that there is no shortage of people with product available today that combine a binocular camera (for 3D imagery) with a time-of-flight (LiDAR) built in. These are targeted mostly toward making more versatile bin-picking robots, however there are also some laboratory studies that look into the viability of the physics of such systems.

There is at least one company which has followed the path of biomimicry by examining the inner structure of an eagle’s eye. In essence the eagle has the equivalent of two highly sensitive receptors: one for forward vision and another for peripheral vision built into each eye. The silicon equivalent consists of layering lenses on top of a CMOS detector chip so that the center part of the image is telephoto, the outer edge is wide-angle and in between those two lenses is a mid-range lens. This helps solve the convergence problem: to ensure that both “eyes” are focusing on the same spot in order to accurately triangulate the distance to objects. It would be rather straightforward to add a ToF Sensor to this system, or even a patterned light that reveals the 3D contours of an object. There is a lot of clever thought going into optimizing this problem and it will take a large user (likely the automotive industry) to drive this solution to a high demand to lower the price. And if past is prologue, it will likely happen.

Sources: