3D Spatial Intelligence

Perspective
Frederik Warburg

Anonymised Fall Clips

When a patient falls, our system sends a fall alarm to nurses, who then comes to the patient's aid. We are super proud to have built a system that can reduce the reaction time after a fall by 95 % and reduce the amount of falls by 84 %. However, in order to assess the damage caused by a fall and potentially avoid future falls, nurses need to see how the fall happened and why it happened. On the other hand, the patients and nurses identities as well as personal belongings, such as pictures on the wall, must remain anonymised!

Enter our most recent AI model, TEO-1, that can generate anonymised fall clips that convey the details about the fall and the scene without risking revealing any personal information. TEO-1 runs locally on our device, thus the actual footage never leaves the room. Below we show some examples of generated fall clips.

Compared to naive approaches, such as blurring or masking out people’s faces, our fall clips are built with a privacy-first mentality, meaning that even if the network makes a wrong prediction, the person's identity will not be revealed.

TEO-1: Privacy Preserving AI

TEO-1 creates these fall clips by combining multiple outputs predicted by our AI model. The model has a shared backbone and three heads; the first predicts the location of all the static (bed, chair, etc.) and dynamic objects (patients, nurses, etc.) in the scene as well as the actions performed by each person (laying in bed, etc.). The second head predicts a segmentation map (that is the semantic class of each pixel in the footage, e.g. this pixel belongs to the floor). The last head predicts a depth map, which describes the distance each pixel is from the camera. It is the latter two, we combine to create the fall clips.

We use a shared backbone to yield high inference speed (~10 fps on Jetson Xavier NX) and learn robust features. Our heads are small (a couple of convolutional layers), meaning that the features produced by our backbone needs to encode a rich understanding of the scene. Our prediction tasks are highly correlated, e.g. sharp edges in the depth map often correspond to pixels belonging to different semantic classes. By using small heads, we achieve positive transfer learning, such that for example the depth supervision of one head improves semantic borders predicted by another head and vice versa. Similar observations have been made in the literature, see for example link. Thus, by pushing the compute budget from the heads to the backbone, we achieve a fast architecture that produces robust generalizable features.

Video shows the multiple outputs of our model, bounding boxes for static and dynamic objects, actions for people in the scene, semantic segmentations and depth estimate (more red is further away).

Tapping into the amazing progress in the AI field

We were able to add depth and segmentation capabilities to TEO-1 in a matter of weeks by distilling knowledge from large open-source models. In short, we use the best open-source models to generate pseudo-ground truth, while estimating the uncertainty of the generated pseudo-ground truth with test-time data augmentation. We supervise TEO-1 only in the regions where the pseudo-ground truth has low uncertainty. We found this strategy very scalable and efficient as we did not require human annotations. Instead, it allowed us to directly leverage and benefit from all the amazing work and progress across the AI field. 

To give an example of how we generate the semantic segmentations, we use a combination of bounding boxes annotated by humans and bounding boxes generated from text prompts. We generate bounding boxes for objects and things we don’t have annotations for (such as the the floor, walls, etc.) by leveraging a recent open-source model, Grounding Dino (link), that takes a text prompt and outputs bounding boxes.

We treat generated and human annotated bounding boxes similarly, and feed them to SAM (link) that outputs a binary semantic mask for each box. We collapse these instance segmentations into a semantic map that we use as our pseudo ground truth. As we show in the figure below, this pseudo-ground truth is sparse, and important regions (such as the patient in the bed) are often missing! However, we found that by only supervising the non missing regions, TEO-1 learns to predict a dense segmentation map - often surpassing the quality of the pseudo ground truth! 

We use large open-source models to generate pseudo-gt. For semantic segmentation, we use a combination of human annotated bounding boxes and bounding boxes generated with Grounding Dino, followed by SAM to produce sparse pseudo-gt. We train on the non-missing regions and learn to produce a dense segmentation map without the tedious human labelling progress. 

Besides giving us the ability to share useful and anonymised fall videos with care staff, TEO-1 gives rise to a new paradigm at Teton, namely 3D spatial intelligence.

3D Spatial Intelligence

We live in a 3D world, however, a camera only perceives a 2D projection of it. With TEO-1, we are able to reconstruct the 3D world by back-projecting our predictions into 3D. This allows us to better reason about occlusions, which might occur in 2D, but can easily be handled in 3D. It enables us to set physical sensible hyperparameters in our tracker, which is otherwise hard in pixel space, where objects close, will move many pixels and objects far, will only move a few. And it allows us to view the scene from novel views as shown in the videos below.

Video shows the back-projected RGB values, semantic segmentations and fall videos in 3D. We are able to view the scene from novel views, although the scene was only captured from a single view.

TEO-1 has been in production since March 2024, running at ≈ 10 fps on a Jetson Xavier NX. We are just getting started with 3D perception, and we are super excited about further improving the quality of our model predictions, fully utilising its 3D awareness, and continue to push our AI capabilities.