HomeSense: Fall Detection with Computer Vision
Detecting falls in real time using an OAK-D depth camera, YOLO pose estimation, and a custom state machine — my contribution to the HomeSense home-automation system.
The problem
Falls are the leading cause of injury-related death in adults over 65. Yet most home-monitoring solutions either require the person to actively press a button, or rely on wearable sensors that are forgotten, uncharged, or simply not worn. I wanted a system that works passively — no wristband, no button, just a camera watching a room.
This post covers the fall-detection node I built for HomeSense, a collaborative home-automation platform my team developed in SYSC 3010 (Systems Project). My teammates owned the sensor grid, the controller Pi, and the web dashboard. I owned detection.
System overview
HomeSense is a distributed system: a network of Raspberry Pi nodes communicate over sockets
through a central controller. My node’s only job is to consume a video stream, decide whether
someone has fallen, and emit a fall_detected event when it’s sure.
The key design choice was putting YOLO pose inference directly on the camera (an OAK-D Lite). The OAK-D has an onboard Myriad X VPU that can run small neural nets at 30 fps without touching the Pi’s CPU. That freed the host to focus on the stateful logic — EMA smoothing, angle computation, and the state machine.
Skeleton → posture label
YOLO outputs 17 COCO keypoints per detected person. I only care about four of them:
| Keypoint | Index | Body part |
|---|---|---|
| Left shoulder | 5 | top of torso |
| Right shoulder | 6 | top of torso |
| Left hip | 11 | bottom of torso |
| Right hip | 12 | bottom of torso |
The torso vector runs from the midpoint of the hips to the midpoint of the shoulders. Its angle from vertical tells me orientation:
def compute_torso_and_label(
kps: list[list[float]],
upright_thresh: float = 40.0,
horiz_thresh: float = 55.0,
) -> tuple[float, str]:
"""
Return (torso_angle_deg, label) where label is one of
UPRIGHT / TRANSITION / HORIZONTAL.
"""
ls, rs = kps[5], kps[6] # left/right shoulder
lh, rh = kps[11], kps[12] # left/right hip
mid_shoulder = ((ls[0] + rs[0]) / 2, (ls[1] + rs[1]) / 2)
mid_hip = ((lh[0] + rh[0]) / 2, (lh[1] + rh[1]) / 2)
dx = mid_shoulder[0] - mid_hip[0]
dy = mid_shoulder[1] - mid_hip[1] # positive → down in image
angle = abs(math.degrees(math.atan2(dx, -dy))) # 0° = upright
if angle < upright_thresh:
label = "UPRIGHT"
elif angle > horiz_thresh:
label = "HORIZONTAL"
else:
label = "TRANSITION"
return angle, label
EMA smoothing
Raw keypoint coordinates from YOLO are jittery frame-to-frame. Before computing the torso angle I pass each keypoint through an Exponential Moving Average filter:
ALPHA = 0.4 # higher = more responsive, lower = smoother
def ema_smooth(
new_kps: list[list[float]],
prev_kps: list[list[float]] | None,
) -> list[list[float]]:
if prev_kps is None:
return new_kps
return [
[ALPHA * n[i] + (1 - ALPHA) * p[i] for i in range(len(n))]
for n, p in zip(new_kps, prev_kps)
]
α = 0.4 was chosen empirically — low enough to suppress single-frame noise, high enough that a genuine fall (which happens in ~0.5 s) still registers within two or three frames.
Detecting the fall event
A torso angle alone isn’t enough. Someone lying down to read a book has a horizontal torso too. I need to detect the transition — a rapid drop in keypoint height.
def compute_rapid_drop_and_down_persist(
history: deque[tuple[float, str]], # (angle, label) ring buffer
drop_window: int = 3,
drop_threshold: float = 40.0,
persist_frames: int = 25,
) -> tuple[bool, bool]:
"""
rapid_drop → hip midpoint fell > drop_threshold px in drop_window frames
down_persist → person has been HORIZONTAL for >= persist_frames consecutive frames
"""
angles = [h[0] for h in history]
labels = [h[1] for h in history]
# rapid_drop: large angle increase over a short window
if len(angles) >= drop_window:
rapid_drop = (angles[-1] - angles[-drop_window]) > drop_threshold
else:
rapid_drop = False
# down_persist: trailing frames are all HORIZONTAL
trailing = labels[-persist_frames:] if len(labels) >= persist_frames else labels
down_persist = len(trailing) == persist_frames and all(l == "HORIZONTAL" for l in trailing)
return rapid_drop, down_persist
The state machine
Three pieces of evidence combine in a simple state machine:
rapid_drop— hip keypoints fell sharply → enterCANDIDATEdown_persist— stayed horizontal for 25+ frames → enterCONFIRMED- Depth — estimated floor distance < 0.7 m → strengthens
CONFIRMED
A 30-second cooldown after each alert prevents a single fall from spamming the dashboard.
Interactive — Fall State Machine
Click the event buttons to step through the detection states.
Torso angle
—
Posture label
UPRIGHT
Alerts fired
0
The transitions map directly to code:
if state == "NORMAL" and rapid_drop:
state = "CANDIDATE"
elif state == "CANDIDATE":
if down_persist and depth_ok:
state = "CONFIRMED"
emit_fall_alert() # Socket.IO → controller Pi
cooldown_until = time.time() + 30
elif not rapid_drop: # person caught themselves
state = "NORMAL"
elif state == "CONFIRMED":
if time.time() > cooldown_until:
state = "NORMAL"
DepthAI pipeline setup
The OAK-D pipeline is configured once at startup. The camera feeds frames directly into the neural-net node — no round-trip to the Pi host — and the detections come back over a XLink output queue.
def build_pipeline() -> dai.Pipeline:
pipeline = dai.Pipeline()
# RGB camera
cam = pipeline.create(dai.node.Camera)
cam.setFps(30)
cam_out = cam.requestOutput((640, 640), dai.ImgFrame.Type.BGR888p)
# YOLO pose model (compiled for Myriad X)
nn = pipeline.create(dai.node.NeuralNetwork)
nn.setBlobPath(Path("models/yolo_pose.blob"))
cam_out.link(nn.input)
# XLink output — detections stream back to the host
xout = pipeline.create(dai.node.XLinkOut)
xout.setStreamName("detections")
nn.out.link(xout.input)
# Stereo depth
left = pipeline.create(dai.node.Camera)
right = pipeline.create(dai.node.Camera)
left.setBoardSocket(dai.CameraBoardSocket.CAM_B)
right.setBoardSocket(dai.CameraBoardSocket.CAM_C)
stereo = pipeline.create(dai.node.StereoDepth)
stereo.setDefaultProfilePreset(dai.node.StereoDepth.PresetMode.HIGH_DENSITY)
left.requestOutput((1280, 720)).link(stereo.left)
right.requestOutput((1280, 720)).link(stereo.right)
depth_out = pipeline.create(dai.node.XLinkOut)
depth_out.setStreamName("depth")
stereo.depth.link(depth_out.input)
return pipeline
Results
Testing in the lab with a crash mat, the detector achieved:
- True positive rate: 94 % across 50 staged falls (various directions, speeds)
- False positive rate: < 2 false alerts per hour of normal activity (sitting, bending, reaching)
- Latency: median 1.1 s from fall to Socket.IO event (dominated by the 25-frame
down_persistwindow at 30 fps ≈ 0.83 s + network)
The biggest failure mode was slow, controlled descents — someone carefully lowering
themselves to the floor doesn’t trigger rapid_drop. That’s intentional: a slow, deliberate
motion is not a fall. The tradeoff is that a person who faints slowly while holding a wall
could be missed.
What I learned
DepthAI’s on-device inference is genuinely powerful. Running YOLO at 30 fps without touching the Pi CPU meant the host Pi stayed cool and responsive even when my roommate was running the dashboard on the same device.
State machines beat thresholds. My first prototype used a single angle threshold and
generated constant false positives. The two-stage rapid_drop → down_persist design dropped
the false positive rate by an order of magnitude.
EMA α matters more than you’d think. With α = 0.8, a stumble could look like a fall. With α = 0.2, an actual fall took too many frames to register. Spending time on this parameter paid off.