A UR5e arm that finds, grips, and places objects on a spoken command — end to end.
Duration
4 months
Role
Robotics software engineer

Voice-controlled UR5e pick-and-place robot
2025-03-01
Context
I build software for businesses now, but the way I think about systems came from robotics. This is the project that taught it to me: a voice-controlled pick-and-place robot built on a Universal Robots UR5e — a real 6-DOF industrial manipulator — that hears a spoken command, finds the right object with a depth camera, plans a collision-free path, and places it where it was asked to.
The point wasn't a clever demo. It was to build the whole chain the way production software is built — decoupled components, a clear contract between each layer, a simulation that mirrors the hardware so you can develop without breaking anything expensive, and one flag that flips the entire stack from simulation to a physical arm.
A robot is the most honest system you can build. There's no hiding a bad abstraction behind a loading spinner. Either the gripper closes on the object or it closes on air.
The system
The pipeline runs end to end across the cloud and the local machine:
Voice → cloud → robot. You say "Alexa, go to task one." Amazon Alexa handles the natural-language understanding and fires an AWS Lambda function (eu-west-1), which posts the command over an HTTPS tunnel to a Flask endpoint on the robot's machine. A ROS 2 bridge node turns that into a message on the /robot/task_command topic. Nothing downstream knows or cares that the command came from a voice — it's just an integer on a topic, which is exactly the decoupling you want.
Perception. A Microsoft Kinect depth camera continuously scans the workspace. When a target is named, the vision node segments it by colour in HSV space, runs contour analysis to find its centroid and shape, and back-projects those pixel coordinates through the depth channel into a real 3-D pose in the robot's world frame. Colour segmentation over a neural model was a deliberate call — on a controlled workspace it's faster, fully deterministic, and trivial to debug when something goes wrong at the wrong hour.
Motion planning. The target pose goes to MoveIt 2, which plans a full grasp sequence — pre-grasp approach, grasp, lift, transport, place — using OMPL/RRTConnect with position and orientation constraints so the arm never tries to drive through itself or the table. The trajectory executes on the UR5e through the scaled_joint_trajectory_controller, and a Robotiq 2F-85 parallel-jaw gripper opens and closes around the object.
Four nodes, one job each. The whole thing is four decoupled ROS 2 nodes communicating over topics: the Alexa bridge, the vision node, the task executor (a MoveGroup client), and the gripper controller. Tasks themselves live in YAML — you add or edit a pick-and-place routine without touching a line of Python.
Decisions & tradeoffs
Simulation that mirrors hardware, not approximates it. The system runs fully in simulation with a synthetic Kinect stream and a fake UR5e driver, visualised live in RViz2. The same code path runs the real arm — use_fake_hardware=false is the only change. This is the single most valuable design decision in the project: every line of logic was developed, broken, and fixed in simulation, so the physical arm only ever ran code that had already worked.
Classical vision over a trained model. A small detection model would generalise better to unusual objects and lighting. On a known workspace with controlled objects, HSV segmentation plus contour analysis was faster, needed no training data, and never surprised me. If the objects or the lighting were uncontrolled, that tradeoff flips — and I'd reach for the model.
Three input modes, one pipeline. Alexa voice, a local microphone (Google Speech Recognition), and plain keyboard input all converge on the same /robot/task_command topic. Building the contract at the topic — not at the input — meant adding a new way to command the robot never touched the motion or perception code.
Outcome
UR5e arm under full MoveIt 2 motion planning
One flag switches the fake driver to a physical UR5e
Alexa voice, local mic, or keyboard — same pipeline
The robot does what it was asked: hears a command, locates the object, plans a clean path, grips it, moves it, and sets it down. The architecture is the part I'm proud of — a cloud voice layer, a perception layer, a planning layer, and a hardware layer, each able to be tested, swapped, or rebuilt without disturbing the others.
That's the same instinct I bring to every AI and software system I build now: decouple the layers, define the contract between them, and make the thing observable enough that when it breaks — and it will — you know exactly where to look.
What I'd do differently
The HSV thresholds are hand-tuned and assume reasonably stable lighting. A short auto-calibration routine at startup — sample the workspace, fit the thresholds — would make the perception layer robust to a room it had never seen, instead of one it was tuned for.
I'd also add a proper grasp-verification step. Right now the system assumes a commanded grasp succeeded; a force-feedback or vision check after closing the gripper would let it detect a miss and retry, which is the difference between a robot that works and a robot you can trust unattended.
Built with
ROS 2 Humble, MoveIt 2, Python 3.10 on Ubuntu 22.04. Universal Robots UR5e with a Robotiq 2F-85 gripper. Microsoft Kinect depth camera with OpenCV for HSV segmentation and contour detection. Amazon Alexa and AWS Lambda for the voice pipeline, bridged to ROS 2 over a Flask/HTTPS tunnel. RViz2 for simulation and visualisation.