The “intelligence” of an intelligent environment is not only influenced by the functionality it offers, but also largely by the naturalness and intuitiveness of its interaction modes. A very important natural interaction mode is gestures, as long as the environment’s interface poses no strict constraints on how the gestures may be performed. Since gestures are generally defined by hand/arm poses and motions, an important prerequisite to the recognition of unconstrained gestures is the robust detection of hands in video images. However, due to the strongly articulated nature of hands and the challenges given by a realistic (i.e., not strictly controlled) environment, this is a very challenging task, because it means hands need to be found in almost arbitrary configurations and under strongly varying lighting conditions. In this article, we present an approach to hand detection in the context of an intelligent house using a fusion of structural cues and color information. We first describe our detection algorithm using scale-invariant salient region features, combined with an efficient region-based filtering approach to reduce the number of false positives. The results are fused with the output of a skin color classifier. A detailed experimental evaluation on realistic data, including different cue fusing schemes, is presented. By means of an experimental evaluation on a challenging task, we demonstrate that, although each of the two different feature types (image structure and color) has drawbacks, their combination yields promising results for robust hand detection.