AHMED ELGAMMAL, RAMANI DURAISWAMI, MEMBER, IEEE, DAVID HARWOOD, AND LARRY S. DAVIS, FELLOW, IEEE Invited Paper
Automatic understanding of events happening at a site is the ultimate goal for many visual surveillance systems. Higher level understanding of events requires that certain lowerlevel computer vision tasks be performed. These may include detection of unusual motion, tracking targets, labeling body parts, and understanding the interactions between people. To achieve many of these tasks, it is necessary to build representations of the appearance of objects in the scene. This paper focuses on two issues related to this problem. First, we construct a statistical representationof the scene background that supports sensitive detection of moving objects in the scene, but is robust to clutter arising out of natural scene variations. Second, we build statistical representations of the foreground regions (moving objects) that support their tracking and support occlusion reasoning. The probability density functions (pdfs) associated with the background and foreground arelikely to vary from image to image and will not in general have a known parametric form. We accordingly utilize general nonparametric kernel density estimation techniques for building these statistical representations of the background and the foreground. These techniques estimate the pdf directly from the data without any assumptions about the underlying distributions. Example results fromapplications are presented. Keywords—Background subtraction, color modeling, kernel density estimation, occlusion modeling, tracking, visual surveillance.
I. INTRODUCTION In automated surveillance systems, cameras and other sensors are typically used to monitor activities at a site with the goal of automatically understanding events happening at the site. Automatic event understanding would enablefunctionalities such as detection of suspicious activities and site security. Current systems archive huge volumes of video for eventual off-line human inspection. The automatic detection of events in videos would facilitate efficient archiving and automatic annotation. It could be used to direct the attention of human operators to potential problems. The automatic detection of events would alsodramatically reduce the bandwidth required for video transmission and storage as only interesting pieces would need to be transmitted or stored. Higher level understanding of events requires certain lower level computer vision tasks to be performed such as detection of unusual motion, tracking targets, labeling body parts, and understanding the interactions between people. For many of these tasks, it isnecessary to build representations of the appearance of objects in the scene. For example, the detection of unusual motions can be achieved by building a representation of the scene background and comparing new frames with this representation. This process is called background subtraction. Building representations for foreground objects (targets) is essential for tracking them and maintainingtheir identities. This paper focuses on two issues: how to construct a statistical representation of the scene background that supports sensitive detection of moving objects in the scene and how to build statistical representations of the foreground (moving objects) that support their tracking. One useful tool for building such representations is statistical modeling, where a process is modeled as arandom variable in a feature space with an associated probability density function (pdf). The density function could be represented parametrically using a specified statistical distribution, that
Manuscript received May 31, 2001; revised February 15, 2002. This work was supported in part by the ARDA Video Analysis and Content Exploitation project under Contract MDA 90 400C2110 and in part by...