I’m interested in combining probability and logic in order to solve action recognition tasks in various different fields. I consider questions such as the following fundamental:
- Given a large stream of time-stamped observations and a prior model for an activity of interest, how can we best explain the observation stream? For example, in network traffic data, operators are interested in automatically detecting patterns of DDoS or other attacks. The major problem here is the abundance of noisy observations with respect to the activity we want to detect, so we need to build models that can handle huge streams of data with close to perfect recall.
- Discriminative models have shown significant classification accuracy and statistical elegance when sufficient training data is available. Unfortunately, the intra-class variation present in high-level scenarios is such that we can never hope to gather enough training data for them. For instance, we simply cannot hope to ever create a good feature descriptor for a soccer match or for cooking a steak. In such highly structured yet also heavily variable scenarios, mining spatio-temporal structure is very important to recognize and summarize activities. To that end, we need to go beyond frequent itemset mining, which leverages only co-occurrence information; we need to come up with probabilistic models that relate low-level observations in a spatio-temporal sense, while still keeping inference tractable. Graphical models such as (Semi-) Markov Models or more general Dynamic Bayesian Networks could be used to build those hierarchical models.
- How should activity recognition be regularized? What constitutes a fine or coarse characterization of an activity? Which one should we prefer for a given task? For example, when observing multiple videos of weddings, we will likely observe a prevalence of wedding cakes. But in another video we might not observe one, for multiple reasons (maybe it was occluded by a participant or the video simply was not extended to the actual cake cutting). Does this mean that we should immediately remove the information about “cutting a wedding cake” as a constituent of wedding videos, or is this an anomaly that should not be considered? From a logical perspective, if we were to assume that we have to build a set of rules with the event of interest in the head, how many different rules would we need, and how long should every rule be? Which atoms should be placed first such that we unify as efficiently as possible during grounding? From a more theoretical perspective, what should be the structure of the Conjunctive Normal Form of the theory that we build for the event?
To answer these questions, I look towards the Probabilistic Graphical Model and Probabilistic Logic Programming communities and try to find tractable and expressive models that can answer interesting probabilistic queries. In the era of Big Data, it appears that the hardest problem in this domain is scalability; probabilistic inference is NP-Hard to even approximately solve, so we have to find efficient approximation algorithms or solve sub-classes of these problems with Dynamic Programming.