2025
Conference Papers
-
ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos
Lei Shi, Paul-Christian Bürkner, Andreas Bulling
Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. , 2025.
We present ActionDiffusion - a novel diffusion model for procedure planning in instructional videos that is the first to take temporal inter-dependencies between actions into account. Our approach is in stark contrast to existing methods that fail to exploit the rich information content available in the particular order in which actions are performed. Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process by projecting the action information into the noise space. This is achieved 1) by adding action embeddings in the noise masks in the noise-adding phase and 2) by introducing an attention mechanism in the noise prediction network to learn the correlations between different action steps. We report extensive experiments on three instructional video benchmark datasets (CrossTask, Coin, and NIV) and show that our method outperforms previous state-of-the-art methods on all metrics on CrossTask and NIV and all metrics except accuracy on Coin dataset. We show that by adding action embeddings into the noise mask the diffusion model can better learn action temporal dependencies and increase the performances on procedure planning.@inproceedings{shi25_wacv, author = {Shi, Lei and Bürkner, Paul-Christian and Bulling, Andreas}, title = {ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos}, booktitle = {Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year = {2025}, pages = {} }
2024
Journal Articles
-
PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation
Mayar Elfares, Pascal Reisert, Zhiming Hu, Wenwu Tang, Ralf Küsters, Andreas Bulling
Proc. ACM on Human-Computer Interaction (PACM HCI), 8 (ETRA), pp. 1–23, 2024.
Latest gaze estimation methods require large-scale training data but their collection and exchange pose significant privacy risks. We propose PrivatEyes - the first privacy-enhancing training approach for appearance-based gaze estimation based on federated learning (FL) and secure multi-party computation (MPC). PrivatEyes enables training gaze estimators on multiple local datasets across different users and server-based secure aggregation of the individual estimators’ updates. PrivatEyes guarantees that individual gaze data remains private even if a majority of the aggregating servers is malicious. We also introduce a new data leakage attack DualView that shows that PrivatEyes limits the leakage of private training data more effectively than previous approaches. Evaluations on the MPIIGaze, MPIIFaceGaze, GazeCapture, and NVGaze datasets further show that the improved privacy does not lead to a lower gaze estimation accuracy or substantially higher computational costs - both of which are on par with its non-secure counterparts.doi: 10.1145/3655606Paper: elfares24_etra.pdf@article{elfares24_etra, title = {PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation}, author = {Elfares, Mayar and Reisert, Pascal and Hu, Zhiming and Tang, Wenwu and Küsters, Ralf and Bulling, Andreas}, year = {2024}, journal = {Proc. ACM on Human-Computer Interaction (PACM HCI)}, pages = {1--23}, volume = {8}, number = {ETRA}, doi = {10.1145/3655606} } -
Mindful Explanations: Prevalence and Impact of Mind Attribution in XAI Research
Susanne Hindennach, Lei Shi, Filip Miletic, Andreas Bulling
Proc. ACM on Human-Computer Interaction (PACM HCI), 8 (CSCW), pp. 1–42, 2024.
When users perceive AI systems as mindful, independent agents, they hold them responsible instead of the AI experts who created and designed these systems. So far, it has not been studied whether explanations support this shift in responsibility through the use of mind-attributing verbs like "to think". To better understand the prevalence of mind-attributing explanations we analyse AI explanations in 3,533 explainable AI (XAI) research articles from the Semantic Scholar Open Research Corpus (S2ORC). Using methods from semantic shift detection, we identify three dominant types of mind attribution: (1) metaphorical (e.g. "to learn" or "to predict"), (2) awareness (e.g. "to consider"), and (3) agency (e.g. "to make decisions"). We then analyse the impact of mind-attributing explanations on awareness and responsibility in a vignette-based experiment with 199 participants. We find that participants who were given a mind-attributing explanation were more likely to rate the AI system as aware of the harm it caused. Moreover, the mind-attributing explanation had a responsibility-concealing effect: Considering the AI experts’ involvement lead to reduced ratings of AI responsibility for participants who were given a non-mind-attributing or no explanation. In contrast, participants who read the mind-attributing explanation still held the AI system responsible despite considering the AI experts’ involvement. Taken together, our work underlines the need to carefully phrase explanations about AI systems in scientific writing to reduce mind attribution and clearly communicate human responsibility.doi: 10.1145/3641009Paper: hindennach24_pacm.pdf@article{hindennach24_pacm, title = {Mindful Explanations: Prevalence and Impact of Mind Attribution in XAI Research}, author = {Hindennach, Susanne and Shi, Lei and Miletic, Filip and Bulling, Andreas}, year = {2024}, pages = {1--42}, volume = {8}, number = {CSCW}, doi = {10.1145/3641009}, journal = {Proc. ACM on Human-Computer Interaction (PACM HCI)} } -
Pose2Gaze: Eye-body Coordination during Daily Activities for Gaze Prediction from Full-body Poses
Zhiming Hu, Jiahui Xu, Syn Schmitt, Andreas Bulling
IEEE Transactions on Visualization and Computer Graphics (TVCG), , pp. 1–12, 2024.
Human eye gaze plays a significant role in many virtual and augmented reality (VR/AR) applications, such as gaze-contingent rendering, gaze-based interaction, or eye-based activity recognition. However, prior works on gaze analysis and prediction have only explored eye-head coordination and were limited to human-object interactions. We first report a comprehensive analysis of eye-body coordination in various human-object and human-human interaction activities based on four public datasets collected in real-world (MoGaze), VR (ADT), as well as AR (GIMO and EgoBody) environments. We show that in human-object interactions, e.g. pick and place, eye gaze exhibits strong correlations with full-body motion while in human-human interactions, e.g. chat and teach, a person’s gaze direction is correlated with the body orientation towards the interaction partner. Informed by these analyses we then present Pose2Gaze – a novel eye-body coordination model that uses a convolutional neural network and a spatio-temporal graph convolutional neural network to extract features from head direction and full-body poses, respectively, and then uses a convolutional neural network to predict eye gaze. We compare our method with state-of-the-art methods that predict eye gaze only from head movements and show that Pose2Gaze outperforms these baselines with an average improvement of 24.0% on MoGaze, 10.1% on ADT, 21.3% on GIMO, and 28.6% on EgoBody in mean angular error, respectively. We also show that our method significantly outperforms prior methods in the sample downstream task of eye-based activity recognition. These results underline the significant information content available in eye-body coordination during daily activities and open up a new direction for gaze prediction.Paper: hu24_tvcg.pdf@article{hu24_tvcg, author = {Hu, Zhiming and Xu, Jiahui and Schmitt, Syn and Bulling, Andreas}, title = {Pose2Gaze: Eye-body Coordination during Daily Activities for Gaze Prediction from Full-body Poses}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, year = {2024}, pages = {1--12} } -
HOIMotion: Forecasting Human Motion During Human-Object Interactions Using Egocentric 3D Object Bounding Boxes
Zhiming Hu, Zheming Yin, Daniel Haeufle, Syn Schmitt, Andreas Bulling
IEEE Transactions on Visualization and Computer Graphics (TVCG), , pp. 1–11, 2024.
We present HOIMotion – a novel approach for human motion forecasting during human-object interactions that integrates information about past body poses and egocentric 3D object bounding boxes. Human motion forecasting is important in many augmented reality applications but most existing methods have only used past body poses to predict future motion. HOIMotion first uses an encoder-residual graph convolutional network (GCN) and multi-layer perceptrons to extract features from body poses and egocentric 3D object bounding boxes, respectively. Our method then fuses pose and object features into a novel pose-object graph and uses a residual-decoder GCN to forecast future body motion. We extensively evaluate our method on the Aria digital twin (ADT) and MoGaze datasets and show that HOIMotion consistently outperforms state-of-the-art methods by a large margin of up to 8.7% on ADT and 7.2% on MoGaze in terms of mean per joint position error. Complementing these evaluations, we report a human study (N=20) that shows that the improvements achieved by our method result in forecasted poses being perceived as both more precise and more realistic than those of existing methods. Taken together, these results reveal the significant information content available in egocentric 3D object bounding boxes for human motion forecasting and the effectiveness of our method in exploiting this information.Paper: hu24_ismar.pdf@article{hu24_ismar, author = {Hu, Zhiming and Yin, Zheming and Haeufle, Daniel and Schmitt, Syn and Bulling, Andreas}, title = {HOIMotion: Forecasting Human Motion During Human-Object Interactions Using Egocentric 3D Object Bounding Boxes}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, year = {2024}, pages = {1--11} } -
Learning User Embeddings from Human Gaze for Personalised Saliency Prediction
Florian Strohm, Mihai Bâce, Andreas Bulling
Proc. ACM on Human-Computer Interaction (PACM HCI), 8 (ETRA), pp. 1–18, 2024.
Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model’s ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.doi: 10.1145/3655603Paper: strohm24_etra.pdf@article{strohm24_etra, title = {Learning User Embeddings from Human Gaze for Personalised Saliency Prediction}, author = {Strohm, Florian and Bâce, Mihai and Bulling, Andreas}, year = {2024}, journal = {Proc. ACM on Human-Computer Interaction (PACM HCI)}, pages = {1--18}, volume = {8}, number = {ETRA}, doi = {10.1145/3655603} } -
VisRecall++: Analysing and Predicting Visualisation Recallability from Gaze Behaviour
Yao Wang, Yue Jiang, Zhiming Hu, Constantin Ruhdorfer, Mihai Bâce, Andreas Bulling
Proc. ACM on Human-Computer Interaction (PACM HCI), 8 (ETRA), pp. 1–18, 2024.
Question answering has recently been proposed as a promising means to assess the recallability of information visualisations. However, prior works are yet to study the link between visually encoding a visualisation in memory and recall performance. To fill this gap, we propose VisRecall++ – a novel 40-participant recallability dataset that contains gaze data on 200 visualisations and five question types, such as identifying the title, and finding extreme values.We measured recallability by asking participants questions after they observed the visualisation for 10 seconds.Our analyses reveal several insights, such as saccade amplitude, number of fixations, and fixation duration significantly differ between high and low recallability groups.Finally, we propose GazeRecallNet – a novel computational method to predict recallability from gaze behaviour that outperforms several baselines on this task.Taken together, our results shed light on assessing recallability from gaze behaviour and inform future work on recallability-based visualisation optimisation.@article{wang24_etra, title = {VisRecall++: Analysing and Predicting Visualisation Recallability from Gaze Behaviour}, author = {Wang, Yao and Jiang, Yue and Hu, Zhiming and Ruhdorfer, Constantin and Bâce, Mihai and Bulling, Andreas}, year = {2024}, journal = {Proc. ACM on Human-Computer Interaction (PACM HCI)}, pages = {1--18}, volume = {8}, number = {ETRA}, doi = {10.1145/3655613} } -
Individual differences in visuo-spatial working memory capacity and prior knowledge during interrupted reading
Francesca Zermiani, Prajit Dhar, Florian Strohm, Sibylle Baumbach, Andreas Bulling, Maria Wirzberger
Frontiers in Cognition, 3, pp. 1–9, 2024.
Interruptions are often pervasive and require attentional shifts from the primary task. Limited data are available on the factors influencing individuals’ efficiency in resuming from interruptions during digital reading. The reported investigation -conducted using the InteRead dataset -examined whether individual differences in visuo-spatial working memory capacity (vsWMC) and prior knowledge could influence resumption lag times during interrupted reading. Participants’ vsWMC capacity was assessed using the symmetry span (SSPAN) task, while a pre-test questionnaire targeted their background knowledge about the text. While reading an extract from a Sherlock Holmes story, they were interrupted six times and asked to answer an opinion question. Our analyses revealed that the interaction between vsWMC and prior knowledge significantly predicted the time needed to resume reading following an interruption. The results from our analyses are discussed in relation to theoretical frameworks of task resumption and current research in the field.Paper: zermiani24_fic.pdf@article{zermiani24_fic, title = {Individual differences in visuo-spatial working memory capacity and prior knowledge during interrupted reading}, author = {Zermiani, Francesca and Dhar, Prajit and Strohm, Florian and Baumbach, Sibylle and Bulling, Andreas and Wirzberger, Maria}, year = {2024}, doi = {10.3389/fcogn.2024.1434642}, pages = {1--9}, volume = {3}, journal = {Frontiers in Cognition} }
Conference Papers
-
DisMouse: Disentangling Information from Mouse Movement Data
Guanhua Zhang, Zhiming Hu, Andreas Bulling
Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 1–13, 2024.
Mouse movement data contain rich information about users, performed tasks, and user interfaces, but separating the respective components remains challenging and unexplored. As a first step to address this challenge, we propose DisMouse – the first method to disentangle user-specific and user-independent information and stochastic variations from mouse movement data. At the core of our method is an autoencoder trained in a semi-supervised fashion, consisting of a self-supervised denoising diffusion process and a supervised contrastive user identification module. Through evaluations on three datasets, we show that DisMouse 1) captures complementary information of mouse input, hence providing an interpretable framework for modelling mouse movements, 2) can be used to produce refined features, thus enabling various applications such as personalised and variable mouse data generation, and 3) generalises across different datasets. Taken together, our results underline the significant potential of disentangled representation learning for explainable, controllable, and generalised mouse behaviour modelling.Paper: zhang24_uist.pdf@inproceedings{zhang24_uist, title = {DisMouse: Disentangling Information from Mouse Movement Data}, author = {Zhang, Guanhua and Hu, Zhiming and Bulling, Andreas}, year = {2024}, pages = {1--13}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, doi = {} } -
SalChartQA: Question-driven Saliency on Information Visualisations
Yao Wang, Weitian Wang, Abdullah Abdelhafez, Mayar Elfares, Zhiming Hu, Mihai Bâce, Andreas Bulling
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–14, 2024.
Understanding the link between visual attention and user’s needs when visually exploring information visualisations is under-explored due to a lack of large and diverse datasets to facilitate these analyses. To fill this gap, we introduce SalChartQA – a novel crowd-sourced dataset that uses the BubbleView interface as a proxy for human gaze and a question-answering (QA) paradigm to induce different information needs in users. SalChartQA contains 74,340 answers to 6,000 questions on 3,000 visualisations. Informed by our analyses demonstrating the tight correlation between the question and visual saliency, we propose the first computational method to predict question-driven saliency on information visualisations. Our method outperforms state-of-the-art saliency models, improving several metrics, such as the correlation coefficient and the Kullback-Leibler divergence. These results show the importance of information needs for shaping attention behaviour and paving the way for new applications, such as task-driven optimisation of visualisations or explainable AI in chart question-answering.Paper: wang24_chi.pdfSupplementary Material: wang24_chi_sup.pdf@inproceedings{wang24_chi, title = {SalChartQA: Question-driven Saliency on Information Visualisations}, author = {Wang, Yao and Wang, Weitian and Abdelhafez, Abdullah and Elfares, Mayar and Hu, Zhiming and B{\^a}ce, Mihai and Bulling, Andreas}, year = {2024}, pages = {1--14}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3613904.3642942} } -
Mouse2Vec: Learning Reusable Semantic Representations of Mouse Behaviour
Guanhua Zhang, Zhiming Hu, Mihai Bâce, Andreas Bulling
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–17, 2024.
The mouse is a pervasive input device used for a wide range of interactive applications. However, computational modelling of mouse behaviour typically requires time-consuming design and extraction of handcrafted features, or approaches that are application-specific. We instead propose Mouse2Vec – a novel self-supervised method designed to learn semantic representations of mouse behaviour that are reusable across users and applications. Mouse2Vec uses a Transformer-based encoder-decoder architecture, which is specifically geared for mouse data: During pretraining, the encoder learns an embedding of input mouse trajectories while the decoder reconstructs the input and simultaneously detects mouse click events. We show that the representations learned by our method can identify interpretable mouse behaviour clusters and retrieve similar mouse trajectories. We also demonstrate on three sample downstream tasks that the representations can be practically used to augment mouse data for training supervised methods and serve as an effective feature extractor.@inproceedings{zhang24_chi, title = {Mouse2Vec: Learning Reusable Semantic Representations of Mouse Behaviour}, author = {Zhang, Guanhua and Hu, Zhiming and B{\^a}ce, Mihai and Bulling, Andreas}, year = {2024}, pages = {1--17}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3613904.3642141} } -
Multi-Modal Video Dialog State Tracking in the Wild
Adnen Abdessaied, Lei Shi, Andreas Bulling
Proc. 18th European Conference on Computer Vision (ECCV), pp. 1–25, 2024.
We present MST-MIXER – a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in the wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, MST-MIXER first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities which further refines its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). MST-MIXER achieves new state-of-the-art results on five challenging benchmarks.@inproceedings{abdessaied24_eccv, author = {Abdessaied, Adnen and Shi, Lei and Bulling, Andreas}, title = {Multi-Modal Video Dialog State Tracking in the Wild}, booktitle = {Proc. 18th European Conference on Computer Vision (ECCV)}, year = {2024}, pages = {1--25} } -
VD-GR: Boosting Visual Dialog with Cascaded Spatial-Temporal Multi-Modal GRaphs
Adnen Abdessaied, Lei Shi, Andreas Bulling
Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 5805–5814, 2024.
We propose VD-GR – a novel visual dialog model that combines pre-trained language models (LMs) with graph neural networks (GNNs). Prior works mainly focused on one class of models at the expense of the other, thus missing out on the opportunity of combining their respective benefits. At the core of VD-GR is a novel integration mechanism that alternates between spatial-temporal multi-modal GNNs and BERT layers, and that covers three distinct contributions: First, we use multi-modal GNNs to process the features of each modality (image, question, and dialog history) and exploit their local structures before performing BERT global attention. Second, we propose hub-nodes that link to all other nodes within one modality graph, allowing the model to propagate information from one GNN (modality) to the other in a cascaded manner. Third, we augment the BERT hidden states with fine-grained multi-modal GNN features before passing them to the next VD-GR layer. Evaluations on VisDial v1.0, VisDial v0.9, VisDialConv, and VisPro show that VD-GR achieves new state-of-the-art results across all four datasets@inproceedings{abdessaied24_wacv, author = {Abdessaied, Adnen and Shi, Lei and Bulling, Andreas}, title = {VD-GR: Boosting Visual Dialog with Cascaded Spatial-Temporal Multi-Modal GRaphs}, booktitle = {Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year = {2024}, pages = {5805--5814} } -
OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog
Adnen Abdessaied, Manuel Hochmeister, Andreas Bulling
Proc. 31st Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), pp. 1–11, 2024.
We present the Object Language Video Transformer (OLViT) – a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): while the OST attends to the most important objects within the video, the LST keeps track of the most important linguistic co-references to previous dialog turns. In stark contrast to previous works, our approach is generic by nature and is therefore capable of learning continuous multi-modal dialog state representations of the most relevant objects and rounds. As a result, they can be seamlessly integrated into Large Language Models (LLMs) and offer high flexibility in dealing with different datasets and tasks. Evaluations on the challenging DVD (response classification) and SIMMC 2.1 (response generation) datasets show that OLViT achieves new state-of-the-art performance across both datasets.@inproceedings{abdessaied24_coling, author = {Abdessaied, Adnen and von Hochmeister, Manuel and Bulling, Andreas}, title = {OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog}, booktitle = {Proc. 31st Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)}, year = {2024}, pages = {1--11} } -
EyeSeeIdentity: Exploring Natural Gaze Behaviour for Implicit User Identification during Photo Viewing
Yasmeen Abdrabou, Mariam Hassib, Shuqin Hu, Ken Pfeuffer, Mohamed Khamis, Andreas Bulling, Florian Alt
Proc. Symposium on Usable Security and Privacy (USEC), pp. 1–12, 2024.
Existing gaze-based methods for user identification either require special-purpose visual stimuli or artificial gaze behaviour. Here, we explore how users can be differentiated by analysing natural gaze behaviour while freely looking at images. Our approach is based on the observation that looking at different images, for example, a picture from your last holiday, induces stronger emotional responses that are reflected in gaze behavioor and, hence, is unique to the person having experienced that situation. We collected gaze data in a remote study (N = 39) where participants looked at three image categories: personal images, other people’s images, and random images from the Internet. We demonstrate the potential of identifying different people using machine learning with an accuracy of 85%. The results pave the way towards a new class of authentication methods solely based on natural human gaze behaviour.Paper: abdrabou24_usec.pdf@inproceedings{abdrabou24_usec, author = {Abdrabou, Yasmeen and Hassib, Mariam and Hu, Shuqin and Pfeuffer, Ken and Khamis, Mohamed and Bulling, Andreas and Alt, Florian}, title = {EyeSeeIdentity: Exploring Natural Gaze Behaviour for Implicit User Identification during Photo Viewing}, booktitle = {Proc. Symposium on Usable Security and Privacy (USEC)}, year = {2024}, pages = {1--12} } -
Neural Reasoning About Agents’ Goals, Preferences, and Actions
Matteo Bortoletto, Lei Shi, Andreas Bulling
Proc. 38th AAAI Conference on Artificial Intelligence (AAAI), pp. 456–464, 2024.
We propose the Intuitive Reasoning Network (IRENE) – a novel neural model for intuitive psychological reasoning about agents’ goals, preferences, and actions that can generalise previous experiences to new situations. IRENE combines a graph neural network for learning agent and world state representations with a transformer to encode the task context. When evaluated on the challenging Baby Intuitions Benchmark, IRENE achieves new state-of-the-art performance on three out of its five tasks – with up to 48.9 % improvement. In contrast to existing methods, IRENE is able to bind preferences to specific agents, to better distinguish between rational and irrational agents, and to better understand the role of blocking obstacles. We also investigate, for the first time, the influence of the training tasks on test performance. Our analyses demonstrate the effectiveness of IRENE in combining prior knowledge gained during training for unseen evaluation tasks.@inproceedings{bortoletto24_aaai, author = {Bortoletto, Matteo and Shi, Lei and Bulling, Andreas}, title = {Neural Reasoning About Agents’ Goals, Preferences, and Actions}, booktitle = {Proc. 38th AAAI Conference on Artificial Intelligence (AAAI)}, year = {2024}, volume = {38}, number = {1}, pages = {456--464}, doi = {10.1609/aaai.v38i1.27800} } -
Limits of Theory of Mind Modelling in Dialogue-Based Collaborative Plan Acquisition
Matteo Bortoletto, Constantin Ruhdorfer, Adnen Abdessaied, Lei Shi, Andreas Bulling
Proc. 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1–16, 2024.
Recent work on dialogue-based collaborative plan acquisition (CPA) has suggested that Theory of Mind (ToM) modelling can improve missing knowledge prediction in settings with asymmetric skill-sets and knowledge. Although ToM was claimed to be important for effective collaboration, its real impact on this novel task remains under-explored. By representing plans as graphs and by exploiting task-specific constraints we show that, as performance on CPA nearly doubles when predicting one’s own missing knowledge, the improvements due to ToM modelling diminish. This phenomenon persists even when evaluating existing baseline methods. To better understand the relevance of ToM for CPA, we report a principled performance comparison of models with and without ToM features. Results across different models and ablations consistently suggest that learned ToM features are indeed more likely to reflect latent patterns in the data with no perceivable link to ToM. This finding calls for a deeper understanding of the role of ToM in CPA and beyond, as well as new methods for modelling and evaluating mental states in computational collaborative agents.@inproceedings{bortoletto24_acl, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Abdessaied, Adnen and Shi, Lei and Bulling, Andreas}, title = {Limits of Theory of Mind Modelling in Dialogue-Based Collaborative Plan Acquisition}, booktitle = {Proc. 62nd Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2024}, pages = {1--16}, doi = {} } -
Benchmarking Mental State Representations in Language Models
Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling
Proc. ICML 2024 Workshop on Mechanistic Interpretability, pp. 1–21, 2024.
While numerous works have assessed the generative performance of language models (LMs) on tasks requiring Theory of Mind reasoning, research into the models’ internal representation of mental states remains limited. Recent work has used probing to demonstrate that LMs can represent beliefs of themselves and others. However, these claims are accompanied by limited evaluation, making it difficult to assess how mental state representations are affected by model design and training choices. We report an extensive benchmark with various LM types with different model sizes, fine-tuning approaches, and prompt designs to study the robustness of mental state representations and memorisation issues within the probes. Our results show that the quality of models’ internal representations of the beliefs of others increases with model size and, more crucially, with fine-tuning. We are the first to study how prompt variations impact probing performance on theory of mind tasks. We demonstrate that models’ representations are sensitive to prompt variations, even when such variations should be beneficial. Finally, we complement previous activation editing experiments on Theory of Mind tasks and show that it is possible to improve models’ reasoning performance by steering their activations without the need to train any probe.@inproceedings{bortoletto24_icmlw, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Shi, Lei and Bulling, Andreas}, title = {Benchmarking Mental State Representations in Language Models}, booktitle = {Proc. ICML 2024 Workshop on Mechanistic Interpretability}, year = {2024}, pages = {1--21}, doi = {}, url = {https://openreview.net/forum?id=yEwEVoH9Be} } -
Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions
Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling
Proc. 27th European Conference on Artificial Intelligence (ECAI), pp. 1–8, 2024.
We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input. ToM is key for effective nonverbal human communication and collaboration, yet, existing methods for belief modelling have not included explicit ToM modelling or have typically been limited to one or two modalities. MToMnet encodes contextual cues (scene videos and object locations) and integrates them with person-specific cues (human gaze and body language) in a separate MindNet for each person. Inspired by prior research on social cognition and computational ToM, we propose three different MToMnet variants: two involving fusion of latent representations and one involving re-ranking of classification scores. We evaluate our approach on two challenging real-world datasets, one focusing on belief prediction, while the other examining belief dynamics prediction. Our results demonstrate that MToMnet surpasses existing methods by a large margin while at the same time requiring a significantly smaller number of parameters. Taken together, our method opens up a highly promising direction for future work on artificial intelligent systems that can robustly predict human beliefs from their non-verbal behaviour and, as such, more effectively collaborate with humans.@inproceedings{bortoletto24_ecai, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Shi, Lei and Bulling, Andreas}, title = {Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions}, booktitle = {Proc. 27th European Conference on Artificial Intelligence (ECAI)}, year = {2024}, pages = {1--8}, doi = {} } -
Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements
Jhon Paul Feliciano Charaja Casas, Isabell Wochner, Pierre Schumacher, Winfried Ilg, Martin Giese, Christophe Maufroy, Andreas Bulling, Syn Schmitt, Daniel F.B. Haeufle
Proc. 10th IEEE RAS/EMBS International Conference on Biomedical Robotics and Biomechatronics, pp. 1–6, 2024.
The mimicking of human-like arm movement characteristics involves the consideration of three factors during control policy synthesis: (a) chosen task requirements, (b) inclusion of noise during movement execution and (c) chosen optimality principles. Previous studies showed that when considering these factors (a-c) individually, it is possible to synthesize arm movements that either kinematically match the experimental data or reproduce the stereotypical triphasic muscle activation pattern. However, to date no quantitative comparison has been made on how realistic the arm movement generated by each factor is; as well as whether a partial or total combination of all factors results in arm movements with human-like kinematic characteristics and a triphasic muscle pattern. To investigate this, we used reinforcement learning to learn a control policy for a musculoskeletal arm model, aiming to discern which combination of factors (a-c) results in realistic arm movements according to four frequently reported stereotypical characteristics. Our findings indicate that incorporating velocity and acceleration requirements into the reaching task, employing reward terms that encourage minimization of mechanical work, hand jerk, and control effort, along with the inclusion of noise during movement, leads to the emergence of realistic human arm movements in reinforcement learning. We expect that the gained insights will help in the future to better predict desired arm movements and corrective forces in wearable assistive devices.Paper: casas24_biorob.pdf@inproceedings{casas24_biorob, title = {Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements}, author = {Casas, Jhon Paul Feliciano Charaja and Wochner, Isabell and Schumacher, Pierre and Ilg, Winfried and Giese, Martin and Maufroy, Christophe and Bulling, Andreas and Schmitt, Syn and Haeufle, Daniel F.B.}, year = {2024}, booktitle = {Proc. 10th IEEE RAS/EMBS International Conference on Biomedical Robotics and Biomechatronics}, pages = {1--6} } -
Explaining Disagreement in Visual Question Answering Using Eye Tracking
Susanne Hindennach, Lei Shi, Andreas Bulling
Proc. International Workshop on Pervasive Eye Tracking and Mobile Gaze-Based Interaction (PETMEI), pp. 1–7, 2024.
When presented with the same question about an image, human annotators often give valid but disagreeing answers indicating that their reasoning was different. Such differences are lost in a single ground truth label used to train and evaluate visual question answering (VQA) methods. In this work, we explore whether visual attention maps, created using stationary eye tracking, provide insight into the reasoning underlying disagreement in VQA. We first manually inspect attention maps in the recent VQA-MHUG dataset and find cases in which attention differs consistently for disagreeing answers. We further evaluate the suitability of four different similarity metrics to detect attention differences matching the disagreement. We show that attention maps plausibly surface differences in reasoning underlying one type of disagreement, and that the metrics complementarily detect them. Taken together, our results represent an important first step to leverage eye-tracking to explain disagreement in VQA.Paper: hindennach24_petmei.pdf@inproceedings{hindennach24_petmei, title = {Explaining Disagreement in Visual Question Answering Using Eye Tracking}, author = {Hindennach, Susanne and Shi, Lei and Bulling, Andreas}, year = {2024}, pages = {1--7}, doi = {10.1145/3649902.3656356}, booktitle = {Proc. International Workshop on Pervasive Eye Tracking and Mobile Gaze-Based Interaction (PETMEI)} } -
GazeMotion: Gaze-guided Human Motion Forecasting
Zhiming Hu, Syn Schmitt, Daniel Häufle, Andreas Bulling
Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–6, 2024.
Abstract Links BibTeX Project Oral Presentation
We present GazeMotion, a novel method for human motion forecasting that combines information on past human poses with human eye gaze. Inspired by evidence from behavioural sciences showing that human eye and body movements are closely coordinated, GazeMotion first predicts future eye gaze from past gaze, then fuses predicted future gaze and past poses into a gaze-pose graph, and finally uses a residual graph convolutional network to forecast body motion. We extensively evaluate our method on the MoGaze, ADT, and GIMO benchmark datasets and show that it outperforms state-of-the-art methods by up to 7.4% improvement in mean per joint position error. Using head direction as a proxy to gaze, our method still achieves an average improvement of 5.5%. We finally report an online user study showing that our method also outperforms prior methods in terms of perceived realism. These results show the significant information content available in eye gaze for human motion forecasting as well as the effectiveness of our method in exploiting this information.@inproceedings{hu24_iros, author = {Hu, Zhiming and Schmitt, Syn and Häufle, Daniel and Bulling, Andreas}, title = {GazeMotion: Gaze-guided Human Motion Forecasting}, year = {2024}, pages = {1--6}, booktitle = {Proc. IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)}, video = {https://youtu.be/I-ecIvRqOCY?si=kK8SE0r-JadwOKLt} } -
VSA4VQA: Scaling A Vector Symbolic Architecture To Visual Question Answering on Natural Images
Anna Penzkofer, Lei Shi, Andreas Bulling
Proc. 46th Annual Meeting of the Cognitive Science Society (CogSci), 2024.
Abstract Links BibTeX Project Oral Presentation
While Vector Symbolic Architectures (VSAs) are promising for modelling spatial cognition, their application is currently limited to artificially generated images and simple spatial queries. We propose VSA4VQA – a novel 4D implementation of VSAs that implements a mental representation of natural images for the challenging task of Visual Question Answering (VQA). VSA4VQA is the first model to scale a VSA to complex spatial queries. Our method is based on the Semantic Pointer Architecture (SPA) to encode objects in a hyper-dimensional vector space. To encode natural images, we extend the SPA to include dimensions for object’s width and height in addition to their spatial location. To perform spatial queries we further introduce learned spatial query masks and integrate a pre-trained vision-language model for answering attribute-related questions. We evaluate our method on the GQA benchmark dataset and show that it can effectively encode natural images, achieving competitive performance to state-of-the-art deep learning methods for zero-shot VQA.@inproceedings{penzkofer24_cogsci, author = {Penzkofer, Anna and Shi, Lei and Bulling, Andreas}, title = {{VSA4VQA}: {Scaling} {A} {Vector} {Symbolic} {Architecture} {To} {Visual} {Question} {Answering} on {Natural} {Images}}, booktitle = {Proc. 46th Annual Meeting of the Cognitive Science Society (CogSci)}, year = {2024}, volume = {46}, url = {https://escholarship.org/uc/item/26j7v1nf.} } -
Quantifying Human Upper Limb Stiffness Responses Based on a Computationally Efficient Neuromusculoskeletal Arm Model
Maria Sapounaki, Pierre Schumacher, Winfried Ilg, Martin Giese, Christophe Maufroy, Andreas Bulling, Syn Schmitt, Daniel F.B. Haeufle, Isabell Wochner
Proc. 10th IEEE RAS/EMBS International Conference on Biomedical Robotics and Biomechatronics, pp. 1–6, 2024.
Abstract Links BibTeX Project Oral Presentation
The mimicking of human-like arm movement characteristics involves the consideration of three factors during control policy synthesis: (a) chosen task requirements, (b) inclusion of noise during movement execution and (c) chosen optimality principles. Previous studies showed that when considering these factors (a-c) individually, it is possible to synthesize arm movements that either kinematically match the experimental data or reproduce the stereotypical triphasic muscle activation pattern. However, to date no quantitative comparison has been made on how realistic the arm movement generated by each factor is; as well as whether a partial or total combination of all factors results in arm movements with human-like kinematic characteristics and a triphasic muscle pattern. To investigate this, we used reinforcement learning to learn a control policy for a musculoskeletal arm model, aiming to discern which combination of factors (a-c) results in realistic arm movements according to four frequently reported stereotypical characteristics. Our findings indicate that incorporating velocity and acceleration requirements into the reaching task, employing reward terms that encourage minimization of mechanical work, hand jerk, and control effort, along with the inclusion of noise during movement, leads to the emergence of realistic human arm movements in reinforcement learning. We expect that the gained insights will help in the future to better predict desired arm movements and corrective forces in wearable assistive devices.Paper: sapounaki24_biorob.pdf@inproceedings{sapounaki24_biorob, title = {Quantifying Human Upper Limb Stiffness Responses Based on a Computationally Efficient Neuromusculoskeletal Arm Model}, author = {Sapounaki, Maria and Schumacher, Pierre and Ilg, Winfried and Giese, Martin and Maufroy, Christophe and Bulling, Andreas and Schmitt, Syn and Haeufle, Daniel F.B. and Wochner, Isabell}, year = {2024}, booktitle = {Proc. 10th IEEE RAS/EMBS International Conference on Biomedical Robotics and Biomechatronics}, pages = {1--6} } -
Inferring Human Intentions from Predicted Action Probabilities
Lei Shi, Paul-Christian Bürkner, Andreas Bulling
Proc. Workshop on Theory of Mind in Human-AI Interaction at CHI 2024, pp. 1–7, 2024.
Inferring human intentions is a core challenge in human-AI collab-oration but while Bayesian methods struggle with complex visual input, deep neural network (DNN) based methods do not provide uncertainty quantifications. In this work we combine both approaches for the first time and show that the predicted next action probabilities contain information that can be used to infer the underlying user intention. We propose a two-step approach to human intention prediction: While a DNN predicts the probabilities of the next action, MCMC-based Bayesian inference is used to infer the underlying intention from these predictions. This approach not only allows for the independent design of the DNN architecture but also the subsequently fast, design-independent inference of human intentions. We evaluate our method using a series of experiments on the Watch-And-Help (WAH) and a keyboard and mouse interaction dataset. Our results show that our approach can accurately predict human intentions from observed actions and the implicit information contained in next action probabilities. Furthermore, we show that our approach can predict the correct intention even if only a few actions have been observed.Paper: shi24_chiw.pdf@inproceedings{shi24_chiw, author = {Shi, Lei and Bürkner, Paul-Christian and Bulling, Andreas}, title = {Inferring Human Intentions from Predicted Action Probabilities}, booktitle = {Proc. Workshop on Theory of Mind in Human-AI Interaction at CHI 2024}, year = {2024}, pages = {1--7}, doi = {} } -
Saliency3D: a 3D Saliency Dataset Collected on Screen
Yao Wang, Qi Dai, Mihai Bâce, Karsten Klein, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2024.
While visual saliency has recently been studied in 3D, the experimental setup for collecting 3D saliency data can be expensive and cumbersome. To address this challenge, we propose a novel experimental design that utilizes an eye tracker on a screen to collect 3D saliency data. Our experimental design reduces the cost and complexity of 3D saliency dataset collection. We first collect gaze data on a screen, then we map them to 3D saliency data through perspective transformation. Using this method, we collect a 3D saliency dataset (49,276 fixations) comprising 10 participants looking at sixteen objects. Moreover, we examine the viewing preferences for objects and discuss our findings in this study. Our results indicate potential preferred viewing directions and a correlation between salient features and the variation in viewing directions.Paper: wang24_etras.pdf@inproceedings{wang24_etras, title = {Saliency3D: a 3D Saliency Dataset Collected on Screen}, author = {Wang, Yao and Dai, Qi and B{\^a}ce, Mihai and Klein, Karsten and Bulling, Andreas}, year = {2024}, pages = {1--9}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3649902.3653350} } -
GazeMoDiff: Gaze-guided Diffusion Model for Stochastic Human Motion Prediction
Haodong Yan, Zhiming Hu, Syn Schmitt, Andreas Bulling
Proc. 32nd Pacific Conference on Computer Graphics and Application (PG), pp. 1–10, 2024.
Human motion prediction is important for many virtual and augmented reality (VR/AR) applications such as collision avoidance and realistic avatar generation. Existing methods have synthesised body motion only from observed past motion, despite the fact that human eye gaze is known to correlate strongly with body movements and is readily available in recent VR/AR headsets. We present GazeMoDiff – a novel gaze-guided denoising diffusion model to generate stochastic human motions. Our method first uses a gaze encoder and a motion encoder to extract the gaze and motion features respectively, then employs a graph attention network to fuse these features, and finally injects the gaze-motion features into a noise prediction network via a cross-attention mechanism to progressively generate multiple reasonable human motions in the future. Extensive experiments on the MoGaze and GIMO datasets demonstrate that our method outperforms the state-of-the-art methods by a large margin in terms of multi-modal final displacement error (17.3% on MoGaze and 13.3% on GIMO). We further conducted a human study (N=21) and validated that the motions generated by our method were perceived as both more precise and more realistic than those of prior methods. Taken together, these results reveal the significant information content available in eye gaze for stochastic human motion prediction as well as the effectiveness of our method in exploiting this information.Paper: yan24_pg.pdf@inproceedings{yan24_pg, title = {GazeMoDiff: Gaze-guided Diffusion Model for Stochastic Human Motion Prediction}, author = {Yan, Haodong and Hu, Zhiming and Schmitt, Syn and Bulling, Andreas}, year = {2024}, doi = {}, pages = {1--10}, booktitle = {Proc. 32nd Pacific Conference on Computer Graphics and Application (PG)} } -
InteRead: An Eye Tracking Dataset of Interrupted Reading
Francesca Zermiani, Prajit Dhar, Ekta Sood, Fabian Kögel, Andreas Bulling, Maria Wirzberger
Proc. 31st Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING), pp. 9154–9169, 2024.
Eye movements during reading offer a window into cognitive processes and language comprehension, but the scarcity of reading data with interruptions – which learners frequently encounter in their everyday learning environments – hampers advances in the development of intelligent learning technologies. We introduce InteRead – a novel 50-participant dataset of gaze data recorded during self-paced reading of real-world text. InteRead further offers fine-grained annotations of interruptions interspersed throughout the text as well as resumption lags incurred by these interruptions. Interruptions were triggered automatically once readers reached predefined target words. We validate our dataset by reporting interdisciplinary analyses on different measures of gaze behavior. In line with prior research, our analyses show that the interruptions as well as word length and word frequency effects significantly impact eye movements during reading. We also explore individual differences within our dataset, shedding light on the potential for tailored educational solutions. InteRead is accessible from our datasets web-page: https://www.ife.uni-stuttgart.de/en/llis/research/datasets/.@inproceedings{zermiani24_coling, title = {InteRead: An Eye Tracking Dataset of Interrupted Reading}, author = {Zermiani, Francesca and Dhar, Prajit and Sood, Ekta and Kögel, Fabian and Bulling, Andreas and Wirzberger, Maria}, year = {2024}, booktitle = {Proc. 31st Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING)}, pages = {9154--9169}, doi = {}, url = {https://aclanthology.org/2024.lrec-main.802/} }
Technical Reports
-
PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation
Mayar Elfares, Pascal Reisert, Zhiming Hu, Wenwu Tang, Ralf Küsters, Andreas Bulling
arXiv:2402.18970, pp. 1–22, 2024.
Latest gaze estimation methods require large-scale training data but their collection and exchange pose significant privacy risks. We propose PrivatEyes - the first privacy-enhancing training approach for appearance-based gaze estimation based on federated learning (FL) and secure multi-party computation (MPC). PrivatEyes enables training gaze estimators on multiple local datasets across different users and server-based secure aggregation of the individual estimators’ updates. PrivatEyes guarantees that individual gaze data remains private even if a majority of the aggregating servers is malicious. We also introduce a new data leakage attack DualView that shows that PrivatEyes limits the leakage of private training data more effectively than previous approaches. Evaluations on the MPIIGaze, MPIIFaceGaze, GazeCapture, and NVGaze datasets further show that the improved privacy does not lead to a lower gaze estimation accuracy or substantially higher computational costs - both of which are on par with its non-secure counterparts.@techreport{elfares24_arxiv, title = {PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation}, author = {Elfares, Mayar and Reisert, Pascal and Hu, Zhiming and Tang, Wenwu and Küsters, Ralf and Bulling, Andreas}, year = {2024}, doi = {10.48550/arXiv.2402.18970}, pages = {1--22} } -
OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog
Adnen Abdessaied, Manuel Hochmeister, Andreas Bulling
arXiv:2402.13146, pp. 1–11, 2024.
We present the Object Language Video Transformer (OLViT) – a novel model for video dialog operating over a multi-modal attention-based dialog state tracker. Existing video dialog models struggle with questions requiring both spatial and temporal localization within videos, long-term temporal reasoning, and accurate object tracking across multiple dialog turns. OLViT addresses these challenges by maintaining a global dialog state based on the output of an Object State Tracker (OST) and a Language State Tracker (LST): while the OST attends to the most important objects within the video, the LST keeps track of the most important linguistic co-references to previous dialog turns. In stark contrast to previous works, our approach is generic by nature and is therefore capable of learning continuous multi-modal dialog state representations of the most relevant objects and rounds. As a result, they can be seamlessly integrated into Large Language Models (LLMs) and offer high flexibility in dealing with different datasets and tasks. Evaluations on the challenging DVD (response classification) and SIMMC 2.1 (response generation) datasets show that OLViT achieves new state-of-the-art performance across both datasets.@techreport{abdessaied24_arxiv, author = {Abdessaied, Adnen and von Hochmeister, Manuel and Bulling, Andreas}, title = {OLViT: Multi-Modal State Tracking via Attention-Based Embeddings for Video-Grounded Dialog}, year = {2024}, pages = {1--11}, url = {https://arxiv.org/abs/2402.13146} } -
Limits of Theory of Mind Modelling in Dialogue-Based Collaborative Plan Acquisition
Matteo Bortoletto, Constantin Ruhdorfer, Adnen Abdessaied, Lei Shi, Andreas Bulling
arXiv:2405.12621, pp. 1–16, 2024.
Recent work on dialogue-based collaborative plan acquisition (CPA) has suggested that Theory of Mind (ToM) modelling can improve missing knowledge prediction in settings with asymmetric skill-sets and knowledge. Although ToM was claimed to be important for effective collaboration, its real impact on this novel task remains under-explored. By representing plans as graphs and by exploiting task-specific constraints we show that, as performance on CPA nearly doubles when predicting one’s own missing knowledge, the improvements due to ToM modelling diminish. This phenomenon persists even when evaluating existing baseline methods. To better understand the relevance of ToM for CPA, we report a principled performance comparison of models with and without ToM features. Results across different models and ablations consistently suggest that learned ToM features are indeed more likely to reflect latent patterns in the data with no perceivable link to ToM. This finding calls for a deeper understanding of the role of ToM in CPA and beyond, as well as new methods for modelling and evaluating mental states in computational collaborative agents.@techreport{bortoletto24_arxiv, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Abdessaied, Adnen and Shi, Lei and Bulling, Andreas}, title = {Limits of Theory of Mind Modelling in Dialogue-Based Collaborative Plan Acquisition}, year = {2024}, pages = {1--16}, url = {https://arxiv.org/abs/2405.12621} } -
Benchmarking Mental State Representations in Language Models
Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling
arXiv:2406.17513, pp. 1–21, 2024.
While numerous works have assessed the generative performance of language models (LMs) on tasks requiring Theory of Mind reasoning, research into the models’ internal representation of mental states remains limited. Recent work has used probing to demonstrate that LMs can represent beliefs of themselves and others. However, these claims are accompanied by limited evaluation, making it difficult to assess how mental state representations are affected by model design and training choices. We report an extensive benchmark with various LM types with different model sizes, fine-tuning approaches, and prompt designs to study the robustness of mental state representations and memorisation issues within the probes. Our results show that the quality of models’ internal representations of the beliefs of others increases with model size and, more crucially, with fine-tuning. We are the first to study how prompt variations impact probing performance on theory of mind tasks. We demonstrate that models’ representations are sensitive to prompt variations, even when such variations should be beneficial. Finally, we complement previous activation editing experiments on Theory of Mind tasks and show that it is possible to improve models’ reasoning performance by steering their activations without the need to train any probe.@techreport{bortoletto24_arxiv_2, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Shi, Lei and Bulling, Andreas}, title = {Benchmarking Mental State Representations in Language Models}, year = {2024}, pages = {1--21}, url = {https://arxiv.org/abs/2406.17513} } -
Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions
Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling
arXiv:2407.06762, pp. 1–11, 2024.
We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input. ToM is key for effective nonverbal human communication and collaboration, yet, existing methods for belief modelling have not included explicit ToM modelling or have typically been limited to one or two modalities. MToMnet encodes contextual cues (scene videos and object locations) and integrates them with person-specific cues (human gaze and body language) in a separate MindNet for each person. Inspired by prior research on social cognition and computational ToM, we propose three different MToMnet variants: two involving fusion of latent representations and one involving re-ranking of classification scores. We evaluate our approach on two challenging real-world datasets, one focusing on belief prediction, while the other examining belief dynamics prediction. Our results demonstrate that MToMnet surpasses existing methods by a large margin while at the same time requiring a significantly smaller number of parameters. Taken together, our method opens up a highly promising direction for future work on artificial intelligent systems that can robustly predict human beliefs from their non-verbal behaviour and, as such, more effectively collaborate with humans.Paper Access: https://arxiv.org/abs/2407.06762@techreport{bortoletto24_arxiv_3, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Shi, Lei and Bulling, Andreas}, title = {Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions}, year = {2024}, pages = {1--11}, url = {https://arxiv.org/abs/2407.06762} } -
Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements
Jhon Paul Feliciano Charaja Casas, Isabell Wochner, Pierre Schumacher, Winfried Ilg, Martin Giese, Christophe Maufroy, Andreas Bulling, Syn Schmitt, Daniel F.B. Haeufle
arXiv:2402.13949, pp. 1–6, 2024.
The mimicking of human-like arm movement characteristics involves the consideration of three factors during control policy synthesis: (a) chosen task requirements, (b) inclusion of noise during movement execution and (c) chosen optimality principles. Previous studies showed that when considering these factors (a-c) individually, it is possible to synthesize arm movements that either kinematically match the experimental data or reproduce the stereotypical triphasic muscle activation pattern. However, to date no quantitative comparison has been made on how realistic the arm movement generated by each factor is; as well as whether a partial or total combination of all factors results in arm movements with human-like kinematic characteristics and a triphasic muscle pattern. To investigate this, we used reinforcement learning to learn a control policy for a musculoskeletal arm model, aiming to discern which combination of factors (a-c) results in realistic arm movements according to four frequently reported stereotypical characteristics. Our findings indicate that incorporating velocity and acceleration requirements into the reaching task, employing reward terms that encourage minimization of mechanical work, hand jerk, and control effort, along with the inclusion of noise during movement, leads to the emergence of realistic human arm movements in reinforcement learning. We expect that the gained insights will help in the future to better predict desired arm movements and corrective forces in wearable assistive devices.Paper Access: https://arxiv.org/abs/2402.13949@techreport{casas24_arxiv, title = {Generating Realistic Arm Movements in Reinforcement Learning: A Quantitative Comparison of Reward Terms and Task Requirements}, author = {Casas, Jhon Paul Feliciano Charaja and Wochner, Isabell and Schumacher, Pierre and Ilg, Winfried and Giese, Martin and Maufroy, Christophe and Bulling, Andreas and Schmitt, Syn and Haeufle, Daniel F.B.}, year = {2024}, pages = {1--6}, url = {https://arxiv.org/abs/2402.13949} } -
GazeMotion: Gaze-guided Human Motion Forecasting
Zhiming Hu, Syn Schmitt, Daniel Häufle, Andreas Bulling
arXiv:2403.09885, pp. 1–6, 2024.
We present GazeMotion, a novel method for human motion forecasting that combines information on past human poses with human eye gaze. Inspired by evidence from behavioural sciences showing that human eye and body movements are closely coordinated, GazeMotion first predicts future eye gaze from past gaze, then fuses predicted future gaze and past poses into a gaze-pose graph, and finally uses a residual graph convolutional network to forecast body motion. We extensively evaluate our method on the MoGaze, ADT, and GIMO benchmark datasets and show that it outperforms state-of-the-art methods by up to 7.4% improvement in mean per joint position error. Using head direction as a proxy to gaze, our method still achieves an average improvement of 5.5%. We finally report an online user study showing that our method also outperforms prior methods in terms of perceived realism. These results show the significant information content available in eye gaze for human motion forecasting as well as the effectiveness of our method in exploiting this information.Paper Access: https://arxiv.org/abs/2403.09885@techreport{hu24_arxiv, author = {Hu, Zhiming and Schmitt, Syn and Häufle, Daniel and Bulling, Andreas}, title = {GazeMotion: Gaze-guided Human Motion Forecasting}, year = {2024}, pages = {1--6}, url = {https://arxiv.org/abs/2403.09885} } -
DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images
Chuhan Jiao, Yao Wang, Guanhua Zhang, Mihai Bâce, Zhiming Hu, Andreas Bulling
arXiv:2403.17477, pp. 1–13, 2024.
We present DiffGaze, a novel method for generating realistic and diverse continuous human gaze sequences on 360° images based on a conditional score-based denoising diffusion model. Generating human gaze on 360° images is important for various human-computer interaction and computer graphics applications, e.g. for creating large-scale eye tracking datasets or for realistic animation of virtual humans. However, existing methods are limited to predicting discrete fixation sequences or aggregated saliency maps, thereby neglecting crucial parts of natural gaze behaviour. Our method uses features extracted from 360° images as condition and uses two transformers to model the temporal and spatial dependencies of continuous human gaze. We evaluate DiffGaze on two 360° image benchmarks for gaze sequence generation as well as scanpath prediction and saliency prediction. Our evaluations show that DiffGaze outperforms state-of-the-art methods on all tasks on both benchmarks. We also report a 21-participant user study showing that our method generates gaze sequences that are indistinguishable from real human sequences. Taken together, our evaluations not only demonstrate the effectiveness of DiffGaze but also point towards a new generation of methods that faithfully model the rich spatial and temporal nature of natural human gaze behaviour.Paper Access: https://arxiv.org/abs/2403.17477@techreport{jiao24_arxiv, title = {DiffGaze: A Diffusion Model for Continuous Gaze Sequence Generation on 360° Images}, author = {Jiao, Chuhan and Wang, Yao and Zhang, Guanhua and B{\^a}ce, Mihai and Hu, Zhiming and Bulling, Andreas}, year = {2024}, pages = {1--13}, url = {https://arxiv.org/abs/2403.17477} } -
DiffEyeSyn: Diffusion-based User-specific Eye Movement Synthesis
Chuhan Jiao, Guanhua Zhang, Zhiming Hu, Andreas Bulling
arXiv:2409.01240, pp. 1–22, 2024.
High-frequency components in eye gaze data contain user-specific information promising for various applications, but existing gaze modelling methods focus on low frequencies of typically not more than 30 Hz. We present DiffEyeSyn – the first computational method to synthesise high-frequency gaze data, including eye movement characteristics specific to individual users. The key idea is to consider the high-frequency, user-specific information as a special type of noise in eye movement data. This perspective reshapes eye movement synthesis into the task of injecting this user-specific noise into any given eye movement sequence. We formulate this injection task as a conditional diffusion process in which the synthesis is conditioned on user-specific embeddings extracted from the gaze data using pre-trained models for user authentication. We propose user identity guidance – a novel loss function that allows our model to preserve user identity while generating human-like eye movements in the spatial domain. Experiment results on two public high-frequency eye movement biometric datasets show that our synthetic eye movements are indistinguishable from real human eye movements. Furthermore, we demonstrate that DiffEyeSyn can be used to synthesise eye gaze data at scale and for different downstream tasks, such as gaze data imputation and gaze data super-resolution. As such, our work lays the methodological foundations for personalised eye movement synthesis that has significant application potential, such as for character animation, eye movement biometrics, or gaze-based activity and context recognition.Paper Access: https://arxiv.org/abs/2409.01240@techreport{jiao24_arxiv_2, title = {DiffEyeSyn: Diffusion-based User-specific Eye Movement Synthesis}, author = {Jiao, Chuhan and Zhang, Guanhua and Hu, Zhiming and Bulling, Andreas}, year = {2024}, pages = {1--22}, url = {https://arxiv.org/abs/2409.01240} } -
MultiMediate’24: Multi-Domain Engagement Estimation
Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Anna Penzkofer, Dominik Schiller, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling
arXiv:2408.16625, pp. 1–6, 2024.
Estimating the momentary level of participant’s engagement is an important prerequisite for assistive systems that support human interactions. Previous work has addressed this task in within-domain evaluation scenarios, i.e. training and testing on the same dataset. This is in contrast to real-life scenarios where domain shifts between training and testing data frequently occur. With MultiMediate’24, we present the first challenge addressing multi-domain engagement estimation. As training data, we utilise the NOXI database of dyadic novice-expert interactions. In addition to within-domain test data, we add two new test domains. First, we introduce recordings following the NOXI protocol but covering languages that are not present in the NOXI training data. Second, we collected novel engagement annotations on the MPIIGroupInteraction dataset which consists of group discussions between three to four people. In this way, MultiMediate’24 evaluates the ability of approaches to generalise across factors such as language and cultural background, group size, task, and screen-mediated vs. face-to-face interaction. This paper describes the MultiMediate’24 challenge and presents baseline results. In addition, we discuss selected challenge solutions.Paper: mueller24_arxiv.pdfPaper Access: http://arxiv.org/abs/2408.16625@techreport{mueller24_arxiv, title = {MultiMediate'24: Multi-Domain Engagement Estimation}, author = {M{\"{u}}ller, Philipp and Balazia, Michal and Baur, Tobias and Dietz, Michael and Heimerl, Alexander and Penzkofer, Anna and Schiller, Dominik and Brémond, François and Alexandersson, Jan and André, Elisabeth and Bulling, Andreas}, year = {2024}, pages = {1--6}, doi = {10.48550/arXiv.2408.16625}, url = {http://arxiv.org/abs/2408.16625} } -
The Overcooked Generalisation Challenge
Constantin Ruhdorfer, Matteo Bortoletto, Anna Penzkofer, Andreas Bulling
arxiv:2406.17949, pp. 1-25, 2024.
We introduce the Overcooked Generalisation Challenge (OGC) - the first benchmark to study agents’ zero-shot cooperation abilities when faced with novel partners and levels in the Overcooked-AI environment. This perspective starkly contrasts a large body of previous work that has trained and evaluated cooperating agents only on the same level, failing to capture generalisation abilities required for real-world human-AI cooperation. Our challenge interfaces with state-of-the-art dual curriculum design (DCD) methods to generate auto-curricula for training general agents in Overcooked. It is the first cooperative multi-agent environment specially designed for DCD methods and, consequently, the first benchmarked with state-of-the-art methods. It is fully GPU-accelerated, built on the DCD benchmark suite minimax, and freely available under an open-source license: this https URL. We show that current DCD algorithms struggle to produce useful policies in this novel challenge, even if combined with recent network architectures that were designed for scalability and generalisability. The OGC pushes the boundaries of real-world human-AI cooperation by enabling the research community to study the impact of generalisation on cooperating agents.@techreport{ruhdorfer2024_arxiv, title = {The Overcooked Generalisation Challenge}, author = {Ruhdorfer, Constantin and Bortoletto, Matteo and Penzkofer, Anna and Bulling, Andreas}, year = {2024}, pages = {1-25}, url = {https://arxiv.org/abs/2406.17949} } -
ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos
Lei Shi, Paul Burkner, Andreas Bulling
arXiv:2403.08591, pp. 1–6, 2024.
We present ActionDiffusion – a novel diffusion model for procedure planning in instructional videos that is the first to take temporal inter-dependencies between actions into account in a diffusion model for procedure planning. This approach is in stark contrast to existing methods that fail to exploit the rich information content available in the particular order in which actions are performed. Our method unifies the learning of temporal dependencies between actions and denoising of the action plan in the diffusion process by projecting the action information into the noise space. This is achieved 1) by adding action embeddings in the noise masks in the noiseadding phase and 2) by introducing an attention mechanism in the noise prediction network to learn the correlations between different action steps. We report extensive experiments on three instructional video benchmark datasets (CrossTask, Coin, and NIV) and show that our method outperforms previous state-of-the-art methods on all metrics on CrossTask and NIV and all metrics except accuracy on Coin dataset. We show that by adding action embeddings into the noise mask the diffusion model can better learn action temporal dependencies and increase the performances on procedure planningPaper Access: https://arxiv.org/abs/2403.08591@techreport{shi24_arxiv, title = {ActionDiffusion: An Action-aware Diffusion Model for Procedure Planning in Instructional Videos}, author = {Shi, Lei and Burkner, Paul and Bulling, Andreas}, year = {2024}, pages = {1--6}, url = {https://arxiv.org/abs/2403.08591} } -
Learning User Embeddings from Human Gaze for Personalised Saliency Prediction
Florian Strohm, Mihai Bâce, Andreas Bulling
arXiv:2403.13653, pp. 1–15, 2024.
Reusable embeddings of user behaviour have shown significant performance improvements for the personalised saliency prediction task. However, prior works require explicit user characteristics and preferences as input, which are often difficult to obtain. We present a novel method to extract user embeddings from pairs of natural images and corresponding saliency maps generated from a small amount of user-specific eye tracking data. At the core of our method is a Siamese convolutional neural encoder that learns the user embeddings by contrasting the image and personal saliency map pairs of different users. Evaluations on two saliency datasets show that the generated embeddings have high discriminative power, are effective at refining universal saliency maps to the individual users, and generalise well across users and images. Finally, based on our model’s ability to encode individual user characteristics, our work points towards other applications that can benefit from reusable embeddings of gaze behaviour.Paper Access: https://arxiv.org/abs/2403.13653@techreport{strohm24_arxiv, title = {Learning User Embeddings from Human Gaze for Personalised Saliency Prediction}, author = {Strohm, Florian and Bâce, Mihai and Bulling, Andreas}, year = {2024}, pages = {1--15}, url = {https://arxiv.org/abs/2403.13653} } -
SeFFeC: Semantic Facial Feature Control for Fine-grained Face Editing
Florian Strohm, Mihai Bâce, Markus Kaltenecker, Andreas Bulling
arXiv:2403.13972, pp. 1–18, 2024.
We propose Semantic Facial Feature Control (SeFFeC) - a novel method for fine-grained face shape editing. Our method enables the manipulation of human-understandable, semantic face features, such as nose length or mouth width, which are defined by different groups of facial landmarks. In contrast to existing methods, the use of facial landmarks enables precise measurement of the facial features, which then enables training SeFFeC without any manually annotated labels. SeFFeC consists of a transformer-based encoder network that takes a latent vector of a pre-trained generative model and a facial feature embedding as input, and learns to modify the latent vector to perform the desired face edit operation. To ensure that the desired feature measurement is changed towards the target value without altering uncorrelated features, we introduced a novel semantic face feature loss. Qualitative and quantitative results show that SeFFeC enables precise and fine-grained control of 23 facial features, some of which could not previously be controlled by other methods, without requiring manual annotations. Unlike existing methods, SeFFeC also provides deterministic control over the exact values of the facial features and more localised and disentangled face edits.Paper Access: https://arxiv.org/abs/2403.13972@techreport{strohm24_arxiv_2, title = {SeFFeC: Semantic Facial Feature Control for Fine-grained Face Editing}, author = {Strohm, Florian and Bâce, Mihai and Kaltenecker, Markus and Bulling, Andreas}, year = {2024}, pages = {1--18}, url = {https://arxiv.org/abs/2403.13972} }
2023
Journal Articles
-
Privacy-Aware Eye Tracking: Challenges and Future Directions
Céline Gressel, Rebekah Overdorf, Inken Hagenstedt, Murat Karaboga, Helmut Lurtz, Michael Raschke, Andreas Bulling
IEEE Pervasive Computing, 22 (1), pp. 95-102, 2023.
What do you have to keep in mind when developing or using eye-tracking technologies regarding privacy? In this article we discuss the main ethical, technical, and legal categories of privacy, which is much more than just data protection. We additionally provide recommendations about how such technologies might mitigate privacy risks and in which cases the risks are higher than the benefits of the technology.Paper: gressel23_pcm.pdf@article{gressel23_pcm, title = {Privacy-Aware Eye Tracking: Challenges and Future Directions}, author = {Gressel, Céline and Overdorf, Rebekah and Hagenstedt, Inken and Karaboga, Murat and Lurtz, Helmut and Raschke, Michael and Bulling, Andreas}, journal = {IEEE Pervasive Computing}, year = {2023}, volume = {22}, number = {1}, doi = {10.1109/MPRV.2022.3228660}, pages = {95-102} } -
Scanpath Prediction on Information Visualisations
Yao Wang, Mihai Bâce, Andreas Bulling
IEEE Transactions on Visualization and Computer Graphics (TVCG), 30 (7), pp. 3902–3914, 2023.
We propose Unified Model of Saliency and Scanpaths (UMSS) – a model that learns to predict multi-duration saliency and scanpaths (i.e. sequences of eye fixations) on information visualisations. Although scanpaths provide rich information about the importance of different visualisation elements during the visual exploration process, prior work has been limited to predicting aggregated attention statistics, such as visual saliency. We present in-depth analyses of gaze behaviour for different information visualisation elements (e.g. Title, Label, Data) on the popular MASSVIS dataset. We show that while, overall, gaze patterns are surprisingly consistent across visualisations and viewers, there are also structural differences in gaze dynamics for different elements. Informed by our analyses, UMSS first predicts multi-duration element-level saliency maps, then probabilistically samples scanpaths from them. Extensive experiments on MASSVIS show that our method consistently outperforms state-of-the-art methods with respect tto several, widely used scanpath and saliency evaluation metrics. Our method achieves a relative improvement in sequence score of 11.5 % for scanpath prediction, and a relative improvement in Pearson correlation coefficient of up to 23.6 % for saliency prediction. These results are auspicious and point towards richer user models and simulations of visual attention on visualisations without the need for any eye tracking equipment.Paper: wang23_tvcg.pdfSupplementary Material: wang23_tvcg_sup.pdf@article{wang23_tvcg, title = {Scanpath Prediction on Information Visualisations}, author = {Wang, Yao and Bâce, Mihai and Bulling, Andreas}, year = {2023}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, volume = {30}, number = {7}, pages = {3902--3914}, doi = {10.1109/TVCG.2023.3242293} }
Conference Papers
-
Exploring Natural Language Processing Methods for Interactive Behaviour Modelling
Guanhua Zhang, Matteo Bortoletto, Zhiming Hu, Lei Shi, Mihai Bâce, Andreas Bulling
Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT), pp. 1–22, 2023.
Abstract Links BibTeX Project Best Student Paper Nomination
Analysing and modelling interactive behaviour is an important topic in human-computer interaction (HCI) and a key requirement for the development of intelligent interactive systems. Interactive behaviour has a sequential (actions happen one after another) and hierarchical (a sequence of actions forms an activity driven by interaction goals) structure, which may be similar to the structure of natural language. Designed based on such a structure, natural language processing (NLP) methods have achieved groundbreaking success in various downstream tasks. However, few works linked interactive behaviour with natural language. In this paper, we explore the similarity between interactive behaviour and natural language by applying an NLP method, byte pair encoding (BPE), to encode mouse and keyboard behaviour. We then analyse the vocabulary, i.e., the set of action sequences, learnt by BPE, as well as use the vocabulary to encode the input behaviour for interactive task recognition. An existing dataset collected in constrained lab settings and our novel out-of-the-lab dataset were used for evaluation. Results show that this natural language-inspired approach not only learns action sequences that reflect specific interaction goals, but also achieves higher F1 scores on task recognition than other methods. Our work reveals the similarity between interactive behaviour and natural language, and presents the potential of applying the new pack of methods that leverage insights from NLP to model interactive behaviour in HCI.Paper: zhang23_interact.pdf@inproceedings{zhang23_interact, title = {Exploring Natural Language Processing Methods for Interactive Behaviour Modelling}, author = {Zhang, Guanhua and Bortoletto, Matteo and Hu, Zhiming and Shi, Lei and B{\^a}ce, Mihai and Bulling, Andreas}, booktitle = {Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT)}, pages = {1--22}, year = {2023}, publisher = {Springer} } -
Improving Neural Saliency Prediction with a Cognitive Model of Human Visual Attention
Ekta Sood, Lei Shi, Matteo Bortoletto, Yao Wang, Philipp Müller, Andreas Bulling
Proc. the 45th Annual Meeting of the Cognitive Science Society (CogSci), pp. 3639–3646, 2023.
We present a novel method for saliency prediction that leverages a cognitive model of visual attention as an inductive bias. This approach is in stark contrast to recent purely data-driven saliency models that achieve performance improvements mainly by increased capacity, resulting in high computational costs and the need for large-scale training datasets. We demonstrate that by using a cognitive model, our method achieves competitive performance to the state of the art across several natural image datasets while only requiring a fraction of the parameters. Furthermore, we set the new state of the art for saliency prediction on information visualizations, demonstrating the effectiveness of our approach for cross-domain generalization. We further provide augmented versions of the full MSCOCO dataset with synthetic gaze data using the cognitive model, which we used to pre-train our method. Our results are highly promising and underline the significant potential of bridging between cognitive and data-driven models, potentially also beyond attention.@inproceedings{sood23_cogsci, author = {Sood, Ekta and Shi, Lei and Bortoletto, Matteo and Wang, Yao and Müller, Philipp and Bulling, Andreas}, title = {Improving Neural Saliency Prediction with a Cognitive Model of Human Visual Attention}, booktitle = {Proc. the 45th Annual Meeting of the Cognitive Science Society (CogSci)}, year = {2023}, pages = {3639--3646} } -
GazeCast: Using Mobile Devices to Allow Gaze-based Interaction on Public Displays
Omar Namnakani, Penpicha Sinrattanavong, Yasmeen Abdrabou, Andreas Bulling, Florian Alt, Mohamed Khamis
Proc. Communication by Gaze Interaction Symposium (COGAIN), pp. 1–8, 2023.
Abstract Links BibTeX Project COGAIN Best Paper Award
Gaze is promising for natural and spontaneous interaction with public displays, but current gaze-enabled displays either require movement-hindering stationary eye trackers or cumbersome head-mounted eye trackers. We propose and evaluate GazeCast - a novel system that leverages users’ personal handheld mobile devices to allow gaze-based interaction with surrounding displays. GazeCast improves gaze interaction on such displays by neither setting limitations on where users have to position themselves nor on the number of concurrent users. In a user study (N = 20), we compared GazeCast to using a standard webcam for gaze-based interaction using Pursuits. We find that while selection using GazeCast requires more time and physical demand, participants value GazeCast’s high accuracy and the flexible positioning. We conclude by discussing how mobile computing can facilitate the adoption of gaze interaction with pervasive displays.Paper: namnakani23_cogain.pdf@inproceedings{namnakani23_cogain, title = {GazeCast: Using Mobile Devices to Allow Gaze-based Interaction on Public Displays}, author = {Namnakani, Omar and Sinrattanavong, Penpicha and Abdrabou, Yasmeen and Bulling, Andreas and Alt, Florian and Khamis, Mohamed}, year = {2023}, pages = {1--8}, booktitle = {Proc. Communication by Gaze Interaction Symposium (COGAIN)}, doi = {10.1145/3588015.3589663} } -
Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning
Anna Penzkofer, Simon Schaefer, Florian Strohm, Mihai Bâce, Stefan Leutenegger, Andreas Bulling
Proc. Adaptive and Learning Agents Workshop (ALA), pp. 1–7, 2023.
While deep reinforcement learning (RL) agents outperform humans on an increasing number of tasks, training them requires data equivalent to decades of human gameplay. Recent hierarchical RL methods have increased sample efficiency by incorporating information inherent to the structure of the decision problem but at the cost of having to discover or use human-annotated sub-goals that guide the learning process. We show that intentions of human players, i.e. the precursor of goal-oriented decisions, can be robustly predicted from eye gaze even for the long-horizon sparse rewards task of Montezuma’s Revenge – one of the most challenging RL tasks in the Atari2600 game suite. We propose Int-HRL: Hierarchical RL with intention-based sub-goals that are inferred from human eye gaze. Our novel sub-goal extraction pipeline is fully automatic and replaces the need for manual sub-goal annotation by human experts. Our evaluations show that replacing hand-crafted sub-goals with automatically extracted intentions leads to a HRL agent that is significantly more sample efficient than previous methods.Paper: penzkofer23_ala.pdf@inproceedings{penzkofer23_ala, author = {Penzkofer, Anna and Schaefer, Simon and Strohm, Florian and Bâce, Mihai and Leutenegger, Stefan and Bulling, Andreas}, title = {Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning}, booktitle = {Proc. Adaptive and Learning Agents Workshop (ALA)}, year = {2023}, pages = {1--7} } -
Impact of Privacy Protection Methods of Lifelogs on Remembered Memories
Passant Elagroudy, Mohamed Khamis, Florian Mathis, Diana Irmscher, Ekta Sood, Andreas Bulling, Albrecht Schmidt
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–10, 2023.
Lifelogging is traditionally used for memory augmentation. However, recent research shows that users’ trust in the completeness and accuracy of lifelogs might skew their memories. Privacy-protection alterations such as body blurring and content deletion are commonly applied to photos to circumvent capturing sensitive information. However, their impact on how users remember memories remain unclear. To this end, we conduct a white-hat memory attack and report on an iterative experiment (N=21) to compare the impact of viewing 1) unaltered lifelogs, 2) blurred lifelogs, and 3) a subset of the lifelogs after deleting private ones, on confidently remembering memories. Findings indicate that all the privacy methods impact memories’ quality similarly and that users tend to change their answers in recognition more than recall scenarios. Results also show that users have high confidence in their remembered content across all privacy methods. Our work raises awareness about the mindful designing of technological interventions.Paper: elagroudy23_chi.pdf@inproceedings{elagroudy23_chi, author = {Elagroudy, Passant and Khamis, Mohamed and Mathis, Florian and Irmscher, Diana and Sood, Ekta and Bulling, Andreas and Schmidt, Albrecht}, title = {Impact of Privacy Protection Methods of Lifelogs on Remembered Memories}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2023}, doi = {10.1145/3544548.3581565}, pages = {1--10} } -
Federated Learning for Appearance-based Gaze Estimation in the Wild
Mayar Elfares, Zhiming Hu, Pascal Reisert, Andreas Bulling, Ralf Küsters
Proceedings of The 1st Gaze Meets ML workshop, PMLR, pp. 20–36, 2023.
Gaze estimation methods have significantly matured in recent years, but the large number of eye images required to train deep learning models poses significant privacy risks. In addition, the heterogeneous data distribution across different users can significantly hinder the training process. In this work, we propose the first federated learning approach for gaze estimation to preserve the privacy of gaze data. We further employ pseudo-gradient optimisation to adapt our federated learning approach to the divergent model updates to address the heterogeneous nature of in-the-wild gaze data in collaborative setups. We evaluate our approach on a real-world dataset (MPIIGaze) and show that our work enhances the privacy guarantees of conventional appearance-based gaze estimation methods, handles the convergence issues of gaze estimators, and significantly outperforms vanilla federated learning by 15.8% (from a mean error of 10.63 degrees to 8.95 degrees). As such, our work paves the way to develop privacy-aware collaborative learning setups for gaze estimation while maintaining the model’s performance.Paper: elfares23_gmml.pdfPaper Access: https://proceedings.mlr.press/v210/elfares23a.html@inproceedings{elfares23_gmml, title = {Federated Learning for Appearance-based Gaze Estimation in the Wild}, author = {Elfares, Mayar and Hu, Zhiming and Reisert, Pascal and Bulling, Andreas and K{\"u}sters, Ralf}, booktitle = {Proceedings of The 1st Gaze Meets ML workshop, PMLR}, pages = {20--36}, year = {2023}, editor = {Lourentzou, Ismini and Wu, Joy and Kashyap, Satyananda and Karargyris, Alexandros and Celi, Leo Anthony and Kawas, Ban and Talathi, Sachin}, volume = {210}, series = {Proceedings of Machine Learning Research}, publisher = {PMLR}, url = {https://proceedings.mlr.press/v210/elfares23a.html} } -
SUPREYES: SUPer Resolution for EYES Using Implicit Neural Representation Learning
Chuhan Jiao, Zhiming Hu, Mihai Bâce, Andreas Bulling
Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 1–13, 2023.
We introduce SUPREYES – a novel self-supervised method to increase the spatio-temporal resolution of gaze data recorded using low(er)-resolution eye trackers. Despite continuing advances in eye tracking technology, the vast majority of current eye trackers – particularly mobile ones and those integrated into mobile devices – suffer from low-resolution gaze data, thus fundamentally limiting their practical usefulness. SUPREYES learns a continuous implicit neural representation from low-resolution gaze data to up-sample the gaze data to arbitrary resolutions. We compare our method with commonly used interpolation methods on arbitrary scale super-resolution and demonstrate that SUPREYES outperforms these baselines by a significant margin. We also test on the sample downstream task of gaze-based user identification and show that our method improves the performance of original low-resolution gaze data and outperforms other baselines. These results are promising as they open up a new direction for increasing eye tracking fidelity as well as enabling new gaze-based applications without the need for new eye tracking equipment.Paper: jiao23_uist.pdf@inproceedings{jiao23_uist, author = {Jiao, Chuhan and Hu, Zhiming and B{\^a}ce, Mihai and Bulling, Andreas}, title = {SUPREYES: SUPer Resolution for EYES Using Implicit Neural Representation Learning}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, year = {2023}, pages = {1--13}, doi = {10.1145/3586183.3606780} } -
MultiMediate ’23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions
Philipp Müller, Michal Balazia, Tobias Baur, Michael Dietz, Alexander Heimerl, Dominik Schiller, Mohammed Guermal, Dominike Thomas, François Brémond, Jan Alexandersson, Elisabeth André, Andreas Bulling
Proceedings of the 31st ACM International Conference on Multimedia, pp. 9640–9645, 2023.
Automatic analysis of human behaviour is a fundamental prerequisite for the creation of machines that can effectively interact with- and support humans in social interactions. In MultiMediate’23, we address two key human social behaviour analysis tasks for the first time in a controlled challenge: engagement estimation and bodily behaviour recognition in social interactions. This paper describes the MultiMediate’23 challenge and presents novel sets of annotations for both tasks. For engagement estimation we collected novel annotations on the NOvice eXpert Interaction (NOXI) database. For bodily behaviour recognition, we annotated test recordings of the MPIIGroupInteraction corpus with the BBSI annotation scheme. In addition, we present baseline results for both challenge tasks.Paper: mueller23_mm.pdfPaper Access: https://doi.org/10.1145/3581783.3613851@inproceedings{mueller23_mm, author = {M\"{u}ller, Philipp and Balazia, Michal and Baur, Tobias and Dietz, Michael and Heimerl, Alexander and Schiller, Dominik and Guermal, Mohammed and Thomas, Dominike and Br\'{e}mond, Fran\c{c}ois and Alexandersson, Jan and Andr\'{e}, Elisabeth and Bulling, Andreas}, title = {MultiMediate '23: Engagement Estimation and Bodily Behaviour Recognition in Social Interactions}, year = {2023}, isbn = {9798400701085}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, url = {https://doi.org/10.1145/3581783.3613851}, doi = {10.1145/3581783.3613851}, booktitle = {Proceedings of the 31st ACM International Conference on Multimedia}, pages = {9640–9645}, numpages = {6}, keywords = {dataset, engagement, nonverbal behaviour, challenge}, location = {Ottawa ON, Canada}, series = {MM '23} } -
Multimodal Integration of Human-Like Attention in Visual Question Answering
Ekta Sood, Fabian Kögel, Philipp Müller, Dominike Thomas, Mihai Bâce, Andreas Bulling
Proc. Workshop on Gaze Estimation and Prediction in the Wild (GAZE), CVPRW, pp. 2647–2657, 2023.
Abstract Links BibTeX Project Tobii Sponsor Award, Oral Presentation
Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to uni-modal integration – even for inherently multi-modal tasks such as visual question answering (VQA). We present the Multimodal Human-like Attention Network (MULAN) – the first method for multimodal integration of human-like attention on image and text during training of VQA models. MULAN integrates attention predictions from two state-of-the-art text and image saliency models into neural self-attention layers of a recent transformer-based VQA model. Through evaluations on the challenging VQAv2 dataset, we show that MULAN achieves a new state-of-the-art performance of 73.98% accuracy on test-std and 73.72% on test-dev and, at the same time, has approximately 80% fewer trainable parameters than prior work. Overall, our work underlines the potential of integrating multimodal human-like and neural attention for VQA.@inproceedings{sood23_gaze, author = {Sood, Ekta and Kögel, Fabian and Müller, Philipp and Thomas, Dominike and Bâce, Mihai and Bulling, Andreas}, title = {Multimodal Integration of Human-Like Attention in Visual Question Answering}, booktitle = {Proc. Workshop on Gaze Estimation and Prediction in the Wild (GAZE), CVPRW}, year = {2023}, pages = {2647--2657}, url = {https://openaccess.thecvf.com/content/CVPR2023W/GAZE/papers/Sood_Multimodal_Integration_of_Human-Like_Attention_in_Visual_Question_Answering_CVPRW_2023_paper.pdf} } -
Facial Composite Generation with Iterative Human Feedback
Florian Strohm, Ekta Sood, Dominike Thomas, Mihai Bâce, Andreas Bulling
Proc. The 1st Gaze Meets ML workshop, PMLR, pp. 165–183, 2023.
We propose the first method in which human and AI collaborate to iteratively reconstruct the human’s mental image of another person’s face only from their eye gaze. Current tools for generating digital human faces involve a tedious and time-consuming manual design process. While gaze-based mental image reconstruction represents a promising alternative, previous methods still assumed prior knowledge about the target face, thereby severely limiting their practical usefulness. The key novelty of our method is a collaborative, it- erative query engine: Based on the user’s gaze behaviour in each iteration, our method predicts which images to show to the user in the next iteration. Results from two human studies (N=12 and N=22) show that our method can visually reconstruct digital faces that are more similar to the mental image, and is more usable compared to other methods. As such, our findings point at the significant potential of human-AI collaboration for recon- structing mental images, potentially also beyond faces, and of human gaze as a rich source of information and a powerful mediator in said collaboration.Paper: strohm23_gmml.pdfPaper Access: https://proceedings.mlr.press/v210/strohm23a.html@inproceedings{strohm23_gmml, title = {Facial Composite Generation with Iterative Human Feedback}, author = {Strohm, Florian and Sood, Ekta and Thomas, Dominike and B{\^a}ce, Mihai and Bulling, Andreas}, booktitle = {Proc. The 1st Gaze Meets ML workshop, PMLR}, pages = {165--183}, year = {2023}, editor = {Lourentzou, Ismini and Wu, Joy and Kashyap, Satyananda and Karargyris, Alexandros and Celi, Leo Anthony and Kawas, Ban and Talathi, Sachin}, volume = {210}, series = {Proceedings of Machine Learning Research}, publisher = {PMLR}, pdf = {https://proceedings.mlr.press/v210/strohm23a/strohm23a.pdf}, url = {https://proceedings.mlr.press/v210/strohm23a.html} } -
Usable and Fast Interactive Mental Face Reconstruction
Florian Strohm, Mihai Bâce, Andreas Bulling
Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 1–15, 2023.
We introduce an end-to-end interactive system for mental face reconstruction – the challenging task of visually reconstructing a face image a person only has in their mind. In contrast to existing methods that suffer from low usability and high mental load, our approach only requires the user to rank images over multiple iterations according to the perceived similarity with their mental image. Based on these rankings, our mental face reconstruction system extracts image features in each iteration, combines them into a joint feature vector, and then uses a generative model to visually reconstruct the mental image. To avoid the need for collecting large amounts of human training data, we further propose a computational user model that can simulate human ranking behaviour using data from an online crowd-sourcing study (N=215). Results from a 12-participant user study show that our method can reconstruct mental images that are visually similar to existing approaches but has significantly higher usability, lower perceived workload, and is 40% faster. In addition, results from a third 22-participant lineup study in which we validated our reconstructions on a face ranking task show a identification rate of 55.3%, which is in line with prior work. These results represent an important step towards new interactive intelligent systems that can robustly and effortlessly reconstruct a user’s mental image.@inproceedings{strohm23_uist, author = {Strohm, Florian and B{\^a}ce, Mihai and Bulling, Andreas}, title = {Usable and Fast Interactive Mental Face Reconstruction}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, year = {2023}, pages = {1--15}, doi = {https://doi.org/10.1145/3586183.3606795} } -
Gaze Behaviour in Adolescents with Obsessive-compulsive Disorder During Exposure Within Cognitive-behavioural Therapy
Annika Thierfelder, Björn Severitt, Carolin Sarah Klein, Annika Kristin Alt, Karsten Hollmann, Andreas Bulling, Winfried Ilg
Proc. 17th EAI International Conference on Pervasive Computing Technologies for Healthcare (Pervasive Health), 2023.
Digital health interventions that involve monitoring patient behaviour increasingly benefit from improvements in sensor technology. Eye tracking in particular can provide useful information for psychotherapy but an effective method to extract this information is currently missing. We propose a method to analyse natural gaze behaviour during exposure exercises for obsessive-compulsive disorder (OCD). At the core of our method is a neural network to detect fixations based on gaze patch similarities. Detected fixations are clustered into exposure-relevant, therapist, and other locations and corresponding eye movement metrics are correlated with subjective stress reported during exposure. We evaluate our method on gaze and stress data recorded during video-based psychotherapy of four adolescents with OCD. We found that fixation duration onto exposure-relevant locations consistently increases with the perceived stress level as opposed to fixations onto other locations. Fixation behaviour towards the therapist varied largely between patients. Taken together, our results not only demonstrate the effectiveness of our method for analysing natural gaze behaviour during exposure sessions. The fixation analysis shows that patients allocate more attention towards exposure-related objects under higher stress levels, suggesting higher mental load. As such, providing feedback on fixation behaviour holds significant promise to support therapists in monitoring intensity of exposure exercises.@inproceedings{thierfelder23_pervasiveh, title = {Gaze Behaviour in Adolescents with Obsessive-compulsive Disorder During Exposure Within Cognitive-behavioural Therapy}, author = {Thierfelder, Annika and Severitt, Björn and Klein, Carolin Sarah and Alt, Annika Kristin and Hollmann, Karsten and Bulling, Andreas and Ilg, Winfried}, year = {2023}, booktitle = {Proc. 17th EAI International Conference on Pervasive Computing Technologies for Healthcare (Pervasive Health)}, doi = {10.13140/RG.2.2.30047.02721} }
Technical Reports
-
Neural Reasoning About Agents’ Goals, Preferences, and Actions
Matteo Bortoletto, Lei Shi, Andreas Bulling
arXiv:2312.07122, pp. 1–13, 2023.
We propose the Intuitive Reasoning Network (IRENE) – a novel neural model for intuitive psychological reasoning about agents’ goals, preferences, and actions that can generalise previous experiences to new situations. IRENE combines a graph neural network for learning agent and world state representations with a transformer to encode the task context. When evaluated on the challenging Baby Intuitions Benchmark, IRENE achieves new state-of-the-art performance on three out of its five tasks – with up to 48.9 % improvement. In contrast to existing methods, IRENE is able to bind preferences to specific agents, to better distinguish between rational and irrational agents, and to better understand the role of blocking obstacles. We also investigate, for the first time, the influence of the training tasks on test performance. Our analyses demonstrate the effectiveness of IRENE in combining prior knowledge gained during training for unseen evaluation tasks.@techreport{bortoletto23_arxiv, author = {Bortoletto, Matteo and Shi, Lei and Bulling, Andreas}, title = {Neural Reasoning About Agents’ Goals, Preferences, and Actions}, year = {2023}, pages = {1--13}, doi = {10.48550/arXiv.2312.07122}, url = {https://arxiv.org/abs/2312.07122} } -
Pose2Gaze: Generating Realistic Human Gaze Behaviour from Full-body Poses using an Eye-body Coordination Model
Zhiming Hu, Jiahui Xu, Syn Schmitt, Andreas Bulling
arXiv:2312.12042, pp. 1–10, 2023.
While generating realistic body movements, e.g., for avatars in virtual reality, is widely studied in computer vision and graphics, the generation of eye movements that exhibit realistic coordination with the body remains under-explored. We first report a comprehensive analysis of the coordination of human eye and full-body movements during everyday activities based on data from the MoGaze and GIMO datasets. We show that eye gaze has strong correlations with head directions and also full-body motions and there exists a noticeable time delay between body and eye movements. Inspired by the analyses, we then present Pose2Gaze – a novel eye-body coordination model that first uses a convolutional neural network and a spatio-temporal graph convolutional neural network to extract features from head directions and full-body poses respectively and then applies a convolutional neural network to generate realistic eye movements. We compare our method with state-of-the-art methods that predict eye gaze only from head movements for three different generation tasks and demonstrate that Pose2Gaze significantly outperforms these baselines on both datasets with an average improvement of 26.4% and 21.6% in mean angular error, respectively. Our findings underline the significant potential of cross-modal human gaze behaviour analysis and modelling.Paper: hu23_arxiv.pdfPaper Access: https://arxiv.org/abs/2312.12042@techreport{hu23_arxiv, author = {Hu, Zhiming and Xu, Jiahui and Schmitt, Syn and Bulling, Andreas}, title = {Pose2Gaze: Generating Realistic Human Gaze Behaviour from Full-body Poses using an Eye-body Coordination Model}, year = {2023}, pages = {1--10}, url = {https://arxiv.org/abs/2312.12042} } -
Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning
Anna Penzkofer, Simon Schaefer, Florian Strohm, Mihai Bâce, Stefan Leutenegger, Andreas Bulling
arxiv:2306.11483, pp. 1–7, 2023.
While deep reinforcement learning (RL) agents outperform humans on an increasing number of tasks, training them requires data equivalent to decades of human gameplay. Recent hierarchical RL methods have increased sample efficiency by incorporating information inherent to the structure of the decision problem but at the cost of having to discover or use human-annotated sub-goals that guide the learning process. We show that intentions of human players, i.e. the precursor of goal-oriented decisions, can be robustly predicted from eye gaze even for the long-horizon sparse rewards task of Montezuma’s Revenge – one of the most challenging RL tasks in the Atari2600 game suite. We propose Int-HRL: Hierarchical RL with intention-based sub-goals that are inferred from human eye gaze. Our novel sub-goal extraction pipeline is fully automatic and replaces the need for manual sub-goal annotation by human experts. Our evaluations show that replacing hand-crafted sub-goals with automatically extracted intentions leads to a HRL agent that is significantly more sample efficient than previous methods.Paper Access: https://arxiv.org/abs/2306.11483@techreport{penzkofer23_arxiv, author = {Penzkofer, Anna and Schaefer, Simon and Strohm, Florian and Bâce, Mihai and Leutenegger, Stefan and Bulling, Andreas}, title = {Int-HRL: Towards Intention-based Hierarchical Reinforcement Learning}, year = {2023}, pages = {1--7}, url = {https://arxiv.org/abs/2306.11483} } -
Inferring Human Intentions from Predicted Action Probabilities
Lei Shi, Paul-Christian Bürkner, Andreas Bulling
arXiv:2308.12194, pp. 1–7, 2023.
Predicting the next action that a human is most likely to perform is key to human-AI collaboration and has consequently attracted increasing research interests in recent years. An important factor for next action prediction are human intentions: If the AI agent knows the intention it can predict future actions and plan collaboration more effectively. Existing Bayesian methods for this task struggle with complex visual input while deep neural network (DNN) based methods do not provide uncertainty quantifications. In this work we combine both approaches for the first time and show that the predicted next action probabilities contain information that can be used to infer the underlying intention. We propose a two-step approach to human intention prediction: While a DNN predicts the probabilities of the next action, MCMC-based Bayesian inference is used to infer the underlying intention from these predictions. This approach not only allows for independent design of the DNN architecture but also the subsequently fast, design-independent inference of human intentions. We evaluate our method using a series of experiments on the Watch-And-Help (WAH) and a keyboard and mouse interaction dataset. Our results show that our approach can accurately predict human intentions from observed actions and the implicit information contained in next action probabilities. Furthermore, we show that our approach can predict the correct intention even if only few actions have been observed.Paper Access: https://arxiv.org/abs/2308.12194@techreport{shi23_arxiv, title = {Inferring Human Intentions from Predicted Action Probabilities}, author = {Shi, Lei and Bürkner, Paul-Christian and Bulling, Andreas}, year = {2023}, pages = {1--7}, url = {https://arxiv.org/abs/2308.12194} } -
GazeMoDiff: Gaze-guided Diffusion Model for Stochastic Human Motion Prediction
Haodong Yan, Zhiming Hu, Syn Schmitt, Andreas Bulling
arXiv:2312.12090, pp. 1–10, 2023.
Human motion prediction is important for virtual reality (VR) applications, e.g., for realistic avatar animation. Existing methods have synthesised body motion only from observed past motion, despite the fact that human gaze is known to correlate strongly with body movements and is readily available in recent VR headsets. We present GazeMoDiff – a novel gaze-guided denoising diffusion model to generate stochastic human motions. Our method first uses a graph attention network to learn the spatio-temporal correlations between eye gaze and human movements and to fuse them into cross-modal gaze-motion features. These cross-modal features are injected into a noise prediction network via a cross-attention mechanism and progressively denoised to generate realistic human full-body motions. Experimental results on the MoGaze and GIMO datasets demonstrate that our method outperforms the state-of-the-art methods by a large margin in terms of average displacement error (15.03% on MoGaze and 9.20% on GIMO). We further conducted an online user study to compare our method with state-of-the-art methods and the responses from 23 participants validate that the motions generated by our method are more realistic than those from other methods. Taken together, our work makes a first important step towards gaze-guided stochastic human motion prediction and guides future work on this important topic in VR research.Paper: yan23_arxiv.pdfPaper Access: https://arxiv.org/abs/2312.12090@techreport{yan23_arxiv, author = {Yan, Haodong and Hu, Zhiming and Schmitt, Syn and Bulling, Andreas}, title = {GazeMoDiff: Gaze-guided Diffusion Model for Stochastic Human Motion Prediction}, year = {2023}, pages = {1--10}, url = {https://arxiv.org/abs/2312.12090} }
2022
Journal Articles
-
Understanding, Addressing, and Analysing Digital Eye Strain in Virtual Reality Head-Mounted Displays
Teresa Hirzle, Fabian Fischbach, Julian Karlbauer, Pascal Jansen, Jan Gugenheimer, Enrico Rukzio, Andreas Bulling
ACM Transactions on Computer-Human Interaction (TOCHI), 29 (4), pp. 1-80, 2022.
Digital eye strain (DES), caused by prolonged exposure to digital screens, stresses the visual system and negatively affects users’ well-being and productivity. While DES is well-studied in computer displays, its impact on users of virtual reality (VR) head-mounted displays (HMDs) is largely unexplored—despite that some of their key properties (e.g., the vergence-accommodation conflict) make VR-HMDs particularly prone. This work provides the first comprehensive investigation into DES in VR HMDs. We present results from a survey with 68 experienced users to understand DES symptoms in VR-HMDs. To help address DES, we investigate eye exercises resulting from survey answers and blue light filtering in three user studies (N = 71). Results demonstrate that eye exercises, but not blue light filtering, can effectively reduce DES. We conclude with an extensive analysis of the user studies and condense our findings in 10 key challenges that guide future work in this emerging research area.doi: 10.1145/3492802Paper: hirzle22_tochi.pdf@article{hirzle22_tochi, title = {Understanding, Addressing, and Analysing Digital Eye Strain in Virtual Reality Head-Mounted Displays}, author = {Hirzle, Teresa and Fischbach, Fabian and Karlbauer, Julian and Jansen, Pascal and Gugenheimer, Jan and Rukzio, Enrico and Bulling, Andreas}, year = {2022}, pages = {1-80}, doi = {10.1145/3492802}, journal = {ACM Transactions on Computer-Human Interaction (TOCHI)}, volume = {29}, number = {4} } -
Anticipatory Human-Machine Interaction (Dagstuhl Seminar 22202)
Jelmer Borst, Andreas Bulling, Cleotilde Gonzalez, Nele Russwinkel
Dagstuhl Reports, 12 (5), pp. 131–169, 2022.
Paper: borst22_dagstuhl.pdfPaper Access: https://drops.dagstuhl.de/opus/volltexte/2022/17446@article{borst22_dagstuhl, author = {Borst, Jelmer and Bulling, Andreas and Gonzalez, Cleotilde and Russwinkel, Nele}, title = {{Anticipatory Human-Machine Interaction (Dagstuhl Seminar 22202)}}, pages = {131--169}, journal = {Dagstuhl Reports}, year = {2022}, volume = {12}, number = {5}, editor = {Borst, Jelmer and Bulling, Andreas and Gonzalez, Cleotilde and Russwinkel, Nele}, publisher = {Schloss Dagstuhl -- Leibniz-Zentrum f{\"u}r Informatik}, address = {Dagstuhl, Germany}, url = {https://drops.dagstuhl.de/opus/volltexte/2022/17446}, doi = {10.4230/DagRep.12.5.131} } -
Adapting visualizations and interfaces to the user
Francesco Chiossi, Johannes Zagermann, Jakob Karolus, Nils Rodrigues, Priscilla Balestrucci, Daniel Weiskopf, Benedikt Ehinger, Tiare Feuchtner, Harald Reiterer, Lewis L. Chuang, Marc Ernst, Andreas Bulling, Sven Mayer, Albrecht Schmidt
it - Information Technology, 64 (4-5), pp. 133–143, 2022.
Adaptive visualization and interfaces pervade our everyday tasks to improve interaction from the point of view of user performance and experience. This approach allows using several user inputs, whether physiological, behavioral, qualitative, or multimodal combinations, to enhance the interaction. Due to the multitude of approaches, we outline the current research trends of inputs used to adapt visualizations and user interfaces. Moreover, we discuss methodological approaches used in mixed reality, physiological computing, visual analytics, and proficiency-aware systems. With this work, we provide an overview of the current research in adaptive systems.Paper: chiossi22_it.pdf@article{chiossi22_it, title = {Adapting visualizations and interfaces to the user}, author = {Chiossi, Francesco and Zagermann, Johannes and Karolus, Jakob and Rodrigues, Nils and Balestrucci, Priscilla and Weiskopf, Daniel and Ehinger, Benedikt and Feuchtner, Tiare and Reiterer, Harald and Chuang, Lewis L. and Ernst, Marc and Bulling, Andreas and Mayer, Sven and Schmidt, Albrecht}, pages = {133--143}, volume = {64}, number = {4-5}, journal = {it - Information Technology}, doi = {10.1515/itit-2022-0035}, year = {2022} } -
User-centred multimodal authentication: securing handheld mobile devices using gaze and touch input
Mohamed Khamis, Karola Marky, Andreas Bulling, Florian Alt
Behaviour & Information Technology, 41 (10), pp. 2061-2083, 2022.
Handheld mobile devices store a plethora of sensitive data, such as private emails, personal messages, photos, and location data. Authentication is essential to protect access to sensitive data. However, the majority of mobile devices are currently secured by singlemodal authentication schemes which are vulnerable to shoulder surfing, smudge attacks, and thermal attacks. While some authentication schemes protect against one of these attacks, only few schemes address all three of them. We propose multimodal authentication where touch and gaze input are combined to resist shoulder surfing, as well as smudge and thermal attacks. Based on a series of previously published works where we studied the usability of several user-centred multimodal authentication designs and their security against multiple threat models, we provide a comprehensive overview of multimodal authentication on handheld mobile devices. We further present guidelines on how to leverage multiple input modalities for enhancing the usability and security of user authentication on mobile devices.Paper: khamis22_bit.pdf@article{khamis22_bit, author = {Khamis, Mohamed and Marky, Karola and Bulling, Andreas and Alt, Florian}, title = {User-centred multimodal authentication: securing handheld mobile devices using gaze and touch input}, journal = {Behaviour \& Information Technology}, volume = {41}, number = {10}, pages = {2061-2083}, year = {2022}, publisher = {Taylor & Francis}, doi = {10.1080/0144929X.2022.2069597} } -
VisRecall: Quantifying Information Visualisation Recallability via Question Answering
Yao Wang, Chuhan Jiao, Mihai Bâce, Andreas Bulling
IEEE Transactions on Visualization and Computer Graphics (TVCG), 28 (12), pp. 4995-5005, 2022.
Despite its importance for assessing the effectiveness of communicating information visually, fine-grained recallability of information visualisations has not been studied quantitatively so far. In this work, we propose a question-answering paradigm to study visualisation recallability and present VisRecall - a novel dataset consisting of 200 visualisations that are annotated with crowd-sourced human (N = 305) recallability scores obtained from 1,000 questions of five question types. Furthermore, we present the first computational method to predict recallability of different visualisation elements, such as the title or specific data values. We report detailed analyses of our method on VisRecall and demonstrate that it outperforms several baselines in overall recallability and FE-, F-, RV-, and U-question recallability. Our work makes fundamental contributions towards a new generation of methods to assist designers in optimising visualisations.Paper: wang22_tvcg.pdfSupplementary Material: wang22_tvcg_sup.pdf@article{wang22_tvcg, title = {VisRecall: Quantifying Information Visualisation Recallability via Question Answering}, author = {Wang, Yao and Jiao, Chuhan and Bâce, Mihai and Bulling, Andreas}, year = {2022}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, volume = {28}, number = {12}, pages = {4995-5005}, doi = {10.1109/TVCG.2022.3198163} }
Conference Papers
-
ThumbPitch: Enriching Thumb Interaction on Mobile Touchscreens using Deep Learning
Jamie Ullerich, Maximiliane Windl, Andreas Bulling, Sven Mayer
ACM Proceedings of the 34st Australian Conference on Human-Computer-Interaction (OzCHI), pp. 1–9, 2022.
Today touchscreens are one of the most common input devices for everyday ubiquitous interaction. Yet, capacitive touchscreens are limited in expressiveness; thus, a large body of work has focused on extending the input capabilities of touchscreens. One promising approach is to use index finger orientation; however, this requires a two-handed interaction and poses ergonomic constraints. We propose using the thumb’s pitch as an additional input dimension to counteract these limitations, enabling one-handed interaction scenarios. Our deep convolutional neural network detecting the thumb’s pitch is trained on more than 230,000 ground truth images recorded using a motion tracking system. We highlight the potential of ThumbPitch by proposing several use cases that exploit the higher expressiveness, especially for one-handed scenarios. We tested three use cases in a validation study and validated our model. Our model achieved a mean error of only 11.9°.Paper: ullerich22_ozchi.pdf@inproceedings{ullerich22_ozchi, author = {Ullerich, Jamie and Windl, Maximiliane and Bulling, Andreas and Mayer, Sven}, title = {ThumbPitch: Enriching Thumb Interaction on Mobile Touchscreens using Deep Learning}, booktitle = {ACM Proceedings of the 34st Australian Conference on Human-Computer-Interaction (OzCHI)}, year = {2022}, pages = {1--9}, doi = {10.1145/3572921.3572925} } -
Impact of Gaze Uncertainty on AOIs in Information Visualisations
Yao Wang, Maurice Koch, Mihai Bâce, Daniel Weiskopf, Andreas Bulling
ETRA Workshop on Eye Tracking and Visualization (ETVIS), pp. 1–6, 2022.
Gaze-based analysis of areas of interest (AOIs) is widely used in information visualisation research to understand how people explore visualisations or assess the quality of visualisations concerning key characteristics such as memorability. However, nearby AOIs in visualisations amplify the uncertainty caused by the gaze estimation error, which strongly influences the mapping between gaze samples or fixations and different AOIs. We contribute a novel investigation into gaze uncertainty and quantify its impact on AOI-based analysis on visualisations using two novel metrics: the Flipping Candidate Rate (FCR) and Hit Any AOI Rate (HAAR). Our analysis of 40 real-world visualisations, including human gaze and AOI annotations, shows that gaze uncertainty frequently and significantly impacts the analysis conducted in AOI-based studies. Moreover, we analysed four visualisation types and found that bar and scatter plots are usually designed in a way that causes more uncertainty than line and pie plots in gaze-based analysis.@inproceedings{wang22_etvis, title = {Impact of Gaze Uncertainty on AOIs in Information Visualisations}, author = {Wang, Yao and Koch, Maurice and B{\^a}ce, Mihai and Weiskopf, Daniel and Bulling, Andreas}, year = {2022}, pages = {1--6}, booktitle = {ETRA Workshop on Eye Tracking and Visualization (ETVIS)}, doi = {10.1145/3517031.3531166} } -
Mind Wandering Trait-level Tendencies During Lecture Viewing: A Pilot Study
Francesca Zermiani, Andreas Bulling, Maria Wirzberger
Proc. EduEye Workshop on Eye Tracking in Learning and Education (EduEye), pp. 1–7, 2022.
Mind wandering (MW) is defined as a shift of attention to task-unrelated internal thoughts that is pervasive and disruptive for learning performance. Current state-of-the-art gaze-based attention-aware intelligent systems are capable of detecting MW from eye movements and delivering interventions to mitigate its negative effects. However, the beneficial functions of MW and its trait-level tendency, defined as the content of MW experience, are still largely neglected by these systems. In this pilot study, we address the questions of whether different MW trait-level tendencies can be detected through off-screen fixations’ frequency and duration and blink rate during a lecture viewing task. We focus on prospective planning and creative problem-solving as two of the main MW trait-level tendencies. Despite the non-significance, the descriptive values show a higher frequency and duration of off-screen fixations, but lower blink rate, in the creative problem-solving MW condition. Interestingly, we do find a highly significant correlation between MW level and engagement scores in the prospective planning MW group. Potential explanations for the observed results are discussed. Overall, these findings represent a preliminary step towards the development of more accurate and adaptive learning technologies, and call for further studies on MW trait-level tendency detection.Paper: zermiani22_edueye.pdf@inproceedings{zermiani22_edueye, title = {Mind Wandering Trait-level Tendencies During Lecture Viewing: A Pilot Study}, author = {Zermiani, Francesca and Bulling, Andreas and Wirzberger, Maria}, year = {2022}, booktitle = {Proc. EduEye Workshop on Eye Tracking in Learning and Education (EduEye)}, doi = {10.1145/3517031.3529241}, pages = {1--7} } -
PrivacyScout: Assessing Vulnerability to Shoulder Surfing on Mobile Devices
Mihai Bâce, Alia Saad, Mohamed Khamis, Stefan Schneegass, Andreas Bulling
Proc. on Privacy Enhancing Technologies (PETs), pp. 650–669, 2022.
One approach to mitigate shoulder surfing attacks on mobile devices is to detect the presence of a bystander using the phone’s front-facing camera. However, a person’s face in the camera’s field of view does not always indicate an attack. To overcome this limitation, in a novel data collection study (N=16), we analysed the influence of three viewing angles and four distances on the success of shoulder surfing attacks. In contrast to prior works that mainly focused on user authentication, we investigated three common types of content susceptible to shoulder surfing: text, photos, and PIN authentications. We show that the vulnerability of text and photos depends on the observer’s location relative to the device, while PIN authentications are vulnerable independent of the observation location. We then present PrivacyScout - a novel method that predicts the shoulder-surfing risk based on visual features extracted from the observer’s face as captured by the front-facing camera. Finally, evaluations from our data collection study demonstrate our method’s feasibility to assess the risk of a shoulder surfing attack more accurately.Paper: bace22_pets.pdf@inproceedings{bace22_pets, title = {PrivacyScout: Assessing Vulnerability to Shoulder Surfing on Mobile Devices}, author = {B{\^a}ce, Mihai and Saad, Alia and Khamis, Mohamed and Schneegass, Stefan and Bulling, Andreas}, year = {2022}, booktitle = {Proc. on Privacy Enhancing Technologies (PETs)}, doi = {10.56553/popets-2022-0090}, pages = {650--669}, issue = {3} } -
Designing for Noticeability: The Impact of Visual Importance on Desktop Notifications
Philipp Müller, Sander Staal, Mihai Bâce, Andreas Bulling
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–13, 2022.
Desktop notifications should be noticeable but are also subject to a number of design choices, e.g. concerning their size, placement, or opacity. It is currently unknown, however, how these choices interact with the desktop background and their influence on noticeability. To address this limitation, we introduce a software tool to automatically synthesize realistically looking desktop images for major operating systems and applications. Using these images, we present a user study (N=34) to investigate the noticeability of notifications during a primary task. We are first to show that visual importance of the background at the notification location significantly impacts whether users detect notifications. We analyse the utility of visual importance to compensate for suboptimal design choices with respect to noticeability, e.g. small notification size. Finally, we introduce noticeability maps - 2D maps encoding the predicted noticeability across the desktop and inform designers how to trade-off notification design and noticeability.Paper: mueller22_chi.pdf@inproceedings{mueller22_chi, title = {Designing for Noticeability: The Impact of Visual Importance on Desktop Notifications}, author = {Müller, Philipp and Staal, Sander and B{\^a}ce, Mihai and Bulling, Andreas}, year = {2022}, pages = {1--13}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3491102.3501954} } -
Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA
Adnen Abdessaied*, Ekta Sood*, Andreas Bulling
Proc. of the 7th Workshop on Representation Learning for NLP (Repl4NLP), pp. 1–12, 2022.
[Equal contribution by the first two authors.] We propose the Video Language Co-Attention Network (VLCN) – a novel memory-enhanced model for Video Question Answering (VideoQA). Our model combines two original contributions: A multimodal fast-learning feature fusion (FLF) block and a mechanism that uses self-attended language features to separately guide neural attention on both static and dynamic visual features extracted from individual video frames and short video clips. When trained from scratch, VLCN achieves competitive results with the state of the art on both MSVD-QA and MSRVTT-QA with 38.06% and 36.01% test accuracies, respectively. Through an ablation study, we further show that FLF improves generalization across different VideoQA datasets and performance for question types that are notoriously challenging in current datasets, such as long questions that require deeper reasoning as well as questions with rare answers.@inproceedings{abdessaied22_repl4NLP, author = {Abdessaied*, Adnen and Sood*, Ekta and Bulling, Andreas}, title = {Video Language Co-Attention with Multimodal Fast-Learning Feature Fusion for VideoQA}, booktitle = {Proc. of the 7th Workshop on Representation Learning for NLP (Repl4NLP)}, year = {2022}, pages = {1--12} } -
Neuro-Symbolic Visual Dialog
Adnen Abdessaied, Mihai Bâce, Andreas Bulling
Proc. 29th International Conference on Computational Linguistics (COLING), pp. 1–11, 2022.
We propose Neuro-Symbolic Visual Dialog (NSVD) —the first method to combine deep learning and symbolic program execution for multi-round visually-grounded reasoning. NSVD significantly outperforms existing purely-connectionist methods on two key challenges inherent to visual dialog: long-distance co-reference resolution as well as vanishing question-answering performance. We demonstrate the latter by proposing a more realistic and stricter evaluation scheme in which we use predicted answers for the full dialog history when calculating accuracy. We describe two variants of our model and show that using this new scheme, our best model achieves an accuracy of 99.72% on CLEVR-Dialog —a relative improvement of more than 10% over the state of the art —while only requiring a fraction of training data. Moreover, we demonstrate that our neuro-symbolic models have a higher mean first failure round, are more robust against incomplete dialog histories, and generalise better not only to dialogs that are up to three times longer than those seen during training but also to unseen question types and scenes.@inproceedings{abdessaied22_coling, author = {Abdessaied, Adnen and Bâce, Mihai and Bulling, Andreas}, title = {Neuro-Symbolic Visual Dialog}, booktitle = {Proc. 29th International Conference on Computational Linguistics (COLING)}, year = {2022}, pages = {1--11} } -
Gaze-enhanced Crossmodal Embeddings for Emotion Recognition
Ahmed Abdou, Ekta Sood, Philipp Müller, Andreas Bulling
Proc. International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–18, 2022.
Emotional expressions are inherently multimodal – integrating facial behavior, speech, and gaze – but their automatic recognition is often limited to a single modality, e.g. speech during a phone call. While previous work proposed crossmodal emotion embeddings to improve monomodal recognition performance, despite its importance, a representation of gaze was not included. We propose a new approach to emotion recognition that incorporates an explicit representation of gaze in a crossmodal emotion embedding framework. We show that our method outperforms the previous state of the art for both audio-only and video-only emotion classification on the popular One-Minute Gradual Emotion Recognition dataset. Furthermore, we report extensive ablation experiments and provide insights into the performance of different state-of-the-art gaze representations and integration strategies. Our results not only underline the importance of gaze for emotion recognition but also demonstrate a practical and highly effective approach to leveraging gaze information for this task.@inproceedings{abdou22_etra, title = {Gaze-enhanced Crossmodal Embeddings for Emotion Recognition}, author = {Abdou, Ahmed and Sood, Ekta and Müller, Philipp and Bulling, Andreas}, year = {2022}, booktitle = {Proc. International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3530879}, volume = {6}, pages = {1--18} } -
MultiMediate’22: Backchannel Detection and Agreement Estimation in Group Interactions
Philipp Müller, Dominik Schiller, Dominike Thomas, Michael Dietz, Hali Lindsay, Patrick Gebhard, Elisabeth André, Andreas Bulling
Proc. ACM Multimedia (MM), pp. 7109-7114, 2022.
Backchannels, i.e. short interjections of the listener, serve important meta-conversational purposes like signifying attention or indicating agreement. Despite their key role, automatic analysis of backchannels in group interactions has been largely neglected so far. The MultiMediate challenge addresses, for the first time, the tasks of backchannel detection and agreement estimation from backchannels in group conversations. This paper describes the MultiMediate challenge and presents a novel set of annotations consisting of 7234 backchannel instances for the MPIIGroup Interaction dataset. Each backchannel was additionally annotated with the extent by which it expresses agreement towards the current speaker. In addition to a an analysis of the collected annotations, we present baseline results for both challenge tasks.Paper: mueller22_mm.pdf@inproceedings{mueller22_mm, title = {MultiMediate'22: Backchannel Detection and Agreement Estimation in Group Interactions}, author = {M{\"{u}}ller, Philipp and Schiller, Dominik and Thomas, Dominike and Dietz, Michael and Lindsay, Hali and Gebhard, Patrick and André, Elisabeth and Bulling, Andreas}, year = {2022}, pages = {7109-7114}, doi = {10.1145/3503161.3551589}, booktitle = {Proc. ACM Multimedia (MM)} } -
Multimodal Sensor-based Identification of Stress and Compulsive Actions in Children with Obsessive-compulsive Disorder for Telemedical Treatment
Annika Thierfelder, Jonas Primbs, Björn Severitt, Carolin Sarah Hohnecker, Jan Kühnhausen, Annika Kristin Alt, Anja Pascher, Ursula Wörz, Helene Passon, Jens Seemann, Christian Ernst, Heinrich Lautenbacher, Martin Holderried, Enkelejda Kasneci, Martin Giese, Andreas Bulling, Michael Menth, Gottfried Maria Barth, Winfried Ilg, Karsten Hollmann, Tobias Johann Renner
Proc. the 44th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), pp. 1–7, 2022.
In modern psychotherapy, digital health technology offers advanced and personalized therapy options, increasing availability as well as ecological validity. These aspects have proven to be highly relevant for children and adolescents with obsessive-compulsive disorder (OCD). Exposure and Response Prevention therapy, which is the state-of-the-art treatment for OCD, builds on the reconstruction of everyday life exposure to anxious situations. However, while compulsive behavior predominantly occurs in home environments, exposure situations during therapy are limited to clinical settings. Telemedical treatment allows to shift from this limited exposure reconstruction to exposure situations in real life. In the SSTeP KiZ study (smart sensor technology in telepsychotherapy for children and adolescents with OCD), we combine video therapy with wearable sensors delivering physiological and behavioral measures to objectively determine the stress level of patients. The setup allows to gain information from exposure to stress in a realistic environment both during and outside of therapy sessions. In a first pilot study, we explored the sensitivity of individual sensor modalities to different levels of stress and anxiety. For this, we captured the obsessive-compulsive behavior of five adolescents with an ECG chest belt, inertial sensors capturing hand movements, and an eye tracker. Despite their prototypical nature, our results deliver strong evidence that the examined sensor modalities yield biomarkers allowing for personalized detection and quantification of stress and anxiety. This opens up future possibilities to evaluate the severity of individual compulsive behavior based on multivariate state classification in real-life situations.Paper: thierfelder22_embc.pdf@inproceedings{thierfelder22_embc, title = {Multimodal Sensor-based Identification of Stress and Compulsive Actions in Children with Obsessive-compulsive Disorder for Telemedical Treatment}, author = {Thierfelder, Annika and Primbs, Jonas and Severitt, Björn and Hohnecker, Carolin Sarah and Kühnhausen, Jan and Alt, Annika Kristin and Pascher, Anja and Wörz, Ursula and Passon, Helene and Seemann, Jens and Ernst, Christian and Lautenbacher, Heinrich and Holderried, Martin and Kasneci, Enkelejda and Giese, Martin and Bulling, Andreas and Menth, Michael and Barth, Gottfried Maria and Ilg, Winfried and Hollmann, Karsten and Renner, Tobias Johann}, year = {2022}, pages = {1--7}, booktitle = {Proc. the 44th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC)}, doi = {10.1109/EMBC48229.2022.9871899} } -
Predicting Next Actions and Latent Intents during Text Formatting
Guanhua Zhang, Susanne Hindennach, Jan Leusmann, Felix Bühler, Benedict Steuerlein, Sven Mayer, Mihai Bâce, Andreas Bulling
Proc. the CHI Workshop Computational Approaches for Understanding, Generating, and Adapting User Interfaces, pp. 1–6, 2022.
In this work we investigate the challenging task of predicting user intents from mouse and keyboard input as well as gaze behaviour. In contrast to prior work we study intent prediction at two different resolutions on the behavioural timeline: predicting future input actions as well as latent intents to achieve a high-level interaction goal. Results from a user study (N=15) on a sample text formatting task show that the sequence of prior actions is more informative for intent prediction than gaze. Only using the action sequence, we can predict the next action and the high-level intent with an accuracy of 66% and 96%, respectively. In contrast, accuracy when using features extracted from gaze behaviour was significantly lower, at 41% and 46%. This finding is important for the development of future anticipatory user interfaces that aim to proactively adapt to user intents and interaction goals.Paper: zhang22_caugaui.pdf@inproceedings{zhang22_caugaui, author = {Zhang, Guanhua and Hindennach, Susanne and Leusmann, Jan and Bühler, Felix and Steuerlein, Benedict and Mayer, Sven and Bâce, Mihai and Bulling, Andreas}, title = {Predicting Next Actions and Latent Intents during Text Formatting}, booktitle = {Proc. the CHI Workshop Computational Approaches for Understanding, Generating, and Adapting User Interfaces}, year = {2022}, pages = {1--6} }
Technical Reports
-
Federated Learning for Appearance-based Gaze Estimation in the Wild
Mayar Elfares, Zhiming Hu, Pascal Reisert, Andreas Bulling, Ralf Küsters
arXiv:2211.07330, pp. 1–17, 2022.
Gaze estimation methods have significantly matured in recent years but the large number of eye images required to train deep learning models poses significant privacy risks. In addition, the heterogeneous data distribution across different users can significantly hinder the training process. In this work, we propose the first federated learning approach for gaze estimation to preserve the privacy of gaze data. We further employ pseudo-gradients optimisation to adapt our federated learning approach to the divergent model updates to address the heterogeneous nature of in-the-wild gaze data in collaborative setups. We evaluate our approach on a real-world dataset (MPIIGaze dataset) and show that our work enhances the privacy guarantees of conventional appearance-based gaze estimation methods, handles the convergence issues of gaze estimators, and significantly outperforms vanilla federated learning by 15.8% (from a mean error of 10.63 degrees to 8.95 degrees). As such, our work paves the way to develop privacy-aware collaborative 14 learning setups for gaze estimation while maintaining the model’s performance.Paper: elfares22_arxiv.pdf@techreport{elfares22_arxiv, title = {Federated Learning for Appearance-based Gaze Estimation in the Wild}, author = {Elfares, Mayar and Hu, Zhiming and Reisert, Pascal and Bulling, Andreas and Küsters, Ralf}, year = {2022}, doi = {10.48550/arXiv.2211.07330}, pages = {1--17} } -
MultiMediate’22: Backchannel Detection and Agreement Estimation in Group Interactions
Philipp Müller, Dominik Schiller, Dominike Thomas, Michael Dietz, Hali Lindsay, Patrick Gebhard, Elisabeth André, Andreas Bulling
arXiv:2209.09578, pp. 1–6, 2022.
Backchannels, i.e. short interjections of the listener, serve important meta-conversational purposes like signifying attention or indicating agreement. Despite their key role, automatic analysis of backchannels in group interactions has been largely neglected so far. The MultiMediate challenge addresses, for the first time, the tasks of backchannel detection and agreement estimation from backchannels in group conversations. This paper describes the MultiMediate challenge and presents a novel set of annotations consisting of 7234 backchannel instances for the MPIIGroupInteraction dataset. Each backchannel was additionally annotated with the extent by which it expresses agreement towards the current speaker. In addition to a an analysis of the collected annotations, we present baseline results for both challenge tasks.Paper: mueller22_arxiv.pdfPaper Access: http://arxiv.org/abs/2209.09578@techreport{mueller22_arxiv, title = {MultiMediate'22: Backchannel Detection and Agreement Estimation in Group Interactions}, author = {M{\"{u}}ller, Philipp and Schiller, Dominik and Thomas, Dominike and Dietz, Michael and Lindsay, Hali and Gebhard, Patrick and André, Elisabeth and Bulling, Andreas}, year = {2022}, pages = {1--6}, doi = {10.48550/arXiv.2209.09578}, url = {http://arxiv.org/abs/2209.09578} }
2021
Journal Articles
-
EHTask: Recognizing User Tasks from Eye and Head Movements in Immersive Virtual Reality
Zhiming Hu, Andreas Bulling, Sheng Li, Guoping Wang
IEEE Transactions on Visualization and Computer Graphics (TVCG), 29 (4), pp. 1992–2004, 2021.
Understanding human visual attention in immersive virtual reality (VR) is crucial for many important applications, including gaze prediction, gaze guidance, and gaze-contingent rendering. However, previous works on visual attention analysis typically only explored one specific VR task and paid less attention to the differences between different tasks. Moreover, existing task recognition methods typically focused on 2D viewing conditions and only explored the effectiveness of human eye movements. We first collect eye and head movements of 30 participants performing four tasks, i.e. Free viewing, Visual search, Saliency, and Track, in 15 360-degree VR videos. Using this dataset, we analyze the patterns of human eye and head movements and reveal significant differences across different tasks in terms of fixation duration, saccade amplitude, head rotation velocity, and eye-head coordination. We then propose EHTask – a novel learning-based method that employs eye and head movements to recognize user tasks in VR. We show that our method significantly outperforms the state-of-the-art methods derived from 2D viewing conditions both on our dataset (accuracy of 84.4% vs. 62.8%) and on a real-world dataset (61.9% vs. 44.1%). As such, our work provides meaningful insights into human visual attention under different VR tasks and guides future work on recognizing user tasks in VR.Paper: hu21_tvcg_2.pdf@article{hu21_tvcg_2, author = {Hu, Zhiming and Bulling, Andreas and Li, Sheng and Wang, Guoping}, title = {EHTask: Recognizing User Tasks from Eye and Head Movements in Immersive Virtual Reality}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, year = {2021}, doi = {10.1109/TVCG.2021.3138902}, pages = {1992--2004}, volume = {29}, number = {4} } -
FixationNet: Forecasting Eye Fixations in Task-Oriented Virtual Environments
Zhiming Hu, Andreas Bulling, Sheng Li, Guoping Wang
IEEE Transactions on Visualization and Computer Graphics (TVCG), 27 (5), pp. 2681–2690, 2021.
Abstract Links BibTeX Project TVCG Best Journal Nominees Award
Human visual attention in immersive virtual reality (VR) is key for many important applications, such as content design, gaze-contingent rendering, or gaze-based interaction. However, prior works typically focused on free-viewing conditions that have limited relevance for practical applications. We first collect eye tracking data of 27 participants performing a visual search task in four immersive VR environments. Based on this dataset, we provide a comprehensive analysis of the collected data and reveal correlations between users’ eye fixations and other factors, i.e. users’ historical gaze positions, task-related objects, saliency information of the VR content, and users’ head rotation velocities. Based on this analysis, we propose FixationNet – a novel learning-based model to forecast users’ eye fixations in the near future in VR. We evaluate the performance of our model for free-viewing and task-oriented settings and show that it outperforms the state of the art by a large margin of 19.8% (from a mean error of 2.93° to 2.35°) in free-viewing and of 15.1% (from 2.05° to 1.74°) in task-oriented situations. As such, our work provides new insights into task-oriented attention in virtual environments and guides future work on this important topic in VR research.@article{hu21_tvcg, author = {Hu, Zhiming and Bulling, Andreas and Li, Sheng and Wang, Guoping}, title = {FixationNet: Forecasting Eye Fixations in Task-Oriented Virtual Environments}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, year = {2021}, doi = {10.1109/TVCG.2021.3067779}, pages = {2681--2690}, volume = {27}, number = {5}, url = {https://cranehzm.github.io/FixationNet.html} } -
Decoding binary decisions under differential target probabilities from pupil dilation: A random forest approach
Christoph Strauch, Teresa Hirzle, Stefan Van Stigchel, Andreas Bulling
Journal of Vision (JOV), 21 (7), pp. 1-13, 2021.
While our pupils slightly dilate when we look at an intended target, they do not when we look at irrelevant distractors. This suggests that it may be possible to decode the intention of an observer, understood as the outcome of implicit covert binary decisions, from the pupillary dynamics over time. However, only few previous works have investigated the feasibility of this approach and the few who did, did not control for possible confounds such as motor-execution, changes in brightness, or target and distractor probability. We report on our efforts to decode intentions from pupil dilation obtained under strict experimental control on a single trial basis using a machine learning approach. The basis for our analyses are data of n = 69 participants who looked at letters that needed to be selected with stimulus probabilities that varied systematically in a blockwise manner (n = 19,417 trials). We confirm earlier findings that pupil dilation is indicative of intentions and show that these can be decoded with a classification performance of up to 76% ROCAUC if targets are rarer than distractors. To better understand which characteristics of the pupillary signal are most informative, we finally compare relative feature importances. The first derivative of pupil size changes was found to be most relevant, allowing us to decode intention within only about 800 ms of trial onset. Taken together, our results provide credible insights into the potential of decoding intentions from pupil dilation and may soon form the basis for new applications in visual search, gaze-based interaction, or human-robot interaction.doi: 10.1167/jov.21.7.6Paper: strauch21_jov.pdf@article{strauch21_jov, author = {Strauch, Christoph and Hirzle, Teresa and der Stigchel, Stefan Van and Bulling, Andreas}, title = {Decoding binary decisions under differential target probabilities from pupil dilation: A random forest approach}, journal = {Journal of Vision (JOV)}, year = {2021}, volume = {21}, number = {7}, pages = {1-13}, doi = {10.1167/jov.21.7.6} }
Conference Papers
-
Altering Non-verbal Cues to Implicitly Direct Attention in Social VR
Radiah Rivu, Ken Pfeuffer, Philipp Müller, Yomna Abdelrahman, Andreas Bulling, Florian Alt
ACM Symposium on Spatial User Interaction (SSUI), pp. 1–2, 2021.
In this work we explore a concept system that alters the virtual eye movements without the user’s awareness, and whether this can affect social attention among others. Our concept augments the real movements with subtle redirected gazes to people, that occur in intervals to remain unnoticed. We present a user study with groups of people conversing on a topic, and measure the level of visual attention among users. Compared to a baseline of natural eye movements, we find that the method has indeed affected the overall attention in the group, but in unexpected ways. Our work points to a new way to exploit the inherent role of eyes in social virtual reality.Paper: rivu21_ssui.pdf@inproceedings{rivu21_ssui, title = {Altering Non-verbal Cues to Implicitly Direct Attention in Social VR}, author = {Rivu, Radiah and Pfeuffer, Ken and Müller, Philipp and Abdelrahman, Yomna and Bulling, Andreas and Alt, Florian}, year = {2021}, booktitle = {ACM Symposium on Spatial User Interaction (SSUI)}, pages = {1--2}, doi = {10.1145/3485279.3485309} } -
VQA-MHUG: A gaze dataset to study multimodal neural attention in VQA
Ekta Sood, Fabian Kögel, Florian Strohm, Prajit Dhar, Andreas Bulling
Proc. ACL SIGNLL Conference on Computational Natural Language Learning (CoNLL), pp. 27–43, 2021.
Abstract Links BibTeX Project Oral Presentation
We present VQA-MHUG - a novel 49-participant dataset of multimodal human gaze on both images and questions during visual question answering (VQA) collected using a high-speed eye tracker. We use our dataset to analyze the similarity between human and neural attentive strategies learned by five state-of-the-art VQA models: Modulated Co-Attention Network (MCAN) with either grid or region features, Pythia, Bilinear Attention Network (BAN), and the Multimodal Factorized Bilinear Pooling Network (MFB). While prior work has focused on studying the image modality, our analyses show - for the first time - that for all models, higher correlation with human attention on text is a significant predictor of VQA performance. This finding points at a potential for improving VQA performance and, at the same time, calls for further research on neural text attention mechanisms and their integration into architectures for vision and language tasks, including but potentially also beyond VQA.@inproceedings{sood21_conll, title = {VQA-MHUG: A gaze dataset to study multimodal neural attention in VQA}, author = {Sood, Ekta and Kögel, Fabian and Strohm, Florian and Dhar, Prajit and Bulling, Andreas}, booktitle = {Proc. ACL SIGNLL Conference on Computational Natural Language Learning (CoNLL)}, year = {2021}, pages = {27--43}, doi = {10.18653/v1/2021.conll-1.3}, publisher = {Association for Computational Linguistics} } -
ConAn: A Usable Tool for Multimodal Conversation Analysis
Anna Penzkofer, Philipp Müller, Felix Bühler, Sven Mayer, Andreas Bulling
Proc. ACM International Conference on Multimodal Interaction (ICMI), pp. 341-351, 2021.
Multimodal analysis of group behavior is a key task in human-computer interaction, as well as the social and behavioral sciences, but is often limited to more easily controllable laboratory settings or requires elaborate multi-sensor setups and time-consuming manual data annotation. We present ConAn – a usable tool to explore and automatically analyze non-verbal behavior of multiple persons during natural group conversations. In contrast to traditional multi-sensor setups, our tool only requires a single 360° camera and uses state-of-the-art computer vision methods to automatically extract behavioral indicators, such as gaze direction, facial expressions, and speaking activity. Thus, our tool allows for easy and fast deployment supporting researchers in understanding both individual behavior and group interaction dynamics, but also in quantifying user-object interactions. We illustrate the benefits of our tool on three sample use cases: general conversation analysis, assessment of collaboration quality, and impact of technology on audience behavior. Taken together, ConAn represents an important step towards democratizing automatic conversation analysis in HCI and beyond.@inproceedings{penzkofer21_icmi, author = {Penzkofer, Anna and Müller, Philipp and Bühler, Felix and Mayer, Sven and Bulling, Andreas}, title = {ConAn: A Usable Tool for Multimodal Conversation Analysis}, booktitle = {Proc. ACM International Conference on Multimodal Interaction (ICMI)}, year = {2021}, doi = {10.1145/3462244.3479886}, pages = {341-351}, video = {https://www.youtube.com/watch?v=H2KfZNgx6CQ} } -
A Critical Assessment of the Use of SSQ as a Measure of General Discomfort in VR Head-Mounted Displays
Teresa Hirzle, Maurice Cordts, Enrico Rukzio, Jan Gugenheimer, Andreas Bulling
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–14, 2021.
Based on a systematic literature review of more than 300 papers published over the last 10 years, we show that the simulator sickness questionnaire (SSQ) is extensively used and widely accepted as general discomfort measure in virtual reality (VR) research - although it only accounts for one category of symptoms. This results in important other categories (digital eye strain (DES) and ergonomics) being largely neglected. To contribute to a more comprehensive picture of discomfort in VR head-mounted displays, we further conducted an online study (N=352) on the severity and relevance of all three symptom categories. Most importantly, our results reveal that symptoms of simulator sickness are significantly less severe and of lower prevalence than those of DES and ergonomics. In light of these findings, we critically discuss the current use of SSQ as the only discomfort measure and propose a more comprehensive factor model that also includes DES and ergonomics.Paper: hirzle21_chi.pdf@inproceedings{hirzle21_chi, title = {A Critical Assessment of the Use of {SSQ} as a Measure of General Discomfort in VR Head-Mounted Displays}, author = {Hirzle, Teresa and Cordts, Maurice and Rukzio, Enrico and Gugenheimer, Jan and Bulling, Andreas}, year = {2021}, pages = {1--14}, doi = {10.1145/3411764.3445361}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)} } -
MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation
Philipp Müller, Dominik Schiller, Dominike Thomas, Guanhua Zhang, Michael Dietz, Patrick Gebhard, Elisabeth André, Andreas Bulling
Proc. ACM Multimedia (MM), pp. 4878–4882, 2021.
Artificial mediators are promising to support human group conversations but at present their abilities are limited by insufficient progress in group behaviour analysis. The MultiMediate challenge addresses, for the first time, two fundamental group behaviour analysis tasks in well-defined conditions: eye contact detection and next speaker prediction. For training and evaluation, MultiMediate makes use of the MPIIGroupInteraction dataset consisting of 22 three- to four-person discussions as well as of an unpublished test set of six additional discussions. This paper describes the MultiMediate challenge and presents the challenge dataset including novel fine-grained speaking annotations that were collected for the purpose of MultiMediate. Furthermore, we present baseline approaches and ablation studies for both challenge tasks.Paper: mueller21_mm.pdf@inproceedings{mueller21_mm, title = {MultiMediate: Multi-modal Group Behaviour Analysis for Artificial Mediation}, author = {M{\"{u}}ller, Philipp and Schiller, Dominik and Thomas, Dominike and Zhang, Guanhua and Dietz, Michael and Gebhard, Patrick and André, Elisabeth and Bulling, Andreas}, year = {2021}, pages = {4878--4882}, doi = {10.1145/3474085.3479219}, booktitle = {Proc. ACM Multimedia (MM)} } -
Neural Photofit: Gaze-based Mental Image Reconstruction
Florian Strohm, Ekta Sood, Sven Mayer, Philipp Müller, Mihai Bâce, Andreas Bulling
Proc. IEEE International Conference on Computer Vision (ICCV), pp. 245-254, 2021.
We propose a novel method that leverages human fixations to visually decode the image a person has in mind into a photofit (facial composite). Our method combines three neural networks: An encoder, a scoring network, and a decoder. The encoder extracts image features and predicts a neural activation map for each face looked at by a human observer. A neural scoring network compares the human and neural attention and predicts a relevance score for each extracted image feature. Finally, image features are aggregated into a single feature vector as a linear combination of all features weighted by relevance which a decoder decodes into the final photofit. We train the neural scoring network on a novel dataset containing gaze data of 19 participants looking at collages of synthetic faces. We show that our method significantly outperforms a mean baseline predictor and report on a human study that shows that we can decode photofits that are visually plausible and close to the observer’s mental image. Code and dataset available upon request.@inproceedings{strohm21_iccv, title = {Neural Photofit: Gaze-based Mental Image Reconstruction}, author = {Strohm, Florian and Sood, Ekta and Mayer, Sven and Müller, Philipp and Bâce, Mihai and Bulling, Andreas}, year = {2021}, booktitle = {Proc. IEEE International Conference on Computer Vision (ICCV)}, doi = {10.1109/ICCV48922.2021.00031}, pages = {245-254} }
Technical Reports
-
Scanpath Prediction on Information Visualisations
Yao Wang, Mihai Bâce, Andreas Bulling
arXiv:2112.02340, pp. 1–14, 2021.
We propose Unified Model of Saliency and Scanpaths (UMSS) – a model that learns to predict visual saliency and scanpaths (i.e. sequences of eye fixations) on information visualisations. Although scanpaths provide rich information about the importance of different visualisation elements during the visual exploration process, prior work has been limited to predicting aggregated attention statistics, such as visual saliency. We present in-depth analyses of gaze behaviour for different information visualisation elements (e.g. Title, Label, Data) on the popular MASSVIS dataset. We show that while, overall, gaze patterns are surprisingly consistent across visualisations and viewers, there are also structural differences in gaze dynamics for different elements. Informed by our analyses, UMSS first predicts multi-duration element-level saliency maps, then probabilistically samples scanpaths from them. Extensive experiments on MASSVIS show that our method consistently outperforms state-of-the-art methods with respect to several, widely used scanpath and saliency evaluation metrics. Our method achieves a relative improvement in sequence score of 11.5 % for scanpath prediction, and a relative improvement in Pearson correlation coefficient of up to 23.6 % for saliency prediction. These results are auspicious and point towards richer user models and simulations of visual attention on visualisations without the need for any eye tracking equipment.@techreport{wang21_arxiv, title = {Scanpath Prediction on Information Visualisations}, author = {Wang, Yao and B{\^a}ce, Mihai and Bulling, Andreas}, year = {2021}, pages = {1--14}, url = {https://arxiv.org/abs/2112.02340} } -
VisRecall: Quantifying Information Visualisation Recallability via Question Answering
Yao Wang, Chuhan Jiao, Mihai Bâce, Andreas Bulling
arXiv:2112.15217, pp. 1–10, 2021.
Despite its importance for assessing the effectiveness of communicating information visually, fine-grained recallability of information visualisations has not been studied quantitatively so far. In this work we propose a visual question answering (VQA) paradigm to study visualisation recallability and present VisRecall — a novel dataset consisting of 200 visualisations that are annotated with crowd-sourced human (N = 305) recallability scores obtained from 1,000 questions from five question types. Furthermore, we present the first computational method to predict recallability of different visualisation elements, such as the title or specific data values. We report detailed analyses of our method on VisRecall and demonstrate that it outperforms several baselines in overall recallability and FE-, F-, RV-, and U-question recallability. We further demonstrate one possible application of our method: recommending the visualisation type that maximises user recallability for a given data source. Taken together, our work makes fundamental contributions towards a new generation of methods to assist designers in optimising visualisations.@techreport{wang21_arxiv_2, title = {VisRecall: Quantifying Information Visualisation Recallability via Question Answering}, author = {Wang, Yao and Jiao, Chuhan and B{\^a}ce, Mihai and Bulling, Andreas}, year = {2021}, pages = {1--10}, url = {https://arxiv.org/abs/2112.15217} } -
Multimodal Integration of Human-Like Attention in Visual Question Answering
Ekta Sood, Fabian Kögel, Philipp Müller, Dominike Thomas, Mihai Bâce, Andreas Bulling
arxiv:2109.13139, pp. 1–11, 2021.
Human-like attention as a supervisory signal to guide neural attention has shown significant promise but is currently limited to uni-modal integration – even for inherently multi-modal tasks such as visual question answering (VQA). We present the Multimodal Human-like Attention Network (MULAN) – the first method for multimodal integration of human-like attention on image and text during training of VQA models. MULAN integrates attention predictions from two state-of-the-art text and image saliency models into neural self-attention layers of a recent transformer-based VQA model. Through evaluations on the challenging VQAv2 dataset, we show that MULAN achieves a new state-of-the-art performance of 73.98% accuracy on test-std and 73.72% on test-dev and, at the same time, has approximately 80% fewer trainable parameters than prior work. Overall, our work underlines the potential of integrating multimodal human-like and neural attention for VQA.Paper: sood21_arxiv.pdfPaper Access: https://arxiv.org/pdf/2109.13139.pdf@techreport{sood21_arxiv, author = {Sood, Ekta and Kögel, Fabian and Müller, Philipp and Thomas, Dominike and Bâce, Mihai and Bulling, Andreas}, title = {Multimodal Integration of Human-Like Attention in Visual Question Answering}, year = {2021}, url = {https://arxiv.org/pdf/2109.13139.pdf}, pages = {1--11} } -
Neural Photofit: Gaze-based Mental Image Reconstruction
Florian Strohm, Ekta Sood, Sven Mayer, Philipp Müller, Mihai Bâce, Andreas Bulling
arXiv:2108.07524, pp. 1–10, 2021.
We propose a novel method that leverages human fixations to visually decode the image a person has in mind into a photofit (facial composite). Our method combines three neural networks: An encoder, a scoring network, and a decoder. The encoder extracts image features and predicts a neural activation map for each face looked at by a human observer. A neural scoring network compares the human and neural attention and predicts a relevance score for each extracted image feature. Finally, image features are aggregated into a single feature vector as a linear combination of all features weighted by relevance which a decoder decodes into the final photofit. We train the neural scoring network on a novel dataset containing gaze data of 19 participants looking at collages of synthetic faces. We show that our method significantly outperforms a mean baseline predictor and report on a human study that shows that we can decode photofits that are visually plausible and close to the observer’s mental image. Code and dataset available upon request.Paper: strohm21_arxiv.pdfCode: Available upon request.Paper Access: https://arxiv.org/abs/2108.07524Dataset: Available upon request.@techreport{strohm21_arxiv, title = {Neural Photofit: Gaze-based Mental Image Reconstruction}, author = {Strohm, Florian and Sood, Ekta and Mayer, Sven and Müller, Philipp and Bâce, Mihai and Bulling, Andreas}, year = {2021}, pages = {1--10}, url = {https://arxiv.org/abs/2108.07524} }
2020
Journal Articles
-
Deep Gaze Pooling: Inferring and Visually Decoding Search Intents From Human Gaze Fixations
Hosnieh Sattar, Mario Fritz, Andreas Bulling
Neurocomputing, 387, pp. 369–382, 2020.
Predicting the target of visual search from human eye fixations (gaze) is a difficult problem with many applications, e.g. in human-computer interaction. While previous work has focused on predicting specific search target instances, we propose the first approach to predict categories and attributes of search intents from gaze data and to visually reconstruct plausible targets. However, state-of-the-art models for categorical recognition, in general, require large amounts of training data, which is prohibitive for gaze data. To address this challenge, we further propose a novel Gaze Pooling Layer that combines gaze information with visual representations from Deep Learning approaches. Our scheme incorporates both spatial and temporal aspects of human gaze behavior as well as the appearance of the fixated locations. We propose an experimental setup and novel dataset and demonstrate the effectiveness of our method for gaze-based search target prediction and reconstruction. We highlight several practical advantages of our approach, such as compatibility with existing architectures, no need for gaze training data, and robustness to noise from common gaze sources.Paper: sattar20_neurocomp.pdf@article{sattar20_neurocomp, title = {Deep Gaze Pooling: Inferring and Visually Decoding Search Intents From Human Gaze Fixations}, author = {Sattar, Hosnieh and Fritz, Mario and Bulling, Andreas}, journal = {Neurocomputing}, year = {2020}, pages = {369–382}, volume = {387}, doi = {10.1016/j.neucom.2020.01.028} } -
How far are we from quantifying visual attention in mobile HCI?
Mihai Bâce, Sander Staal, Andreas Bulling
IEEE Pervasive Computing, 19 (2), pp. 46-55, 2020.
With an ever-increasing number of mobile devices competing for attention, quantifying when, how often, or for how long users look at their devices has emerged as a key challenge in mobile human-computer interaction. Encouraged by recent advances in automatic eye contact detection using machine learning and device-integrated cameras, we provide a fundamental investigation into the feasibility of quantifying overt visual attention during everyday mobile interactions. We discuss the main challenges and sources of error associated with sensing visual attention on mobile devices in the wild, including the impact of face and eye visibility, the importance of robust head pose estimation, and the need for accurate gaze estimation. Our analysis informs future research on this emerging topic and underlines the potential of eye contact detection for exciting new applications towards next-generation pervasive attentive user interfaces.Paper: bace20_pcm.pdf@article{bace20_pcm, title = {How far are we from quantifying visual attention in mobile HCI?}, author = {B{\^a}ce, Mihai and Staal, Sander and Bulling, Andreas}, journal = {IEEE Pervasive Computing}, year = {2020}, volume = {19}, number = {2}, doi = {10.1109/MPRV.2020.2967736}, pages = {46-55} }
Conference Papers
-
Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention
Ekta Sood, Simon Tannert, Philipp Müller, Andreas Bulling
Advances in Neural Information Processing Systems (NeurIPS), pp. 1–15, 2020.
A lack of corpora has so far limited advances in integrating human gaze data as a supervisory signal in neural attention mechanisms for natural language processing (NLP). We propose a novel hybrid text saliency model (TSM) that, for the first time, combines a cognitive model of reading with explicit human gaze supervision in a single machine learning framework. We show on four different corpora that our hybrid TSM duration predictions are highly correlated with human gaze ground truth. We further propose a novel joint modelling approach to integrate the predictions of the TSM into the attention layer of a network designed for a specific upstream task without the need for task-specific human gaze data. We demonstrate that our joint model outperforms the state of the art in paraphrase generation on the Quora Question Pairs corpus by more than 10% in BLEU-4 and achieves state-of-the-art performance for sentence compression on the challenging Google Sentence Compression corpus. As such, our work introduces a practical approach for bridging between data-driven and cognitive models and demonstrates a new way to integrate human gaze-guided neural attention into NLP tasks.@inproceedings{sood20_neurips, author = {Sood, Ekta and Tannert, Simon and Müller, Philipp and Bulling, Andreas}, title = {Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention}, year = {2020}, pages = {1--15}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, url = {https://proceedings.neurips.cc/paper/2020/hash/460191c72f67e90150a093b4585e7eb4-Abstract.html} } -
Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension
Ekta Sood, Simon Tannert, Diego Frassinelli, Andreas Bulling, Ngoc Thang Vu
Proc. ACL SIGNLL Conference on Computational Natural Language Learning (CoNLL), pp. 12-25, 2020.
While neural networks with attention mechanisms have achieved superior performance on many natural language processing tasks, it remains unclear to which extent learned attention resembles human visual attention. In this paper, we propose a new method that leverages eye-tracking data to investigate the relationship between human visual attention and neural attention in machine reading comprehension. To this end, we introduce a novel 23 participant eye tracking dataset - MQA-RC, in which participants read movie plots and answered pre-defined questions. We compare state of the art networks based on long short-term memory (LSTM), convolutional neural models (CNN) and XLNet Transformer architectures. We find that higher similarity to human attention and performance significantly correlates to the LSTM and CNN models. However, we show this relationship does not hold true for the XLNet models – despite the fact that the XLNet performs best on this challenging task. Our results suggest that different architectures seem to learn rather different neural attention strategies and similarity of neural to human attention does not guarantee best performance.@inproceedings{sood20_conll, title = {Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension}, author = {Sood, Ekta and Tannert, Simon and Frassinelli, Diego and Bulling, Andreas and Vu, Ngoc Thang}, booktitle = {Proc. ACL SIGNLL Conference on Computational Natural Language Learning (CoNLL)}, year = {2020}, pages = {12-25}, doi = {10.18653/v1/P17}, publisher = {Association for Computational Linguistics} } -
Combining Gaze Estimation and Optical Flow for Pursuits Interaction
Mihai Bâce, Vincent Becker, Chenyang Wang, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1-10, 2020.
Abstract Links BibTeX Project Best Paper Award
Pursuit eye movements have become widely popular because they enable spontaneous eye-based interaction. However, existing methods to detect smooth pursuits require special-purpose eye trackers. We propose the first method to detect pursuits using a single off-the-shelf RGB camera in unconstrained remote settings. The key novelty of our method is that it combines appearance-based gaze estimation with optical flow in the eye region to jointly analyse eye movement dynamics in a single pipeline. We evaluate the performance and robustness of our method for different numbers of targets and trajectories in a 13-participant user study. We show that our method not only outperforms the current state of the art but also achieves competitive performance to a consumer eye tracker for a small number of targets. As such, our work points towards a new family of methods for pursuit interaction directly applicable to an ever-increasing number of devices readily equipped with cameras.Paper: bace20_etra.pdf@inproceedings{bace20_etra, title = {Combining Gaze Estimation and Optical Flow for Pursuits Interaction}, author = {B{\^a}ce, Mihai and Becker, Vincent and Wang, Chenyang and Bulling, Andreas}, year = {2020}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3379155.3391315}, pages = {1-10} } -
Adversarial Attacks on Classifiers for Eye-based User Modelling
Inken Hagestedt, Michael Backes, Andreas Bulling
Adj. Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1-3, 2020.
An ever-growing body of work has demonstrated the rich information content available in eye movements for user modelling, e.g. for predicting users’ activities, cognitive processes, or even personality traits. We show that state-of-the-art classifiers for eye-based user modelling are highly vulnerable to adversarial examples: small artificial perturbations in gaze input that can dramatically change a classifier’s predictions. On the sample task of eye-based document type recognition we study the success of adversarial attacks with and without targeting the attack to a specific class.Paper: hagestedt20_etra.pdf@inproceedings{hagestedt20_etra, title = {Adversarial Attacks on Classifiers for Eye-based User Modelling}, author = {Hagestedt, Inken and Backes, Michael and Bulling, Andreas}, year = {2020}, pages = {1-3}, doi = {10.1145/3379157.3390511}, booktitle = {Adj. Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)} } -
A Survey of Digital Eye Strain in Gaze-Based Interactive Systems
Teresa Hirzle, Maurice Cordts, Enrico Rukzio, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1-12, 2020.
Display-based interfaces pose high demands on users’ eyes that can cause severe vision and eye problems, also known as digital eye strain (DES). Although these problems can become even more severe if the eyes are actively used for interaction, prior work on gaze-based interfaces has largely neglected these risks. We offer the first comprehensive account of DES in gaze-based interactive systems that is specifically geared to gaze interaction designers. Through an extensive survey of more than 400 papers published over the last 46 years, we first discuss the current role of DES in interactive systems. One key finding is that DES is only rarely considered when evaluating novel gaze interfaces and neglected in discussions of usability. We identify the main causes and solutions to DES and derive recommendations for interaction designers on how to guide future research on evaluating and alleviating DES.Paper: hirzle20_etra.pdf@inproceedings{hirzle20_etra, title = {A Survey of Digital Eye Strain in Gaze-Based Interactive Systems}, author = {Hirzle, Teresa and Cordts, Maurice and Rukzio, Enrico and Bulling, Andreas}, year = {2020}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3379155.3391313}, pages = {1-12} } -
Visual Analytics and Annotation of Pervasive Eye Tracking Video
Kuno Kurzhals, Nils Rodrigues, Maurice Koch, Michael Stoll, Andrés Bruhn, Andreas Bulling, Daniel Weiskopf
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1-9, 2020.
We propose a new technique for visual analytics and annotation of long-term pervasive eye tracking data for which a combined analysis of gaze and egocentric video is necessary. Our approach enables two important tasks for such data for hour-long videos from individual participants: (1) efficient annotation and (2) direct interpretation of the results. Exemplary time spans can be selected by the user and are then used as a query that initiates a fuzzy search of similar time spans based on gaze and video features. In an iterative refinement loop, the query interface then provides suggestions for the importance of individual features to improve the search results. A multi-layered timeline visualization shows an overview of annotated time spans. We demonstrate the efficiency of our approach for analyzing activities in about seven hours of video in a case study and discuss feedback on our approach from novices and experts performing the annotation task.Paper: kurzhals20_etra.pdf@inproceedings{kurzhals20_etra, title = {Visual Analytics and Annotation of Pervasive Eye Tracking Video}, author = {Kurzhals, Kuno and Rodrigues, Nils and Koch, Maurice and Stoll, Michael and Bruhn, Andrés and Bulling, Andreas and Weiskopf, Daniel}, year = {2020}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3379155.3391326}, pages = {1-9} } -
Anticipating Averted Gaze in Dyadic Interactions
Philipp Müller, Ekta Sood, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1-10, 2020.
We present the first method to anticipate averted gaze in natural dyadic interactions. The task of anticipating averted gaze, i.e. that a person will not make eye contact in the near future, remains unsolved despite its importance for human social encounters as well as a number of applications, including human-robot interaction or conversational agents. Our multimodal method is based on a long short-term memory (LSTM) network that analyses non-verbal facial cues and speaking behaviour. We empirically evaluate our method for different future time horizons on a novel dataset of 121 YouTube videos of dyadic video conferences (74 hours in total). We investigate person-specific and person-independent performance and demonstrate that our method clearly outperforms baselines in both settings. As such, our work sheds light on the tight interplay between eye contact and other non-verbal signals and underlines the potential of computational modelling and anticipation of averted gaze for interactive applications.Paper: mueller20_etra.pdf@inproceedings{mueller20_etra, title = {Anticipating Averted Gaze in Dyadic Interactions}, author = {Müller, Philipp and Sood, Ekta and Bulling, Andreas}, year = {2020}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3379155.3391332}, pages = {1-10} } -
Quantification of Users’ Visual Attention During Everyday Mobile Device Interactions
Mihai Bâce, Sander Staal, Andreas Bulling
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–14, 2020.
We present the first real-world dataset and quantitative evaluation of visual attention of mobile device users in-situ, i.e. while using their devices during everyday routine. Understanding user attention is a core research challenge in mobile HCI but previous approaches relied on usage logs or self-reports that are only proxies and consequently do neither reflect attention completely nor accurately. Our evaluations are based on Everyday Mobile Visual Attention (EMVA) – a new 32-participant dataset containing around 472 hours of video snippets recorded over more than two weeks in real life using the front-facing camera as well as associated usage logs, interaction events, and sensor data. Using an eye contact detection method, we are first to quantify the highly dynamic nature of everyday visual attention across users, mobile applications, and usage contexts. We discuss key insights from our analyses that highlight the potential and inform the design of future mobile attentive user interfaces.@inproceedings{bace20_chi, title = {Quantification of Users' Visual Attention During Everyday Mobile Device Interactions}, author = {B{\^a}ce, Mihai and Staal, Sander and Bulling, Andreas}, year = {2020}, pages = {1--14}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3313831.3376449}, news = {https://ethz.ch/en/news-and-events/eth-news/news/2020/09/our-actual-attention-is-now-measurable.html}, video = {https://www.youtube.com/watch?v=SzLn3LujIqw} } -
Learning-based Region Selection for End-to-End Gaze Estimation
Xucong Zhang, Yusuke Sugano, Andreas Bulling, Otmar Hilliges
Proc. British Machine Vision Conference (BMVC), pp. 1-13, 2020.
Traditionally, appearance-based gaze estimation methods use statically defined face regions as input to the gaze estimator, such as eye patches, and therefore suffer from difficult lighting conditions and extreme head poses for which these regions are often not the most informative with respect to the gaze estimation task. We posit that facial regions should be selected dynamically based on the image content and propose a novel gaze estimation method that combines the task of region proposal and gaze estimation into a single end-to-end trainable framework. We introduce a novel loss that allows for unsupervised training of a region proposal network alongside the (supervised) training of the final gaze estimator. We show that our method can learn meaningful region selection strategies and outperforms fixed region approaches. We further show that our method performs particularly well for challenging cases, i.e., those with difficult lighting conditions such as directional lights, extreme head angles, or self-occlusion. Finally, we show that the proposed method achieves better results than the current state-of-the-art method in within and cross-dataset evaluations.Paper: zhang20_bmvc.pdfSupplementary Material: zhang20_bmvc_sup.pdf@inproceedings{zhang20_bmvc, author = {Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas and Hilliges, Otmar}, title = {Learning-based Region Selection for End-to-End Gaze Estimation}, booktitle = {Proc. British Machine Vision Conference (BMVC)}, year = {2020}, pages = {1-13} }
Technical Reports
-
Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention
Ekta Sood, Simon Tannert, Philipp Müller, Andreas Bulling
arxiv:2010.07891, pp. 1–18, 2020.
A lack of corpora has so far limited advances in integrating human gaze data as a supervisory signal in neural attention mechanisms for natural language processing (NLP). We propose a novel hybrid text saliency model (TSM) that, for the first time, combines a cognitive model of reading with explicit human gaze supervision in a single machine learning framework. We show on four different corpora that our hybrid TSM duration predictions are highly correlated with human gaze ground truth. We further propose a novel joint modelling approach to integrate the predictions of the TSM into the attention layer of a network designed for a specific upstream task without the need for task-specific human gaze data. We demonstrate that our joint model outperforms the state of the art in paraphrase generation on the Quora Question Pairs corpus by more than 10% in BLEU-4 and achieves state-of-the-art performance for sentence compression on the challenging Google Sentence Compression corpus. As such, our work introduces a practical approach for bridging between data-driven and cognitive models and demonstrates a new way to integrate human gaze-guided neural attention into NLP tasks.@techreport{sood20_arxiv, author = {Sood, Ekta and Tannert, Simon and Müller, Philipp and Bulling, Andreas}, title = {Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention}, year = {2020}, url = {https://arxiv.org/abs/2010.07891}, pages = {1--18} } -
Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension
Ekta Sood, Simon Tannert, Diego Frassinelli, Andreas Bulling, Ngoc Thang Vu
arxiv:2010.06396, pp. 1–14, 2020.
While neural networks with attention mecha- nisms have achieved superior performance on many natural language processing tasks, it remains unclear to which extent learned attention resembles human visual attention. In this paper, we propose a new method that leverages eye-tracking data to investigate the relationship between human visual attention and neural attention in machine reading comprehension. To this end, we introduce a novel 23 participant eye tracking dataset - MQA-RC, in which participants read movie plots and answered pre-defined questions. We compare state of the art networks based on long short-term memory (LSTM), convolutional neural models (CNN) and XLNet Transformer architectures. We find that higher similarity to human attention and performance significantly correlates to the LSTM and CNN models. However, we show this relationship does not hold true for the XLNet models – despite the fact that the XLNet performs best on this challenging task. Our results suggest that different architectures seem to learn rather different neural attention strategies and similarity of neural to human attention does not guarantee best performance.@techreport{sood20_arxiv_2, title = {Interpreting Attention Models with Human Visual Attention in Machine Reading Comprehension}, author = {Sood, Ekta and Tannert, Simon and Frassinelli, Diego and Bulling, Andreas and Vu, Ngoc Thang}, year = {2020}, url = {https://arxiv.org/abs/2010.06396}, pages = {1--14} } -
Adversarial Attacks on Classifiers for Eye-based User Modelling
Inken Hagestedt, Michael Backes, Andreas Bulling
arXiv:2006.00860, pp. 1–9, 2020.
An ever-growing body of work has demonstrated the rich information content available in eye movements for user modelling, e.g. for predicting users’ activities, cognitive processes, or even personality traits. We show that state-of-the-art classifiers for eye-based user modelling are highly vulnerable to adversarial examples: small artificial perturbations in gaze input that can dramatically change a classifier’s predictions. We generate these adversarial examples using the Fast Gradient Sign Method (FGSM) that linearises the gradient to find suitable perturbations. On the sample task of eye-based document type recognition we study the success of different adversarial attack scenarios: with and without knowledge about classifier gradients (white-box vs. black-box) as well as with and without targeting the attack to a specific class, In addition, we demonstrate the feasibility of defending against adversarial attacks by adding adversarial examples to a classifier’s training data.Paper: hagestedt20_arxiv.pdfPaper Access: https://arxiv.org/abs/2006.00860@techreport{hagestedt20_arxiv, title = {Adversarial Attacks on Classifiers for Eye-based User Modelling}, author = {Hagestedt, Inken and Backes, Michael and Bulling, Andreas}, year = {2020}, pages = {1--9}, url = {https://arxiv.org/abs/2006.00860} }
2019
Journal Articles
-
Classifying Attention Types with Thermal Imaging and Eye Tracking
Yomna Abdelrahman, Anam Ahmad Khan, Joshua Newn, Eduardo Velloso, Sherine Ashraf Safwat, James Bailey, Andreas Bulling, Frank Vetere, Albrecht Schmidt
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), 3 (3), pp. 1–27, 2019.
Despite the importance of attention in user performance, current methods for attention classification do not allow to discriminate between different attention types. We propose a novel method that combines thermal imaging and eye tracking to unobtrusively classify four types of attention: sustained, alternating, selective, and divided. We collected a data set in which we stimulate these four attention types in a user study (N=22) using combinations of audio and visual stimuli while measuring users’ facial temperature and eye movement. Using a Logistic Regression on features extracted from both sensing technologies, we can classify the four attention types with high AUC scores up to 75.7% for the user independent-condition independent, 87% for the user-independent-condition dependent, and 77.4% for the user-dependent prediction. Our findings not only demonstrate the potential of thermal imaging and eye tracking for unobtrusive classification of different attention types but also pave the way for novel applications for attentive user interfaces and attention-aware computing.doi: 10.1145/3351227Paper: abdelrahman19_imwut.pdf@article{abdelrahman19_imwut, author = {Abdelrahman, Yomna and Khan, Anam Ahmad and Newn, Joshua and Velloso, Eduardo and Safwat, Sherine Ashraf and Bailey, James and Bulling, Andreas and Vetere, Frank and Schmidt, Albrecht}, title = {Classifying Attention Types with Thermal Imaging and Eye Tracking}, journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)}, year = {2019}, volume = {3}, number = {3}, pages = {1--27}, doi = {10.1145/3351227} } -
InvisibleEye: Fully Embedded Mobile Eye Tracking Using Appearance-Based Gaze Estimation
Julian Steil, Marc Tonsen, Yusuke Sugano, Andreas Bulling
ACM SIGMOBILE Mobile Computing and Communications Review, 23 (2), pp. 30-34, 2019.
Despite their potential for a range of exciting new applications, mobile eye trackers suffer from several fundamental usability problems. InvisibleEye is an innovative approach for mobile eye tracking that uses millimetre-size RGB cameras that can be fully embedded into normal glasses frames, as well as appearance-based gaze estimation to directly estimate gaze from the eye images. Through evaluation on three large-scale, increasingly realistic datasets, we show that InvisibleEyes can achieve a person-specific gaze estimation accuracy of up to 2.04° using three camera pairs with a resolution of only 3x3 pixels.Paper: steil19_sigmobile.pdf@article{steil19_sigmobile, author = {Steil, Julian and Tonsen, Marc and Sugano, Yusuke and Bulling, Andreas}, title = {InvisibleEye: Fully Embedded Mobile Eye Tracking Using Appearance-Based Gaze Estimation}, journal = {ACM SIGMOBILE Mobile Computing and Communications Review}, year = {2019}, volume = {23}, number = {2}, pages = {30-34} } -
MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation
Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling
IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41 (1), pp. 162-175, 2019.
Learning-based methods are believed to work well for unconstrained gaze estimation, i.e. gaze estimation from a monocular RGB camera without assumptions regarding user, environment, or camera. However, current gaze datasets were collected under laboratory conditions and methods were not evaluated across multiple datasets. Our work makes three contributions towards addressing these limitations. First, we present the MPIIGaze dataset, which contains 213,659 full face images and corresponding ground-truth gaze positions collected from 15 users during everyday laptop use over several months. An experience sampling approach ensured continuous gaze and head poses and realistic variation in eye appearance and illumination. To facilitate cross-dataset evaluations, 37,667 images were manually annotated with eye corners, mouth corners, and pupil centres. Second, we present an extensive evaluation of state-of-the-art gaze estimation methods on three current datasets, including MPIIGaze. We study key challenges including target gaze range, illumination conditions, and facial appearance variation. We show that image resolution and the use of both eyes affect gaze estimation performance, while head pose and pupil centre information are less informative. Finally, we propose GazeNet, the first deep appearance-based gaze estimation method. GazeNet improves on the state of the art by 22% (from a mean error of 13.9 degrees to 10.8 degrees) for the most challenging cross-dataset evaluation.Paper: zhang19_pami.pdf@article{zhang19_pami, title = {MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation}, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, year = {2019}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, doi = {10.1109/TPAMI.2017.2778103}, pages = {162-175}, volume = {41}, number = {1} }
Conference Papers
-
Towards High-Frequency SSVEP-Based Target Discrimination with an Extended Alphanumeric Keyboard
Sahar Abdelnabi, Michael Xuelin Huang, Andreas Bulling
Proc. IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 1-6, 2019.
Despite significant advances in using Steady-State Visually Evoked Potentials (SSVEP) for on-screen target discrimination, existing methods either require intrusive, low- frequency visual stimulation or only support a small number of targets. We propose SSVEPNet: a convolutional long short-term memory (LSTM) recurrent neural network for high-frequency stimulation (≥30Hz) using a large number of visual targets. We evaluate our method for discriminating between 43 targets on an extended alphanumeric virtual keyboard and compare three different frequency assignment strategies. Our experimental results show that SSVEPNet significantly outperforms state-of-the-art correlation-based methods and convolutional neural networks. As such, our work opens up an exciting new direction of research towards a new class of unobtrusive and highly expressive SSVEP-based interfaces for text entry and beyond.Paper: abdelnabi19_smc.pdf@inproceedings{abdelnabi19_smc, author = {Abdelnabi, Sahar and Huang, Michael Xuelin and Bulling, Andreas}, title = {Towards High-Frequency SSVEP-Based Target Discrimination with an Extended Alphanumeric Keyboard}, booktitle = {Proc. IEEE International Conference on Systems, Man, and Cybernetics (SMC)}, year = {2019}, pages = {1-6} } -
A fast approach to refraction-aware 3D eye-model fitting and gaze prediction
Kai Dierkes, Moritz Kassner, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2019.
By temporally integrating information about pupil contours extracted from eye images, model-based methods for glint-free gaze estimation can mitigate pupil detection noise. However, current approaches require time-consuming iterative solving of a nonlinear minimization problem to estimate key parameters, such as eyeball position. Based on the method presented by [Swirski and Dodgson 2013], we propose a novel approach to glint-free 3D eye-model fitting and gaze prediction using a single near-eye camera. By recasting model optimization as a least-squares intersection of lines, we make it amenable to a fast non-iterative solution. We further present a method for estimating deterministic refraction-correction functions from synthetic eye images and validate them on both synthetic and real eye images. We demonstrate the robustness of our method in the presence of pupil detection noise and show the benefit of temporal integration of pupil contour information on eyeball position and gaze estimation accuracy.Paper: dierkes19_etra.pdf@inproceedings{dierkes19_etra, title = {A fast approach to refraction-aware 3D eye-model fitting and gaze prediction}, author = {Dierkes, Kai and Kassner, Moritz and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3314111.3319819}, pages = {1--9} } -
Can Privacy-Aware Lifelogs Alter Our Memories?
Passant Elagroudy, Florian Mathis, Andreas Bulling, Mohamed Khamis, Diana Irmscher, Albrecht Schmidt
Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–6, 2019.
The abundance of automatically-triggered lifelogging cameras is a privacy threat to bystanders. Countering this by deleting photos limits relevant memory cues and the informative content of lifelogs. An alternative is to obfuscate bystanders, but it is not clear how this impacts the lifelogger’s recall of memories. We report on a study in which we compare viewing 1) unaltered photos, 2) photos with blurred people, and 3) a subset of the photos after deleting private ones, on memory recall. Findings show that obfuscated content helps users recall a lot of content, but it also results in recalling less accurate details, which can sometimes mislead the user. Our work informs the design of privacy- aware lifelogging systems that maximizes recall and steers discussion about ubiquitous technologies that could alter human memories.Paper: elagroudy19_chi.pdf@inproceedings{elagroudy19_chi, author = {Elagroudy, Passant and Mathis, Florian and Bulling, Andreas and Khamis, Mohamed and Irmscher, Diana and Schmidt, Albrecht}, title = {Can Privacy-Aware Lifelogs Alter Our Memories?}, booktitle = {Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2019}, doi = {10.1145/3290607.3313052}, pages = {1--6} } -
A Design Space for Gaze Interaction on Head-mounted Displays
Teresa Hirzle, Jan Gugenheimer, Florian Geiselhart, Andreas Bulling, Enrico Rukzio
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–12, 2019.
Augmented and virtual reality (AR/VR) has entered the mass market and, with it, will soon eye tracking as a core technology for next generation head-mounted displays (HMDs). In contrast to existing gaze interfaces, the 3D nature of AR and VR requires estimating a user’s gaze in 3D. While first applications, such as foveated rendering, hint at the compelling potential of combining HMDs and gaze, a systematic analysis is missing. To fill this gap, we present the first design space for gaze interaction on HMDs. Our design space covers human depth perception and technical requirements in two dimensions aiming to identify challenges and opportunities for interaction design. As such, our design space provides a comprehensive overview and serves as an important guideline for researchers and practitioners working on gaze interaction on HMDs. We further demonstrate how our design space is used in practice by presenting two interactive applications: EyeHealth and XRay-Vision.Paper: hirzle19_chi.pdf@inproceedings{hirzle19_chi, author = {Hirzle, Teresa and Gugenheimer, Jan and Geiselhart, Florian and Bulling, Andreas and Rukzio, Enrico}, title = {A Design Space for Gaze Interaction on Head-mounted Displays}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2019}, doi = {10.1145/3290605.3300855}, pages = {1--12} } -
SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements
Michael Xuelin Huang, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–10, 2019.
Recent methods to automatically calibrate stationary eye trackers were shown to effectively reduce inherent calibration distortion. However, these methods require additional information, such as mouse clicks or on-screen content. We propose the first method that only requires users’ eye movements to reduce calibration distortion in the background while users naturally look at an interface. Our method exploits that calibration distortion makes straight saccade trajectories appear curved between the saccadic start and end points. We show that this curving effect is systematic and the result of a distorted gaze projection plane. To mitigate calibration distortion, our method undistorts this plane by straightening saccade trajectories using image warping. We show that this approach improves over the common six-point calibration and is promising for reducing distortion. As such, it provides a non-intrusive solution to alleviating accuracy decrease of eye tracker during long-term use.Paper: huang19_etra.pdf@inproceedings{huang19_etra, title = {SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements}, author = {Huang, Michael Xuelin and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, pages = {1--10}, doi = {0.1145/3317956.3321553} } -
Moment-to-Moment Detection of Internal Thought during Video Viewing from Eye Vergence Behavior
Michael Xuelin Huang, Jiajia Li, Grace Ngai, Hong Va Leong, Andreas Bulling
Proc. ACM Multimedia (MM), pp. 1–9, 2019.
Internal thought refers to the process of directing attention away from a primary visual task to internal cognitive processing. Internal thought is a pervasive mental activity and closely related to primary task performance. As such, automatic detection of internal thought has significant potential for user modelling in intelligent interfaces, particularly for e-learning applications. Despite the close link between the eyes and the human mind, only a few studies have investigated vergence behaviour during internal thought and none has studied moment-to-moment detection of internal thought from gaze. While prior studies relied on long-term data analysis and required a large number of gaze characteristics, we describe a novel method that is computationally light-weight and that only requires eye vergence information that is readily available from binocular eye trackers. We further propose a novel paradigm to obtain ground truth internal thought annotations that exploits human blur perception. We evaluate our method for three increasingly challenging detection tasks: (1) during a controlled math-solving task, (2) during natural viewing of lecture videos, and (3) during daily activities, such as coding, browsing, and reading. Results from these evaluations demonstrate the performance and robustness of vergence-based detection of internal thought and, as such, open up new directions for research on interfaces that adapt to shifts of mental attention.Paper: huang19_mm.pdf@inproceedings{huang19_mm, title = {Moment-to-Moment Detection of Internal Thought during Video Viewing from Eye Vergence Behavior}, author = {Huang, Michael Xuelin and Li, Jiajia and Ngai, Grace and Leong, Hong Va and Bulling, Andreas}, booktitle = {Proc. ACM Multimedia (MM)}, year = {2019}, doi = {10.1145/3343031.3350573}, pages = {1--9} } -
Reducing Calibration Drift in Mobile Eye Trackers by Exploiting Mobile Phone Usage
Philipp Müller, Daniel Buschek, Michael Xuelin Huang, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2019.
Automatic saliency-based recalibration is promising for addressing calibration drift in mobile eye trackers but existing bottom-up saliency methods neglect user’s goal-directed visual attention in natural behaviour. By inspecting real-life recordings of egocentric eye tracker cameras, we reveal that users are likely to look at their phones once these appear in view. We propose two novel automatic recalibration methods that exploit mobile phone usage: The first builds saliency maps using the phone location in the egocentric view to identify likely gaze locations. The second uses the occurrence of touch events to recalibrate the eye tracker, thereby enabling privacy-preserving recalibration. Through in-depth evaluations on a recent mobile eye tracking dataset (N=17, 65 hours) we show that our approaches outperform a state-of-the-art saliency approach for the automatic recalibration task. As such, our approach improves mobile eye tracking and gaze-based interaction, particularly for long-term use.Paper: mueller19_etra.pdf@inproceedings{mueller19_etra, title = {Reducing Calibration Drift in Mobile Eye Trackers by Exploiting Mobile Phone Usage}, author = {M{\"{u}}ller, Philipp and Buschek, Daniel and Huang, Michael Xuelin and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3314111.3319918}, pages = {1--9} } -
Emergent Leadership Detection Across Datasets
Philipp Müller, Andreas Bulling
Proc. ACM International Conference on Multimodal Interaction (ICMI), pp. 274-278, 2019.
Automatic detection of emergent leaders in small groups from nonverbal behaviour is a growing research topic in social signal processing but existing methods were evaluated on single datasets – an unrealistic assumption for real-world applications in which systems are required to also work in settings unseen at training time. It therefore remains unclear whether current methods for emergent leadership detection generalise to similar but new settings and to which extent. To overcome this limitation, we are the first to study a cross-dataset evaluation setting for the emergent leadership detection task. We provide evaluations for within- and cross-dataset prediction using two current datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the robustness of commonly used feature channels and online prediction in the cross-dataset setting. Our evaluations show that using pose and eye contact based features, cross-dataset prediction is possible with an accuracy of 0.68, as such providing another important piece of the puzzle towards real-world emergent leadership detection.Paper: mueller19_icmi.pdf@inproceedings{mueller19_icmi, title = {Emergent Leadership Detection Across Datasets}, author = {M{\"{u}}ller, Philipp and Bulling, Andreas}, year = {2019}, pages = {274-278}, booktitle = {Proc. ACM International Conference on Multimodal Interaction (ICMI)}, doi = {10.1145/3340555.3353721} } -
KnuckleTouch: Enabling Knuckle Gestures on Capacitive Touchscreens using Deep Learning
Robin Schweigert, Jan Leusmann, Simon Hagenmayer, Maximilian Weiß, Huy Viet Le, Sven Mayer, Andreas Bulling
Proc. Mensch und Computer, pp. 387-397, 2019.
While mobile devices have become essential for social communication and have paved the way for work on the go, their interactive capabilities are still limited to simple touch input. A promising enhancement for touch interaction is knuckle input but recognizing knuckle gestures robustly and accurately remains challenging. We present a method to differentiate between 17 finger and knuckle gestures based on a long short-term memory (LSTM) machine learning model. Furthermore, we introduce an open source approach that is ready-to-deploy on commodity touch-based devices. The model was trained on a new dataset that we collected in a mobile interaction study with 18 participants. We show that our method can achieve an accuracy of 86.8% on recognizing one of the 17 gestures and an accuracy of 94.6% to differentiate between finger and knuckle. In our evaluation study, we validate our models and found that the LSTM gestures recognizing archived an accuracy of 88.6%. We show that KnuckleTouch can be used to improve the input expressiveness and to provide shortcuts to frequently used functions.@inproceedings{schweigert19_muc, title = {KnuckleTouch: Enabling Knuckle Gestures on Capacitive Touchscreens using Deep Learning}, author = {Schweigert, Robin and Leusmann, Jan and Hagenmayer, Simon and Weiß, Maximilian and Le, Huy Viet and Mayer, Sven and Bulling, Andreas}, year = {2019}, booktitle = {Proc. Mensch und Computer}, doi = {10.1145/3340764.3340767}, pages = {387-397}, video = {https://www.youtube.com/watch?v=akL3Ejx3bv8} } -
Predicting Gaze Patterns: Text Saliency for Integration into Machine Learning Tasks
Proc. International Workshop on Computational Cognition (ComCo), pp. 1–2, 2019.
Links BibTeX Project Best Poster Award
Paper: sood19_comco.pdf@inproceedings{sood19_comco, author = {Sood, Ekta}, title = {Predicting Gaze Patterns: Text Saliency for Integration into Machine Learning Tasks}, year = {2019}, pages = {1--2}, booktitle = {Proc. International Workshop on Computational Cognition (ComCo)}, url = {https://perceptualui.org/publications/sood19_comco_poster.pdf} } -
PrivacEye: Privacy-Preserving Head-Mounted Eye Tracking Using Egocentric Scene Image and Eye Movement Features
Julian Steil, Marion Koelle, Wilko Heuten, Susanne Boll, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–10, 2019.
Abstract Links BibTeX Project Best Video Award
Eyewear devices, such as augmented reality displays, increasingly integrate eye tracking but the first-person camera required to map a user’s gaze to the visual scene can pose a significant threat to user and bystander privacy. We present PrivacEye, a method to detect privacy-sensitive everyday situations and automatically enable and disable the eye tracker’s first-person camera using a mechanical shutter. To close the shutter in privacy-sensitive situations, the method uses a deep representation of the first-person video combined with rich features that encode users’ eye movements. To open the shutter without visual input, PrivacEye detects changes in users’ eye movements alone to gauge changes in the "privacy level" of the current situation. We evaluate our method on a first-person video dataset recorded in daily life situations of 17 participants, annotated by themselves for privacy sensitivity, and show that our method is effective in preserving privacy in this challenging setting.Paper: steil19_etra.pdfSupplementary Material: steil19_etra_sup.pdf@inproceedings{steil19_etra, title = {PrivacEye: Privacy-Preserving Head-Mounted Eye Tracking Using Egocentric Scene Image and Eye Movement Features}, author = {Steil, Julian and Koelle, Marion and Heuten, Wilko and Boll, Susanne and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, pages = {1--10}, doi = {10.1145/3314111.3319913}, video = {https://www.youtube.com/watch?v=Gy61255F8T8} } -
Privacy-Aware Eye Tracking Using Differential Privacy
Julian Steil, Inken Hagestedt, Michael Xuelin Huang, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2019.
Abstract Links BibTeX Project Best Paper Award
With eye tracking being increasingly integrated into virtual and augmented reality (VR/AR) head-mounted displays, preserving users’ privacy is an ever more important, yet under-explored, topic in the eye tracking community. We report a large-scale online survey (N=124) on privacy aspects of eye tracking that provides the first comprehensive account of with whom, for which services, and to which extent users are willing to share their gaze data. Using these insights, we design a privacy-aware VR interface that uses differential privacy, which we evaluate on a new 20-participant dataset for two privacy sensitive tasks: We show that our method can prevent user re-identification and protect gender information while maintaining high performance for gaze-based document type classification. Our results highlight the privacy challenges particular to gaze data and demonstrate that differential privacy is a potential means to address them. Thus, this paper lays important foundations for future research on privacy-aware gaze interfaces.Paper: steil19_etra_2.pdf@inproceedings{steil19_etra_2, title = {Privacy-Aware Eye Tracking Using Differential Privacy}, author = {Steil, Julian and Hagestedt, Inken and Huang, Michael Xuelin and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3314111.3319915}, pages = {1--9} } -
Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications
Xucong Zhang, Yusuke Sugano, Andreas Bulling
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–13, 2019.
Appearance-based gaze estimation methods that only require an off-the-shelf camera have significantly improved but they are still not yet widely used in the human-computer interaction (HCI) community. This is partly because it remains unclear how they perform compared to model-based approaches as well as dominant, special-purpose eye tracking equipment. To address this limitation, we evaluate the performance of state-of-the-art appearance-based gaze estimation for interaction scenarios with and without personal calibration, indoors and outdoors, for different sensing distances, as well as for users with and without glasses. We discuss the obtained findings and their implications for the most important gaze-based applications, namely explicit eye input, attentive user interfaces, gaze-based user modelling, and passive eye monitoring. To democratise the use of appearance-based gaze estimation and interaction in HCI, we finally present OpenGaze (www.opengaze.org), the first software toolkit for appearance-based gaze estimation and interaction.@inproceedings{zhang19_chi, author = {Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas}, title = {Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2019}, doi = {10.1145/3290605.3300646}, pages = {1--13} }
Book Chapters
-
Pervasive Eye Tracking for Real-World Consumer Behavior Analysis
Andreas Bulling, Michel Wedel
Michael Schulte-Mecklenbeck, Anton Kühberger (Eds.): A Handbook of Process Tracing Methods for Decision Research: A Critical Review and User’s Guide, Taylor & Francis, pp. 27-44, 2019.
Eye tracking is the computational process of measuring the absolute point of gaze and/or the relative movement of the eyes over time using sensing systems placed in the environment (remote eye tracking) or worn on the head (mobile eye tracking). Eye tracking has a long history as a tool in psychology, human behaviour, and human-computer interaction research, and has also found its way into many commercial applications, such as marketing, web usability, virtual reality, or automotive engineering. Recent advances in mobile eye tracking as well as remote eye tracking using RGB cameras readily integrated into handheld devices and ambient displays pave the way for a whole new class of everyday eye tracking systems that allow researchers and practitioners to better understand and analyse gaze information in real-world settings. In this chapter, we first provide a history of eye tracking as both a measurement tool and research topic. Afterwards, we discuss the considerable potential but also remaining technical challenges for leveraging everyday eye tracking for real-world consumer behaviour analysis and decision making in retail.Paper: bulling19_tf.pdf@inbook{bulling19_tf, author = {Bulling, Andreas and Wedel, Michel}, title = {Pervasive Eye Tracking for Real-World Consumer Behavior Analysis}, booktitle = {A Handbook of Process Tracing Methods for Decision Research: A Critical Review and User's Guide}, year = {2019}, editor = {Schulte-Mecklenbeck, Michael and K{\"{u}}hberger, Anton}, publisher = {Taylor \& Francis}, pages = {27-44} }
Technical Reports
-
Accurate and Robust Eye Contact Detection During Everyday Mobile Device Interactions
Mihai Bâce, Sander Staal, Andreas Bulling
arXiv:1907.11115, pp. 1–12, 2019.
Quantification of human attention is key to several tasks in mobile human-computer interaction (HCI), such as predicting user interruptibility, estimating noticeability of user interface content, or measuring user engagement. Previous works to study mobile attentive behaviour required special-purpose eye tracking equipment or constrained users’ mobility. We propose a novel method to sense and analyse visual attention on mobile devices during everyday interactions. We demonstrate the capabilities of our method on the sample task of eye contact detection that has recently attracted increasing research interest in mobile HCI. Our method builds on a state-of-the-art method for unsupervised eye contact detection and extends it to address challenges specific to mobile interactive scenarios. Through evaluation on two current datasets, we demonstrate significant performance improvements for eye contact detection across mobile devices, users, or environmental conditions. Moreover, we discuss how our method enables the calculation of additional attention metrics that, for the first time, enable researchers from different domains to study and quantify attention allocation during mobile interactions in the wild.Paper: bace19_arxiv.pdfPaper Access: https://arxiv.org/abs/1907.11115@techreport{bace19_arxiv, title = {Accurate and Robust Eye Contact Detection During Everyday Mobile Device Interactions}, author = {B{\^a}ce, Mihai and Staal, Sander and Bulling, Andreas}, year = {2019}, pages = {1--12}, url = {https://arxiv.org/abs/1907.11115} } -
How far are we from quantifying visual attention in mobile HCI?
Mihai Bâce, Sander Staal, Andreas Bulling
arXiv:1907.11106, pp. 1–7, 2019.
With an ever-increasing number of mobile devices competing for our attention, quantifying when, how often, or for how long users visually attend to their devices has emerged as a core challenge in mobile human-computer interaction. Encouraged by recent advances in automatic eye contact detection using machine learning and device-integrated cameras, we provide a fundamental investigation into the feasibility of quantifying visual attention during everyday mobile interactions. We identify core challenges and sources of errors associated with sensing attention on mobile devices in the wild, including the impact of face and eye visibility, the importance of robust head pose estimation, and the need for accurate gaze estimation. Based on this analysis, we propose future research directions and discuss how eye contact detection represents the foundation for exciting new applications towards next-generation pervasive attentive user interfaces.Paper: bace19_arxiv_2.pdfPaper Access: https://arxiv.org/abs/1907.11106@techreport{bace19_arxiv_2, title = {How far are we from quantifying visual attention in mobile HCI?}, author = {B{\^a}ce, Mihai and Staal, Sander and Bulling, Andreas}, year = {2019}, pages = {1--7}, url = {https://arxiv.org/abs/1907.11106} } -
Moment-to-Moment Detection of Internal Thought from Eye Vergence Behaviour
Michael Xuelin Huang, Jiajia Li, Grace Ngai, Hong Va Leong, Andreas Bulling
arXiv:1901.06572, pp. 1–22, 2019.
Internal thought refers to the process of directing attention away from a primary visual task to internal cognitive processing. Internal thought is a pervasive mental activity and closely related to primary task performance. As such, automatic detection of internal thought has significant potential for user modelling in intelligent interfaces, particularly for e-learning applications. Despite the close link between the eyes and the human mind, only a few studies have investigated vergence behaviour during internal thought and none has studied moment-to-moment detection of internal thought from gaze. While prior studies relied on long-term data analysis and required a large number of gaze characteristics, we describe a novel method that is computationally light-weight and that only requires eye vergence information that is readily available from binocular eye trackers. We further propose a novel paradigm to obtain ground truth internal thought annotations that exploits human blur perception. We evaluate our method for three increasingly challenging detection tasks: (1) during a controlled math-solving task, (2) during natural viewing of lecture videos, and (3) during daily activities, such as coding, browsing, and reading. Results from these evaluations demonstrate the performance and robustness of vergence-based detection of internal thought and, as such, open up new directions for research on interfaces that adapt to shifts of mental attention.Paper: huang19_arxiv.pdfPaper Access: https://arxiv.org/abs/1901.06572@techreport{huang19_arxiv, title = {Moment-to-Moment Detection of Internal Thought from Eye Vergence Behaviour}, author = {Huang, Michael Xuelin and Li, Jiajia and Ngai, Grace and Leong, Hong Va and Bulling, Andreas}, year = {2019}, pages = {1--22}, url = {https://arxiv.org/abs/1901.06572} } -
SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements
Michael Xuelin Huang, Andreas Bulling
arXiv:1903.04047, pp. 1–10, 2019.
Recent methods to automatically calibrate stationary eye trackers were shown to effectively reduce inherent calibration distortion. However, these methods require additional information, such as mouse clicks or on-screen content. We propose the first method that only requires users’ eye movements to reduce calibration distortion in the background while users naturally look at an interface. Our method exploits that calibration distortion makes straight saccade trajectories appear curved between the saccadic start and end points. We show that this curving effect is systematic and the result of a distorted gaze projection plane. To mitigate calibration distortion, our method undistorts this plane by straightening saccade trajectories using image warping. We show that this approach improves over the common six-point calibration and is promising for reducing distortion. As such, it provides a non-intrusive solution to alleviating accuracy decrease of eye tracker during long-term use.Paper: huang19_arxiv_2.pdfPaper Access: https://arxiv.org/abs/1903.04047@techreport{huang19_arxiv_2, title = {SacCalib: Reducing Calibration Distortion for Stationary Eye Trackers Using Saccadic Eye Movements}, author = {Huang, Michael Xuelin and Bulling, Andreas}, year = {2019}, pages = {1--10}, url = {https://arxiv.org/abs/1903.04047} } -
Emergent Leadership Detection Across Datasets
Philipp Müller, Andreas Bulling
arXiv:1905.02058, pp. 1–5, 2019.
Automatic detection of emergent leaders in small groups from nonverbal behaviour is a growing research topic in social signal processing but existing methods were evaluated on single datasets – an unrealistic assumption for real-world applications in which systems are required to also work in settings unseen at training time. It therefore remains unclear whether current methods for emergent leadership detection generalise to similar but new settings and to which extent. To overcome this limitation, we are the first to study a cross-dataset evaluation setting for the emergent leadership detection task. We provide evaluations for within- and cross-dataset prediction using two current datasets (PAVIS and MPIIGroupInteraction), as well as an investigation on the robustness of commonly used feature channels (visual focus of attention, body pose, facial action units, speaking activity) and online prediction in the cross-dataset setting. Our evaluations show that using pose and eye contact based features, cross-dataset prediction is possible with an accuracy of 0.68, as such providing another important piece of the puzzle towards emergent leadership detection in the real world.Paper: mueller19_arxiv.pdfPaper Access: https://arxiv.org/abs/1905.02058@techreport{mueller19_arxiv, title = {Emergent Leadership Detection Across Datasets}, author = {M{\"{u}}ller, Philipp and Bulling, Andreas}, year = {2019}, pages = {1--5}, url = {https://arxiv.org/abs/1905.02058} } -
Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications
Xucong Zhang, Yusuke Sugano, Andreas Bulling
arXiv:1901.10906, pp. 1–13, 2019.
Appearance-based gaze estimation methods that only require an off-the-shelf camera have significantly improved but they are still not yet widely used in the human-computer interaction (HCI) community. This is partly because it remains unclear how they perform compared to model-based approaches as well as dominant, special-purpose eye tracking equipment. To address this limitation, we evaluate the performance of state-of-the-art appearance-based gaze estimation for interaction scenarios with and without personal calibration, indoors and outdoors, for different sensing distances, as well as for users with and without glasses. We discuss the obtained findings and their implications for the most important gaze-based applications, namely explicit eye input, attentive user interfaces, gaze-based user modelling, and passive eye monitoring. To democratise the use of appearance-based gaze estimation and interaction in HCI, we finally present OpenGaze (www.opengaze.org), the first software toolkit for appearance-based gaze estimation and interaction.Paper: zhang19_arxiv.pdfPaper Access: https://arxiv.org/abs/1901.10906@techreport{zhang19_arxiv, title = {Evaluation of Appearance-Based Methods and Implications for Gaze-Based Applications}, author = {Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas}, year = {2019}, pages = {1--13}, url = {https://arxiv.org/abs/1901.10906} }
2018
Journal Articles
-
Eye movements during everyday behavior predict personality traits
Sabrina Hoppe, Tobias Loetscher, Stephanie Morey, Andreas Bulling
Frontiers in Human Neuroscience, 12, pp. 1–8, 2018.
Besides allowing us to perceive our surroundings, eye movements are also a window into our mind and a rich source of information on who we are, how we feel, and what we do. Here we show that eye movements during an everyday task predict aspects of our personality. We tracked eye movements of 42 participants while they ran an errand on a university campus and subsequently assessed their personality traits using well-established questionnaires. Using a state-of-the-art machine learning method and a rich set of features encoding different eye movement characteristics, we were able to reliably predict four of the Big Five personality traits (neuroticism, extraversion, agreeableness, conscientiousness) as well as perceptual curiosity only from eye movements. Further analysis revealed new relations between previously neglected eye movement characteristics and personality. Our findings demonstrate a considerable influence of personality on everyday eye movement control, thereby complementing earlier studies in laboratory settings. Improving automatic recognition and interpretation of human social signals is an important endeavor, enabling innovative design of human–computer systems capable of sensing spontaneous natural user behavior to facilitate efficient interaction and personalization.Paper: hoppe18_fhns.pdf@article{hoppe18_fhns, title = {Eye movements during everyday behavior predict personality traits}, author = {Hoppe, Sabrina and Loetscher, Tobias and Morey, Stephanie and Bulling, Andreas}, doi = {10.3389/fnhum.2018.00105}, volume = {12}, pages = {1--8}, year = {2018}, journal = {Frontiers in Human Neuroscience} } -
CueAuth: Comparing Touch, Mid-Air Gestures, and Gaze for Cue-based Authentication on Situated Displays
Mohamed Khamis, Ludwig Trotter, Ville Mäkelä, Emanuel Zezschwitz, Jens Le, Andreas Bulling, Florian Alt
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), 2 (7), pp. 1–22, 2018.
Secure authentication on situated displays s (e.g., to access sensitive information or to make purchases) is becoming increasingly important. A promising approach are authentication schemes that employ cues that users respond to while authenticating; these schemes overwhelm observers by requiring them to observe the cue itself as well as users’ response to the cue. Although previous work proposed a variety of modalities, such as gaze and mid-air gestures, to further improve security, an understanding of how they compare with regard to usability and security is still missing as of today. In this paper, we compare modalities for cue-based authentication on situated displays. We provide the first comparison between touch, mid-air gestures, and calibration-free gaze using a state-of-the-art authentication concept. In two user studies (N=37) we found that the choice of touch or gaze presents a clear trade-off between usability and security. For example, while gaze input is more secure, it is also more demanding and requires longer authentication times. Mid-air gestures are slightly slower and more secure than touch but users hesitate using them in public. We conclude with design implications for authentication using touch, mid-air gestures, and gaze and discuss how the choice of modality creates opportunities and challenges for improved authentication in public.doi: 10.1145/3287052Paper: khamis18_imwut.pdf@article{khamis18_imwut, title = {CueAuth: Comparing Touch, Mid-Air Gestures, and Gaze for Cue-based Authentication on Situated Displays}, author = {Khamis, Mohamed and Trotter, Ludwig and Mäkelä, Ville and von Zezschwitz, Emanuel and Le, Jens and Bulling, Andreas and Alt, Florian}, year = {2018}, journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)}, volume = {2}, number = {7}, pages = {1--22}, doi = {10.1145/3287052} } -
GazeDirector: Fully Articulated Eye Gaze Redirection in Video
Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, Andreas Bulling
Computer Graphics Forum (CGF), 37 (2), pp. 217-225, 2018.
Abstract Links BibTeX Project Best Paper Honourable Mention Award
We present GazeDirector, a new approach for eye gaze redirection that uses model-fitting. Our method first tracks the eyes by fitting a multi-part eye region model to video frames using analysis-by-synthesis, thereby recovering eye region shape, texture, pose, and gaze simultaneously. It then redirects gaze by 1) warping the eyelids from the original image using a model-derived flow field, and 2) rendering and compositing synthesized 3D eyeballs onto the output image in a photorealistic manner. GazeDirector allows us to change where people are looking without person-specific training data, and with full articulation, i.e. we can precisely specify new gaze directions in 3D. Quantitatively, we evaluate both model-fitting and gaze synthesis, with experiments for gaze estimation and redirection on the Columbia gaze dataset. Qualitatively, we compare GazeDirector against recent work on gaze redirection, showing better results especially for large redirection angles. Finally, we demonstrate gaze redirection on YouTube videos by introducing new 3D gaze targets and by manipulating visual behavior.@article{wood18_cgf, title = {GazeDirector: Fully Articulated Eye Gaze Redirection in Video}, author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Morency, Louis-Philippe and Robinson, Peter and Bulling, Andreas}, year = {2018}, journal = {Computer Graphics Forum (CGF)}, volume = {37}, number = {2}, pages = {217-225}, doi = {10.1111/cgf.13355}, video = {https://www.youtube.com/watch?v=rSNUGciJH6A} }
Conference Papers
-
Error-Aware Gaze-Based Interfaces for Robust Mobile Gaze Interaction
Michael Barz, Florian Daiber, Daniel Sonntag, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–10, 2018.
Abstract Links BibTeX Project Best Paper Award
Gaze estimation error is unavoidable in head-mounted eye trackers and can severely hamper usability and performance of mobile gaze-based interfaces given that the error varies constantly for different interaction positions. In this work, we explore error-aware gaze-based interfaces that estimate and adapt to gaze estimation error on-the-fly. We implement a sample error-aware user interface for gaze-based selection and different error compensation methods: a naïve approach that increases component size directly proportional to the absolute error, a recent model by Feit et al. (CHI’17) that is based on the 2-dimensional error distribution, and a novel predictive model that shifts gaze by a directional error estimate. We evaluate these models in a 12-participant user study and show that our predictive model outperforms the others significantly in terms of selection rate, particularly for small gaze targets. These results underline both the feasibility and potential of next generation error-aware gaze-based user interfaces.Paper: barz18_etra.pdf@inproceedings{barz18_etra, author = {Barz, Michael and Daiber, Florian and Sonntag, Daniel and Bulling, Andreas}, title = {Error-Aware Gaze-Based Interfaces for Robust Mobile Gaze Interaction}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--10}, doi = {10.1145/3204493.3204536} } -
A novel approach to single camera, glint-free 3D eye model fitting including corneal refraction
Kai Dierkes, Moritz Kassner, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2018.
Model-based methods for glint-free gaze estimation typically infer eye pose using pupil contours extracted from eye images. Existing methods, however, either ignore or require complex hardware setups to deal with refraction effects occurring at the corneal interfaces. In this work we provide a detailed analysis of the effects of refraction in glint-free gaze estimation using a single near-eye camera, based on the method presented by [Swirski et al. 2013]. We demonstrate systematic deviations in inferred eyeball positions and gaze directions with respect to synthetic ground-truth data and show that ignoring corneal refraction can result in angular errors of several degrees. Furthermore, we quantify gaze direction dependent errors in pupil radius estimates. We propose a novel approach to account for corneal refraction in 3D eye model fitting and by analyzing synthetic and real images show that our new method successfully captures refraction effects and helps to overcome the shortcomings of the state of the art approach.Paper: dierkes18_etra.pdf@inproceedings{dierkes18_etra, author = {Dierkes, Kai and Kassner, Moritz and Bulling, Andreas}, title = {A novel approach to single camera, glint-free 3D eye model fitting including corneal refraction}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--9}, doi = {10.1145/3204493.3204525} } -
Towards a Symbiotic Human-Machine Depth Sensor: Exploring 3D Gaze for Object Reconstruction
Teresa Hirzle, Jan Gugenheimer, Florian Geiselhart, Andreas Bulling, Enrico Rukzio
Adj. Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 114-116, 2018.
Eye tracking is expected to become an integral part of future augmented reality (AR) head-mounted displays (HMDs) given that it can easily be integrated into existing hardware and provides a versatile interaction modality. To augment objects in the real world, AR HMDs require a three-dimensional understanding of the scene, which is currently solved using depth cameras. In this work we aim to explore how 3D gaze data can be used to enhance scene understanding for AR HMDs by envisioning a symbiotic human-machine depth camera, fusing depth data with 3D gaze information. We present a first proof of concept, exploring to what extend we are able to recognise what a user is looking at by plotting 3D gaze data. To measure 3D gaze, we implemented a vergence-based algorithm and built an eye tracking setup consisting of a Pupil Labs headset and an OptiTrack motion capture system, allowing us to measure 3D gaze inside a 50x50x50 cm volume. We show first 3D gaze plots of "gazed-at" objects and describe our vision of a symbiotic human-machine depth camera that combines a depth camera and human 3D gaze information.Paper: hirzle18_uist.pdf@inproceedings{hirzle18_uist, title = {Towards a Symbiotic Human-Machine Depth Sensor: Exploring 3D Gaze for Object Reconstruction}, author = {Hirzle, Teresa and Gugenheimer, Jan and Geiselhart, Florian and Bulling, Andreas and Rukzio, Enrico}, year = {2018}, pages = {114-116}, doi = {10.1145/3266037.3266119}, booktitle = {Adj. Proc. ACM Symposium on User Interface Software and Technology (UIST)} } -
Understanding Face and Eye Visibility in Front-Facing Cameras of Smartphones used in the Wild
Mohamed Khamis, Anita Baier, Niels Henze, Florian Alt, Andreas Bulling
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–12, 2018.
Commodity mobile devices are now equipped with high-resolution front-facing cameras, paving the way for applications in biometrics, facial expression analysis, or gaze interaction. However, it is unknown how often users hold devices in a way that allows capturing their face or eyes, and how this impacts detection accuracy. We collected 25,726 in-the-wild photos taken from the front-facing camera of smartphones and associated application usage logs. We found that the full face is visible about 29% of the time, and that in most cases the face is only partially visible. We further identified an influence of users’ current activity; for example, when watching videos, the eyes but not the entire face are visible 75% of the time in our dataset. We found that state-of-the-art face detection algorithms perform poorly against photos taken from front-facing cameras. We discuss how these findings impact mobile applications that leverage face and eye detection, and derive practical implications to address state-of-the art’s limitations.Paper: khamis18_chi.pdf@inproceedings{khamis18_chi, title = {Understanding Face and Eye Visibility in Front-Facing Cameras of Smartphones used in the Wild}, author = {Khamis, Mohamed and Baier, Anita and Henze, Niels and Alt, Florian and Bulling, Andreas}, year = {2018}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3173574.3173854}, pages = {1--12}, video = {https://www.youtube.com/watch?v=_L6FyzTjFG0} } -
Which one is me? Identifying Oneself on Public Displays
Mohamed Khamis, Christian Becker, Andreas Bulling, Florian Alt
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–12, 2018.
Abstract Links BibTeX Project Best Paper Honourable Mention Award
While user representations are extensively used on public displays, it remains unclear how well users can recognize their own representation among those of surrounding users. We study the most widely used representations: abstract objects, skeletons, silhouettes and mirrors. In a prestudy (N=12), we identify five strategies that users follow to recognize themselves on public displays. In a second study (N=19), we quantify the users’ recognition time and accuracy with respect to each representation type. Our findings suggest that there is a significant effect of (1) the representation type, (2) the strategies performed by users, and (3) the combination of both on recognition time and accuracy. We discuss the suitability of each representation for different settings and provide specific recommendations as to how user representations should be applied in multi-user scenarios. These recommendations guide practitioners and researchers in selecting the representation that optimizes the most for the deployment’s requirements, and for the user strategies that are feasible in that environment.Paper: khamis18_chi_2.pdf@inproceedings{khamis18_chi_2, title = {Which one is me? Identifying Oneself on Public Displays}, author = {Khamis, Mohamed and Becker, Christian and Bulling, Andreas and Alt, Florian}, year = {2018}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3173574.3173861}, pages = {1--12}, video = {https://www.youtube.com/watch?v=yG5_RBrnRx0} } -
VRPursuits: Interaction in Virtual Reality using Smooth Pursuit Eye Movements
Mohamed Khamis, Carl Oechsner, Florian Alt, Andreas Bulling
Proc. International Conference on Advanced Visual Interfaces (AVI), pp. 1–8, 2018.
Gaze-based interaction using smooth pursuit eye movements (Pursuits) is attractive given that it is intuitive and overcomes the Midas touch problem. At the same time, eye tracking is becoming increasingly popular for VR applications. While Pursuits was shown to be effective in several interaction contexts, it was never explored in-depth for VR before. In a user study (N=26), we investigated how parameters that are specific to VR settings influence the performance of Pursuits. We found that Pursuits is robust against different sizes of virtual 3D targets and sizes to them. However, Pursuits’ performance improves when the trajectory size is larger, particularly if the user is walking while interacting. While walking, selecting moving targets via Pursuits is generally feasible albeit less accurate than when stationary. Finally, we discuss the implications of these findings and the potential of smooth pursuits for interaction in VR by demonstrating two sample use cases: 1) gaze-based authentication in VR, and 2) a space meteors shooting game.Paper: khamis18_avi.pdf@inproceedings{khamis18_avi, title = {VRPursuits: Interaction in Virtual Reality using Smooth Pursuit Eye Movements}, author = {Khamis, Mohamed and Oechsner, Carl and Alt, Florian and Bulling, Andreas}, year = {2018}, pages = {1--8}, booktitle = {Proc. International Conference on Advanced Visual Interfaces (AVI)}, doi = {10.1145/3206505.3206522} } -
The Past, Present, and Future of Gaze-enabled Handheld Mobile Devices: Survey and Lessons Learned
Mohamed Khamis, Florian Alt, Andreas Bulling
Proc. ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI), pp. 1–17, 2018.
Abstract Links BibTeX Project Best Paper Honourable Mention Award
While first-generation mobile gaze interfaces required special-purpose hardware, recent advances in computational gaze estimation and the availability of sensor-rich and powerful devices is finally fulfilling the promise of pervasive eye tracking and eye-based interaction on off-the-shelf mobile devices. This work provides the first holistic view on the past, present, and future of eye tracking on handheld mobile devices. To this end, we discuss how research developed from building hardware prototypes, to accurate gaze estimation on unmodified smartphones and tablets. We then discuss implications by laying out 1) novel opportunities, which include pervasive advertising and conducting in-the-wild eye tracking studies on handhelds, as well as 2) new challenges that require further research, such as the visibility of the user’s eyes, lighting conditions, and privacy implications. We discuss how these developments shape MobileHCI research in the future, possibly the “next 20 years”, as the overarching theme of MobileHCI 2018 suggests.Paper: khamis18_mobilehci.pdf@inproceedings{khamis18_mobilehci, author = {Khamis, Mohamed and Alt, Florian and Bulling, Andreas}, title = {The Past, Present, and Future of Gaze-enabled Handheld Mobile Devices: Survey and Lessons Learned}, booktitle = {Proc. ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI)}, year = {2018}, doi = {10.1145/3229434.3229452}, pages = {1--17} } -
GazeDrone: Mobile Eye-Based Interaction in Public Space Without Augmenting the User
Mohamed Khamis, Anna Kienle, Florian Alt, Andreas Bulling
Proc. ACM Workshop on Micro Aerial Vehicle Networks, Systems, and Applications (DroNet), pp. 66-71, 2018.
Gaze interaction holds a lot of promise for seamless human-computer interaction. At the same time, current wearable mobile eye trackers require user augmentation that negatively impacts natural user behavior while remote trackers require users to position themselves within a confined tracking range. We present GazeDrone, the first system that combines a camera-equipped aerial drone with a computational method to detect sidelong glances for spontaneous (calibration-free) gaze-based interaction with surrounding pervasive systems (e.g., public displays). GazeDrone does not require augmenting each user with on-body sensors and allows interaction from arbitrary positions, even while moving. We demonstrate that drone-supported gaze interaction is feasible and accurate for certain movement types. It is well-perceived by users, in particular while interacting from a fixed position as well as while moving orthogonally or diagonally to a display. We present design implications and discuss opportunities and challenges for drone-supported gaze interaction in public.Paper: khamis18_dronet.pdf@inproceedings{khamis18_dronet, title = {GazeDrone: Mobile Eye-Based Interaction in Public Space Without Augmenting the User}, author = {Khamis, Mohamed and Kienle, Anna and Alt, Florian and Bulling, Andreas}, doi = {10.1145/3213526.3213539}, year = {2018}, booktitle = {Proc. ACM Workshop on Micro Aerial Vehicle Networks, Systems, and Applications (DroNet)}, pages = {66-71} } -
A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks
Arif Khan, Ingmar Steiner, Yusuke Sugano, Andreas Bulling, Ross Macdonald
Proc. Language Resources and Evaluation Conference (LREC), pp. 4277–4281, 2018.
Phonetic segmentation is the process of splitting speech into distinct phonetic units. Human experts routinely perform this task manually by analyzing auditory and visual cues using analysis software, which is an extremely time-consuming process. Methods exist for automatic segmentation, but these are not always accurate enough. In order to improve automatic segmentation, we need to model it as close to the manual segmentation as possible. This corpus is an effort to capture the human segmentation behavior by recording experts performing a segmentation task. We believe that this data will enable us to highlight the important aspects of manual segmentation, which can be used in automatic segmentation to improve its accuracy.Paper: khan18_lrec.pdf@inproceedings{khan18_lrec, title = {A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks}, author = {Khan, Arif and Steiner, Ingmar and Sugano, Yusuke and Bulling, Andreas and Macdonald, Ross}, year = {2018}, pages = {4277--4281}, booktitle = {Proc. Language Resources and Evaluation Conference (LREC)} } -
Hidden Pursuits: Evaluating Gaze-selection via Pursuits when the Stimulus Trajectory is Partially Hidden
Thomas Mattusch, Mahsa Mirzamohammad, Mohamed Khamis, Andreas Bulling, Florian Alt
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–5, 2018.
The idea behind gaze interaction using Pursuits is to leverage the human’s smooth pursuit eye movements performed when following moving targets. However, humans can also anticipate where a moving target would reappear if it temporarily hides from their view. In this work, we investigate how well users can select targets using Pursuits in cases where the target’s trajectory is partially invisible (HiddenPursuits): e.g., can users select a moving target that temporarily hides behind another object? Although HiddenPursuits was not studied in the context of interaction before, understanding how well users can perform HiddenPursuits presents numerous opportunities, particularly for small interfaces where a target’s trajectory can cover area outside of the screen. We found that users can still select targets quickly via Pursuits even if their trajectory is up to 50% hidden, and at the expense of longer selection times when the hidden portion is larger. We discuss how gaze-based interfaces can leverage HiddenPursuits for an improved user experience.Paper: mattusch18_etra.pdf@inproceedings{mattusch18_etra, author = {Mattusch, Thomas and Mirzamohammad, Mahsa and Khamis, Mohamed and Bulling, Andreas and Alt, Florian}, title = {Hidden Pursuits: Evaluating Gaze-selection via Pursuits when the Stimulus Trajectory is Partially Hidden}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--5}, doi = {10.1145/3204493.3204569} } -
Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behavior
Philipp Müller, Michael Xuelin Huang, Andreas Bulling
Proc. ACM International Conference on Intelligent User Interfaces (IUI), pp. 153-164, 2018.
Rapport, the close and harmonious relationship in which interaction partners are "in sync" with each other, was shown to result in smoother social interactions, improved collaboration, and improved interpersonal outcomes. In this work, we are first to investigate automatic prediction of low rapport during natural interactions within small groups. This task is challenging given that rapport only manifests in subtle non-verbal signals that are, in addition, subject to influences of group dynamics as well as inter-personal idiosyncrasies. We record videos of unscripted discussions of three to four people using a multi-view camera system and microphones. We analyse a rich set of non-verbal signals for rapport detection, namely facial expressions, hand motion, gaze, speaker turns, and speech prosody. Using facial features, we can detect low rapport with an average precision of 0.7 (chance level at 0.25), while incorporating prior knowledge of participants’ personalities can even achieve early prediction without a drop in performance. We further provide a detailed analysis of different feature sets and the amount of information contained in different temporal segments of the interactions.Paper: mueller18_iui.pdf@inproceedings{mueller18_iui, title = {Detecting Low Rapport During Natural Interactions in Small Groups from Non-Verbal Behavior}, author = {M{\"{u}}ller, Philipp and Huang, Michael Xuelin and Bulling, Andreas}, year = {2018}, pages = {153-164}, booktitle = {Proc. ACM International Conference on Intelligent User Interfaces (IUI)}, doi = {10.1145/3172944.3172969} } -
Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour
Philipp Müller, Michael Xuelin Huang, Xucong Zhang, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–10, 2018.
Eye contact is one of the most important non-verbal social cues and fundamental to human interactions. However, detecting eye contact without specialized eye tracking equipment poses significant challenges, particularly for multiple people in real-world settings. We present a novel method to robustly detect eye contact in natural three- and four-person interactions using off-the-shelf ambient cameras. Our method exploits that, during conversations, people tend to look at the person who is currently speaking. Harnessing the correlation between people’s gaze and speaking behaviour therefore allows our method to automatically acquire training data during deployment and adaptively train eye contact detectors for each target user. We empirically evaluate the performance of our method on a recent dataset of natural group interactions and demonstrate that it achieves a relative improvement over the state-of-the-art method of more than 60%, and also improves over a head pose based baseline.Paper: mueller18_etra.pdf@inproceedings{mueller18_etra, author = {M{\"{u}}ller, Philipp and Huang, Michael Xuelin and Zhang, Xucong and Bulling, Andreas}, title = {Robust Eye Contact Detection in Natural Multi-Person Interactions Using Gaze and Speaking Behaviour}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--10}, doi = {10.1145/3204493.3204549} } -
Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings
Seonwook Park, Xucong Zhang, Andreas Bulling, Otmar Hilliges
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–10, 2018.
Abstract Links BibTeX Project Best Presentation Award
Conventional feature-based and model-based gaze estimation methods have proven to perform well in settings with controlled illumination and specialized cameras. In unconstrained real-world settings, however, such methods are surpassed by recent appearance-based methods due to difficulties in modeling factors such as illumination changes and other visual artifacts. We present a novel learning-based method for eye region landmark localization that enables conventional methods to be competitive to latest appearance-based methods. Despite having been trained exclusively on synthetic data, our method exceeds the state of the art for iris localization and eye shape registration on real-world imagery. We then use the detected landmarks as input to iterative model-fitting and lightweight learning-based gaze estimation methods. Our approach outperforms existing model-fitting and appearance-based methods in the context of person-independent and personalized gaze estimation.Paper: park18_etra.pdf@inproceedings{park18_etra, author = {Park, Seonwook and Zhang, Xucong and Bulling, Andreas and Hilliges, Otmar}, title = {Learning to Find Eye Region Landmarks for Remote Gaze Estimation in Unconstrained Settings}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--10}, doi = {10.1145/3204493.3204545}, video = {https://www.youtube.com/watch?v=I8WlEHgDBV4} } -
Fixation Detection for Head-Mounted Eye Tracking Based on Visual Similarity of Gaze Targets
Julian Steil, Michael Xuelin Huang, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2018.
Fixations are widely analysed in human vision, gaze-based interaction, and experimental psychology research. However, robust fixation detection in mobile settings is profoundly challenging given the prevalence of user and gaze target motion. These movements feign a shift in gaze estimates in the frame of reference defined by the eye tracker’s scene camera. To address this challenge, we present a novel fixation detection method for head-mounted eye trackers. Our method exploits that, independent of user or gaze target motion, target appearance remains about the same during a fixation. It extracts image information from small regions around the current gaze position and analyses the appearance similarity of these gaze patches across video frames to detect fixations. We evaluate our method using fine-grained fixation annotations on a five-participant indoor dataset (MPIIEgoFixation) with more than 2,300 fixations in total. Our method outperforms commonly used velocity- and dispersion-based algorithms, which highlights its significant potential to analyse scene image information for eye movement detection.Paper: steil18_etra.pdf@inproceedings{steil18_etra, author = {Steil, Julian and Huang, Michael Xuelin and Bulling, Andreas}, title = {Fixation Detection for Head-Mounted Eye Tracking Based on Visual Similarity of Gaze Targets}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--9}, doi = {10.1145/3204493.3204538} } -
Forecasting User Attention During Everyday Mobile Interactions Using Device-Integrated and Wearable Sensors
Julian Steil, Philipp Müller, Yusuke Sugano, Andreas Bulling
Proc. ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI), pp. 1–13, 2018.
Abstract Links BibTeX Project Best Paper Award
Visual attention is highly fragmented during mobile interactions but the erratic nature of attention shifts currently limits attentive user interfaces to adapt after the fact, i.e. after shifts have already happened. We instead study attention forecasting – the challenging task of predicting users’ gaze behavior (overt visual attention) in the near future. We present a novel long-term dataset of everyday mobile phone interactions, continuously recorded from 20 participants engaged in common activities on a university campus over 4.5 hours each (more than 90 hours in total). We propose a proof-of-concept method that uses device-integrated sensors and body-worn cameras to encode rich information on device usage and users’ visual scene. We demonstrate that our method can forecast bidirectional attention shifts and whether the primary attentional focus is on the handheld mobile device. We study the impact of different feature sets on performance and discuss the significant potential but also remaining challenges of forecasting user attention during mobile interactions.Paper: steil18_mobilehci.pdf@inproceedings{steil18_mobilehci, author = {Steil, Julian and M{\"{u}}ller, Philipp and Sugano, Yusuke and Bulling, Andreas}, title = {Forecasting User Attention During Everyday Mobile Interactions Using Device-Integrated and Wearable Sensors}, booktitle = {Proc. ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI)}, year = {2018}, doi = {10.1145/3229434.3229439}, pages = {1--13} } -
Training Person-Specific Gaze Estimators from Interactions with Multiple Devices
Xucong Zhang, Michael Xuelin Huang, Yusuke Sugano, Andreas Bulling
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–12, 2018.
Learning-based gaze estimation has significant potential to enable attentive user interfaces and gaze-based interaction on the billions of camera-equipped handheld devices and ambient displays. While training accurate person- and device-independent gaze estimators remains challenging, person-specific training is feasible but requires tedious data collection for each target device. To address these limitations, we present the first method to train person-specific gaze estimators across multiple devices. At the core of our method is a single convolutional neural network with shared feature extraction layers and device-specific branches that we train from face images and corresponding on-screen gaze locations. Detailed evaluations on a new dataset of interactions with five common devices (mobile phone, tablet, laptop, desktop computer, smart TV) and three common applications (mobile game, text editing, media center) demonstrate the significant potential of cross-device training. We further explore training with gaze locations derived from natural interactions, such as mouse or touch input.Paper: zhang18_chi.pdf@inproceedings{zhang18_chi, title = {Training Person-Specific Gaze Estimators from Interactions with Multiple Devices}, author = {Zhang, Xucong and Huang, Michael Xuelin and Sugano, Yusuke and Bulling, Andreas}, year = {2018}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3173574.3174198}, pages = {1--12} } -
Revisiting Data Normalization for Appearance-Based Gaze Estimation
Xucong Zhang, Yusuke Sugano, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2018.
Appearance-based gaze estimation is promising for unconstrained real-world settings, but the significant variability in head pose and user-camera distance poses significant challenges for training generic gaze estimators. Data normalization was proposed to cancel out this geometric variability by mapping input images and gaze labels to a normalized space. Although used successfully in prior works, the role and importance of data normalization remains unclear. To fill this gap, we study data normalization for the first time using principled evaluations on both simulated and real data. We propose a modification to the current data normalization formulation by removing the scaling factor and show that our new formulation performs significantly better (between 9.5% and 32.7%) in the different evaluation settings. Using images synthesized from a 3D face model, we demonstrate the benefit of data normalization for the efficiency of the model training. Experiments on real-world images confirm the advantages of data normalization in terms of gaze estimation performance.Paper: zhang18_etra.pdf@inproceedings{zhang18_etra, author = {Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas}, title = {Revisiting Data Normalization for Appearance-Based Gaze Estimation}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2018}, pages = {1--9}, doi = {10.1145/3204493.3204548} }
Technical Reports
-
Privacy-Aware Eye Tracking Using Differential Privacy
Julian Steil, Inken Hagestedt, Michael Xuelin Huang, Andreas Bulling
arXiv:1812.08000, pp. 1–22, 2018.
With eye tracking being increasingly integrated into virtual and augmented reality (VR/AR) head-mounted displays, preserving users’ privacy is an ever more important, yet under-explored, topic in the eye tracking community. We report a large-scale online survey (N=124) on privacy aspects of eye tracking that provides the first comprehensive account of with whom, for which services, and to which extent users are willing to share their gaze data. Using these insights, we design a privacy-aware VR interface that uses differential privacy, which we evaluate on a new 20-participant dataset for two privacy sensitive tasks: We show that our method can prevent user re-identification and protect gender information while maintaining high performance for gaze-based document type classification. Our results highlight the privacy challenges particular to gaze data and demonstrate that differential privacy is a potential means to address them. Thus, this paper lays important foundations for future research on privacy-aware gaze interfaces.Paper: steil18_arxiv.pdfPaper Access: https://arxiv.org/abs/1812.08000@techreport{steil18_arxiv, author = {Steil, Julian and Hagestedt, Inken and Huang, Michael Xuelin and Bulling, Andreas}, title = {Privacy-Aware Eye Tracking Using Differential Privacy}, year = {2018}, pages = {1--22}, url = {https://arxiv.org/abs/1812.08000} } -
PrivacEye: Privacy-Preserving First-Person Vision Using Image Features and Eye Movement Analysis
Julian Steil, Marion Koelle, Wilko Heuten, Susanne Boll, Andreas Bulling
arXiv:1801.04457, pp. 1–14, 2018.
As first-person cameras in head-mounted displays become increasingly prevalent, so does the problem of infringing user and bystander privacy. To address this challenge, we present PrivacEye, a proof-of-concept system that detects privacysensitive everyday situations and automatically enables and disables the first-person camera using a mechanical shutter. To close the shutter, PrivacEye detects sensitive situations from first-person camera videos using an end-to-end deep-learning model. To open the shutter without visual input, PrivacEye uses a separate, smaller eye camera to detect changes in users’ eye movements to gauge changes in the "privacy level" of the current situation. We evaluate PrivacEye on a dataset of first-person videos recorded in the daily life of 17 participants that they annotated with privacy sensitivity levels. We discuss the strengths and weaknesses of our proof-of-concept system based on a quantitative technical evaluation as well as qualitative insights from semi-structured interviews.Paper: steil18_arxiv_2.pdfPaper Access: https://arxiv.org/abs/1801.04457@techreport{steil18_arxiv_2, title = {PrivacEye: Privacy-Preserving First-Person Vision Using Image Features and Eye Movement Analysis}, author = {Steil, Julian and Koelle, Marion and Heuten, Wilko and Boll, Susanne and Bulling, Andreas}, year = {2018}, pages = {1--14}, url = {https://arxiv.org/abs/1801.04457} }
2017
Journal Articles
-
EyePACT: Eye-Based Parallax Correction on Touch-Enabled Interactive Displays
Mohamed Khamis, Daniel Buschek, Tobias Thieron, Florian Alt, Andreas Bulling
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), 1 (4), pp. 1–18, 2017.
The parallax effect describes the displacement between the perceived and detected touch locations on a touch-enabled surface. Parallax is a key usability challenge for interactive displays, particularly for those that require thick layers of glass between the screen and the touch surface to protect them from vandalism. To address this challenge, we present EyePACT, a method that compensates for input error caused by parallax on public displays. Our method uses a display-mounted depth camera to detect the user’s 3D eye position in front of the display and the detected touch location to predict the perceived touch location on the surface. We evaluate our method in two user studies in terms of parallax correction performance as well as multi-user support. Our evaluations demonstrate that EyePACT (1) significantly improves accuracy even with varying gap distances between the touch surface and the display, (2) adapts to different levels of parallax by resulting in significantly larger corrections with larger gap distances, and (3) maintains a significantly large distance between two users’ fingers when interacting with the same object. Our results provide implications for the development of future touch-enabled public displays.doi: 10.1145/3161168Paper: khamis17_imwut.pdf@article{khamis17_imwut, author = {Khamis, Mohamed and Buschek, Daniel and Thieron, Tobias and Alt, Florian and Bulling, Andreas}, title = {EyePACT: Eye-Based Parallax Correction on Touch-Enabled Interactive Displays}, journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)}, year = {2017}, volume = {1}, number = {4}, pages = {1--18}, doi = {10.1145/3161168} } -
InvisibleEye: Mobile Eye Tracking Using Multiple Low-Resolution Cameras and Learning-Based Gaze Estimation
Marc Tonsen, Julian Steil, Yusuke Sugano, Andreas Bulling
Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT), 1 (3), pp. 1–21, 2017.
Abstract Links BibTeX Project Distinguished Paper Award
Analysis of everyday human gaze behaviour has significant potential for ubiquitous computing, as evidenced by a large body of work in gaze-based human-computer interaction, attentive user interfaces, and eye-based user modelling. However, current mobile eye trackers are still obtrusive, which not only makes them uncomfortable to wear and socially unacceptable in daily life, but also prevents them from being widely adopted in the social and behavioural sciences. To address these challenges we present InvisibleEye, a novel approach for mobile eye tracking that uses millimetre-size RGB cameras that can be fully embedded into normal glasses frames. To compensate for the cameras’ low image resolution of only a few pixels, our approach uses multiple cameras to capture different views of the eye, as well as learning-based gaze estimation to directly regress from eye images to gaze directions. We prototypically implement our system and characterise its performance on three large-scale, increasingly realistic, and thus challenging datasets: 1) eye images synthesised using a recent computer graphics eye region model, 2) real eye images recorded of 17 participants under controlled lighting, and 3) eye images recorded of four participants over the course of four recording sessions in a mobile setting. We show that InvisibleEye achieves a top person-specific gaze estimation accuracy of 1.79° using four cameras with a resolution of only 5×5 pixels. Our evaluations not only demonstrate the feasibility of this novel approach but, more importantly, underline its significant potential for finally realising the vision of invisible mobile eye tracking and pervasive attentive user interfaces.doi: 10.1145/3130971Paper: tonsen17_imwut.pdf@article{tonsen17_imwut, author = {Tonsen, Marc and Steil, Julian and Sugano, Yusuke and Bulling, Andreas}, title = {InvisibleEye: Mobile Eye Tracking Using Multiple Low-Resolution Cameras and Learning-Based Gaze Estimation}, journal = {Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies (IMWUT)}, year = {2017}, doi = {10.1145/3130971}, volume = {1}, number = {3}, pages = {1--21} } -
Look together: using gaze for assisting co-located collaborative search
Yanxia Zhang, Ken Pfeuffer, Ming Ki Chong, Jason Alexander, Andreas Bulling, Hans Gellersen
Personal and Ubiquitous Computing, 21 (1), pp. 173-186, 2017.
Gaze information provides indication of users focus which complements remote collaboration tasks, as distant users can see their partner’s focus. In this paper, we apply gaze for co-located collaboration, where users’ gaze locations are presented on the same display, to help collaboration between partners. We integrated various types of gaze indicators on the user interface of a collaborative search system, and we conducted two user studies to understand how gaze enhances coordination and communication between co-located users. Our results show that gaze indeed enhances co-located collaboration, but with a trade-off between visibility of gaze indicators and user distraction. Users acknowledged that seeing gaze indicators eases communication, because it let them be aware of their partner’s interests and attention. However, users can be reluctant to share their gaze information due to trust and privacy, as gaze potentially divulges their interests.Paper: zhang17_puc.pdf@article{zhang17_puc, title = {Look together: using gaze for assisting co-located collaborative search}, author = {Zhang, Yanxia and Pfeuffer, Ken and Chong, Ming Ki and Alexander, Jason and Bulling, Andreas and Gellersen, Hans}, year = {2017}, journal = {Personal and Ubiquitous Computing}, publisher = {Springer}, volume = {21}, number = {1}, pages = {173-186}, doi = {10.1007/s00779-016-0969-x} }
Conference Papers
-
Gaze Embeddings for Zero-Shot Image Classification
Nour Karessli, Zeynep Akata, Bernt Schiele, Andreas Bulling
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6412-6421, 2017.
Abstract Links BibTeX Project Spotlight Presentation
Zero-shot image classification using auxiliary information, such as attributes describing discriminative object properties, requires time-consuming annotation by domain experts. We instead propose a method that relies on human gaze as auxiliary information, exploiting that even non-expert users have a natural ability to judge class membership. We present a data collection paradigm that involves a discrimination task to increase the information content obtained from gaze data. Our method extracts discriminative descriptors from the data and learns a compatibility function between image and gaze using three novel gaze embeddings: Gaze Histograms (GH), Gaze Features with Grid (GFG) and Gaze Features with Sequence (GFS). We introduce two new gaze-annotated datasets for fine-grained image classification and show that human gaze data is indeed class discriminative, provides a competitive alternative to expert-annotated attributes, and outperforms other baselines for zero-shot image classification.Paper: karessli17_cvpr.pdf@inproceedings{karessli17_cvpr, title = {Gaze Embeddings for Zero-Shot Image Classification}, author = {Karessli, Nour and Akata, Zeynep and Schiele, Bernt and Bulling, Andreas}, year = {2017}, booktitle = {Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, pages = {6412-6421}, doi = {10.1109/CVPR.2017.679} } -
GTmoPass: Two-factor Authentication on Public Displays Using GazeTouch passwords and Personal Mobile Devices
Mohamed Khamis, Regina Hasholzner, Andreas Bulling, Florian Alt
Proc. ACM International Symposium on Pervasive Displays (PerDis), pp. 1–9, 2017.
As public displays continue to deliver increasingly private and personalized content, there is a need to ensure that only the legitimate users can access private information in sensitive contexts. While public displays can adopt similar authentication concepts like those used on public terminals (e.g., ATMs), authentication in public is subject to a number of risks. Namely, adversaries can uncover a user’s password through (1) surfing users, (2) thermal attacks, or (3) smudge attacks. To address this problem we propose GTmoPass, an authentication architecture that enables Multi-factor user authentication on public displays. The first factor is a knowledge-factor: we employ a shoulder-surfing resilient multimodal scheme that combines gaze and touch input for password entry. The second factor is a possession-factor: users utilize their personal mobile devices, on which they enter the password. Credentials are securely transmitted to a server via Bluetooth beacons. We describe the implementation of GTmoPass and report on an evaluation of its usability and security, which shows that although authentication using GTmoPass is slightly slower than traditional methods, it protects against the three aforementioned threats.Paper: khamis17_perdis.pdf@inproceedings{khamis17_perdis, title = {GTmoPass: Two-factor Authentication on Public Displays Using GazeTouch passwords and Personal Mobile Devices}, author = {Khamis, Mohamed and Hasholzner, Regina and Bulling, Andreas and Alt, Florian}, doi = {10.1145/3078810.3078815}, year = {2017}, pages = {1--9}, booktitle = {Proc. ACM International Symposium on Pervasive Displays (PerDis)} } -
EyeScout: Active Eye Tracking for Position and Movement Independent Gaze Interaction with Large Public Displays
Mohamed Khamis, Axel Hoesl, Alexander Klimczak, Martin Reiss, Florian Alt, Andreas Bulling
Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 155-166, 2017.
While gaze holds a lot of promise for hands-free interaction with public displays, remote eye trackers with their confined tracking box restrict users to a single stationary position in front of the display. We present EyeScout, an active eye tracking system that combines an eye tracker mounted on a rail system with a computational method to automatically detect and align the tracker with the user’s lateral movement. EyeScout addresses key limitations of current gaze-enabled large public displays by offering two novel gaze-interaction modes for a single user: In "Walk then Interact" the user can walk up to an arbitrary position in front of the display and interact, while in "Walk and Interact" the user can interact even while on the move. We report on a user study that shows that EyeScout is well perceived by users, extends a public display’s sweet spot into a sweet line, and reduces gaze interaction kick- off time to 3.5 seconds - a 62% improvement over state of the art solutions. We discuss sample applications that demonstrate how EyeScout can enable position and movement-independent gaze interaction with large public displays.Paper: khamis17_uist.pdf@inproceedings{khamis17_uist, title = {EyeScout: Active Eye Tracking for Position and Movement Independent Gaze Interaction with Large Public Displays}, author = {Khamis, Mohamed and Hoesl, Axel and Klimczak, Alexander and Reiss, Martin and Alt, Florian and Bulling, Andreas}, year = {2017}, pages = {155-166}, doi = {10.1145/3126594.3126630}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, video = {https://www.youtube.com/watch?v=J7_OiRqsmdM} } -
GazeTouchPIN: Protecting Sensitive Data on Mobile Devices using Secure Multimodal Authentication
Mohamed Khamis, Mariam Hassib, Emanuel Zezschwitz, Andreas Bulling, Florian Alt
Proc. ACM International Conference on Multimodal Interaction (ICMI), pp. 446-450, 2017.
Although mobile devices provide access to a plethora of sensitive data, most users still only protect them with PINs or patterns, which are vulnerable to side-channel attacks (e.g., shoulder surfing). However, prior research has shown that privacy-aware users are willing to take further steps to protect their private data. We propose GazeTouchPIN, a novel secure authentication scheme for mobile devices that combines gaze and touch input. Our multimodal approach complicates shoulder-surfing attacks by requiring attackers to observe the screen as well as the user’s eyes to find the password. We evaluate the security and usability of GazeTouchPIN in two user studies (N=30). We found that while GazeTouchPIN requires longer entry times, privacy aware users would use it on-demand when feeling observed or when accessing sensitive data. The results show that successful shoulder surfing attack rate drops from 68% to 10.4% when using GazeTouchPIN.Paper: khamis17_icmi.pdf@inproceedings{khamis17_icmi, title = {GazeTouchPIN: Protecting Sensitive Data on Mobile Devices using Secure Multimodal Authentication}, author = {Khamis, Mohamed and Hassib, Mariam and von Zezschwitz, Emanuel and Bulling, Andreas and Alt, Florian}, year = {2017}, pages = {446-450}, doi = {10.1145/3136755.3136809}, booktitle = {Proc. ACM International Conference on Multimodal Interaction (ICMI)}, video = {https://www.youtube.com/watch?v=gs2YO0gP4kI} } -
They are all after you: Investigating the Viability of a Threat Model that involves Multiple Shoulder Surfers
Mohamed Khamis, Linda Bandelow, Stina Schick, Dario Casadevall, Andreas Bulling, Florian Alt
Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM), pp. 1–5, 2017.
Abstract Links BibTeX Project Best Paper Honourable Mention Award
Many of the authentication schemes for mobile devices that were proposed lately complicate shoulder surfing by splitting the attacker’s attention into two or more entities. For example, multimodal authentication schemes such as GazeTouchPIN and GazeTouchPass require attackers to observe the user’s gaze input and the touch input performed on the phone’s screen. These schemes have always been evaluated against single observers, while multiple observers could potentially attack these schemes with greater ease, since each of them can focus exclusively on one part of the password. In this work, we study the effectiveness of a novel threat model against authentication schemes that split the attacker’s attention. As a case study, we report on a security evaluation of two state of the art authentication schemes in the case of a team of two observers. Our results show that although multiple observers perform better against these schemes than single observers, multimodal schemes are significantly more secure against multiple observers compared to schemes that employ a single modality. We discuss how this threat model impacts the design of authentication schemes.Paper: khamis17_mum.pdf@inproceedings{khamis17_mum, title = {They are all after you: Investigating the Viability of a Threat Model that involves Multiple Shoulder Surfers}, author = {Khamis, Mohamed and Bandelow, Linda and Schick, Stina and Casadevall, Dario and Bulling, Andreas and Alt, Florian}, year = {2017}, doi = {10.1145/3152832.3152851}, pages = {1--5}, booktitle = {Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM)} } -
Noticeable or Distractive? A Design Space for Gaze-Contingent User Interface Notifications
Michaela Klauck, Yusuke Sugano, Andreas Bulling
Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1779-1786, 2017.
Users are interrupted by an ever-increasing number of notifications, ranging from error messages, over new email or chat alerts, to advertisement pop-ups. We explore gaze-contingent user interfaces notifications that are shown depending on users’ current gaze location. Specifically, we evaluate how different design properties influence notification noticeability and distractiveness. We measure noticeability quantitatively by analyzing participants’ performance in confirming notifications and distractiveness using a questionnaire. Based on a 12-participant user study on a public display, we show that each of these properties affects noticeability and distractiveness differently and that the properties, in turn, allow for fine-grained optimization of notification display. These findings inform the design of future attentive user interfaces that could optimize the trade-off between, for example, the notification importance and the cost of interruption.Paper: klauck17_chi.pdf@inproceedings{klauck17_chi, author = {Klauck, Michaela and Sugano, Yusuke and Bulling, Andreas}, title = {Noticeable or Distractive? A Design Space for Gaze-Contingent User Interface Notifications}, booktitle = {Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2017}, pages = {1779-1786}, doi = {10.1145/3027063.3053085} } -
EyeMirror: Mobile Calibration-Free Gaze Approximation using Corneal Imaging
Christian Lander, Sven Gehring, Markus Löchtefeld, Andreas Bulling, Antonio Krüger
Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM), pp. 1–13, 2017.
Gaze is a powerful measure of people’s attracted attention and reveals where we are looking at within our current field of view. Hence, gaze-based interfaces are gaining in importance. However, gaze estimation usually requires extensive hardware and depends on a calibration that has to be renewed regularly. We present EyeMirror, a mobile device for calibration-free gaze approximation on surfaces (e.g. displays). It consists of a head-mounted camera, connected to a wearable mini-computer, capturing the environment reflected on the human cornea. The corneal images are analyzed using natural feature tracking for gaze estimation on surfaces. In two lab studies we compared variations of EyeMirror against established methods for gaze estimation in a display scenario, and investigated the effect of display content (i.e. number of features). EyeMirror achieved 4.03° gaze estimation error, while we found no significant effect of display content.Paper: lander17_mum.pdf@inproceedings{lander17_mum, title = {EyeMirror: Mobile Calibration-Free Gaze Approximation using Corneal Imaging}, author = {Lander, Christian and Gehring, Sven and L{\"{o}}chtefeld, Markus and Bulling, Andreas and Kr{\"{u}}ger, Antonio}, year = {2017}, pages = {1--13}, doi = {10.1145/3152832.3152839}, booktitle = {Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM)} } -
Predicting the Category and Attributes of Visual Search Targets Using Deep Gaze Pooling
Hosnieh Sattar, Andreas Bulling, Mario Fritz
Proc. IEEE International Conference on Computer Vision Workshops (ICCVW), pp. 2740-2748, 2017.
Predicting the target of visual search from eye fixation (gaze) data is a challenging problem with many applications in human-computer interaction. In contrast to previous work that has focused on individual instances as search target, we propose the first approach to predict categories and attributes of search targets based on gaze data. However, state of the art models for categorical recognition in general require large amounts of training data, which is prohibitive for gaze data. To address this challenge, we propose a novel Gaze Pooling Layer that integrates gaze information into CNN-based architectures as an attention mechanism - incorporating both spatial and temporal aspects of human gaze behavior. We show that our approach is effective even when the gaze pooling layer is added to an already trained CNN, thus eliminating the need for expensive joint data collection of visual and gaze data. We propose an experimental setup and data set and demonstrate the effectiveness of our method for search target prediction based on gaze behavior. We further study how to integrate temporal and spatial gaze information most effectively, and indicate directions for future research in gaze-based prediction of mental states.Paper: sattar17_iccvw.pdf@inproceedings{sattar17_iccvw, title = {Predicting the Category and Attributes of Visual Search Targets Using Deep Gaze Pooling}, author = {Sattar, Hosnieh and Bulling, Andreas and Fritz, Mario}, year = {2017}, pages = {2740-2748}, doi = {10.1109/ICCVW.2017.322}, booktitle = {Proc. IEEE International Conference on Computer Vision Workshops (ICCVW)} } -
It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation
Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling
Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 2299-2308, 2017.
Eye gaze is an important non-verbal cue for human affect analysis. Recent gaze estimation work indicated that information from the full face region can benefit performance. Pushing this idea further, we propose an appearance-based method that, in contrast to a long-standing line of work in computer vision, only takes the full face image as input. Our method encodes the face image using a convolutional neural network with spatial weights applied on the feature maps to flexibly suppress or enhance information in different facial regions. Through extensive evaluation, we show that our full-face method significantly outperforms the state of the art for both 2D and 3D gaze estimation, achieving improvements of up to 14.3% on MPIIGaze and 27.7% on EYEDIAP for person-independent 3D gaze estimation. We further show that this improvement is consistent across different illumination conditions and gaze directions and par- ticularly pronounced for the most challenging extreme head poses.Paper: zhang17_cvprw.pdf@inproceedings{zhang17_cvprw, title = {It's Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation}, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, booktitle = {Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW)}, year = {2017}, doi = {10.1109/CVPRW.2017.284}, pages = {2299-2308} } -
Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery
Xucong Zhang, Yusuke Sugano, Andreas Bulling
Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 193-203, 2017.
Abstract Links BibTeX Project Best Paper Honourable Mention Award
Eye contact is an important non-verbal cue in social signal processing and promising as a measure of overt attention in human-object interactions and attentive user interfaces. However, robust detection of eye contact across different users, gaze targets, camera positions, and illumination conditions is notoriously challenging. We present a novel method for eye contact detection that combines a state-of-the-art appearance-based gaze estimator with a novel approach for unsupervised gaze target discovery, i.e. without the need for tedious and time-consuming manual data annotation. We evaluate our method in two real-world scenarios: detecting eye contact at the workplace, including on the main work display, from cameras mounted to target objects, as well as during everyday social interactions with the wearer of a head-mounted egocentric camera. We empirically evaluate the performance of our method in both scenarios and demonstrate its effectiveness for detecting eye contact independent of target object type and size, camera position, and user and recording environment.Paper: zhang17_uist.pdf@inproceedings{zhang17_uist, title = {Everyday Eye Contact Detection Using Unsupervised Gaze Target Discovery}, author = {Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas}, year = {2017}, pages = {193-203}, doi = {10.1145/3126594.3126614}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, video = {https://www.youtube.com/watch?v=ccrS5XuhQpk} }
Technical Reports
-
A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks
Arif Khan, Ingmar Steiner, Yusuke Sugano, Andreas Bulling, Ross Macdonald
arXiv:1712.04798, pp. 1–4, 2017.
Phonetic segmentation is the process of splitting speech into distinct phonetic units. Human experts routinely perform this task manually by analyzing auditory and visual cues using analysis software, which is an extremely time-consuming process. Methods exist for automatic segmentation, but these are not always accurate enough. In order to improve automatic segmentation, we need to model it as close to the manual segmentation as possible. This corpus is an effort to capture the human segmentation behavior by recording experts performing a segmentation task. We believe that this data will enable us to highlight the important aspects of manual segmentation, which can be used in automatic segmentation to improve its accuracy.Paper: khan17_arxiv.pdfPaper Access: https://arxiv.org/abs/1712.04798@techreport{khan17_arxiv, title = {A Multimodal Corpus of Expert Gaze and Behavior during Phonetic Segmentation Tasks}, author = {Khan, Arif and Steiner, Ingmar and Sugano, Yusuke and Bulling, Andreas and Macdonald, Ross}, year = {2017}, pages = {1--4}, url = {https://arxiv.org/abs/1712.04798} } -
Visual Decoding of Targets During Visual Search From Human Eye Fixations
Hosnieh Sattar, Mario Fritz, Andreas Bulling
arXiv:1706.05993, pp. 1–9, 2017.
What does human gaze reveal about a users’ intents and to which extend can these intents be inferred or even visualized? Gaze was proposed as an implicit source of information to predict the target of visual search and, more recently, to predict the object class and attributes of the search target. In this work, we go one step further and investigate the feasibility of combining recent advances in encoding human gaze information using deep convolutional neural networks with the power of generative image models to visually decode, i.e. create a visual representation of, the search target. Such visual decoding is challenging for two reasons: 1) the search target only resides in the user’s mind as a subjective visual pattern, and can most often not even be described verbally by the person, and 2) it is, as of yet, unclear if gaze fixations contain sufficient information for this task at all. We show, for the first time, that visual representations of search targets can indeed be decoded only from human gaze fixations. We propose to first encode fixations into a semantic representation and then decode this representation into an image. We evaluate our method on a recent gaze dataset of 14 participants searching for clothing in image collages and validate the model’s predictions using two human studies. Our results show that 62% (Chance level = 10%) of the time users were able to select the categories of the decoded image right. In our second studies we show the importance of a local gaze encoding for decoding visual search targets of users.Paper: sattar17_arxiv.pdfPaper Access: https://arxiv.org/abs/1706.05993@techreport{sattar17_arxiv, title = {Visual Decoding of Targets During Visual Search From Human Eye Fixations}, author = {Sattar, Hosnieh and Fritz, Mario and Bulling, Andreas}, year = {2017}, pages = {1--9}, url = {https://arxiv.org/abs/1706.05993} } -
GazeDirector: Fully Articulated Eye Gaze Redirection in Video
Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, Andreas Bulling
arXiv:1704.08763, pp. 1–10, 2017.
We present GazeDirector, a new approach for eye gaze redirection that uses model-fitting. Our method first tracks the eyes by fitting a multi-part eye region model to video frames using analysis-by-synthesis, thereby recovering eye region shape, texture, pose, and gaze simultaneously. It then redirects gaze by 1) warping the eyelids from the original image using a model-derived flow field, and 2) rendering and compositing synthesized 3D eyeballs onto the output image in a photorealistic manner. GazeDirector allows us to change where people are looking without person-specific training data, and with full articulation, i.e. we can precisely specify new gaze directions in 3D. Quantitatively, we evaluate both model-fitting and gaze synthesis, with experiments for gaze estimation and redirection on the Columbia gaze dataset. Qualitatively, we compare GazeDirector against recent work on gaze redirection, showing better results especially for large redirection angles. Finally, we demonstrate gaze redirection on YouTube videos by introducing new 3D gaze targets and by manipulating visual behavior.Paper: wood17_arxiv.pdfPaper Access: https://arxiv.org/abs/1704.08763@techreport{wood17_arxiv, title = {GazeDirector: Fully Articulated Eye Gaze Redirection in Video}, author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Morency, Louis-Philippe and Robinson, Peter and Bulling, Andreas}, year = {2017}, pages = {1--10}, url = {https://arxiv.org/abs/1704.08763} } -
MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation
Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling
arXiv:1711.09017, pp. 1–14, 2017.
Learning-based methods are believed to work well for unconstrained gaze estimation, i.e. gaze estimation from a monocular RGB camera without assumptions regarding user, environment, or camera. However, current gaze datasets were collected under laboratory conditions and methods were not evaluated across multiple datasets. Our work makes three contributions towards addressing these limitations. First, we present the MPIIGaze dataset, which contains 213,659 full face images and corresponding ground-truth gaze positions collected from 15 users during everyday laptop use over several months. An experience sampling approach ensured continuous gaze and head poses and realistic variation in eye appearance and illumination. To facilitate cross-dataset evaluations, 37,667 images were manually annotated with eye corners, mouth corners, and pupil centres. Second, we present an extensive evaluation of state-of-the-art gaze estimation methods on three current datasets, including MPIIGaze. We study key challenges including target gaze range, illumination conditions, and facial appearance variation. We show that image resolution and the use of both eyes affect gaze estimation performance, while head pose and pupil centre information are less informative. Finally, we propose GazeNet, the first deep appearance-based gaze estimation method. GazeNet improves on the state of the art by 22% (from a mean error of 13.9 degrees to 10.8 degrees) for the most challenging cross-dataset evaluation.Paper: zhang17_arxiv.pdfPaper Access: https://arxiv.org/abs/1711.09017@techreport{zhang17_arxiv, title = {MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation}, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, year = {2017}, pages = {1--14}, url = {https://arxiv.org/abs/1711.09017} }
2016
Journal Articles
-
Pervasive Attentive User Interfaces
IEEE Computer, 49 (1), pp. 94-98, 2016.
As the number of displays we interact with rapidly increases, managing user attention has emerged as a critical challenge for next-generation human−computer interfaces.doi: 10.1109/MC.2016.32Paper: bulling16_computer.pdf@article{bulling16_computer, title = {Pervasive Attentive User Interfaces}, author = {Bulling, Andreas}, doi = {10.1109/MC.2016.32}, year = {2016}, journal = {IEEE Computer}, volume = {49}, number = {1}, pages = {94-98} } -
EyeWear Computers for Human-Computer Interaction
Andreas Bulling, Kai Kunze
ACM Interactions, 23 (3), pp. 70-73, 2016.
Head-worn displays and eye trackers, augmented and virtual reality glasses, egocentric cameras, and other "smart eyewear" have recently emerged as a research platform in fields such as ubiquitous computing, computer vision, and cognitive and social science. While earlier generations of devices were too bulky to be worn regularly, recent technological advances have made eyewear unobtrusive and lightweight, and therefore more suitable for daily use. Given that many human senses are located on the head, smart eyewear provides opportunities for types of interaction that were impossible before now. In this article, we highlight the potential of eyewear computing for HCI, discuss available input and output modalities, and suggest the most promising future directions for eyewear computing research, namely multimodal user modeling, lifelong learning, and large-scale (collective) human-behavior sensing and analysis.doi: 10.1145/2912886Paper: bulling16_interactions.pdf@article{bulling16_interactions, title = {EyeWear Computers for Human-Computer Interaction}, author = {Bulling, Andreas and Kunze, Kai}, year = {2016}, journal = {ACM Interactions}, volume = {23}, number = {3}, doi = {10.1145/2912886}, pages = {70-73} } -
Eyewear Computing – Augmenting the Human with Head-mounted Wearable Assistants (Dagstuhl Seminar 16042)
Andreas Bulling, Ozan Cakmakci, Kai Kunze, James M. Rehg
Dagstuhl Reports, 6 (1), pp. 160–206, 2016.
Paper: bulling16_dagstuhl.pdfPaper Access: http://drops.dagstuhl.de/opus/volltexte/2016/5820@article{bulling16_dagstuhl, author = {Bulling, Andreas and Cakmakci, Ozan and Kunze, Kai and Rehg, James M.}, title = {{Eyewear Computing – Augmenting the Human with Head-mounted Wearable Assistants (Dagstuhl Seminar 16042)}}, pages = {160--206}, journal = {Dagstuhl Reports}, year = {2016}, volume = {6}, number = {1}, editor = {Bulling, Andreas and Cakmakci, Ozan and Kunze, Kai and Rehg, James M.}, publisher = {Schloss Dagstuhl--Leibniz-Zentrum fuer Informatik}, address = {Dagstuhl, Germany}, url = {http://drops.dagstuhl.de/opus/volltexte/2016/5820}, doi = {10.4230/DagRep.6.1.160} } -
Pupil detection for head-mounted eye tracking in the wild: an evaluation of the state of the art
Wolfgang Fuhl, Marc Tonsen, Andreas Bulling, Enkelejda Kasneci
Springer Machine Vision and Applications, 27, pp. 1275-1288, 2016.
Robust and accurate detection of the pupil position is a key building block for head-mounted eye tracking and prerequisite for applications on top, such as gaze-based human-computer interaction or attention analysis. Despite a large body of work, detecting the pupil in images recorded under real-world conditions is challenging given significant variability in eye appearance (e.g., illumination, reflections, occlusions, etc.), individual differences in eye physiology, as well as other sources of noise, such as contact lenses or make-up. In this paper we review six state-of-the-art pupil detection methods, namely ElSe, ExCuSe, Pupil Labs, SET, Starburst, and Swirski. We compare their performance on a large-scale dataset consisting of 225,569 annotated eye images taken from four publicly available datasets. Our experimental results show that the algorithm ElSe outperforms other pupil detection methods by a large margin, offering thus robust and accurate pupil positions on challenging everyday eye images.Paper: fuhl16_mvap.pdf@article{fuhl16_mvap, title = {Pupil detection for head-mounted eye tracking in the wild: an evaluation of the state of the art}, author = {Fuhl, Wolfgang and Tonsen, Marc and Bulling, Andreas and Kasneci, Enkelejda}, year = {2016}, pages = {1275-1288}, doi = {10.1007/s00138-016-0776-4}, journal = {Springer Machine Vision and Applications}, volume = {27} }
Conference Papers
-
Attention, please! Comparing Features for Measuring Audience Attention Towards Pervasive Displays
Florian Alt, Andreas Bulling, Lukas Mecke, Daniel Buschek
Proc. ACM SIGCHI Conference on Designing Interactive Systems (DIS), pp. 823-828, 2016.
Measuring audience attention towards pervasive displays is important but accurate measurement in real time remains a significant sensing challenge. Consequently, researchers and practitioners typically use other features, such as face presence, as a proxy. We provide a principled comparison of the performance of six features and their combinations for measuring attention: face presence, movement trajectory, walking speed, shoulder orientation, head pose, and gaze direction. We implemented a prototype that is capable of capturing this rich set of features from video and depth camera data. Using a controlled lab experiment (N=18) we show that as a single feature, face presence is indeed among the most accurate. We further show that accuracy can be increased through a combination of features (+10.3%), knowledge about the audience (+63.8%), as well as user identities (+69.0%). Our findings are valuable for display providers who want to collect data on display effectiveness or build interactive, responsive apps.Paper: alt16_dis.pdf@inproceedings{alt16_dis, author = {Alt, Florian and Bulling, Andreas and Mecke, Lukas and Buschek, Daniel}, title = {Attention, please! Comparing Features for Measuring Audience Attention Towards Pervasive Displays}, booktitle = {Proc. ACM SIGCHI Conference on Designing Interactive Systems (DIS)}, year = {2016}, doi = {10.1145/2901790.2901897}, pages = {823-828} } -
Memorability of Cued-Recall Graphical Passwords with Saliency Masks
Florian Alt, Mateusz Mikusz, Stefan Schneegass, Andreas Bulling
Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM), pp. 191-200, 2016.
Cued-recall graphical passwords have a lot of potential for secure user authentication, particularly if combined with saliency masks to prevent users from selecting weak passwords. Saliency masks exclude those areas of the image that are most likely to lead to hotspots and were shown to significantly improve password security. In this paper we investigate the impact of such saliency masks on the memorability of cued-recall graphical passwords. We first conduct two pre-studies with 52 participants to obtain a set of images with three different image complexities as well as real passwords. Based on a month-long user study with 26 participants we then show that cued-recall graphical passwords defined on a single image with a saliency mask are not more difficult to remember than those without saliency mask, and that the complexity of the password images does not have any influence on password memorability. These results complement prior work on the security of such passwords and underline the potential of saliency masks as both a secure and usable improvement to cued-recall gaze-based graphical passwords.Paper: alt16_mum.pdf@inproceedings{alt16_mum, title = {Memorability of Cued-Recall Graphical Passwords with Saliency Masks}, author = {Alt, Florian and Mikusz, Mateusz and Schneegass, Stefan and Bulling, Andreas}, year = {2016}, doi = {10.1145/3012709.3012730}, pages = {191-200}, booktitle = {Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM)} } -
Prediction of Gaze Estimation Error for Error-Aware Gaze-Based Interfaces
Michael Barz, Florian Daiber, Andreas Bulling
Proc. International ACM Symposium on Eye Tracking Research and Applications (ETRA), pp. 275-278, 2016.
Gaze estimation error is inherent in head-mounted eye trackers and seriously impacts performance, usability, and user experience of gaze-based interfaces. Particularly in mobile settings, this error varies constantly as users move in front and look at different parts of a display. We envision a new class of gaze-based interfaces that are aware of the gaze estimation error and adapt to it in real time. As a first step towards this vision we introduce an error model that is able to predict the gaze estimation error. Our method covers major building blocks of mobile gaze estimation, specifically mapping of pupil positions to scene camera coordinates, marker-based display detection, and mapping of gaze from scene camera to on-screen coordinates. We develop our model through a series of principled measurements of a state-of-the-art head-mounted eye tracker.Paper: barz16_etra.pdf@inproceedings{barz16_etra, author = {Barz, Michael and Daiber, Florian and Bulling, Andreas}, title = {Prediction of Gaze Estimation Error for Error-Aware Gaze-Based Interfaces}, booktitle = {Proc. International ACM Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2016}, pages = {275-278}, doi = {10.1145/2857491.2857493} } -
Xplore-M-Ego: Contextual Media Retrieval Using Natural Language Queries
Sreyasi Nag Chowdhury, Mateusz Malinowski, Andreas Bulling, Mario Fritz
Proc. ACM International Conference on Multimedia Retrieval (ICMR), pp. 243-247, 2016.
The widespread integration of cameras in hand-held and head-worn devices and the ability to share content online enables a large and diverse visual capture of the world that millions of users build up collectively every day. We envision these images as well as associated meta information, such as GPS coordinates and timestamps, to form a collective visual memory that can be queried while automatically taking the ever-changing context of mobile users into account. As a first step towards this vision, in this work we present Xplore-M-Ego: a novel media retrieval system that allows users to query a dynamic database of images using spatio-temporal natural language queries. We evaluate our system using a new dataset of real image queries as well as through a usability study. One key finding is that there is a considerable amount of inter-user variability in the resolution of spatial relations in natural language utterances. We show that our system can cope with this variability using personalisation through an online learning-based retrieval formulation.Paper: chowdhury16_icmr.pdf@inproceedings{chowdhury16_icmr, title = {Xplore-M-Ego: Contextual Media Retrieval Using Natural Language Queries}, author = {Chowdhury, Sreyasi Nag and Malinowski, Mateusz and Bulling, Andreas and Fritz, Mario}, year = {2016}, booktitle = {Proc. ACM International Conference on Multimedia Retrieval (ICMR)}, doi = {10.1145/2911996.2912044}, pages = {243-247} } -
Smooth Eye Movement Interaction Using EOG Glasses
Murtaza Dhuliawala, Juyoung Lee, Junichi Shimizu, Andreas Bulling, Kai Kunze, Thad Starner, Woontack Woo
Proc. ACM International Conference on Multimodal Interaction (ICMI), pp. 307-311, 2016.
Orbits combines a visual display and an eye motion sensor to allow a user to select between options by tracking a cursor with the eyes as the cursor travels in a circular path around each option. Using an off-the-shelf J!NS MEME pair of eyeglasses, we present a pilot study that suggests that the eye movement required for Orbits can be sensed using three electrodes: one in the nose bridge and one in each nose pad. For forced choice binary selection, we achieve a 2.6 bits per second (bps) input rate at 250ms per input. We also introduce Head Orbits, where the user fixates the eyes on a target and moves the head in synchrony with the orbiting target. Measuring only the relative movement of the eyes in relation to the head, this method achieves a maximum rate of 2.0 bps at 500ms per input. Finally, we combine the two techniques together with a gyro to create an interface with a maximum input rate of 5.0 bps.Paper: dhuliawala16_icmi.pdf@inproceedings{dhuliawala16_icmi, title = {Smooth Eye Movement Interaction Using EOG Glasses}, author = {Dhuliawala, Murtaza and Lee, Juyoung and Shimizu, Junichi and Bulling, Andreas and Kunze, Kai and Starner, Thad and Woo, Woontack}, year = {2016}, doi = {10.1145/2993148.2993181}, pages = {307-311}, booktitle = {Proc. ACM International Conference on Multimodal Interaction (ICMI)} } -
GazeTouchPass: Multimodal Authentication Using Gaze and Touch on Mobile Devices
Mohamed Khamis, Florian Alt, Mariam Hassib, Emanuel Zezschwitz, Regina Hasholzner, Andreas Bulling
Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 2156-2164, 2016.
We propose a multimodal scheme, GazeTouchPass, that combines gaze and touch for shoulder-surfing resistant user authentication on mobile devices. GazeTouchPass allows passwords with multiple switches between input modalities during authentication. This requires attackers to simultaneously observe the device screen and the user’s eyes to find the password. We evaluate the security and usability of GazeTouchPass in two user studies. Our findings show that GazeTouchPass is usable and significantly more secure than single-modal authentication against basic and even advanced shoulder-surfing attacks.Paper: khamis16_chi.pdf@inproceedings{khamis16_chi, author = {Khamis, Mohamed and Alt, Florian and Hassib, Mariam and von Zezschwitz, Emanuel and Hasholzner, Regina and Bulling, Andreas}, title = {GazeTouchPass: Multimodal Authentication Using Gaze and Touch on Mobile Devices}, booktitle = {Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2016}, pages = {2156-2164}, doi = {10.1145/2851581.2892314} } -
TextPursuits: Using Text for Pursuits-Based Interaction and Calibration on Public Displays
Mohamed Khamis, Ozan Saltuk, Alina Hang, Katharina Stolz, Andreas Bulling, Florian Alt
Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 274-285, 2016.
Pursuits, a technique that correlates users’ eye movements with moving on-screen targets, was recently introduced for calibration-free interaction with public displays. While prior work used abstract objects or dots as targets, we explore the use of Pursuits with text (read-and-pursue). Given that much of the content on public displays includes text, designers could greatly benefit from users being able to spontaneously interact and implicitly calibrate an eye tracker while simply read- ing text on a display. At the same time, using Pursuits with textual content is challenging given that the eye movements performed while reading interfere with the pursuit movements. We present two systems, EyeVote and Read2Calibrate, that enable spontaneous gaze interaction and implicit calibration by reading text. Results from two user studies (N=37) show that Pursuits with text is feasible and can achieve similar accu- racy as non text-based pursuit approaches. While calibration is less accurate, it integrates smoothly with reading and allows to identify areas of the display the user is looking at.Paper: khamis16_ubicomp.pdf@inproceedings{khamis16_ubicomp, author = {Khamis, Mohamed and Saltuk, Ozan and Hang, Alina and Stolz, Katharina and Bulling, Andreas and Alt, Florian}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, title = {TextPursuits: Using Text for Pursuits-Based Interaction and Calibration on Public Displays}, year = {2016}, doi = {10.1145/2971648.2971679}, pages = {274-285} } -
Challenges and Design Space of Gaze-enabled Public Displays
Mohamed Khamis, Florian Alt, Andreas Bulling
Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 1736-1745, 2016.
Gaze is an attractive modality for public displays, hence the recent years saw an increase in deployments of gaze-enabled public displays. Although gaze has been thoroughly investigated for desktop scenarios, gaze-enabled public displays present new challenges that are unique to this setup. In contrast to desktop settings, public displays (1) cannot afford requiring eye tracker calibration, (2) expect users to interact from different positions, and (3) expect multiple users to interact simultaneously. In this work we discuss these challenges, and explore the design space of gaze-enabled public displays. We conclude by discussing how the current state of research stands wrt. the identified challenges, and highlight directions for future work.Paper: khamis16_petmei.pdf@inproceedings{khamis16_petmei, title = {Challenges and Design Space of Gaze-enabled Public Displays}, author = {Khamis, Mohamed and Alt, Florian and Bulling, Andreas}, year = {2016}, pages = {1736-1745}, doi = {10.1145/2968219.2968342}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)} } -
EyeVote in the Wild: Do Users bother Correcting System Errors on Public Displays?
Mohamed Khamis, Ludwig Trotter, Markus Tessman, Christina Dannhart, Andreas Bulling, Florian Alt
Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM), pp. 57-62, 2016.
Although recovering from errors is straightforward on most interfaces, public display systems pose very unique design challenges. Namely, public display users interact for very short amounts of times and are believed to abandon the display when interrupted or forced to deviate from the main task. To date, it is not well understood whether public display designers should enable users to correct errors (e.g. by asking users to confirm or giving them a chance correct their input), or aim for faster interaction and rely on other types of feedback to estimate errors. To close this gap, we conducted a field study where we investigated the users willingness to correct their input on public displays. We report on our findings from an in-the-wild deployment of a public gaze-based voting system where we intentionally evoked system errors to see if users correct them. We found that public display users are willing to correct system errors provided that the correction is fast and straightforward. We discuss how our findings influence the choice of interaction methods for public displays; interaction methods that are highly usable but suffer from low accuracy can still be effective if users can "undo" their interactions.Paper: khamis16_mum.pdf@inproceedings{khamis16_mum, title = {EyeVote in the Wild: Do Users bother Correcting System Errors on Public Displays?}, author = {Khamis, Mohamed and Trotter, Ludwig and Tessman, Markus and Dannhart, Christina and Bulling, Andreas and Alt, Florian}, year = {2016}, doi = {10.1145/3012709.3012743}, pages = {57-62}, booktitle = {Proc. International Conference on Mobile and Ubiquitous Multimedia (MUM)} } -
On the Verge: Voluntary Convergences for Accurate and Precise Timing of Gaze Input
Dominik Kirst, Andreas Bulling
Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1519-1525, 2016.
The problem of triggering input accurately (with a small temporal offset) and precisely (with high repeatability) at a specific point in time has so far been largely ignored in gaze interaction research. We explore voluntary eye convergences as a novel interaction technique for precise and accurate timing of gaze input and a solution to the "Midas touch" problem, i.e. the accidental triggering of input when looking at an interface. We introduce a novel clock paradigm to study input timing and demonstrate that voluntary convergences are significantly more accurate and precise than common gaze dwelling. Our findings suggest that voluntary convergences are well-suited for applications in which timing of user input is important, thereby complementing existing gaze techniques that focus on speed and spatial precision.Paper: kirst16_chi.pdf@inproceedings{kirst16_chi, author = {Kirst, Dominik and Bulling, Andreas}, title = {On the Verge: Voluntary Convergences for Accurate and Precise Timing of Gaze Input}, booktitle = {Ext. Abstr. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2016}, pages = {1519-1525}, doi = {10.1145/2851581.2892307} } -
3D Gaze Estimation from 2D Pupil Positions on Monocular Head-Mounted Eye Trackers
Mohsen Mansouryar, Julian Steil, Yusuke Sugano, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 197-200, 2016.
3D gaze information is important for scene-centric attention analysis, but accurate estimation and analysis of 3D gaze in real-world environments remains challenging. We present a novel 3D gaze estimation method for monocular head-mounted eye trackers. In contrast to previous work, our method does not aim to infer 3D eye- ball poses, but directly maps 2D pupil positions to 3D gaze directions in scene camera coordinate space. We first provide a detailed discussion of the 3D gaze estimation task and summarize different methods, including our own. We then evaluate the performance of different 3D gaze estimation approaches using both simulated and real data. Through experimental validation, we demonstrate the effectiveness of our method in reducing parallax error, and we identify research challenges for the design of 3D calibration procedures.Paper: mansouryar16_etra.pdf@inproceedings{mansouryar16_etra, author = {Mansouryar, Mohsen and Steil, Julian and Sugano, Yusuke and Bulling, Andreas}, title = {3D Gaze Estimation from 2D Pupil Positions on Monocular Head-Mounted Eye Trackers}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2016}, pages = {197-200}, doi = {10.1145/2857491.2857530} } -
Concept for Using Eye Tracking in a Head-mounted Display to Adapt Rendering to the User’s Current Visual Field
Daniel Pohl, Xucong Zhang, Andreas Bulling, Oliver Grau
Proc. of the 22nd ACM Conference on Virtual Reality Software and Technology (VRST), pp. 323-324, 2016.
With increasing spatial and temporal resolution in head-mounted displays (HMDs), using eye trackers to adapt rendering to the user is getting important to handle the rendering workload. Besides using methods like foveated rendering, we propose to use the current visual field for rendering, depending on the eye gaze. We use two effects for performance optimizations. First, we noticed a lens defect in HMDs, where depending on the distance of the eye gaze to the center, certain parts of the screen towards the edges are not visible anymore. Second, if the user looks up, he cannot see the lower parts of the screen anymore. For the invisible areas, we propose to skip rendering and to reuse the pixels colors from the previous frame. We provide a calibration routine to measure these two effects. We apply the current visual field to a renderer and get up to 2x speed-ups.Paper: pohl16_vrst.pdf@inproceedings{pohl16_vrst, title = {Concept for Using Eye Tracking in a Head-mounted Display to Adapt Rendering to the User's Current Visual Field}, author = {Pohl, Daniel and Zhang, Xucong and Bulling, Andreas and Grau, Oliver}, doi = {10.1145/2993369.2996300}, year = {2016}, booktitle = {Proc. of the 22nd ACM Conference on Virtual Reality Software and Technology (VRST)}, pages = {323-324} } -
Combining Eye Tracking with Optimizations for Lens Astigmatism in Modern Wide-Angle HMDs
Daniel Pohl, Xucong Zhang, Andreas Bulling
Proc. IEEE Virtual Reality (VR), pp. 269-270, 2016.
Virtual Reality has hit the consumer market with affordable head-mounted displays. When using these, it quickly becomes apparent that the resolution of the built-in display panels still needs to be highly increased. To overcome the resulting higher performance demands, eye tracking can be used for foveated rendering. However, as there are lens distortions in HMDs, there are more possibilities to increase the performance with smarter rendering approaches. We present a new system using optimizations for rendering considering lens astigmatism and combining this with foveated rendering through eye tracking. Depending on the current eye gaze, this delivers a rendering speed-up of up to 20%.Paper: pohl16_vr.pdf@inproceedings{pohl16_vr, title = {Combining Eye Tracking with Optimizations for Lens Astigmatism in Modern Wide-Angle HMDs}, author = {Pohl, Daniel and Zhang, Xucong and Bulling, Andreas}, year = {2016}, pages = {269-270}, doi = {10.1109/VR.2016.7504757}, booktitle = {Proc. IEEE Virtual Reality (VR)} } -
SkullConduct: Biometric User Identification on Eyewear Computers Using Bone Conduction Through the Skull
Stefan Schneegass, Youssef Oualil, Andreas Bulling
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1379-1384, 2016.
Secure user identification is important for the increasing number of eyewear computers but limited input capabilities pose significant usability challenges for established knowledge-based schemes, such as a passwords or PINs. We present SkullConduct, a biometric system that uses bone conduction of sound through the user’s skull as well as a microphone readily integrated into many of these devices, such as Google Glass. At the core of SkullConduct is a method to analyze the characteristic frequency response created by the user’s skull using a combination of Mel Frequency Cepstral Coefficient (MFCC) features as well as a computationally light-weight 1NN classifier. We report on a controlled experiment with 10 participants that shows that this frequency response is person-specific and stable - even when taking off and putting on the device multiple times - and thus serves as a robust biometric. We show that our method can identify users with 97.0% accuracy and authenticate them with an equal error rate of 6.9%, thereby bringing biometric user identification to eyewear computers equipped with bone conduction technology.@inproceedings{schneegass16_chi, title = {SkullConduct: Biometric User Identification on Eyewear Computers Using Bone Conduction Through the Skull}, author = {Schneegass, Stefan and Oualil, Youssef and Bulling, Andreas}, year = {2016}, pages = {1379-1384}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/2858036.2858152}, video = {https://www.youtube.com/watch?v=A4BCnsQmo6c} } -
Gaussian Processes as an Alternative to Polynomial Gaze Estimation Functions
Laura Sesma-Sanchez, Yanxia Zhang, Andreas Bulling, Hans Gellersen
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 229-232, 2016.
Interpolation-based methods are widely used for gaze estimation due to their simplicity. In particular, feature-based methods that map the image eye features to gaze, are very popular. The most spread regression function used in this kind of method is the polynomial regression. In this paper, we present an alternative regression function to estimate gaze: the Gaussian regression. We show how the Gaussian processes can better adapt to the non-linear behavior of the eye movement, providing higher gaze estimation accuracies. The Gaussian regression is compared, in a simulated environment, to the polynomial regression, when using the same mapping features, the normalized pupil center-corneal reflection and pupil center-eye corners vectors. This comparison is done for three different screen sizes. The results show that for larger screens, where wider gaze angles are required, i.e., the non-linear behavior of the eye is more present, the outperformance of the Gaussian regression is more evident. Furthermore, we can conclude that, for both types of regressions, the gaze estimation accuracy increases for smaller screens, where the eye movements are more linear.Paper: sesma16_etra.pdf@inproceedings{sesma16_etra, author = {Sesma-Sanchez, Laura and Zhang, Yanxia and Bulling, Andreas and Gellersen, Hans}, title = {Gaussian Processes as an Alternative to Polynomial Gaze Estimation Functions}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2016}, pages = {229-232}, doi = {10.1145/2857491.2857509} } -
Solar System: Smooth Pursuit Interactions Using EOG Glasses
Junichi Shimizu, Juyoung Lee, Murtaza Dhuliawala, Andreas Bulling, Thad Starner, Woontack Woo, Kai Kunze
Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 369-372, 2016.
Solar System implements smooth pursuit eye movement interactions on commercial smart glasses using electrooculography. The system requires no calibration and little to no training. We present a prototype implementation, describe initial user tests and show several application scenarios for hands-free eye gaze interactions.Paper: shimizu16_ubicomp.pdf@inproceedings{shimizu16_ubicomp, author = {Shimizu, Junichi and Lee, Juyoung and Dhuliawala, Murtaza and Bulling, Andreas and Starner, Thad and Woo, Woontack and Kunze, Kai}, title = {Solar System: Smooth Pursuit Interactions Using EOG Glasses}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2016}, pages = {369-372}, doi = {10.1145/2968219.2971376}, keywords = {eye tracking, gaze interaction, wearable computing} } -
Three-Point Interaction: Combining Bi-manual Direct Touch with Gaze
Adalberto Simeone, Andreas Bulling, Jason Alexander, Hans Gellersen
Proc. International Conference on Advanced Visual Interfaces (AVI), pp. 168-175, 2016.
The benefits of two-point interaction for tasks that require users to simultaneously manipulate multiple entities or dimensions are widely known. Two-point interaction has become common, e.g., when zooming or pinching using two fingers on a smartphone. We propose a novel interaction technique that implements three-point interaction by augmenting two-finger direct touch with gaze as a third input channel. We evaluate two key characteristics of our technique in two multi-participant user studies. In the first, we used the technique for object selection. In the second, we evaluate it in a 3D matching task that requires simultaneous continuous input from fingers and the eyes. Our results show that in both cases participants learned to interact with three input channels without cognitive or mental overload. Participants’ performance tended towards fast selection times in the first study and exhibited parallel interaction in the second. These results are promising and show that there is scope for additional input channels beyond two-point interaction.Paper: simeone16_avi.pdf@inproceedings{simeone16_avi, author = {Simeone, Adalberto and Bulling, Andreas and Alexander, Jason and Gellersen, Hans}, title = {Three-Point Interaction: Combining Bi-manual Direct Touch with Gaze}, booktitle = {Proc. International Conference on Advanced Visual Interfaces (AVI)}, year = {2016}, pages = {168-175}, doi = {10.1145/2909132.2909251} } -
AggreGaze: Collective Estimation of Audience Attention on Public Displays
Yusuke Sugano, Xucong Zhang, Andreas Bulling
Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 821-831, 2016.
Abstract Links BibTeX Project Best Paper Honourable Mention Award
Gaze is frequently explored in public display research given its importance for monitoring and analysing audience attention. However, current gaze-enabled public display interfaces require either special-purpose eye tracking equipment or explicit personal calibration for each individual user. We present AggreGaze, a novel method for estimating spatio-temporal audience attention on public displays. Our method requires only a single off-the-shelf camera attached to the display, does not require any personal calibration, and provides visual attention estimates across the full display. We achieve this by 1) compensating for errors of state-of-the-art appearance-based gaze estimation methods through on-site training data collection, and by 2) aggregating uncalibrated and thus inaccurate gaze estimates of multiple users into joint attention estimates. We propose different visual stimuli for this compensation: a standard 9-point calibration, moving targets, text and visual stimuli embedded into the display content, as well as normal video content. Based on a two-week deployment in a public space, we demonstrate the effectiveness of our method for estimating attention maps that closely resemble ground-truth audience gaze distributions.Paper: sugano16_uist.pdf@inproceedings{sugano16_uist, title = {AggreGaze: Collective Estimation of Audience Attention on Public Displays}, author = {Sugano, Yusuke and Zhang, Xucong and Bulling, Andreas}, year = {2016}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, doi = {10.1145/2984511.2984536}, pages = {821-831}, video = {https://www.youtube.com/watch?v=eFK39S_lgdg} } -
Labeled pupils in the wild: A dataset for studying pupil detection in unconstrained environments
Marc Tonsen, Xucong Zhang, Yusuke Sugano, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 139-142, 2016.
We present labelled pupils in the wild (LPW), a novel dataset of 66 high-quality, high-speed eye region videos for the development and evaluation of pupil detection algorithms. The videos in our dataset were recorded from 22 participants in everyday locations at about 95 FPS using a state-of-the-art dark-pupil head-mounted eye tracker. They cover people of different ethnicities and a diverse set of everyday indoor and outdoor illumination environments, as well as natural gaze direction distributions. The dataset also includes participants wearing glasses, contact lenses, and make-up. We bench- mark five state-of-the-art pupil detection algorithms on our dataset with respect to robustness and accuracy. We further study the influence of image resolution and vision aids as well as recording lo- cation (indoor, outdoor) on pupil detection performance. Our evaluations provide valuable insights into the general pupil detection problem and allow us to identify key challenges for robust pupil detection on head-mounted eye trackers.Paper: tonsen16_etra.pdf@inproceedings{tonsen16_etra, author = {Tonsen, Marc and Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas}, title = {Labeled pupils in the wild: A dataset for studying pupil detection in unconstrained environments}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2016}, pages = {139-142}, doi = {10.1145/2857491.2857520} } -
Learning an appearance-based gaze estimator from one million synthesised images
Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, Andreas Bulling
Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 131–138, 2016.
Abstract Links BibTeX Project Emerging Investigator Award
Learning-based methods for appearance-based gaze estimation achieve state-of-the-art performance in challenging real-world settings but require large amounts of labelled training data. Learning-by-synthesis was proposed as a promising solution to this problem but current methods are limited with respect to speed, the appearance variability as well as the head pose and gaze angle distribution they can synthesize. We present UnityEyes, a novel method to rapidly synthesize large amounts of variable eye region images as training data. Our method combines a novel generative 3D model of the human eye region with a real-time rendering framework. The model is based on high-resolution 3D face scans and uses real- time approximations for complex eyeball materials and structures as well as novel anatomically inspired procedural geometry methods for eyelid animation. We show that these synthesized images can be used to estimate gaze in difficult in-the-wild scenarios, even for extreme gaze angles or in cases in which the pupil is fully occluded. We also demonstrate competitive gaze estimation results on a benchmark in-the-wild dataset, despite only using a light-weight nearest-neighbor algorithm. We are making our UnityEyes synthesis framework freely available online for the benefit of the research community.Paper: wood16_etra.pdf@inproceedings{wood16_etra, author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Morency, Louis-Philippe and Robinson, Peter and Bulling, Andreas}, title = {Learning an appearance-based gaze estimator from one million synthesised images}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, year = {2016}, pages = {131--138}, doi = {10.1145/2857491.2857492} } -
A 3D Morphable Eye Region Model for Gaze Estimation
Erroll Wood, Tadas Baltrušaitis, Louis-Philippe Morency, Peter Robinson, Andreas Bulling
Proc. European Conference on Computer Vision (ECCV), pp. 297-313, 2016.
Morphable face models are a powerful tool, but have previ- ously failed to model the eye accurately due to complexities in its material and motion. We present a new multi-part model of the eye that includes a morphable model of the facial eye region, as well as an anatomy-based eyeball model. It is the first morphable model that accurately captures eye region shape, since it was built from high-quality head scans. It is also the first to allow independent eyeball movement, since we treat it as a separate part. To showcase our model we present a new method for illumination- and head-pose–invariant gaze estimation from a single RGB image. We fit our model to an image through analysis-by-synthesis, solving for eye region shape, texture, eyeball pose, and illumination simul- taneously. The fitted eyeball pose parameters are then used to estimate gaze direction. Through evaluation on two standard datasets we show that our method generalizes to both webcam and high-quality camera images, and outperforms a state-of-the-art CNN method achieving a gaze estimation accuracy of 9.44° in a challenging user-independent scenario.Paper: wood16_eccv.pdf@inproceedings{wood16_eccv, author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Morency, Louis-Philippe and Robinson, Peter and Bulling, Andreas}, title = {A 3D Morphable Eye Region Model for Gaze Estimation}, booktitle = {Proc. European Conference on Computer Vision (ECCV)}, year = {2016}, pages = {297-313}, doi = {10.1007/978-3-319-46448-0_18} } -
Spatio-Temporal Modeling and Prediction of Visual Attention in Graphical User Interfaces
Pingmei Xu, Yusuke Sugano, Andreas Bulling
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 3299-3310, 2016.
Abstract Links BibTeX Project Best Paper Honourable Mention Award
We present a computational model to predict users’ spatio-temporal visual attention for WIMP-style (windows, icons, mouse, pointer) graphical user interfaces. Like existing models of bottom-up visual attention in computer vision, our model does not require any eye tracking equipment. Instead, it predicts attention solely using information available to the interface, specifically users’ mouse and keyboard input as well as the UI components they interact with. To study our model in a principled way we further introduce a method to synthesize user interface layouts that are functionally equivalent to real-world interfaces, such as from Gmail, Facebook, or GitHub. We first quantitatively analyze attention allocation and its correlation with user input and UI components using ground-truth gaze, mouse, and keyboard data of 18 participants performing a text editing task. We then show that our model predicts attention maps more accurately than state-of-the-art methods. Our results underline the significant potential of spatio-temporal attention modeling for user interface evaluation, optimization, or even simulation.Paper: xu16_chi.pdf@inproceedings{xu16_chi, title = {Spatio-Temporal Modeling and Prediction of Visual Attention in Graphical User Interfaces}, author = {Xu, Pingmei and Sugano, Yusuke and Bulling, Andreas}, year = {2016}, pages = {3299-3310}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/2858036.2858479} }
Book Chapters
-
Proc. 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)
Paul Lukowicz, Antonio Krüger, Andreas Bulling, Youn-Kyung Lim, Shwetak N. Patel
Heidelberg, Germany, ACM, 2016.
Paper Access: https://dl.acm.org/citation.cfm?id=2971648@inbook{lukowicz16_ubicomp, title = {Proc. 2016 ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2016}, author = {Lukowicz, Paul and Kr{\"{u}}ger, Antonio and Bulling, Andreas and Lim, Youn-Kyung and Patel, Shwetak N.}, isbn = {978-1-4503-4461-6}, location = {Heidelberg, Germany}, publisher = {ACM}, address = {New York, NY, USA}, url = {https://dl.acm.org/citation.cfm?id=2971648} }
Technical Reports
-
Computational Modelling and Prediction of Gaze Estimation Error for Head-mounted Eye Trackers
Michael Barz, Andreas Bulling, Florian Daiber
DFKI Research Reports, pp. 1–6, 2016.
Head-mounted eye tracking has significant potential for mobile gaze-based interaction with ambient displays but current interfaces lack information about the tracker’s gaze estimation error. Consequently, current interfaces do not exploit the full potential of gaze input as the inherent estimation error can not be dealt with. The error depends on the physical properties of the display and constantly varies with changes in position and distance of the user to the display. In this work we present a computational model of gaze estimation error for head-mounted eye trackers. Our model covers the full processing pipeline for mobile gaze estimation, namely mapping of pupil positions to scene camera coordinates, marker-based display detection, and display mapping. We build the model based on a series of controlled measurements of a sample state-of-the-art monocular head-mounted eye tracker. Results show that our model can predict gaze estimation error with a root mean squared error of 17.99 px (1.96^circ).Paper: barz16_techrep.pdf@techreport{barz16_techrep, author = {Barz, Michael and Bulling, Andreas and Daiber, Florian}, title = {Computational Modelling and Prediction of Gaze Estimation Error for Head-mounted Eye Trackers}, volume = {1}, year = {2016}, pages = {1--6}, institution = {German Research Center for Artificial Intelligence (DFKI)} } -
Contextual Media Retrieval Using Natural Language Queries
Sreyasi Nag Chowdhury, Mateusz Malinowski, Andreas Bulling, Mario Fritz
arXiv:1602.04983, pp. 1–8, 2016.
The widespread integration of cameras in hand-held and head-worn devices as well as the ability to share content online enables a large and diverse visual capture of the world that millions of users build up collectively every day. We envision these images as well as associated meta information, such as GPS coordinates and timestamps, to form a collective visual memory that can be queried while automatically taking the ever-changing context of mobile users into account. As a first step towards this vision, in this work we present Xplore-M-Ego: a novel media retrieval system that allows users to query a dynamic database of images and videos using spatio-temporal natural language queries. We evaluate our system using a new dataset of real user queries as well as through a usability study. One key finding is that there is a considerable amount of inter-user variability, for example in the resolution of spatial relations in natural language utterances. We show that our retrieval system can cope with this variability using personalisation through an online learning-based retrieval formulation.Paper: chowdhury16_arxiv.pdfPaper Access: https://arxiv.org/abs/1602.04983@techreport{chowdhury16_arxiv, title = {Contextual Media Retrieval Using Natural Language Queries}, author = {Chowdhury, Sreyasi Nag and Malinowski, Mateusz and Bulling, Andreas and Fritz, Mario}, year = {2016}, pages = {1--8}, url = {https://arxiv.org/abs/1602.04983} } -
End-to-End Eye Movement Detection Using Convolutional Neural Networks
Sabrina Hoppe, Andreas Bulling
arXiv:1609.02452, pp. 1–15, 2016.
Common computational methods for automated eye movement detection - i.e. the task of detecting different types of eye movement in a continuous stream of gaze data - are limited in that they either involve thresholding on hand-crafted signal features, require individual detectors each only detecting a single movement, or require pre-segmented data. We propose a novel approach for eye movement detection that only involves learning a single detector end-to-end, i.e. directly from the continuous gaze data stream and simultaneously for different eye movements without any manual feature crafting or segmentation. Our method is based on convolutional neural networks (CNN) that recently demonstrated superior performance in a variety of tasks in computer vision, signal processing, and machine learning. We further introduce a novel multi-participant dataset that contains scripted and free-viewing sequences of ground-truth annotated saccades, fixations, and smooth pursuits. We show that our CNN-based method outperforms state-of-the-art baselines by a large margin on this challenging dataset, thereby underlining the significant potential of this approach for holistic, robust, and accurate eye movement protocol analysis.Paper: hoppe16_arxiv.pdfPaper Access: https://arxiv.org/abs/1609.02452@techreport{hoppe16_arxiv, title = {End-to-End Eye Movement Detection Using Convolutional Neural Networks}, author = {Hoppe, Sabrina and Bulling, Andreas}, year = {2016}, pages = {1--15}, url = {https://arxiv.org/abs/1609.02452} } -
Gaze Embeddings for Zero-Shot Image Classification
Nour Karessli, Zeynep Akata, Bernt Schiele, Andreas Bulling
arXiv:1611.09309, pp. 1–10, 2016.
Zero-shot image classification using auxiliary information, such as attributes describing discriminative object properties, requires time-consuming annotation by domain experts. We instead propose a method that relies on human gaze as auxiliary information, exploiting that even non-expert users have a natural ability to judge class membership. We present a data collection paradigm that involves a discrimination task to increase the information content obtained from gaze data. Our method extracts discriminative descriptors from the data and learns a compatibility function between image and gaze using three novel gaze embeddings: Gaze Histograms (GH), Gaze Features with Grid (GFG) and Gaze Features with Sequence (GFS). We introduce two new gaze-annotated datasets for fine-grained image classification and show that human gaze data is indeed class discriminative, provides a competitive alternative to expert-annotated attributes, and outperforms other baselines for zero-shot image classification.Paper: karessli16_arxiv.pdfPaper Access: https://arxiv.org/abs/1611.09309@techreport{karessli16_arxiv, title = {Gaze Embeddings for Zero-Shot Image Classification}, author = {Karessli, Nour and Akata, Zeynep and Schiele, Bernt and Bulling, Andreas}, year = {2016}, pages = {1--10}, url = {https://arxiv.org/abs/1611.09309} } -
3D Gaze Estimation from 2D Pupil Positions on Monocular Head-Mounted Eye Trackers
Mohsen Mansouryar, Julian Steil, Yusuke Sugano, Andreas Bulling
arXiv:1601.02644, pp. 1–6, 2016.
3D gaze information is important for scene-centric attention analysis but accurate estimation and analysis of 3D gaze in real-world environments remains challenging. We present a novel 3D gaze estimation method for monocular head-mounted eye trackers. In contrast to previous work, our method does not aim to infer 3D eyeball poses but directly maps 2D pupil positions to 3D gaze directions in scene camera coordinate space. We first provide a detailed discussion of the 3D gaze estimation task and summarize different methods, including our own. We then evaluate the performance of different 3D gaze estimation approaches using both simulated and real data. Through experimental validation, we demonstrate the effectiveness of our method in reducing parallax error, and we identify research challenges for the design of 3D calibration procedures.Paper: mansouryar16_arxiv.pdfPaper Access: https://arxiv.org/abs/1601.02644@techreport{mansouryar16_arxiv, title = {3D Gaze Estimation from 2D Pupil Positions on Monocular Head-Mounted Eye Trackers}, author = {Mansouryar, Mohsen and Steil, Julian and Sugano, Yusuke and Bulling, Andreas}, year = {2016}, pages = {1--6}, url = {https://arxiv.org/abs/1601.02644} } -
Predicting the Category and Attributes of Mental Pictures Using Deep Gaze Pooling
Hosnieh Sattar, Andreas Bulling, Mario Fritz
arXiv:1611.10162, pp. 1–14, 2016.
Predicting the target of visual search from eye fixation (gaze) data is a challenging problem with many applications in human-computer interaction. In contrast to previous work that has focused on individual instances as a search target, we propose the first approach to predict categories and attributes of search targets based on gaze data. However, state of the art models for categorical recognition, in general, require large amounts of training data, which is prohibitive for gaze data. To address this challenge, we propose a novel Gaze Pooling Layer that integrates gaze information into CNN-based architectures as an attention mechanism - incorporating both spatial and temporal aspects of human gaze behavior. We show that our approach is effective even when the gaze pooling layer is added to an already trained CNN, thus eliminating the need for expensive joint data collection of visual and gaze data. We propose an experimental setup and data set and demonstrate the effectiveness of our method for search target prediction based on gaze behavior. We further study how to integrate temporal and spatial gaze information most effectively, and indicate directions for future research in the gaze-based prediction of mental states.Paper: sattar16_arxiv.pdfPaper Access: https://arxiv.org/abs/1611.10162@techreport{sattar16_arxiv, title = {Predicting the Category and Attributes of Mental Pictures Using Deep Gaze Pooling}, author = {Sattar, Hosnieh and Bulling, Andreas and Fritz, Mario}, year = {2016}, pages = {1--14}, url = {https://arxiv.org/abs/1611.10162} } -
Seeing with Humans: Gaze-Assisted Neural Image Captioning
Yusuke Sugano, Andreas Bulling
arXiv:1608.05203, pp. 1–8, 2016.
Gaze reflects how humans process visual scenes and is therefore increasingly used in computer vision systems. Previous works demonstrated the potential of gaze for object-centric tasks, such as object localization and recognition, but it remains unclear if gaze can also be beneficial for scene-centric tasks, such as image captioning. We present a new perspective on gaze-assisted image captioning by studying the interplay between human gaze and the attention mechanism of deep neural networks. Using a public large-scale gaze dataset, we first assess the relationship between state-of-the-art object and scene recognition models, bottom-up visual saliency, and human gaze. We then propose a novel split attention model for image captioning. Our model integrates human gaze information into an attention-based long short-term memory architecture, and allows the algorithm to allocate attention selectively to both fixated and non-fixated image regions. Through evaluation on the COCO/SALICON datasets we show that our method improves image captioning performance and that gaze can complement machine attention for semantic scene understanding tasks.Paper: sugano16_tr.pdfPaper Access: https://arxiv.org/abs/1608.05203@techreport{sugano16_tr, title = {Seeing with Humans: Gaze-Assisted Neural Image Captioning}, author = {Sugano, Yusuke and Bulling, Andreas}, year = {2016}, pages = {1--8}, url = {https://arxiv.org/abs/1608.05203} } -
It’s Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation
Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling
arXiv:1611.08860, pp. 1–10, 2016.
Eye gaze is an important non-verbal cue for human affect analysis. Recent gaze estimation work indicated that information from the full face region can benefit performance. Pushing this idea further, we propose an appearance-based method that, in contrast to a long-standing line of work in computer vision, only takes the full face image as input. Our method encodes the face image using a convolutional neural network with spatial weights applied on the feature maps to flexibly suppress or enhance information in different facial regions. Through extensive evaluation, we show that our full-face method significantly outperforms the state of the art for both 2D and 3D gaze estimation, achieving improvements of up to 14.3% on MPIIGaze and 27.7% on EYEDIAP for person-independent 3D gaze estimation. We further show that this improvement is consistent across different illumination conditions and gaze directions and particularly pronounced for the most challenging extreme head poses.Paper: zhang16_arxiv.pdfPaper Access: https://arxiv.org/abs/1611.08860@techreport{zhang16_arxiv, title = {It's Written All Over Your Face: Full-Face Appearance-Based Gaze Estimation}, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, year = {2016}, pages = {1--10}, url = {https://arxiv.org/abs/1611.08860} }
2015
Journal Articles
-
Introduction to the Special Issue on Activity Recognition for Interaction
Andreas Bulling, Ulf Blanke, Desney Tan, Jun Rekimoto, Gregory Abowd
ACM Transactions on Interactive Intelligent Systems (TiiS), 4 (16e), pp. 1–3, 2015.
This editorial introduction describes the aims and scope of the ACM Transactions on Interactive Intelligent Systems special issue on Activity Recognition for Interaction. It explains why activity recognition is becoming crucial as part of the cycle of interaction between users and computing systems, and it shows how the five articles selected for this special issue reflect this theme.doi: 10.1145/2694858Paper: bulling15_tiis.pdf@article{bulling15_tiis, author = {Bulling, Andreas and Blanke, Ulf and Tan, Desney and Rekimoto, Jun and Abowd, Gregory}, title = {Introduction to the Special Issue on Activity Recognition for Interaction}, journal = {ACM Transactions on Interactive Intelligent Systems (TiiS)}, volume = {4}, number = {16e}, year = {2015}, pages = {1--3}, doi = {10.1145/2694858} } -
A study on the natural history of scanning behaviour in patients with visual field defects after stroke
Tobias Loetscher, Celia Chen, Sophie Wignall, Andreas Bulling, Sabrina Hoppe, Owen Churches, Nicole Thomas
BMC Neurology, 15 (64), pp. 1–4, 2015.
A visual field defect (VFD) is a common consequence of stroke with a detrimental effect upon the survivors’ functional ability and quality of life. The identification of effective treatments for VFD is a key priority relating to life post-stroke. Understanding the natural evolution of scanning compensation over time may have important ramifications for the development of efficacious therapies. The study aims to unravel the natural history of visual scanning behaviour in patients with VFD. The assessment of scanning patterns in the acute to chronic stages of stroke will reveal who does and does not learn to compensate for vision loss. Methods/Design Eye-tracking glasses are used to delineate eye movements in a cohort of 100 stroke patients immediately after stroke, and additionally at 6 and 12 months post-stroke. The longitudinal study will assess eye movements in static (sitting) and dynamic (walking) conditions. The primary outcome constitutes the change of lateral eye movements from the acute to chronic stages of stroke. Secondary outcomes include changes of lateral eye movements over time as a function of subgroup characteristics, such as side of VFD, stroke location, stroke severity and cognitive functioning. Discussion The longitudinal comparison of patients who do and do not learn compensatory scanning techniques may reveal important prognostic markers of natural recovery. Importantly, it may also help to determine the most effective treatment window for visual rehabilitation.Paper: loetscher15_neurology.pdf@article{loetscher15_neurology, title = {A study on the natural history of scanning behaviour in patients with visual field defects after stroke}, author = {Loetscher, Tobias and Chen, Celia and Wignall, Sophie and Bulling, Andreas and Hoppe, Sabrina and Churches, Owen and Thomas, Nicole}, doi = {10.1186/s12883-015-0321-5}, year = {2015}, journal = {BMC Neurology}, volume = {15}, number = {64}, pages = {1--4} } -
Walking reduces spatial neglect
Tobias Loetscher, Celia Chen, Sabrina Hoppe, Andreas Bulling, Sophie Wignall, Churches Owen, Nicole Thomas, Andrew Lee
, 21 (S2), pp. 120-121, 2015.
Spatial neglect is a common consequence of stroke. Neglect behaviour is typically exacerbated by increased task demands. It was thus anticipated that the addition of a secondary task requiring general attention (walking) would worsen performance on tests of spatial neglect. Here, however, we report a patient in whom neglect was considerably reduced when performing a visual search task while walking.Paper: loetscher15_ins.pdf@article{loetscher15_ins, title = {Walking reduces spatial neglect}, author = {Loetscher, Tobias and Chen, Celia and Hoppe, Sabrina and Bulling, Andreas and Wignall, Sophie and Owen, Churches and Thomas, Nicole and Lee, Andrew}, year = {2015}, volume = {21}, number = {S2}, pages = {120-121}, booktitle = {Journal of the International Neuropsychological Society} } -
The Feet in HCI: A Survey of Foot-Based Interaction
Eduardo Velloso, Dominik Schmidt, Jayson Alexander, Hans Gellersen, Andreas Bulling
ACM Computing Surveys, 48 (2), pp. 1–36, 2015.
Foot-operated computer interfaces have been studied since the inception of Human-Computer Interaction. Thanks to the miniaturisation and decreasing cost of sensing technology, there is an increasing interest exploring this alternative input modality, but no comprehensive overview of its research landscape. In this survey, we review the literature on interfaces operated by the lower limbs. We investigate the characteristics of users and how they affect the design of such interfaces. Next, we describe and analyse foot-based research prototypes and commercial systems in how they capture input and provide feedback. We then analyse the interactions between users and systems from the perspective of the actions performed in these interactions. Finally, we discuss our findings and use them to identify open questions and directions for future research.doi: 10.1145/2816455Paper: velloso15_csur.pdf@article{velloso15_csur, author = {Velloso, Eduardo and Schmidt, Dominik and Alexander, Jayson and Gellersen, Hans and Bulling, Andreas}, title = {{The Feet in HCI: A Survey of Foot-Based Interaction}}, journal = {ACM Computing Surveys}, year = {2015}, volume = {48}, number = {2}, pages = {1--36}, doi = {10.1145/2816455} } -
Pursuits: Spontaneous Eye-Based Interaction for Dynamic Interfaces
Mélodie Vidal, Andreas Bulling, Hans Gellersen
ACM SIGMOBILE Mobile Computing and Communications Review, 18 (4), pp. 8-10, 2015.
Although gaze is an attractive modality for pervasive interaction, real-world implementation of eye-based interfaces poses significant challenges. In particular, user calibration is tedious and time consuming. Pursuits is an innovative interaction technique that enables truly spontaneous interaction with eye-based interfaces. A user can simply walk up to the screen and readily interact with moving targets. Instead of being based on gaze location, Pursuits correlates eye pursuit movements with objects dynamically moving on the interface.Paper: vidal15_sigmobile.pdf@article{vidal15_sigmobile, author = {Vidal, M{\'{e}}lodie and Bulling, Andreas and Gellersen, Hans}, title = {Pursuits: Spontaneous Eye-Based Interaction for Dynamic Interfaces}, journal = {ACM SIGMOBILE Mobile Computing and Communications Review}, volume = {18}, number = {4}, year = {2015}, pages = {8-10}, doi = {10.1145/2721914.2721917} } -
Eye tracking for public displays in the wild
Yanxia Zhang, Ming Ki Chong, Jörg Müller, Andreas Bulling, Hans Gellersen
Springer Personal and Ubiquitous Computing, 19 (5), pp. 967-981, 2015.
In public display contexts, interactions are spontaneous and have to work without preparation. We propose gaze as a modality for such con- texts, as gaze is always at the ready, and a natural indicator of the user’s interest. We present GazeHorizon, a system that demonstrates sponta- neous gaze interaction, enabling users to walk up to a display and navi- gate content using their eyes only. GazeHorizon is extemporaneous and optimised for instantaneous usability by any user without prior configura- tion, calibration or training. The system provides interactive assistance to bootstrap gaze interaction with unaware users, employs a single off-the- shelf web camera and computer vision for person-independent tracking of the horizontal gaze direction, and maps this input to rate-controlled nav- igation of horizontally arranged content. We have evaluated GazeHorizon through a series of field studies, culminating in a four-day deployment in a public environment during which over a hundred passers-by interacted with it, unprompted and unassisted. We realised that since eye move- ments are subtle, users cannot learn gaze interaction from only observing others, and as a results guidance is required.Paper: zhang15_puc.pdf@article{zhang15_puc, title = {Eye tracking for public displays in the wild}, author = {Zhang, Yanxia and Chong, Ming Ki and M\"uller, J\"org and Bulling, Andreas and Gellersen, Hans}, year = {2015}, doi = {10.1007/s00779-015-0866-8}, pages = {967-981}, volume = {19}, number = {5}, journal = {Springer Personal and Ubiquitous Computing}, keywords = {Eye tracking; Gaze interaction; Public displays; Scrolling; Calibration-free; In-the-wild study; Deployment} }
Conference Papers
-
GravitySpot: Guiding Users in Front of Public Displays Using On-Screen Visual Cues
Florian Alt, Andreas Bulling, Gino Gravanis, Daniel Buschek
Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 47-56, 2015.
Users tend to position themselves in front of interactive public displays in such a way as to best perceive its content. Currently, this sweet spot is implicitly defined by display properties, content, the input modality, as well as space constraints in front of the display. We present GravitySpot – an approach that makes sweet spots flexible by actively guiding users to arbitrary target positions in front of displays using visual cues. Such guidance is beneficial, for example, if a particular input technology only works at a specific distance or if users should be guided towards a non-crowded area of a large display. In two controlled lab studies (n=29) we evaluate different visual cues based on color, shape, and motion, as well as position-to-cue mapping functions. We show that both the visual cues and mapping functions allow for fine-grained control over positioning speed and accuracy. Findings are complemented by observations from a 3-month real-world deployment.Paper: alt15_uist.pdf@inproceedings{alt15_uist, title = {GravitySpot: Guiding Users in Front of Public Displays Using On-Screen Visual Cues}, author = {Alt, Florian and Bulling, Andreas and Gravanis, Gino and Buschek, Daniel}, year = {2015}, pages = {47-56}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, doi = {10.1145/2807442.2807490}, video = {https://www.youtube.com/watch?v=laWfbOpQQ8A} } -
Graphical Passwords in the Wild – Understanding How Users Choose Pictures and Passwords in Image-based Authentication Schemes
Florian Alt, Stefan Schneegass, Alireza Sahami, Mariam Hassib, Andreas Bulling
Proc. ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI), pp. 316-322, 2015.
Common user authentication methods on smartphones, such as lock patterns, PINs, or passwords, impose a trade-off between security and password memorability. Image-based passwords were proposed as a secure and usable alternative. As of today, however, it remains unclear how such schemes are used in the wild. We present the first study to investigate how image-based passwords are used over long periods of time in the real world. Our analyses are based on data from 2318 unique devices collected over more than one year using a custom application released in the Android Play store. We present an in-depth analysis of what kind of images users select, how they define their passwords, and how secure these passwords are. Our findings provide valuable insights into real-world use of image-based passwords and inform the design of future graphical authentication schemes.Paper: alt15_mobilehci.pdf@inproceedings{alt15_mobilehci, title = {Graphical Passwords in the Wild -- Understanding How Users Choose Pictures and Passwords in Image-based Authentication Schemes}, author = {Alt, Florian and Schneegass, Stefan and Sahami, Alireza and Hassib, Mariam and Bulling, Andreas}, year = {2015}, pages = {316-322}, booktitle = {Proc. ACM International Conference on Human-Computer Interaction with Mobile Devices and Services (MobileHCI)}, doi = {10.1145/2785830.2785882} } -
Human Visual Behaviour for Collaborative Human-Machine Interaction
Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 903-907, 2015.
Non-verbal behavioural cues are fundamental to human communication and interaction. Despite significant advances in recent years, state-of-the-art human-machine systems still fall short in sensing, analysing, and fully "understanding" cues naturally expressed in everyday settings. Two of the most important non-verbal cues, as evidenced by a large body of work in experimental psychology and behavioural sciences, are visual (gaze) behaviour and body language. We envision a new class of collaborative human-machine systems that fully exploit the information content available in non-verbal human behaviour in everyday settings through joint analysis of human gaze and physical behaviour.Paper: bulling15_ubicomp.pdf@inproceedings{bulling15_ubicomp, title = {Human Visual Behaviour for Collaborative Human-Machine Interaction}, author = {Bulling, Andreas}, doi = {10.1145/2800835.2815378}, pages = {903-907}, year = {2015}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)} } -
Orbits: Enabling Gaze Interaction in Smart Watches using Moving Targets
Augusto Esteves, Eduardo Velloso, Andreas Bulling, Hans Gellersen
Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 419-422, 2015.
In this paper we demonstrate Orbits, a novel gaze interaction technique that accounts for both the reduced size of smart watch displays and the hands-free nature of conventional watches. Orbits combines graphical controls that display one or multiple targets moving on a circular path, with input that is provided by users as they follow any of the targets briefly with their eyes. This gaze input triggers the functionality associated with the followed target – be it answering a call, playing a song or managing multiple notifications.@inproceedings{esteves15_ubicomp, title = {Orbits: Enabling Gaze Interaction in Smart Watches using Moving Targets}, author = {Esteves, Augusto and Velloso, Eduardo and Bulling, Andreas and Gellersen, Hans}, year = {2015}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, doi = {10.1145/2800835.2800942}, pages = {419-422}, video = {https://www.youtube.com/watch?v=KEIgw5A0yfI} } -
Orbits: Enabling Gaze Interaction in Smart Watches using Moving Targets
Augusto Esteves, Eduardo Velloso, Andreas Bulling, Hans Gellersen
Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 457-466, 2015.
Abstract Links BibTeX Project Best Paper Award
We introduce Orbits, a novel gaze interaction technique that enables hands-free input on smart watches. The technique relies on moving controls to leverage the smooth pursuit movements of the eyes and detect whether and at which control the user is looking at. In Orbits, controls include targets that move in a circular trajectory in the face of the watch, and can be selected by following the desired one for a small amount of time. We conducted two user studies to assess the technique’s recognition and robustness, which demonstrated how Orbits is robust against false positives triggered by natural eye movements and how it presents a hands-free, high accuracy way of interacting with smart watches using off-the-shelf devices. Finally, we developed three example interfaces built with Orbits: a music player, a notifications face plate and a missed call menu. Despite relying on moving controls – very unusual in current HCI interfaces – these were generally well received by participants in a third and final study.Paper: esteves15_uist.pdf@inproceedings{esteves15_uist, title = {Orbits: Enabling Gaze Interaction in Smart Watches using Moving Targets}, author = {Esteves, Augusto and Velloso, Eduardo and Bulling, Andreas and Gellersen, Hans}, year = {2015}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, doi = {10.1145/2807442.2807499}, pages = {457-466} } -
Recognition of Curiosity Using Eye Movement Analysis
Sabrina Hoppe, Tobias Loetscher, Stephanie Morey, Andreas Bulling
Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 185-188, 2015.
Among the different personality traits that guide our behaviour, curiosity is particularly interesting for context-aware assistive systems as it is closely linked to our well-being and the way we learn. This work proposes eye movement analysis for automatic recognition of different levels of curiosity. We present a 26-participant gaze dataset recorded during a real-world shopping task with empirically validated curiosity questionnaires as ground truth. Using a support vector machine classifier and a leave-one-person-out evaluation scheme we can discriminate between two to four classes of standard curiosity scales well above chance. These results are promising and point towards a new class of context-aware systems that take the user’s curiosity into account, thereby enabling new types of interaction and user adaptation.Paper: hoppe15_ubicomp.pdf@inproceedings{hoppe15_ubicomp, title = {Recognition of Curiosity Using Eye Movement Analysis}, author = {Hoppe, Sabrina and Loetscher, Tobias and Morey, Stephanie and Bulling, Andreas}, year = {2015}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, doi = {10.1145/2800835.2800910}, pages = {185-188} } -
Tackling Challenges of Interactive Public Displays using Gaze
Mohamed Khamis, Andreas Bulling, Florian Alt
Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 763-766, 2015.
Falling hardware prices led to a widespread use of public displays. Common interaction techniques for such displays currently include touch, mid-air, or smartphone-based interaction. While these techniques are well understood from a technical perspective, several remaining challenges hinder the uptake of interactive displays among passersby. In this paper we propose addressing major public display challenges through gaze as a novel interaction modality. We discuss why gaze-based interaction can tackle these challenges effectively and discuss how solutions can be technically realized. Furthermore, we summarize state-of-the-art eye tracking techniques that show particular promise in the area of public displays.Paper: khamis15_ubicomp.pdf@inproceedings{khamis15_ubicomp, title = {Tackling Challenges of Interactive Public Displays using Gaze}, author = {Khamis, Mohamed and Bulling, Andreas and Alt, Florian}, year = {2015}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, doi = {10.1145/2800835.2807951}, pages = {763-766} } -
A Field Study on Spontaneous Gaze-based Interaction with a Public Display using Pursuits
Mohamed Khamis, Florian Alt, Andreas Bulling
Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 865-874, 2015.
Smooth pursuit eye movements were recently introduced as a promising technique for calibration-free and thus spontaneous and natural gaze interaction. While pursuits have been evaluated in controlled laboratory studies, the technique has not yet been evaluated with respect to usability in the wild. We report on a field study in which we deployed a game on a public display where participants used pursuits to select fish moving in linear and circular trajectories at different speeds. The study ran for two days in a busy computer lab resulting in a total of 56 interactions. Results from our study show that linear trajectories are statistically faster to select via pursuits than circular trajectories. We also found that pursuits is well perceived by users who find it fast and responsive.Paper: khamis15_ubicomp_2.pdf@inproceedings{khamis15_ubicomp_2, title = {A Field Study on Spontaneous Gaze-based Interaction with a Public Display using Pursuits}, author = {Khamis, Mohamed and Alt, Florian and Bulling, Andreas}, year = {2015}, booktitle = {Adj. Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, doi = {10.1145/2800835.2804335}, pages = {865-874} } -
Scene viewing and gaze analysis during phonetic segmentation tasks
Arif Khan, Ingmar Steiner, Ross Macdonald, Yusuke Sugano, Andreas Bulling
Proc. European Conference on Eye Movements (ECEM), pp. 1–2, 2015.
Paper: khan15_ecem.pdf@inproceedings{khan15_ecem, title = {Scene viewing and gaze analysis during phonetic segmentation tasks}, author = {Khan, Arif and Steiner, Ingmar and Macdonald, Ross and Sugano, Yusuke and Bulling, Andreas}, year = {2015}, pages = {1--2}, booktitle = {Proc. European Conference on Eye Movements (ECEM)} } -
On the interplay between spontaneous spoken instructions and human visual behaviour in an indoor guidance task
Nikolina Koleva, Sabrina Hoppe, Mohammed Mehdi Moniri, Maria Staudte, Andreas Bulling
Proc. Annual Meeting of the Cognitive Science Society (CogSci), pp. 1–6, 2015.
We report on an indoor guidance study to explore the inter- play between spontaneous spoken instructions and listeners’ eye movement behaviour. The study involves a remote speaker (the instructor) to verbally guide a listener (the walker) to com- plete nine everyday tasks in different locations inside a room. We collect a multi-modal dataset of 12 pairs of users consist- ing of egocentric videos from the listener’s perspective, their gaze data, and instructors’ verbal instructions. We analyse the impact on instructions and listener gaze when the speaker can see 1) only the egocentric video, 2) the video and the point of gaze, or 3) the video and gaze with artificial noise. Our re- sults show that gaze behaviour varies significantly after (but hardly before) instructions and that speakers give more nega- tive feedback when listener gaze is available. These findings suggest that although speakers use gaze information as an in- dication of what referent the listener is effectively considering, this does not lead listeners to deliberately use their gaze as a pointer even when this is potentially beneficial for the task.Paper: koleva15_cogsci.pdf@inproceedings{koleva15_cogsci, title = {On the interplay between spontaneous spoken instructions and human visual behaviour in an indoor guidance task}, author = {Koleva, Nikolina and Hoppe, Sabrina and Moniri, Mohammed Mehdi and Staudte, Maria and Bulling, Andreas}, year = {2015}, pages = {1--6}, booktitle = {Proc. Annual Meeting of the Cognitive Science Society (CogSci)} } -
GazeProjector: Accurate Gaze Estimation and Seamless Gaze Interaction Across Multiple Displays
Christian Lander, Sven Gehring, Antonio Krüger, Sebastian Boring, Andreas Bulling
Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 395-404, 2015.
Mobile gaze-based interaction with multiple displays may occur from arbitrary positions and orientations. However, maintaining high gaze estimation accuracy in such situations remains a significant challenge. In this paper, we present GazeProjector, a system that combines (1) natural feature tracking on displays to determine the mobile eye tracker’s position relative to a display with (2) accurate point-of-gaze estimation. GazeProjector allows for seamless gaze estimation and interaction on multiple displays of arbitrary sizes independently of the user’s position and orientation to the display. In a user study with 12 participants we compare GazeProjector to established methods (here: visual on-screen markers and a state-of-the-art video-based motion capture system). We show that our approach is robust to varying head poses, orientations, and distances to the display, while still providing high gaze estimation accuracy across multiple displays without re-calibration for each variation. Our system represents an important step towards the vision of pervasive gaze-based interfaces.Paper: lander15_uist.pdf@inproceedings{lander15_uist, title = {GazeProjector: Accurate Gaze Estimation and Seamless Gaze Interaction Across Multiple Displays}, author = {Lander, Christian and Gehring, Sven and Kr{\"{u}}ger, Antonio and Boring, Sebastian and Bulling, Andreas}, year = {2015}, pages = {395-404}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, doi = {10.1145/2807442.2807479} } -
Emotion recognition from embedded bodily expressions and speech during dyadic interactions
Philipp Müller, Sikandar Amin, Prateek Verma, Mykhaylo Andriluka, Andreas Bulling
Proc. International Conference on Affective Computing and Intelligent Interaction (ACII), pp. 663-669, 2015.
Previous work on emotion recognition from bodily expressions focused on analysing such expressions in isolation, of individuals or in controlled settings, from a single camera view, or required intrusive motion tracking equipment. We study the problem of emotion recognition from bodily expressions and speech during dyadic (person-person) interactions in a real kitchen instrumented with ambient cameras and microphones. We specifically focus on bodily expressions that are embedded in regular interactions and background activities and recorded without human augmentation to increase naturalness of the expressions. We present a human-validated dataset that contains 224 high-resolution, multi-view video clips and audio recordings of emotionally charged interactions between eight couples of actors. The dataset is fully annotated with categorical labels for four basic emotions (anger, happiness, sadness, and surprise) and continuous labels for valence, activation, power, and anticipation provided by five annotators for each actor. We evaluate vision and audio-based emotion recognition using dense trajectories and a standard audio pipeline and provide insights into the importance of different body parts and audio features for emotion recognition.Paper: mueller15_acii.pdf@inproceedings{mueller15_acii, title = {Emotion recognition from embedded bodily expressions and speech during dyadic interactions}, author = {M{\"{u}}ller, Philipp and Amin, Sikandar and Verma, Prateek and Andriluka, Mykhaylo and Bulling, Andreas}, year = {2015}, pages = {663-669}, doi = {10.1109/ACII.2015.7344640}, booktitle = {Proc. International Conference on Affective Computing and Intelligent Interaction (ACII)} } -
Prediction of Search Targets From Fixations in Open-world Settings
Hosnieh Sattar, Sabine Müller, Mario Fritz, Andreas Bulling
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 981-990, 2015.
Previous work on predicting the target of visual search from human fixations only considered closed-world settings in which training labels are available and predictions are performed for a known set of potential targets. In this work we go beyond the state of the art by studying search target prediction in an open-world setting in which we no longer assume that we have fixation data to train for the search targets. We present a dataset containing fixation data of 18 users searching for natural images from three image categories within synthesised image collages of about 80 images. In a closed-world baseline experiment we show that we can predict the correct target image out of a candidate set of five images. We then present a new problem formulation for search target prediction in the open-world setting that is based on learning compatibilities between fixations and potential targets.Paper: sattar15_cvpr.pdf@inproceedings{sattar15_cvpr, author = {Sattar, Hosnieh and M{\"{u}}ller, Sabine and Fritz, Mario and Bulling, Andreas}, title = {Prediction of Search Targets From Fixations in Open-world Settings}, booktitle = {Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2015}, pages = {981-990}, doi = {10.1109/CVPR.2015.7298700} } -
Discovery of Everyday Human Activities From Long-term Visual Behaviour Using Topic Models
Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 75-85, 2015.
Human visual behaviour has significant potential for activity recognition and computational behaviour analysis, but previous works focused on supervised methods and recognition of predefined activity classes based on short-term eye movement recordings. We propose a fully unsupervised method to discover users’ everyday activities from their long-term visual behaviour. Our method combines a bag-of-words representation of visual behaviour that encodes saccades, fixations, and blinks with a latent Dirichlet allocation (LDA) topic model. We further propose different methods to encode saccades for their use in the topic model. We evaluate our method on a novel long-term gaze dataset that contains full-day recordings of natural visual behaviour of 10 participants (more than 80 hours in total). We also provide annotations for eight sample activity classes (outdoor, social interaction, focused work, travel, reading, computer work, watching media, eating) and periods with no specific activity. We show the ability of our method to discover these activities with performance competitive with that of previously published supervised methods.Paper: steil15_ubicomp.pdf@inproceedings{steil15_ubicomp, author = {Steil, Julian and Bulling, Andreas}, title = {Discovery of Everyday Human Activities From Long-term Visual Behaviour Using Topic Models}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2015}, doi = {10.1145/2750858.2807520}, pages = {75-85} } -
Self-Calibrating Head-Mounted Eye Trackers Using Egocentric Visual Saliency
Yusuke Sugano, Andreas Bulling
Proc. ACM Symposium on User Interface Software and Technology (UIST), pp. 363-372, 2015.
Head-mounted eye tracking has significant potential for gaze-based applications such as life logging, mental health monitoring, or quantified self. However, a neglected challenge for such applications is that drift in the initial person-specific eye tracker calibration, for example caused by physical activity, can severely impact gaze estimation accuracy and, thus, system performance and user experience. We first analyse calibration drift on a new dataset of natural gaze data recorded using synchronised video-based and Electrooculography-based eye trackers of 20 users performing everyday activities in a mobile setting. Based on this analysis we present a method to automatically self-calibrate head-mounted eye trackers based on a computational model of bottom-up visual saliency. Through evaluations on the dataset we show that our method is 1) effective in reducing calibration drift in calibrated eye trackers and 2) given sufficient data, can achieve competitive gaze estimation accuracy to a calibrated eye tracker without any manual calibration.Paper: sugano15_uist.pdf@inproceedings{sugano15_uist, title = {Self-Calibrating Head-Mounted Eye Trackers Using Egocentric Visual Saliency}, author = {Sugano, Yusuke and Bulling, Andreas}, year = {2015}, booktitle = {Proc. ACM Symposium on User Interface Software and Technology (UIST)}, doi = {10.1145/2807442.2807445}, pages = {363-372}, video = {https://www.youtube.com/watch?v=CvsZ3YCWFPk} } -
Gaze+RST: Integrating Gaze and Multitouch for Remote Rotate-Scale-Translate Tasks
Jayson Turner, Jason Alexander, Andreas Bulling, Hans Gellersen
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 4179-4188, 2015.
Our work investigates the use of gaze and multitouch to flu- idly perform rotate-scale translate (RST) tasks on large dis- plays. The work specifically aims to understand if gaze can provide benefit in such a task, how task complexity af- fects performance, and how gaze and multitouch can be com- bined to create an integral input structure suited to the task of RST. We present four techniques that individually strike a different balance between gaze-based and touch-based trans- lation while maintaining concurrent rotation and scaling op- erations. A 16 participant empirical evaluation revealed that three of our four techniques present viable options for this scenario, and that larger distances and rotation/scaling opera- tions can significantly affect a gaze-based translation configu- ration. Furthermore we uncover new insights regarding mul- timodal integrality, finding that gaze and touch can be com- bined into configurations that pertain to integral or separable input structures.Paper: turner15_chi.pdf@inproceedings{turner15_chi, author = {Turner, Jayson and Alexander, Jason and Bulling, Andreas and Gellersen, Hans}, title = {Gaze+RST: Integrating Gaze and Multitouch for Remote Rotate-Scale-Translate Tasks}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2015}, pages = {4179-4188}, doi = {10.1145/2702123.2702355} } -
An Empirical Investigation of Gaze Selection in Mid-Air Gestural 3D Manipulation
Eduardo Velloso, Jayson Turner, Jason Alexander, Andreas Bulling, Hans Gellersen
Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT), pp. 315-330, 2015.
In this work, we investigate gaze selection in the context of mid-air hand gestural manipulation of 3D rigid bodies in monoscopic displays. We present the results of a user study with 12 participants in which we compared the performance of Gaze, a Raycasting technique (2D Cursor) and a Virtual Hand technique (3D Cursor) to select objects in two 3D mid-air interaction tasks. Also, we compared selection confirmation times for Gaze selection when selection is followed by manipulation to when it is not. Our results show that gaze selection is faster and more preferred than 2D and 3D mid-air-controlled cursors, and is particularly well suited for tasks in which users constantly switch between several objects during the manipulation. Further, selection confirmation times are longer when selection is followed by manipulation than when it is not.Paper: velloso15_interact.pdf@inproceedings{velloso15_interact, title = {{An Empirical Investigation of Gaze Selection in Mid-Air Gestural 3D Manipulation}}, author = {Velloso, Eduardo and Turner, Jayson and Alexander, Jason and Bulling, Andreas and Gellersen, Hans}, year = {2015}, pages = {315-330}, doi = {10.1007/978-3-319-22668-2_25}, booktitle = {Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT)} } -
Interactions Under the Desk: A Characterisation of Foot Movements for Input in a Seated Position
Eduardo Velloso, Jason Alexander, Andreas Bulling, Hans Gellersen
Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT), pp. 384-401, 2015.
This paper takes a bottom-up approach to characterising foot movements as input for users seated at computing systems. We conducted four user studies to characterise various aspects of foot-based interaction. First, we built unconstrained foot pointing performance models for 16 participants in a seated desktop setting using 1D and 2D ISO 9241-9-compliant Fitts’s Law tasks. Second, we evaluated the effect of the foot and direction in one-direction tasks, finding no effect of the foot used, but a significant effect of the direction in which targets are distributed. Third, we compared the use of one foot against two feet to control two independent variables, finding that while one foot is better suited for tasks with a spatial representation that matches its movement, there is little difference between the two feet techniques when it does not. Fourth, we analysed the overhead caused by introducing a feet-controlled variable in a mouse-based task, finding the feet to be comparable to the scroll wheel. The results of our studies show the feet are an effective method of enhancing our interaction with desktop systems; we use our findings to inform a series of design guidelines for such systems.Paper: velloso15_interact_2.pdf@inproceedings{velloso15_interact_2, title = {Interactions Under the Desk: A Characterisation of Foot Movements for Input in a Seated Position}, author = {Velloso, Eduardo and Alexander, Jason and Bulling, Andreas and Gellersen, Hans}, year = {2015}, pages = {384-401}, doi = {10.1007/978-3-319-22701-6_29}, booktitle = {Proc. IFIP TC13 Conference on Human-Computer Interaction (INTERACT)} } -
The Royal Corgi: Exploring Social Gaze Interaction for Immersive Gameplay
Mélodie Vidal, Remi Bismuth, Andreas Bulling, Hans Gellersen
Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 115-124, 2015.
The eyes are a rich channel for non-verbal communication in our daily interactions. We propose social gaze interaction as a game mechanic to enhance user interactions with virtual characters. We develop a game from the ground-up in which characters are designed to be reactive to the player’s gaze in social ways, such as getting annoyed when the player seems distracted or changing their dialogue depending on the player’s apparent focus of attention. Results from a qualitative user study provide insights bout how social gaze interaction is intuitive for users, elicits deep feelings of immersion, and highlight the players’ self-consciousness of their own eye movements through their strong reactions to the characters.Paper: vidal15_chi.pdf@inproceedings{vidal15_chi, author = {Vidal, M{\'{e}}lodie and Bismuth, Remi and Bulling, Andreas and Gellersen, Hans}, title = {{The Royal Corgi: Exploring Social Gaze Interaction for Immersive Gameplay}}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2015}, pages = {115-124}, doi = {10.1145/2702123.2702163} } -
Analyzing Visual Attention During Whole Body Interaction with Public Displays
Robert Walter, Andreas Bulling, David Lindlbauer, Martin Schüssler, Hans Jörg Müller
Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 1263-1267, 2015.
While whole body interaction can enrich user experience on public displays, it remains unclear how common visualizations of user representations impact users’ ability to perceive content on the display. In this work we use a head-mounted eye tracker to record visual behavior of 25 users interacting with a public display game that uses a silhouette user representation, mirroring the users’ movements. Results from visual attention analysis as well as post-hoc recall and recognition tasks on display contents reveal that visual attention is mostly on users’ silhouette while peripheral screen elements remain largely unattended. In our experiment, content attached to the user representation attracted significantly more attention than other screen contents, while content placed at the top and bottom of the screen attracted significantly less. Screen contents attached to the user representation were also significantly better remembered than those at the top and bottom of the screen.@inproceedings{walter15_ubicomp, author = {Walter, Robert and Bulling, Andreas and Lindlbauer, David and Sch{\"{u}}ssler, Martin and M{\"{u}}ller, Hans J{\"{o}}rg}, title = {Analyzing Visual Attention During Whole Body Interaction with Public Displays}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)}, year = {2015}, doi = {10.1145/2750858.280425}, pages = {1263-1267}, video = {https://www.youtube.com/watch?v=JlEnUyhQ1cY} } -
Rendering of Eyes for Eye-Shape Registration and Gaze Estimation
Erroll Wood, Tadas Baltrušaitis, Xucong Zhang, Yusuke Sugano, Peter Robinson, Andreas Bulling
Proc. IEEE International Conference on Computer Vision (ICCV), pp. 3756-3764, 2015.
Images of the eye are key in several computer vision problems, such as shape registration and gaze estimation. Recent large-scale supervised methods for these problems require time-consuming data collection and manual annotation, which can be unreliable. We propose synthesizing perfectly labelled photo-realistic training data in a fraction of the time. We used computer graphics techniques to build a collection of dynamic eye-region models from head scan geometry. These were randomly posed to synthesize close-up eye images for a wide range of head poses, gaze directions, and illumination conditions. We used our model’s controllability to verify the importance of realistic illumination and shape variations in eye-region training data. Finally, we demonstrate the benefits of our synthesized training data (SynthesEyes) by out-performing state-of-the-art methods for eye-shape registration as well as cross-dataset appearance-based gaze estimation in the wild.Paper: wood15_iccv.pdf@inproceedings{wood15_iccv, title = {Rendering of Eyes for Eye-Shape Registration and Gaze Estimation}, author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Zhang, Xucong and Sugano, Yusuke and Robinson, Peter and Bulling, Andreas}, doi = {10.1109/ICCV.2015.428}, year = {2015}, pages = {3756-3764}, booktitle = {Proc. IEEE International Conference on Computer Vision (ICCV)} } -
Appearance-based Gaze Estimation in the Wild
Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling
Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4511-4520, 2015.
Appearance-based gaze estimation is believed to work well in real-world settings but existing datasets were collected under controlled laboratory conditions and methods were not evaluated across multiple datasets. In this work we study appearance-based gaze estimation in the wild. We present the MPIIGaze dataset that contains 213,659 images we collected from 15 participants during natural everyday laptop use over more than three months. Our dataset is significantly more variable than existing datasets with respect to appearance and illumination. We also present a method for in-the-wild appearance-based gaze estimation using multimodal convolutional neural networks, which significantly outperforms state-of-the art methods in the most challenging cross-dataset evaluation setting. We present an extensive evaluation of several state-of-the-art image-based gaze estimation algorithm on three current datasets, including our own. This evaluation provides clear insights and allows us identify key research challenges of gaze estimation in the wild.Paper: zhang15_cvpr.pdf@inproceedings{zhang15_cvpr, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, title = {Appearance-based Gaze Estimation in the Wild}, booktitle = {Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR)}, year = {2015}, pages = {4511-4520}, doi = {10.1109/CVPR.2015.7299081}, video = {https://www.youtube.com/watch?v=rw6LZA1USG8} }
Book Chapters
-
Proc. International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)
Peter Kiefer, Yanxia Zhang, Andreas Bulling
Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp), pp. 825–828, 2015.
Previous work on eye tracking and eye-based human-computer interfaces mainly concentrated on making use of the eyes in traditional desktop settings. With the recent growth of interest in smart eyewear and low-cost mobile eye trackers, gaze-based techniques for mobile computing is becoming increasingly important. PETMEI 2015 focuses on the pervasive eye tracking paradigm as a trailblazer for mobile eye-based interaction and eye-based context-awareness. We want to stimulate and explore the creativity of these communities with respect to the implications, key research challenges, and new applications for pervasive eye tracking in ubiquitous computing. The long-term goal is to create a strong interdisciplinary research community linking these fields together and to establish the workshop as the premier forum for research on pervasive eye tracking.Paper: kiefer15_petmei.pdf@inbook{kiefer15_petmei, author = {Kiefer, Peter and Zhang, Yanxia and Bulling, Andreas}, title = {Proc. International Workshop on Pervasive Eye Tracking and Mobile Eye-Based Interaction (PETMEI)}, year = {2015}, doi = {10.1145/2800835.2807960}, pages = {825--828}, booktitle = {Proc. ACM International Joint Conference on Pervasive and Ubiquitous Computing (UbiComp)} }
Technical Reports
-
GazeProjector: Location-independent gaze interaction on and across multiple displays
Christian Lander, Sven Gehring, Antonio Krüger, Sebastian Boring, Andreas Bulling
DFKI Research Reports, pp. 1–10, 2015.
Mobile gaze-based interaction with multiple displays may occur from arbitrary positions and orientations. However, maintaining high gaze estimation accuracy still represents a significant challenge. To address this, we present GazeProjector, a system that combines accurate point-of-gaze estimation with natural feature tracking on displays to determine the mobile eye tracker’s position relative to a display. The detected eye positions are transformed onto that display allowing for gaze-based interaction. This allows for seamless gaze estimation and interaction on (1) multiple displays of arbitrary sizes, (2) independently of the user’s position and orientation to the display. In a user study with 12 participants we compared GazeProjector to existing well- established methods such as visual on-screen markers and a state-of-the-art motion capture system. Our results show that our approach is robust to varying head poses, orientations, and distances to the display, while still providing high gaze estimation accuracy across multiple displays without re-calibration. The system represents an important step towards the vision of pervasive gaze-based interfaces.@techreport{lander15_techrep, author = {Lander, Christian and Gehring, Sven and Kr{\"{u}}ger, Antonio and Boring, Sebastian and Bulling, Andreas}, title = {GazeProjector: Location-independent gaze interaction on and across multiple displays}, volume = {1}, year = {2015}, pages = {1--10}, institution = {German Research Center for Artificial Intelligence (DFKI)}, video = {https://www.youtube.com/watch?v=peuL4WRfrRM} } -
Prediction of Search Targets From Fixations in Open-world Settings
Hosnieh Sattar, Sabine Müller, Mario Fritz, Andreas Bulling
arXiv:1502.05137, pp. 1–10, 2015.
Previous work on predicting the target of visual search from human fixations only considered closed-world settings in which training labels are available and predictions are performed for a known set of potential targets. In this work we go beyond the state of the art by studying search target prediction in an open-world setting in which we no longer assume that we have fixation data to train for the search targets. We present a dataset containing fixation data of 18 users searching for natural images from three image categories within synthesised image collages of about 80 images. In a closed-world baseline experiment we show that we can predict the correct target image out of a candidate set of five images. We then present a new problem formulation for search target prediction in the open-world setting that is based on learning compatibilities between fixations and potential targets.Paper: sattar15_arxiv.pdfPaper Access: https://arxiv.org/abs/1502.05137@techreport{sattar15_arxiv, author = {Sattar, Hosnieh and M{\"{u}}ller, Sabine and Fritz, Mario and Bulling, Andreas}, title = {Prediction of Search Targets From Fixations in Open-world Settings}, year = {2015}, pages = {1--10}, url = {https://arxiv.org/abs/1502.05137} } -
GazeDPM: Early Integration of Gaze Information in Deformable Part Models
Iaroslav Shcherbatyi, Andreas Bulling, Mario Fritz
arXiv:1505.05753, pp. 1–14, 2015.
An increasing number of works explore collaborative human-computer systems in which human gaze is used to enhance computer vision systems. For object detection these efforts were so far restricted to late integration approaches that have inherent limitations, such as increased precision without increase in recall. We propose an early integration approach in a deformable part model, which constitutes a joint formulation over gaze and visual data. We show that our GazeDPM method improves over the state-of-the-art DPM baseline by 4% and a recent method for gaze-supported object detection by 3% on the public POET dataset. Our approach additionally provides introspection of the learnt models, can reveal salient image structures, and allows us to investigate the interplay between gaze attracting and repelling areas, the importance of view-specific models, as well as viewers’ personal biases in gaze patterns. We finally study important practical aspects of our approach, such as the impact of using saliency maps instead of real fixations, the impact of the number of fixations, as well as robustness to gaze estimation error.Paper: shcherbatyi15_arxiv.pdfPaper Access: https://arxiv.org/abs/1505.05753@techreport{shcherbatyi15_arxiv, title = {GazeDPM: Early Integration of Gaze Information in Deformable Part Models}, author = {Shcherbatyi, Iaroslav and Bulling, Andreas and Fritz, Mario}, year = {2015}, pages = {1--14}, url = {https://arxiv.org/abs/1505.05753} } -
Labeled pupils in the wild: A dataset for studying pupil detection in unconstrained environments
Marc Tonsen, Xucong Zhang, Yusuke Sugano, Andreas Bulling
arXiv:1511.05768, pp. 1–4, 2015.
We present labelled pupils in the wild (LPW), a novel dataset of 66 high-quality, high-speed eye region videos for the development and evaluation of pupil detection algorithms. The videos in our dataset were recorded from 22 participants in everyday locations at about 95 FPS using a state-of-the-art dark-pupil head-mounted eye tracker. They cover people with different ethnicities, a diverse set of everyday indoor and outdoor illumination environments, as well as natural gaze direction distributions. The dataset also includes participants wearing glasses, contact lenses, as well as make-up. We benchmark five state-of-the-art pupil detection algorithms on our dataset with respect to robustness and accuracy. We further study the influence of image resolution, vision aids, as well as recording location (indoor, outdoor) on pupil detection performance. Our evaluations provide valuable insights into the general pupil detection problem and allow us to identify key challenges for robust pupil detection on head-mounted eye trackers.Paper: tonsen15_arxiv.pdfPaper Access: https://arxiv.org/abs/1511.05768@techreport{tonsen15_arxiv, title = {Labeled pupils in the wild: A dataset for studying pupil detection in unconstrained environments}, author = {Tonsen, Marc and Zhang, Xucong and Sugano, Yusuke and Bulling, Andreas}, year = {2015}, pages = {1--4}, url = {https://arxiv.org/abs/1511.05768} } -
Rendering of Eyes for Eye-Shape Registration and Gaze Estimation
Erroll Wood, Tadas Baltrušaitis, Xucong Zhang, Yusuke Sugano, Peter Robinson, Andreas Bulling
arXiv:1505.05916, pp. 1–9, 2015.
Images of the eye are key in several computer vision problems, such as shape registration and gaze estimation. Recent large-scale supervised methods for these problems require time-consuming data collection and manual annotation, which can be unreliable. We propose synthesizing perfectly labelled photo-realistic training data in a fraction of the time. We used computer graphics techniques to build a collection of dynamic eye-region models from head scan geometry. These were randomly posed to synthesize close-up eye images for a wide range of head poses, gaze directions, and illumination conditions. We used our model’s controllability to verify the importance of realistic illumination and shape variations in eye-region training data. Finally, we demonstrate the benefits of our synthesized training data (SynthesEyes) by out-performing state-of-the-art methods for eye-shape registration as well as cross-dataset appearance-based gaze estimation in the wild.Paper: wood15_arxiv.pdfPaper Access: https://arxiv.org/abs/1505.05916@techreport{wood15_arxiv, title = {Rendering of Eyes for Eye-Shape Registration and Gaze Estimation}, author = {Wood, Erroll and Baltru{\v{s}}aitis, Tadas and Zhang, Xucong and Sugano, Yusuke and Robinson, Peter and Bulling, Andreas}, year = {2015}, pages = {1--9}, url = {https://arxiv.org/abs/1505.05916} } -
Appearance-Based Gaze Estimation in the Wild
Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling
arXiv:1504.02863, pp. 1–10, 2015.
Appearance-based gaze estimation is believed to work well in real-world settings, but existing datasets have been collected under controlled laboratory conditions and methods have been not evaluated across multiple datasets. In this work we study appearance-based gaze estimation in the wild. We present the MPIIGaze dataset that contains 213,659 images we collected from 15 participants during natural everyday laptop use over more than three months. Our dataset is significantly more variable than existing ones with respect to appearance and illumination. We also present a method for in-the-wild appearance-based gaze estimation using multimodal convolutional neural networks that significantly outperforms state-of-the art methods in the most challenging cross-dataset evaluation. We present an extensive evaluation of several state-of-the-art image-based gaze estimation algorithms on three current datasets, including our own. This evaluation provides clear insights and allows us to identify key research challenges of gaze estimation in the wild.Paper: zhang15_arxiv.pdfPaper Access: https://arxiv.org/abs/1504.02863@techreport{zhang15_arxiv, title = {Appearance-Based Gaze Estimation in the Wild}, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, year = {2015}, pages = {1--10}, url = {https://arxiv.org/abs/1504.02863} }
2014
Journal Articles
-
A Tutorial on Human Activity Recognition Using Body-worn Inertial Sensors
Andreas Bulling, Ulf Blanke, Bernt Schiele
ACM Computing Surveys, 46 (3), pp. 1–33, 2014.
The last 20 years have seen an ever increasing research activity in the field of human activity recognition. With activity recognition having considerably matured so did the number of challenges in designing, implementing and evaluating activity recognition systems. This tutorial aims to provide a comprehensive hands-on introduction for newcomers to the field of human activity recognition. It specifically focuses on activity recognition using on-body inertial sensors. We first discuss the key research challenges that human activity recognition shares with general pattern recognition and identify those challenges that are specific to human activity recognition. We then describe the concept of an activity recognition chain (ARC) as a general-purpose framework for designing and evaluating activity recognition systems. We detail each component of the framework, provide references to related research and introduce the best practise methods developed by the activity recognition research community. We conclude with the educational example problem of recognising different hand gestures from inertial sensors attached to the upper and lower arm. We illustrate how each component of this framework can be implemented for this specific activity recognition problem and demonstrate how different implementations compare and how they impact overall recognition performance.doi: 10.1145/2499621Paper: bulling14_csur.pdf@article{bulling14_csur, author = {Bulling, Andreas and Blanke, Ulf and Schiele, Bernt}, title = {A Tutorial on Human Activity Recognition Using Body-worn Inertial Sensors}, journal = {ACM Computing Surveys}, volume = {46}, number = {3}, year = {2014}, pages = {1--33}, doi = {10.1145/2499621} } -
Cognition-Aware Computing
Andreas Bulling, Thorsten O. Zander
IEEE Pervasive Computing, 13 (3), pp. 80-83, 2014.
Despite significant advances in context sensing and inference since its inception in the late 1990s, context-aware computing still doesn’t implement a holistic view of all covert aspects of the user state. Here, the authors introduce the concept of cognitive context as an extension to the current notion of context with a cognitive dimension. They argue that visual behavior and brain activity are two promising sensing modalities for assessing the cognitive context and thus the development of cognition-aware computing systems.doi: 10.1109/mprv.2014.42Paper: bulling14_pcm.pdf@article{bulling14_pcm, author = {Bulling, Andreas and Zander, Thorsten O.}, keywords = {bioinformatics, cognition, cognition-aware computing, Context modeling, Context-aware computing, electroencephalography, intelligent systems, Pervasive computing, Sensors, tracking, Visualization}, title = {Cognition-Aware Computing}, journal = {IEEE Pervasive Computing}, volume = {13}, number = {3}, year = {2014}, pages = {80-83}, doi = {10.1109/mprv.2014.42} } -
On the potential of human visual behaviour for memory augmentation and life logging
Dagstuhl Reports/14362, , 2014.
@article{bulling14_dagstuhl, title = {On the potential of human visual behaviour for memory augmentation and life logging}, author = {Bulling, Andreas}, journal = {Dagstuhl Reports/14362}, year = {2014} }
Conference Papers
-
Test-time Adaptation for 3D Human Pose Estimation
Sikandar Amin, Mykhaylo Andriluka, Philipp Müller, Andreas Bulling
Proc. of the DAGM German Conference on Pattern Recognition (GCPR), pp. 253-264, 2014.
In this paper we consider the task of articulated 3D human pose estimation in challenging scenes with dynamic background and multiple people. Initial progress on this task has been achieved building on discriminatively trained part-based models that deliver a set of 2D body pose candidates that are then subsequently refined by reasoning in 3D [1, 4, 5]. The performance of such methods is limited by the performance of the underlying 2D pose estimation approaches. In this paper we explore a way to boost the performance of 2D pose estimation based on the output of the 3D pose reconstruction process, thus closing the loop in the pose estimation pipeline. We build our approach around a component that is able to identify true positive pose estimation hypotheses with high confidence. We then either retrain 2D pose estimation models using such highly confident hypotheses as additional training examples, or we use similarity o these hypotheses as a cue for 2D pose estimation. We consider a number of features that can be used for assessing the confidence of the pose estimation results. The strongest feature in our comparison corresponds to the ensemble greement on the 3D pose output. We evaluate our approach on two publicly available datasets improving over state of the art in each case.Paper: amin14_gcpr.pdf