Collaborative Artificial Intelligence

Realising the vision and exploiting the full potential of collaborative artificial intelligence requires significant advances on four primary research challenges: (1) Computational sensing and modelling of everyday verbal and non-verbal human behaviour; (2) Integration of these behaviour models with data-driven and theoretical models of human cognition and perception; (3) Analysing, evaluating, and facilitating the fundamental mechanisms of effective and natural human-AI collaboration; (4) Integrating social and ethical aspects in all method developments. Successfully addressing these challenges requires an interdisciplinary approach and advancing methods in multimodal machine learning, computational cognitive modelling, computer vision, and human-machine interaction.

Below is a summary of a selection of research projects that our group has been working on towards addressing these challenges. A full list of publications is available here.

Human Behaviour Sensing and Modelling

The ecology of the human ability to collaborate effectively is verbal and non-verbal behaviour. It is through our behaviour that we successfully perform tasks with different collaboration partners, for diverse purposes, and in different everyday contexts. Complementing the rich information content available in human language, non-verbal behaviour involving body language, facial expressions, or gaze is an essential, complementary communication channel for seamless coordination, negotiation, and social signalling. It is therefore of utmost importance for collaborative artificial intelligent systems to have similar abilities, i.e., to be able to (make) sense of human behaviour.

HOIMotion: Forecasting Human Motion During Human-Object Interactions Using Egocentric 3D Object Bounding Boxes

Zhiming Hu, Zheming Yin, Daniel Haeufle, Syn Schmitt, Andreas Bulling

IEEE Transactions on Visualization and Computer Graphics (TVCG), , pp. 1–11, 2024.

Abstract Links BibTeX Project Best Journal Paper Award

We present HOIMotion – a novel approach for human motion forecasting during human-object interactions that integrates information about past body poses and egocentric 3D object bounding boxes. Human motion forecasting is important in many augmented reality applications but most existing methods have only used past body poses to predict future motion. HOIMotion first uses an encoder-residual graph convolutional network (GCN) and multi-layer perceptrons to extract features from body poses and egocentric 3D object bounding boxes, respectively. Our method then fuses pose and object features into a novel pose-object graph and uses a residual-decoder GCN to forecast future body motion. We extensively evaluate our method on the Aria digital twin (ADT) and MoGaze datasets and show that HOIMotion consistently outperforms state-of-the-art methods by a large margin of up to 8.7% on ADT and 7.2% on MoGaze in terms of mean per joint position error. Complementing these evaluations, we report a human study (N=20) that shows that the improvements achieved by our method result in forecasted poses being perceived as both more precise and more realistic than those of existing methods. Taken together, these results reveal the significant information content available in egocentric 3D object bounding boxes for human motion forecasting and the effectiveness of our method in exploiting this information.

doi: 10.1109/TVCG.2024.3456161

Paper: hu24_ismar.pdf

@article{hu24_ismar, author = {Hu, Zhiming and Yin, Zheming and Haeufle, Daniel and Schmitt, Syn and Bulling, Andreas}, title = {HOIMotion: Forecasting Human Motion During Human-Object Interactions Using Egocentric 3D Object Bounding Boxes}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, year = {2024}, pages = {1--11}, doi = {10.1109/TVCG.2024.3456161} }

Mouse2Vec: Learning Reusable Semantic Representations of Mouse Behaviour

Guanhua Zhang, Zhiming Hu, Mihai Bâce, Andreas Bulling

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–17, 2024.

Abstract Links BibTeX Project

The mouse is a pervasive input device used for a wide range of interactive applications. However, computational modelling of mouse behaviour typically requires time-consuming design and extraction of handcrafted features, or approaches that are application-specific. We instead propose Mouse2Vec – a novel self-supervised method designed to learn semantic representations of mouse behaviour that are reusable across users and applications. Mouse2Vec uses a Transformer-based encoder-decoder architecture, which is specifically geared for mouse data: During pretraining, the encoder learns an embedding of input mouse trajectories while the decoder reconstructs the input and simultaneously detects mouse click events. We show that the representations learned by our method can identify interpretable mouse behaviour clusters and retrieve similar mouse trajectories. We also demonstrate on three sample downstream tasks that the representations can be practically used to augment mouse data for training supervised methods and serve as an effective feature extractor.

doi: 10.1145/3613904.3642141

Paper: zhang24_chi.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/Mouse2Vec

@inproceedings{zhang24_chi, title = {Mouse2Vec: Learning Reusable Semantic Representations of Mouse Behaviour}, author = {Zhang, Guanhua and Hu, Zhiming and B{\^a}ce, Mihai and Bulling, Andreas}, year = {2024}, pages = {1--17}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, doi = {10.1145/3613904.3642141} }

Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention

Ekta Sood, Simon Tannert, Philipp Müller, Andreas Bulling

Advances in Neural Information Processing Systems (NeurIPS), pp. 1–15, 2020.

Abstract Links BibTeX Project

A lack of corpora has so far limited advances in integrating human gaze data as a supervisory signal in neural attention mechanisms for natural language processing (NLP). We propose a novel hybrid text saliency model (TSM) that, for the first time, combines a cognitive model of reading with explicit human gaze supervision in a single machine learning framework. We show on four different corpora that our hybrid TSM duration predictions are highly correlated with human gaze ground truth. We further propose a novel joint modelling approach to integrate the predictions of the TSM into the attention layer of a network designed for a specific upstream task without the need for task-specific human gaze data. We demonstrate that our joint model outperforms the state of the art in paraphrase generation on the Quora Question Pairs corpus by more than 10% in BLEU-4 and achieves state-of-the-art performance for sentence compression on the challenging Google Sentence Compression corpus. As such, our work introduces a practical approach for bridging between data-driven and cognitive models and demonstrates a new way to integrate human gaze-guided neural attention into NLP tasks.

Paper: sood20_neurips.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/human-gaze-guided-neural-attention-for-nlp

Supplementary Material: sood20_neurips_sup.pdf

Paper Access: https://proceedings.neurips.cc/paper/2020/hash/460191c72f67e90150a093b4585e7eb4-Abstract.html

@inproceedings{sood20_neurips, author = {Sood, Ekta and Tannert, Simon and Müller, Philipp and Bulling, Andreas}, title = {Improving Natural Language Processing Tasks with Human Gaze-Guided Neural Attention}, year = {2020}, pages = {1--15}, booktitle = {Advances in Neural Information Processing Systems (NeurIPS)}, url = {https://proceedings.neurips.cc/paper/2020/hash/460191c72f67e90150a093b4585e7eb4-Abstract.html} }

MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation

Xucong Zhang, Yusuke Sugano, Mario Fritz, Andreas Bulling

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41 (1), pp. 162-175, 2019.

Abstract Links BibTeX Project

Learning-based methods are believed to work well for unconstrained gaze estimation, i.e. gaze estimation from a monocular RGB camera without assumptions regarding user, environment, or camera. However, current gaze datasets were collected under laboratory conditions and methods were not evaluated across multiple datasets. Our work makes three contributions towards addressing these limitations. First, we present the MPIIGaze dataset, which contains 213,659 full face images and corresponding ground-truth gaze positions collected from 15 users during everyday laptop use over several months. An experience sampling approach ensured continuous gaze and head poses and realistic variation in eye appearance and illumination. To facilitate cross-dataset evaluations, 37,667 images were manually annotated with eye corners, mouth corners, and pupil centres. Second, we present an extensive evaluation of state-of-the-art gaze estimation methods on three current datasets, including MPIIGaze. We study key challenges including target gaze range, illumination conditions, and facial appearance variation. We show that image resolution and the use of both eyes affect gaze estimation performance, while head pose and pupil centre information are less informative. Finally, we propose GazeNet, the first deep appearance-based gaze estimation method. GazeNet improves on the state of the art by 22% (from a mean error of 13.9 degrees to 10.8 degrees) for the most challenging cross-dataset evaluation.

doi: 10.1109/TPAMI.2017.2778103

Paper: zhang19_pami.pdf

@article{zhang19_pami, title = {MPIIGaze: Real-World Dataset and Deep Appearance-Based Gaze Estimation}, author = {Zhang, Xucong and Sugano, Yusuke and Fritz, Mario and Bulling, Andreas}, year = {2019}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, doi = {10.1109/TPAMI.2017.2778103}, pages = {162-175}, volume = {41}, number = {1} }

Eye Movement Analysis for Activity Recognition Using Electrooculography

Andreas Bulling, Jamie A. Ward, Hans Gellersen, Gerhard Tröster

IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 33 (4), pp. 741-753, 2011.

Abstract Links BibTeX Project

In this work we investigate eye movement analysis as a new sensing modality for activity recognition. Eye movement data was recorded using an electrooculography (EOG) system. We first describe and evaluate algorithms for detecting three eye movement characteristics from EOG signals - saccades, fixations, and blinks - and propose a method for assessing repetitive patterns of eye movements. We then devise 90 different features based on these characteristics and select a subset of them using minimum redundancy maximum relevance feature selection (mRMR). We validate the method using an eight participant study in an office environment using an example set of five activity classes: copying a text, reading a printed paper, taking hand-written notes, watching a video, and browsing the web. We also include periods with no specific activity (the NULL class). Using a support vector machine (SVM) classifier and a person-independent (leave-one-out) training scheme, we obtain an average precision of 76.1% and recall of 70.5% over all classes and participants. The work demonstrates the promise of eye-based activity recognition (EAR) and opens up discussion on the wider applicability of EAR to other activities that are difficult, or even impossible, to detect using common sensing modalities.

doi: 10.1109/TPAMI.2010.86

Paper: bulling11_pami.pdf

@article{bulling11_pami, author = {Bulling, Andreas and Ward, Jamie A. and Gellersen, Hans and Tr{\"{o}}ster, Gerhard}, keywords = {Feature evaluation and selection, signal processing, Ubiquitous computing}, title = {Eye {M}ovement {A}nalysis for {A}ctivity {R}ecognition {U}sing {E}lectrooculography}, journal = {IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI)}, volume = {33}, number = {4}, year = {2011}, pages = {741-753}, doi = {10.1109/TPAMI.2010.86} }

Computational Cognitive Modelling

Successful collaboration between humans grounds deeply in our ability to form and maintain mental models of our interaction partners, and in robustly predicting their goals, intentions, and beliefs (also known as Theory of Mind). We further rely on our ability to predict what others are attending to, and what they are likely to know or remember. These abilities not only allow us to anticipate others’ behaviour, but also to behave pro-actively ourselves, greatly improving the robustness, efficiency, and seamlessness of our interactions with others. Despite their importance, however, research on installing similar cognitive modelling capabilities in machines is still in its infancy.

Neural Reasoning About Agents’ Goals, Preferences, and Actions

Matteo Bortoletto, Lei Shi, Andreas Bulling

Proc. 38th AAAI Conference on Artificial Intelligence (AAAI), pp. 456–464, 2024.

Abstract Links BibTeX Project

We propose the Intuitive Reasoning Network (IRENE) – a novel neural model for intuitive psychological reasoning about agents’ goals, preferences, and actions that can generalise previous experiences to new situations. IRENE combines a graph neural network for learning agent and world state representations with a transformer to encode the task context. When evaluated on the challenging Baby Intuitions Benchmark, IRENE achieves new state-of-the-art performance on three out of its five tasks – with up to 48.9 % improvement. In contrast to existing methods, IRENE is able to bind preferences to specific agents, to better distinguish between rational and irrational agents, and to better understand the role of blocking obstacles. We also investigate, for the first time, the influence of the training tasks on test performance. Our analyses demonstrate the effectiveness of IRENE in combining prior knowledge gained during training for unseen evaluation tasks.

doi: 10.1609/aaai.v38i1.27800

Paper: bortoletto24_aaai.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/IRENE

@inproceedings{bortoletto24_aaai, author = {Bortoletto, Matteo and Shi, Lei and Bulling, Andreas}, title = {Neural Reasoning About Agents’ Goals, Preferences, and Actions}, booktitle = {Proc. 38th AAAI Conference on Artificial Intelligence (AAAI)}, year = {2024}, volume = {38}, number = {1}, pages = {456--464}, doi = {10.1609/aaai.v38i1.27800} }
VSA4VQA: Scaling A Vector Symbolic Architecture To Visual Question Answering on Natural Images

Anna Penzkofer, Lei Shi, Andreas Bulling

Proc. 46th Annual Meeting of the Cognitive Science Society (CogSci), 2024.

Abstract Links BibTeX Project Oral Presentation

While Vector Symbolic Architectures (VSAs) are promising for modelling spatial cognition, their application is currently limited to artificially generated images and simple spatial queries. We propose VSA4VQA – a novel 4D implementation of VSAs that implements a mental representation of natural images for the challenging task of Visual Question Answering (VQA). VSA4VQA is the first model to scale a VSA to complex spatial queries. Our method is based on the Semantic Pointer Architecture (SPA) to encode objects in a hyper-dimensional vector space. To encode natural images, we extend the SPA to include dimensions for object’s width and height in addition to their spatial location. To perform spatial queries we further introduce learned spatial query masks and integrate a pre-trained vision-language model for answering attribute-related questions. We evaluate our method on the GQA benchmark dataset and show that it can effectively encode natural images, achieving competitive performance to state-of-the-art deep learning methods for zero-shot VQA.

Paper: penzkofer24_cogsci.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/VSA4VQA

Paper Access: https://escholarship.org/uc/item/26j7v1nf.

@inproceedings{penzkofer24_cogsci, author = {Penzkofer, Anna and Shi, Lei and Bulling, Andreas}, title = {{VSA4VQA}: {Scaling} {A} {Vector} {Symbolic} {Architecture} {To} {Visual} {Question} {Answering} on {Natural} {Images}}, booktitle = {Proc. 46th Annual Meeting of the Cognitive Science Society (CogSci)}, year = {2024}, volume = {46}, url = {https://escholarship.org/uc/item/26j7v1nf.} }

Scanpath Prediction on Information Visualisations

Yao Wang, Mihai Bâce, Andreas Bulling

IEEE Transactions on Visualization and Computer Graphics (TVCG), 30 (7), pp. 3902–3914, 2023.

Abstract Links BibTeX Project

We propose Unified Model of Saliency and Scanpaths (UMSS) – a model that learns to predict multi-duration saliency and scanpaths (i.e. sequences of eye fixations) on information visualisations. Although scanpaths provide rich information about the importance of different visualisation elements during the visual exploration process, prior work has been limited to predicting aggregated attention statistics, such as visual saliency. We present in-depth analyses of gaze behaviour for different information visualisation elements (e.g. Title, Label, Data) on the popular MASSVIS dataset. We show that while, overall, gaze patterns are surprisingly consistent across visualisations and viewers, there are also structural differences in gaze dynamics for different elements. Informed by our analyses, UMSS first predicts multi-duration element-level saliency maps, then probabilistically samples scanpaths from them. Extensive experiments on MASSVIS show that our method consistently outperforms state-of-the-art methods with respect tto several, widely used scanpath and saliency evaluation metrics. Our method achieves a relative improvement in sequence score of 11.5 % for scanpath prediction, and a relative improvement in Pearson correlation coefficient of up to 23.6 % for saliency prediction. These results are auspicious and point towards richer user models and simulations of visual attention on visualisations without the need for any eye tracking equipment.

doi: 10.1109/TVCG.2023.3242293

Paper: wang23_tvcg.pdf

Supplementary Material: wang23_tvcg_sup.pdf

Dataset: https://darus.uni-stuttgart.de/dataset.xhtml?persistentId=doi:10.18419/darus-3361

@article{wang23_tvcg, title = {Scanpath Prediction on Information Visualisations}, author = {Wang, Yao and Bâce, Mihai and Bulling, Andreas}, year = {2023}, journal = {IEEE Transactions on Visualization and Computer Graphics (TVCG)}, volume = {30}, number = {7}, pages = {3902--3914}, doi = {10.1109/TVCG.2023.3242293} }

Improving Neural Saliency Prediction with a Cognitive Model of Human Visual Attention

Ekta Sood, Lei Shi, Matteo Bortoletto, Yao Wang, Philipp Müller, Andreas Bulling

Proc. the 45th Annual Meeting of the Cognitive Science Society (CogSci), pp. 3639–3646, 2023.

Abstract Links BibTeX Project

We present a novel method for saliency prediction that leverages a cognitive model of visual attention as an inductive bias. This approach is in stark contrast to recent purely data-driven saliency models that achieve performance improvements mainly by increased capacity, resulting in high computational costs and the need for large-scale training datasets. We demonstrate that by using a cognitive model, our method achieves competitive performance to the state of the art across several natural image datasets while only requiring a fraction of the parameters. Furthermore, we set the new state of the art for saliency prediction on information visualizations, demonstrating the effectiveness of our approach for cross-domain generalization. We further provide augmented versions of the full MSCOCO dataset with synthetic gaze data using the cognitive model, which we used to pre-train our method. Our results are highly promising and underline the significant potential of bridging between cognitive and data-driven models, potentially also beyond attention.

Paper: sood23_cogsci.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/neural-saliency-prediction-with-a-cognitive-model/

Supplementary Material: sood23_cogsci_sup.pdf

Dataset: https://collaborative-ai.org/research/datasets/MSCOCOEMMAFigureQAEMMA/

@inproceedings{sood23_cogsci, author = {Sood, Ekta and Shi, Lei and Bortoletto, Matteo and Wang, Yao and Müller, Philipp and Bulling, Andreas}, title = {Improving Neural Saliency Prediction with a Cognitive Model of Human Visual Attention}, booktitle = {Proc. the 45th Annual Meeting of the Cognitive Science Society (CogSci)}, year = {2023}, pages = {3639--3646} }

Neural Photofit: Gaze-based Mental Image Reconstruction

Florian Strohm, Ekta Sood, Sven Mayer, Philipp Müller, Mihai Bâce, Andreas Bulling

Proc. IEEE International Conference on Computer Vision (ICCV), pp. 245-254, 2021.

Abstract Links BibTeX Project

We propose a novel method that leverages human fixations to visually decode the image a person has in mind into a photofit (facial composite). Our method combines three neural networks: An encoder, a scoring network, and a decoder. The encoder extracts image features and predicts a neural activation map for each face looked at by a human observer. A neural scoring network compares the human and neural attention and predicts a relevance score for each extracted image feature. Finally, image features are aggregated into a single feature vector as a linear combination of all features weighted by relevance which a decoder decodes into the final photofit. We train the neural scoring network on a novel dataset containing gaze data of 19 participants looking at collages of synthetic faces. We show that our method significantly outperforms a mean baseline predictor and report on a human study that shows that we can decode photofits that are visually plausible and close to the observer’s mental image. Code and dataset available upon request.

doi: 10.1109/ICCV48922.2021.00031

Paper: strohm21_iccv.pdf

Code: Available upon request.

Dataset: Available upon request.

@inproceedings{strohm21_iccv, title = {Neural Photofit: Gaze-based Mental Image Reconstruction}, author = {Strohm, Florian and Sood, Ekta and Mayer, Sven and Müller, Philipp and Bâce, Mihai and Bulling, Andreas}, year = {2021}, booktitle = {Proc. IEEE International Conference on Computer Vision (ICCV)}, doi = {10.1109/ICCV48922.2021.00031}, pages = {245-254} }

Mechanisms of Human-AI Collaboration

Collaboration is a complex, iterative process in which two or more interaction partners engage in a multi-round interactive dialogue, aiming to achieve a shared goal and in which the outcomes typically go beyond what each partner could achieve on their own. To enable machines to successfully participate in such interactions as equal partners requires understanding and computationally replicating the mechanisms of successful collaboration. This involves, for example, keeping track of the dialogue state by integrating multimodal information, or the ability to adapt and generalise to arbitrary interaction partners.

Limits of Theory of Mind Modelling in Dialogue-Based Collaborative Plan Acquisition

Matteo Bortoletto, Constantin Ruhdorfer, Adnen Abdessaied, Lei Shi, Andreas Bulling

Proc. 62nd Annual Meeting of the Association for Computational Linguistics (ACL), pp. 1–16, 2024.

Abstract Links BibTeX Project

Recent work on dialogue-based collaborative plan acquisition (CPA) has suggested that Theory of Mind (ToM) modelling can improve missing knowledge prediction in settings with asymmetric skill-sets and knowledge. Although ToM was claimed to be important for effective collaboration, its real impact on this novel task remains under-explored. By representing plans as graphs and by exploiting task-specific constraints we show that, as performance on CPA nearly doubles when predicting one’s own missing knowledge, the improvements due to ToM modelling diminish. This phenomenon persists even when evaluating existing baseline methods. To better understand the relevance of ToM for CPA, we report a principled performance comparison of models with and without ToM features. Results across different models and ablations consistently suggest that learned ToM features are indeed more likely to reflect latent patterns in the data with no perceivable link to ToM. This finding calls for a deeper understanding of the role of ToM in CPA and beyond, as well as new methods for modelling and evaluating mental states in computational collaborative agents.

doi:

Paper: bortoletto24_acl.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/limits-of-tom

@inproceedings{bortoletto24_acl, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Abdessaied, Adnen and Shi, Lei and Bulling, Andreas}, title = {Limits of Theory of Mind Modelling in Dialogue-Based Collaborative Plan Acquisition}, booktitle = {Proc. 62nd Annual Meeting of the Association for Computational Linguistics (ACL)}, year = {2024}, pages = {1--16}, doi = {} }
Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions

Matteo Bortoletto, Constantin Ruhdorfer, Lei Shi, Andreas Bulling

Proc. 27th European Conference on Artificial Intelligence (ECAI), pp. 866–873, 2024.

Abstract Links BibTeX Project

We propose MToMnet - a Theory of Mind (ToM) neural network for predicting beliefs and their dynamics during human social interactions from multimodal input. ToM is key for effective nonverbal human communication and collaboration, yet, existing methods for belief modelling have not included explicit ToM modelling or have typically been limited to one or two modalities. MToMnet encodes contextual cues (scene videos and object locations) and integrates them with person-specific cues (human gaze and body language) in a separate MindNet for each person. Inspired by prior research on social cognition and computational ToM, we propose three different MToMnet variants: two involving fusion of latent representations and one involving re-ranking of classification scores. We evaluate our approach on two challenging real-world datasets, one focusing on belief prediction, while the other examining belief dynamics prediction. Our results demonstrate that MToMnet surpasses existing methods by a large margin while at the same time requiring a significantly smaller number of parameters. Taken together, our method opens up a highly promising direction for future work on artificial intelligent systems that can robustly predict human beliefs from their non-verbal behaviour and, as such, more effectively collaborate with humans.

doi: 10.3233/FAIA240573

Paper: bortoletto24_ecai.pdf

@inproceedings{bortoletto24_ecai, author = {Bortoletto, Matteo and Ruhdorfer, Constantin and Shi, Lei and Bulling, Andreas}, title = {Explicit Modelling of Theory of Mind for Belief Prediction in Nonverbal Social Interactions}, booktitle = {Proc. 27th European Conference on Artificial Intelligence (ECAI)}, year = {2024}, pages = {866--873}, doi = {10.3233/FAIA240573} }
Multi-Modal Video Dialog State Tracking in the Wild

Adnen Abdessaied, Lei Shi, Andreas Bulling

Proc. 18th European Conference on Computer Vision (ECCV), pp. 1–25, 2024.

Abstract Links BibTeX Project

We present MST-MIXER – a novel video dialog model operating over a generic multi-modal state tracking scheme. Current models that claim to perform multi-modal state tracking fall short of two major aspects: (1) They either track only one modality (mostly the visual input) or (2) they target synthetic datasets that do not reflect the complexity of real-world in the wild scenarios. Our model addresses these two limitations in an attempt to close this crucial research gap. Specifically, MST-MIXER first tracks the most important constituents of each input modality. Then, it predicts the missing underlying structure of the selected constituents of each modality by learning local latent graphs using a novel multi-modal graph structure learning method. Subsequently, the learned local graphs and features are parsed together to form a global graph operating on the mix of all modalities which further refines its structure and node embeddings. Finally, the fine-grained graph node features are used to enhance the hidden states of the backbone Vision-Language Model (VLM). MST-MIXER achieves new state-of-the-art results on five challenging benchmarks.

Paper: abdessaied24_eccv.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/MST-MIXER

@inproceedings{abdessaied24_eccv, author = {Abdessaied, Adnen and Shi, Lei and Bulling, Andreas}, title = {Multi-Modal Video Dialog State Tracking in the Wild}, booktitle = {Proc. 18th European Conference on Computer Vision (ECCV)}, year = {2024}, pages = {1--25} }
VD-GR: Boosting Visual Dialog with Cascaded Spatial-Temporal Multi-Modal GRaphs

Adnen Abdessaied, Lei Shi, Andreas Bulling

Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 5805–5814, 2024.

Abstract Links BibTeX Project

We propose VD-GR – a novel visual dialog model that combines pre-trained language models (LMs) with graph neural networks (GNNs). Prior works mainly focused on one class of models at the expense of the other, thus missing out on the opportunity of combining their respective benefits. At the core of VD-GR is a novel integration mechanism that alternates between spatial-temporal multi-modal GNNs and BERT layers, and that covers three distinct contributions: First, we use multi-modal GNNs to process the features of each modality (image, question, and dialog history) and exploit their local structures before performing BERT global attention. Second, we propose hub-nodes that link to all other nodes within one modality graph, allowing the model to propagate information from one GNN (modality) to the other in a cascaded manner. Third, we augment the BERT hidden states with fine-grained multi-modal GNN features before passing them to the next VD-GR layer. Evaluations on VisDial v1.0, VisDial v0.9, VisDialConv, and VisPro show that VD-GR achieves new state-of-the-art results across all four datasets

Paper: abdessaied24_wacv.pdf

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/VDGR

@inproceedings{abdessaied24_wacv, author = {Abdessaied, Adnen and Shi, Lei and Bulling, Andreas}, title = {VD-GR: Boosting Visual Dialog with Cascaded Spatial-Temporal Multi-Modal GRaphs}, booktitle = {Proc. IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year = {2024}, pages = {5805--5814} }

The Overcooked Generalisation Challenge

Constantin Ruhdorfer, Matteo Bortoletto, Anna Penzkofer, Andreas Bulling

arxiv:2406.17949, pp. 1-25, 2024.

Abstract Links BibTeX Project

We introduce the Overcooked Generalisation Challenge (OGC) - the first benchmark to study agents’ zero-shot cooperation abilities when faced with novel partners and levels in the Overcooked-AI environment. This perspective starkly contrasts a large body of previous work that has trained and evaluated cooperating agents only on the same level, failing to capture generalisation abilities required for real-world human-AI cooperation. Our challenge interfaces with state-of-the-art dual curriculum design (DCD) methods to generate auto-curricula for training general agents in Overcooked. It is the first cooperative multi-agent environment specially designed for DCD methods and, consequently, the first benchmarked with state-of-the-art methods. It is fully GPU-accelerated, built on the DCD benchmark suite minimax, and freely available under an open-source license: this https URL. We show that current DCD algorithms struggle to produce useful policies in this novel challenge, even if combined with recent network architectures that were designed for scalability and generalisability. The OGC pushes the boundaries of real-world human-AI cooperation by enabling the research community to study the impact of generalisation on cooperating agents.

Code: https://git.hcics.simtech.uni-stuttgart.de/public-projects/OGC

Paper Access: https://arxiv.org/abs/2406.17949

@techreport{ruhdorfer2024_arxiv, title = {The Overcooked Generalisation Challenge}, author = {Ruhdorfer, Constantin and Bortoletto, Matteo and Penzkofer, Anna and Bulling, Andreas}, year = {2024}, pages = {1-25}, url = {https://arxiv.org/abs/2406.17949} }

As artificial intelligent systems will be more deeply embedded and will collaborate with us in an increasing number of everyday situations, social and ethical implications of their doings, explainability of their behaviour, as well as questions related to the privacy of the information they obtain in interactions will become crucial. Privacy-preserving methods are essential to protect users' sensitive information, particularly as collaboration may reveal or require sharing personal information. Ethical frameworks ensure that collaborative AI systems align with societal values, addressing biases, fairness, and accountability in decision-making.

Mindful Explanations: Prevalence and Impact of Mind Attribution in XAI Research

Susanne Hindennach, Lei Shi, Filip Miletic, Andreas Bulling

Proc. ACM on Human-Computer Interaction (PACM HCI), 8 (CSCW), pp. 1–42, 2024.

Abstract Links BibTeX Project Best Paper Honourable Mention Award

When users perceive AI systems as mindful, independent agents, they hold them responsible instead of the AI experts who created and designed these systems. So far, it has not been studied whether explanations support this shift in responsibility through the use of mind-attributing verbs like "to think". To better understand the prevalence of mind-attributing explanations we analyse AI explanations in 3,533 explainable AI (XAI) research articles from the Semantic Scholar Open Research Corpus (S2ORC). Using methods from semantic shift detection, we identify three dominant types of mind attribution: (1) metaphorical (e.g. "to learn" or "to predict"), (2) awareness (e.g. "to consider"), and (3) agency (e.g. "to make decisions"). We then analyse the impact of mind-attributing explanations on awareness and responsibility in a vignette-based experiment with 199 participants. We find that participants who were given a mind-attributing explanation were more likely to rate the AI system as aware of the harm it caused. Moreover, the mind-attributing explanation had a responsibility-concealing effect: Considering the AI experts’ involvement lead to reduced ratings of AI responsibility for participants who were given a non-mind-attributing or no explanation. In contrast, participants who read the mind-attributing explanation still held the AI system responsible despite considering the AI experts’ involvement. Taken together, our work underlines the need to carefully phrase explanations about AI systems in scientific writing to reduce mind attribution and clearly communicate human responsibility.

doi: 10.1145/3641009

Paper: hindennach24_pacm.pdf

Paper Access: https://medium.com/acm-cscw/be-mindful-when-using-mindful-descriptions-in-explanations-about-ai-bfc7666885c6

@article{hindennach24_pacm, title = {Mindful Explanations: Prevalence and Impact of Mind Attribution in XAI Research}, author = {Hindennach, Susanne and Shi, Lei and Miletic, Filip and Bulling, Andreas}, year = {2024}, pages = {1--42}, volume = {8}, number = {CSCW}, doi = {10.1145/3641009}, journal = {Proc. ACM on Human-Computer Interaction (PACM HCI)}, url = {https://medium.com/acm-cscw/be-mindful-when-using-mindful-descriptions-in-explanations-about-ai-bfc7666885c6} }
PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation

Mayar Elfares, Pascal Reisert, Zhiming Hu, Wenwu Tang, Ralf Küsters, Andreas Bulling

Proc. ACM on Human-Computer Interaction (PACM HCI), 8 (ETRA), pp. 1–23, 2024.

Abstract Links BibTeX Project

Latest gaze estimation methods require large-scale training data but their collection and exchange pose significant privacy risks. We propose PrivatEyes - the first privacy-enhancing training approach for appearance-based gaze estimation based on federated learning (FL) and secure multi-party computation (MPC). PrivatEyes enables training gaze estimators on multiple local datasets across different users and server-based secure aggregation of the individual estimators’ updates. PrivatEyes guarantees that individual gaze data remains private even if a majority of the aggregating servers is malicious. We also introduce a new data leakage attack DualView that shows that PrivatEyes limits the leakage of private training data more effectively than previous approaches. Evaluations on the MPIIGaze, MPIIFaceGaze, GazeCapture, and NVGaze datasets further show that the improved privacy does not lead to a lower gaze estimation accuracy or substantially higher computational costs - both of which are on par with its non-secure counterparts.

doi: 10.1145/3655606

Paper: elfares24_etra.pdf

@article{elfares24_etra, title = {PrivatEyes: Appearance-based Gaze Estimation Using Federated Secure Multi-Party Computation}, author = {Elfares, Mayar and Reisert, Pascal and Hu, Zhiming and Tang, Wenwu and Küsters, Ralf and Bulling, Andreas}, year = {2024}, journal = {Proc. ACM on Human-Computer Interaction (PACM HCI)}, pages = {1--23}, volume = {8}, number = {ETRA}, doi = {10.1145/3655606} }

Impact of Privacy Protection Methods of Lifelogs on Remembered Memories

Passant Elagroudy, Mohamed Khamis, Florian Mathis, Diana Irmscher, Ekta Sood, Andreas Bulling, Albrecht Schmidt

Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI), pp. 1–10, 2023.

Abstract Links BibTeX Project

Lifelogging is traditionally used for memory augmentation. However, recent research shows that users’ trust in the completeness and accuracy of lifelogs might skew their memories. Privacy-protection alterations such as body blurring and content deletion are commonly applied to photos to circumvent capturing sensitive information. However, their impact on how users remember memories remain unclear. To this end, we conduct a white-hat memory attack and report on an iterative experiment (N=21) to compare the impact of viewing 1) unaltered lifelogs, 2) blurred lifelogs, and 3) a subset of the lifelogs after deleting private ones, on confidently remembering memories. Findings indicate that all the privacy methods impact memories’ quality similarly and that users tend to change their answers in recognition more than recall scenarios. Results also show that users have high confidence in their remembered content across all privacy methods. Our work raises awareness about the mindful designing of technological interventions.

doi: 10.1145/3544548.3581565

Paper: elagroudy23_chi.pdf

@inproceedings{elagroudy23_chi, author = {Elagroudy, Passant and Khamis, Mohamed and Mathis, Florian and Irmscher, Diana and Sood, Ekta and Bulling, Andreas and Schmidt, Albrecht}, title = {Impact of Privacy Protection Methods of Lifelogs on Remembered Memories}, booktitle = {Proc. ACM SIGCHI Conference on Human Factors in Computing Systems (CHI)}, year = {2023}, doi = {10.1145/3544548.3581565}, pages = {1--10} }

PrivacEye: Privacy-Preserving Head-Mounted Eye Tracking Using Egocentric Scene Image and Eye Movement Features

Julian Steil, Marion Koelle, Wilko Heuten, Susanne Boll, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–10, 2019.

Abstract Links BibTeX Project Best Video Award

Eyewear devices, such as augmented reality displays, increasingly integrate eye tracking but the first-person camera required to map a user’s gaze to the visual scene can pose a significant threat to user and bystander privacy. We present PrivacEye, a method to detect privacy-sensitive everyday situations and automatically enable and disable the eye tracker’s first-person camera using a mechanical shutter. To close the shutter in privacy-sensitive situations, the method uses a deep representation of the first-person video combined with rich features that encode users’ eye movements. To open the shutter without visual input, PrivacEye detects changes in users’ eye movements alone to gauge changes in the "privacy level" of the current situation. We evaluate our method on a first-person video dataset recorded in daily life situations of 17 participants, annotated by themselves for privacy sensitivity, and show that our method is effective in preserving privacy in this challenging setting.

doi: 10.1145/3314111.3319913

Paper: steil19_etra.pdf

Supplementary Material: steil19_etra_sup.pdf

Video: https://www.youtube.com/watch?v=Gy61255F8T8

@inproceedings{steil19_etra, title = {PrivacEye: Privacy-Preserving Head-Mounted Eye Tracking Using Egocentric Scene Image and Eye Movement Features}, author = {Steil, Julian and Koelle, Marion and Heuten, Wilko and Boll, Susanne and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, pages = {1--10}, doi = {10.1145/3314111.3319913}, video = {https://www.youtube.com/watch?v=Gy61255F8T8} }
Privacy-Aware Eye Tracking Using Differential Privacy

Julian Steil, Inken Hagestedt, Michael Xuelin Huang, Andreas Bulling

Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA), pp. 1–9, 2019.

Abstract Links BibTeX Project Best Paper Award

With eye tracking being increasingly integrated into virtual and augmented reality (VR/AR) head-mounted displays, preserving users’ privacy is an ever more important, yet under-explored, topic in the eye tracking community. We report a large-scale online survey (N=124) on privacy aspects of eye tracking that provides the first comprehensive account of with whom, for which services, and to which extent users are willing to share their gaze data. Using these insights, we design a privacy-aware VR interface that uses differential privacy, which we evaluate on a new 20-participant dataset for two privacy sensitive tasks: We show that our method can prevent user re-identification and protect gender information while maintaining high performance for gaze-based document type classification. Our results highlight the privacy challenges particular to gaze data and demonstrate that differential privacy is a potential means to address them. Thus, this paper lays important foundations for future research on privacy-aware gaze interfaces.

doi: 10.1145/3314111.3319915

Paper: steil19_etra_2.pdf

@inproceedings{steil19_etra_2, title = {Privacy-Aware Eye Tracking Using Differential Privacy}, author = {Steil, Julian and Hagestedt, Inken and Huang, Michael Xuelin and Bulling, Andreas}, year = {2019}, booktitle = {Proc. ACM International Symposium on Eye Tracking Research and Applications (ETRA)}, doi = {10.1145/3314111.3319915}, pages = {1--9} }

Human Behaviour Sensing and Modelling

Computational Cognitive Modelling

Mechanisms of Human-AI Collaboration

Social/societal aspects of CAI

Links

Contact Us