Introduction to the Special Issue on Human-Centered Machine Learning
In gesture recognition, one challenge that researchers and developers face is the need for recognition strategies that mediate between false positives and false negatives. In this paper, we examine bi-level thresholding, a recognition strategy that uses two threshold: a tighter threshold limits false positives and recognition errors, and a looser threshold prevents repeated errors (false negatives) by analyzing movements in sequence. We first describe early observations that lead to the development of the bi-level thresholding algorithm. Next, using a wizard-of-Oz recognizer, we hold recognition rates constant and adjust for fixed versus bi-level thresholding; we show that systems using bi-level thresholding result in significant lower workload scores on the NASA-TLX and significantly lower accelerometer variance when performing gesture input. Finally, we examine the effect that bi-level thresholding has on a real-world data set of wrist and finger gestures, showing an ability to significantly improve measures of precision and recall. Overall, these results argue for the viability of bi-level thresholding as an effective technique for balancing between false positives, recognition errors and false negatives.
Machine learning (ML) has become increasingly inuential to human society, yet the primary advancements and applications of ML are driven by research in only a few computational disciplines. Even applications that affect or analyze human behaviors and social structures are often developed with limited input from experts outside of computational elds. Social scientistsexperts trained to examine and explain the complexity of human behavior and interactions in the worldhave considerable expertise to contribute to the development of ML applications for human-generated data, and their analytic practices could benet from more human- centered ML methods. In this work, we highlight some of the gaps in applying ML to social science research. Building upon content analysis of social media papers, a survey study, and interviews, we summarize the current use and challenges of ML in social sciences. Additionally, we utilize our experience designing a visual analytics tool for collaborative qualitative coding as a case study to illustrate how we might re-imagine the way ML could support workows for social scientists. Finally, we propose three research directions to ground ML applications for social science with the ultimate goal of achieving truly human-centered machine learning.
When searching on the web, results are often returned as lists of hundreds to thousands of items, making it difficult for users to understand or navigate the space of results. Research has demonstrated that using clustering to partition search results into coherent, topical clusters can aid in both exploration and discovery. Yet clusters generated by an algorithm for this purpose are often of poor quality and do not satisfy users. As a result, experts must manually evaluate and refine the clustered results for each search query, a process that does not scale to large numbers of search queries. In this work, we investigate using crowd-based human evaluation to inspect, evaluate, and improve clusters to create high-quality clustered search results at scale. We introduce a workflow that begins by using a collection of well-known clustering algorithms to produce a set of clustered search results for a given query. Then, we use crowd workers to holistically assess the quality of each clustered search result in order to find the best one. Finally, the workflow has the crowd spot and fix problems in the best result in order to produce a final output. We evaluate this workflow on 120 top search queries from the Google Play Store, some of whom have clustered search results as a result of evaluations and refinements by experts. Our evaluations demonstrate that the workflow is effective at reproducing the evaluation of expert judges and also improves clusters in a way that agrees with experts and crowds alike.
Sleep is the most important aspect of healthy and active living. Right amount of sleep at the right time helps an individual to protect his physical, mental, cognitive health and maintain his quality of life. The most durative of the Activities of Daily Living (ADL), sleep, has a major synergic influence on a persons fuctional, behavioral and cognitive health. A deep understanding of sleep behavior and its relationship with its physiological signals, and contexts (such as eye or body movements) is necessary to design and develop a robust intelligent sleep monitoring system. In this paper, we propose an intelligent algorithm to detect the microscopic states of the sleep, which fundamentally constitute the components of a good and bad sleeping behavior and thus help shape the formative assessment of sleep quality. Our initial analysis includes the investigation of several classification techniques to identify and correlate the relationship of microscopic sleep states with the overall sleep behavior. Subsequently, we also propose an online algorithm based on change point detection to process and classify the microscopic sleep states. We also develop a lightweight version of the proposed algorithm for real-time sleep monitoring, recognition and assessment at scale. For a larger deployment of our proposed model across a community of individuals, we propose an active learning based methodology to reduce the effort of ground truth data collection and labeling. Finally, we evaluate the performance of our proposed algorithms on real data traces, and demonstrate the efficacy of our models for detecting and assessing the fine-grained sleep states beyond an individual.
Interactive Machine Learning (IML) seeks to complement human perception and intelligence by tightly integrating these strengths with the computational power and speed of computers. The interactive process is designed to involve input from the user but does not require the background knowledge or experience that might be necessary to work with more traditional machine learning techniques. Under the IML process, non-experts can apply their domain knowledge and insight over otherwise unwieldy datasets to find patterns of interest or develop complex data driven applications. This process is co-adaptive in nature and relies on careful management of the interaction between human and machine. Design of the interface is fundamental to the success of this approach, yet there is a lack of consolidated principles on how such an interface should be implemented. This article presents a detailed review and characterization of Interactive Machine Learning from an interactive systems perspective. We propose and describe a structural and behavioural model of a generalized IML system and identify solution principles for building effective interfaces for IML. Where possible, these emergent solution principles are contextualized by reference to the broader human-computer interaction literature. Finally, we identify strands of user interface research key to unlocking more efficient and productive non-expert interactive machine learning applications.
Design sketching is an important tool for designers and creative professionals to express their ideas and concepts in a visual medium. Being a critical and versatile skill for many different disciplines, courses on design sketching are sometimes taught in universities. Courses today predominately rely on pen and paper, however this traditional pedagogy is limited by the availability of human instructors who can provide personalized feedback. Using a stylus-based intelligent tutoring system called PerSketchTivity, we aim to mimic the feedback given by an instructor and assess the student drawn sketches to give them insight into the areas they need to improve on. In order to provide effective feedback to users, it is important to identify what features of their sketches they should work on to improve their sketching ability. After consulting with several domain experts in sketching, we came up with an initial list of 22 different metrics that could potentially differentiate expert and novice sketches. We gathered over 2000 sketches from 20 novices and four experts for analysis. Seven metrics were shown to significantly correlate with the quality of expert sketches and provided insight into providing intelligent user feedback as well as an overall model of expert sketching ability.
Gone are the days of robots solely operating in isolation, without direct interaction with people. Rather, robots are increasingly being deployed in environments and roles that require complex social interaction with humans. The implementation of human-robot teams continues to increase as technology develops in tandem with the state of human-robot interaction (HRI) research. Trust, a major component of much human interaction, is an important facet of HRI. However, the ideas of trust repair and trust violations are understudied in the HRI literature. Trust repair is the activity of rebuilding trust after one party breaks the trust of another. These trust breaks are referred to as trust violations. As HRI becomes widespread, so will trust violations; as a result, a clear understanding of the process of HRI trust repair must be developed in order to ensure that a human-robot team can continue to perform well after trust is violated. Previous research on human-automation trust and human-human trust can serve as starting places for exploring trust repair in HRI. Although existing models of human-automation and human-human trust are helpful, they do not account for some of the complexities of building and maintaining trust in unique relationships between humans and robots. As such, the purpose of this paper is to provide a foundation for exploring human-robot trust repair by drawing upon prior work in the human-robot and human-human trust literature, concluding with recommendations for advancing this body of work.
People are not infallible consistent ``oracles'': their confidence in decision-making may vary significantly between tasks and over time. We have previously reported the benefits of using an interface and algorithms that explicitly captured and exploited users' confidence: error rates were reduced by up to 50% for an industrial multi-class learning problem; and the number of interactions required in a design optimisation context was reduced by 33%. Having access to users' confidence judgements could significantly benefit intelligent interactive systems in industry, in areas such as Intelligent Tutoring systems, and in healthcare. There are many reasons for wanting to capture information about confidence implicitly. Some are ergonomic, but others are more `social' - such as wishing to understand (and possibly take account of) users' cognitive state without interrupting them. We investigate the hypothesis that users' confidence can be accurately predicted from measurements of their behaviour. Eye-tracking systems were used to capture users' gaze patterns as they undertook a series of visual decision tasks, after each of which they reported their confidence on a 5-point Likert scale. Subsequently, predictive models were built using ``conventional" Machine Learning approaches for numerical summary features derived from users' behaviour. We also investigate the extent to which the deep learning paradigm can reduce the need to design features specific to each application, by creating ``gazemaps" -- visual representations of the trajectories and durations of users' gaze fixations -- and then training deep convolutional networks on these images. Treating the prediction of user confidence as a two-class problem (confident/not confident), we attained classification accuracy of 88% for the scenario of new users on known tasks, and 87% for known users on new tasks. Considering the confidence as an ordinal variable, we produced regression models with a mean absolute error of H0.7 in both cases. Capturing just a simple subset of non-task-specific numerical features gave slightly worse, but still quite high accuracy (eg. MAE H1.0). Results obtained with gazemaps and convolutional networks are competitive, despite not having access to longer-term information about users and tasks, which was vital for the `summary' feature sets. This suggests that the gazemap-based approach forms a viable, transferable, alternative to hand-crafting features for each different application. These results provide significant evidence to confirm our hypothesis, and offer a way of substantially improving many interactive artificial intelligence applications via the addition of cheap non-intrusive hardware and computationally cheap prediction algorithms
Exploring high-dimensional data is challenging. Dimension reduction algorithms, such as weighted multi- dimensional scaling, support data explorations by projecting datasets to two dimensions for visualization. These projections can be explored through parametric interaction, tweaking underlying parameterizations, and observation-level interaction, directly interacting with the points within the projection. In this paper, we present the results of a controlled usability study determining the differences, advantages, and drawbacks among parametric interaction, observation-level interaction, and their combination. The study assesses both interaction techniques affects on domain-specific high-dimensional data analyses performed by non-experts of statistical algorithms. This study is performed using Andromeda, a tool that enables both parametric and observation-level interaction to provide in-depth data exploration. The results indicate that the two forms of interaction serve different, but complementary, purposes in gaining insight through steerable dimension reduction algorithms.
Convergent lines of evidence indicate that anthropomorphic robots are represented using neurocognitive mechanisms typically employed in social reasoning about other people. Relatedly, a growing literature documents that contexts of threat can exacerbate coalitional biases in social perceptions. Integrating these research programs, the present studies test whether cues of violent intergroup conflict modulate perceptions of the intelligence, emotional experience, or overall personhood of robots. In Studies 1 and 2, participants evaluated a large, bipedal all-terrain robot; in Study 3, participants evaluated a small, social robot with humanlike facial and vocal characteristics. Across all studies, cues of violent conflict caused significant decreases in perceived robotic personhood, and this shift was mediated by parallel reductions in emotional sympathy with the robot (with no significant effects of threat on attributions of intelligence). In addition, in Study 2, participants in the conflict condition estimated the large bipedal robot to be less effective in military combat, and this difference was mediated by the reduction in perceived robotic personhood. These results are discussed as they motivate future investigation into the links between threat, coalitional bias and human-robot interaction.
Sophisticated ubiquitous sensing systems are being used to measure motor ability in clinical settings. Intended to augment clinical decision-making, the interpretability of the machine learning measurements underneath becomes critical to their use. We explore how visualization can support the interpretability of machine learning measures through the case of Assess MS, a system to support the clinical assessment of Multiple Sclerosis. A substantial design challenge is to make visible the algorithms decision-making process in a way that allows clinicians to integrate the algorithms result into their own decision process. To this end, we present an iterative design research study that draws out challenges of supporting interpretability in a real-world system. The key contribution of this paper is to illustrate that simply making visible the algorithmic decision-making process is not helpful in supporting clinicians in their own decision-making process. It disregards that people and algorithms make decisions in different ways. Instead, we propose that visualisation can provide context to algorithmic decision-making, rendering observable a range of internal workings of the algorithm from data quality issues to the web of relationships generated in the machine learning process.