Advancing the Value of Ethnography

Hybrid Methodology: Combining Ethnography, Cognitive Science, and Machine Learning to Inform the Development of Context-Aware Personal Computing and Assistive Technology


Cite this article:

2019 EPIC Proceedings pp 254–281, ISSN 1559-8918,

[s2If !is_user_logged_in()]

[s2If is_user_logged_in()]

Reconstructed Narratives with Video Playback

How does a researcher break down, in moment-by-moment sequence, another person’s experience? The researcher can observe someone in real time, but that does not explore interiority (e.g. what is our participant Marcus deciding between as he’s stirring the pot of noodles? What caused him to pause for so long by the window?). Researchers could interrupt that person at a steady cadence to probe deeply at interiority beyond the “Paas+why” experience sampling described above, but that would introduce an “observer effect” distortion. The in-depth questioning could prevent the participant from entering important and common subjective states such as “flow” states (Csikszentmihalyi 2008) or mind-wandering (Smallwood and Schooler 2015) that would otherwise typically occur when the participant is in the everyday context and that would be helpful for the researcher’s understanding of what assistance, if any at all, a person might need in that context.

As our team puzzled over this problem, we began to look to what was in retrospect one of the techniques of narrative journalism: the reconstructed narrative interview (Menkedick 2018). Journalists who specialize in telling narrative stories deeply rooted in one “character’s” experience rapidly learn the value of revisiting with an interview subject a particular event again and again; each visit adds a new layer of depth, and helps the journalist to recapture what it was like to live through that event. In designing our research, we settled on a version of this technique as a method to probe participants’ experience of cooking in a way that was both deep, yet unobtrusive: we would allow the participant to perform his or her task with no questioning beyond the mental effort scores asked every two to four minutes, and only after the cooking was complete would we engage the participant in an interview of approximately 60 minutes (sometimes longer) to immediately reconstruct, with as much fidelity as possible, what the interior experience of the just-completed task had been, particularly during a few moments of interest informed by steep changes in the mental effort scores they reported. For each participant we did this process twice, after each of the two activities. Crucially, we scrolled through the just-captured first-person video of the participant doing the activity during the interview to guide the questioning.

With in-situ fieldwork, researchers have an advantage over the journalist, as well as a disadvantage. The advantage is presence. Journalists are rarely physically present during the “scenes” or moments they later seek to reconstruct in their subjects’ lives. By contrast, researchers in-situ are able to quietly observe and take notes about the scenes they will shortly try to reconstruct. Research can be set up to have the further advantage of being able to conduct the debrief interview immediately following the task; a narrative journalist often is piecing together events that date back years or even decades. The disadvantage is that researchers are seeking to reconstruct the experience of essentially banal events (e.g. doing laundry), and on a more minute time scale than a journalist would try to explore (e.g. returning the shirt to the ironing board just when it seemed like the shirt was done getting ironed). Very seldom does a journalist attempt to reconstruct how a person’s experience shifted across the course of a second, and never would a journalist expect a subject to remember with any fidelity the precise order in which the subject executed essentially banal tasks, like whether salt was added to a broth before pepper, and why.

Video footage can be used to overcome this challenge. In our study, we decided to play back to participants the video that had been recorded of them performing the cooking task using a head-mounted camera. (During the second activity of the participants’ choosing, there was only an in-room camera recording the activity. We decided on this approach in case the head-mounted camera proved to be too disruptive for the participants’ experience, but participants reflected that for the most part they forgot about the head-mounted camera after a few minutes of cooking.) This video, if instantly replay-able, serves as a kind of memory prosthetic to assist reconstructive narrative interviewing; the first-person perspective of the camera view further helps the participant relive the experience of the hour before. For instance, vision darting from one ingredient to another could help the participant viscerally remember a moment’s indecision over how to proceed with a recipe. (We also realized that participants were much more comfortable watching first-person video of themselves than room-camera video of themselves that often made participants feel self-conscious.)

The reconstructed narrative with video playback can take longer than doing the activity itself, but it is this time investment that allows for deep probing into what would otherwise remain unseen or untranslatable to the researcher — a furrowed brow, a pause, a chuckle. Moments that are apt for deep discussion can be selected by both the researchers, looking back on their notes, and the participants, recalling something they had thought about but didn’t say aloud at the time. Following the cooking task, we sat down with the participant and spent about an hour reviewing moments of special interest with the participant. Moments of interest were chosen at the researcher’s discretion, but often involved spikes or significant fluctuations of mental effort as recorded from the mental effort score self-reports, moments of clear task-switching, moments of interruption, or moments the researchers had trouble deciphering. The researchers also allowed the participants to highlight moments that to the researchers seemed uneventful but where internally within the participant there was a lot of activity. For instance, one participant Haley noted that when she was waiting for the tofu to brown she was reminded of a reply she was waiting on from a love interest. The researchers soon discovered that to thoroughly explain everything that influenced the participant’s experience during a moment of high complexity — even if that moment only lasted 30 seconds — could easily take 20 minutes of exhaustive probing through repeated playback of the video clip.

To give one example: one researcher witnessed a participant, Daryl: 1) have a dialogue with his wife about a task related to their young daughter’s pajamas, 2) make a note about this task on a nearby whiteboard, 3) rapidly decide to execute the task immediately instead, thereby abandoning his borscht recipe for the moment, 4) quickly visit different drawers in his daughter’s bedroom (captured for the ethnographer only due to Daryl’s wearing a head-mounted camera, as he had darted away from the kitchen at this point), 5) visit a drying machine to grab a pair of pajamas, then 6) finally return to his borscht. Puzzling out all of these decisions, and the sub-decisions within these decisions, was a laborious (if joyful) task for the researcher, necessitating digressive interviews about the state of Daryl’s relationship with his wife, his young daughter’s aversion to wearing pajamas, and a history of the participant’s forgetting to execute tasks placed on the family chore-board. The entire video clip lasted perhaps just 30 seconds, but the exhaustive and fully explanatory account of the meaning of it ran for several hundreds of words.

This method of narrative reconstruction using first-person point of view video playback builds on participatory ethnographic video practices (see for example Pink 2007, 103-115), and places emphasis on the research participant’s role in interpreting and making sense of their own experiences, rather than leaving the interpretation and sensemaking to the researcher alone upon return from the field (as may often be the case for the ethnographer) or from the lab (as may often be the case for the cognitive scientist). As anthropologist João Biehl writes, “How can the lives of our informants and collaborators, and the counter-knowledges that they fashion, become alternative figures of thought that might animate comparative work […]? […] As anthropologists, […] we are challenged to listen to people — their self-understandings, their storytelling, their own concept work — with deliberate openness to life in all its refractions” (Biehl 2013, pp 574-6). This is perhaps another way in which hybrid methodology seeks to push the boundaries of research — by bringing participants more actively into the sensemaking process — and future work might benefit from developing this aspect further. Providing research participants more opportunities to articulate their internal states, including what they need and what they don’t need, rather than assuming or inferring from observations alone, seems particularly important for determining the relevance, helpfulness, and boundaries of an assistive technology in everyday contexts.


Because of the mix of methods combined in research, hybrid methodology generates a substantial amount of data of different types (e.g. numerical scores, observational field notes, images, video recordings). Given the wealth of data collected, many analysis strategies are possible in order to make sense of that data. The interdisciplinary team needs to choose which means of analyses to prioritize and combine in ways that best serve the research question (rather than in ways that best serve each discipline). In the case of complex research questions (e.g. what is the human experience of context?), conducting complementary analyses that make simultaneous entry points into the data allows the team to explore the research question from different angles and to revisit the data later on as distinct disciplines follow particular tracks to explore a sub-component of the research question more in-depth.

In this section we present a selection of complementary analyses that we conducted, which combined qualitative and quantitative approaches. These analyses are part of a larger pattern recognition or “Sensemaking” process (Madsbjerg and Rasmussen 2014; also described in Hou and Holme 2015), in which teams use “bottom-up” data-driven approaches (i.e. based on what we see in the field) alongside “top-down” theme-driven approaches (i.e. based on the themes we sought to explore at the outset and questions we needed answers to). In our case, we wanted the results of the analyses to help inform the early design of new assistance experiences, the research agenda for further studies (in a lab and in the field) based on new questions emerging from the work, and the early development of infrastructure for new assistive personal computing technology.

Structured Storytelling and Qualitative Data Clustering

How do teams ensure that all researchers are familiar with the details of the raw data and have a shared starting point, particularly when each researcher met with only a subset of participants? How do we enable researchers to discern themes across distinct moments in the field? We took what we informally called “structured storytelling” as our starting point in analysis: a discussion centered on each of the research participants, led by the researchers who met with that participant, and structured around key questions and instances from the field that the team wants to systematically and consistently probe for details. This ensures that human voices and experiences are top of mind — the participants are not abstracted as “Subject A” or as data points on a graph, but instead as individuals with names (we used pseudonyms to protect identity). It also ensures all team members have a shared grasp of the details and particularities of the fieldwork, from which (when those details are compared, connected, and abstracted) insights tend to emerge.

In the discussions, the team focused on concrete moments observed in the field — Dina doing laundry, or Mitchell tending to his indoor garden. This involved re-watching video footage around moments that were quantitatively interesting because the participant reported a very high or very low mental effort score, and moments that were qualitatively interesting because of an ethnographically rich observation (e.g. a moment the participant identified as meaningful upon reflection after the activity was done or a moment the researcher noticed as having many contextual dimensions at play). The purpose of structured storytelling is to interrogate the raw data with pertinent lines of questioning that help the team to interpret what happened in the field. Some of the questions we asked as a team included, “What dimensions of the context were especially relevant for the individual in this moment?” “What type of information was the individual engaged with?” “What other moments from the field, from this participant or other participants, might be similar to this one, and why?”

Structured storytelling stems from grounded theory, a methodology used in sociology and anthropology to generate theories based on systematic analysis of qualitative data rather than using data to confirm or refute a hypothesis, or building research around an existing theory (Glaser and Strauss 2017). Structured storytelling, as described above, generates interpretive descriptions or reflections that the team members then write down individually (e.g. on post-it notes or note cards) and aggregate collectively. This content, in turn, leads the team to do qualitative data clustering, which entails making further sense of the interpretive descriptions by grouping them into thematic buckets based on commonalities. These buckets are then analyzed, connected, and compared to develop working theories or insights. The development of these theories requires a constant “zooming out and in” — once there is a potential insight (i.e. a working theory that explains observations from the field), it is necessary to go back to the raw data itself to collect other moments (e.g. moments that corresponded with similar mental effort score, or moments that were ethnographically rich) that support, nuance, or challenge the proto-insight, for its refinement.

A team can tell whether or not the structured storytelling and qualitative data clustering are going in the right direction if there is a certain productiveness to the hypotheses or proto-insights — these are helping to reframe or give new meaning to moments in the field not otherwise considered, are leading to other proto-insights, or are providing structure and groupings in an otherwise fragmented data set of moments from the field. The purpose is to develop high-level insights that address the project’s research questions and ambitions — in our case, about the role of different dimensions of context on a person’s experience that then informed the abstractions we developed for a data labelling protocol, described in the Impact section. The abstractions we developed (which we refer to in this paper as Abstraction Set A and Set B and which can be thought of as an early framework that informs the later framework the assistive technology itself might eventually use) were based on the strongest patterns in our qualitative data clustering exercises and the relationships those patterns had to the quantitative analysis we will now describe.

Quantitative Analysis of Ethnographic Data

To allow machine learning models and cognitive science research to benefit from insights derived from qualitative analysis, we need to also find complementary quantitative methods for data analysis. How do teams work quantitatively with data captured in ethnographic research? Quantitative analysis of ethnographic data entails developing an approach to data processing and graphical representation to best serve the team’s goals. We had three learnings that could be useful for teams doing this type of work: First, if in doubt about what type of quantitative analysis will prove useful, the team should develop multiple initial representations of the same data to enable a variety of early insights. Second, the team should seek ways to compare data points consistently and systematically even when individual research participants’ experiences or real-world contexts and interpretations of tested concept are highly variable. Third, the team should explore connections between the quantitative and qualitative data to better understand the results of the quantitative analyses and address project goals (e.g. in our case going back to the thick descriptions associated with extreme mental effort scores to find other patterns in this data).

One of our goals was to obtain generalizable patterns about mental effort from the mental effort scores. The challenge is that, given the uncontrolled situations we were studying, the mental effort scores were generally not comparable across participants because of variable real-world contexts and because of individual differences in how participants interpreted the mental effort scale. This is a common problem with all self-report scales. For example, one participant never gave a maximum score of 9 (always hovering around 6s or 7s at the extreme), but her qualitative description of a moment was very similar to another participant’s description for a 9 score. This left us with an interesting question: Can mental effort scores be compared across different activities for the same participant, and across participants?

We plotted the mental effort scores for each participant’s two activities first in box-and-whisker plots, which allowed us to visualize the median mental effort score the participant gave for that activity, as well as the upper and lower quartiles of that median and the upper and lower extremes (moments when the participant gave a really high score or a really low score, outside of the norm of scores they were otherwise providing). We were able to contextualize these plots with what we knew qualitatively about each participant, to identify patterns in how each participant “typically” scores mental effort (e.g. Marcus loves cooking and it’s easy for him, whereas he doesn’t enjoy studying and finds the material difficult, but there are relative “extremes” in each activity, with distinct needs, and those might have similarities to another participant’s, when we begin to abstract out through the qualitative data clustering).

In order to paint a picture of how each participant’s mental effort reports shifted over time, we made another set of mental effort score plots with score values on the y-axis and time on the x-axis. This provided a “story arc” of how an activity unfolded in terms of mental effort from start to finish, which we could then contextualize with qualitative data (e.g. Dina did laundry late in the day feeling rushed to get it done while the food was cooking, so perhaps that’s why the “arc” of the activity looks the way it does). We could also compare the mental effort score arc with what we knew from the reconstructed narratives in the field (e.g. when Haley’s scores were low during a banal moment in cooking, we knew she was thinking about her romantic interest and about her work responsibilities). We were able to assess where our ethnographic observations differed from or aligned with the mental effort scores, and understand how two participants’ needs, when compared, were distinct even when they each gave a score of 9 during a moment when they were each cooking.

To better visualize the set of high mental effort “outliers” (the particularly rich moments from a cognitive science point of view) and identify clusters (similar patterns) between participants, we calculated the mean and standard deviation of the mental effort scores across both activities for each participant, plotted in temporal sequence (how the mental effort scores changed over time for each participant). High outliers were defined as those that fell in the top 10% of the distribution for a participant. Because we had qualitative notes accompanying each score, we were able to interpret and theorize about why a moment was an extreme high or low score, for that individual, and find patterns among the “why’s” behind the relative extreme scores. This data informed subsequent analyses conducted by the cognitive scientists on our team (Jonker et al. in review).

Multiple forms of analyses are possible on, and can enrich our understanding of, a hybridized data set, to provide more directional outcomes. Together these approaches set our team up to explore further cognitive science questions around mental effort, and to explore further questions around helpful abstractions to inform machine learning (some of which is described in the Impact section that follows). Hybrid Methodology is amenable to subsequent analyses that build on or depart from the initial analysis of the data, both because there are many “kinds” of data (e.g. quantitative, qualitative, self-reports, interpretation) to work with and because there are disciplinary experts who are already familiar with that data from the interdisciplinary work.


Having multiple analytical entry points into a hybrid methodology data set can provide a team opportunity to make impact in a variety of ways and for different intellectual communities (both company-internal and external). The richness and variety of the hybrid methodology data set, and the analyses described above, left our team poised to develop work products (i.e. outputs, deliverables) that generated impactful early outcomes for context-aware assistive technology, including: (1) shaping early user experience design, (2) informing the research agenda for future studies in cognitive science, and (3) developing nascent research on infrastructure for assistive technologies. Together these follow-on projects represent a portfolio approach to delivering impact from a hybrid methodology data set, leveraging and extending the data and analysis in different ways.

Each of these three follow-on projects had distinct ambitions for how to deliver relevant findings to the “home discipline” intellectual communities that came together at the outset of our hybrid methodology project. The follow-on projects offer contrasting approaches to extending the analysis and application of a hybrid methodology data set, and suggest ways that qualitative data could be used in machine learning and cognitive science. The first two of our listed outcomes were in sense more straightforward or familiar. One involved envisioning a series of end-user design concepts based on the insights — means of interventions, broadly, that users might find helpful. The other involved addressing a single cognitive science research question emerging from the analysis of outlier mental effort scores (Jonker et al. in review).

This section focuses on the third on the list — a follow-on project focused on technology infrastructure development — to illustrate a form of impact that can be created through work products that may be novel in applied ethnography. This project involved developing and partially implementing two data labelling protocols based on abstractions deemed potentially useful for context-aware assistive technology. The abstractions, protocols, and resulting labelled data set each served an early informative role in infrastructure development.

Building frameworks or abstractions that make sense of the human, social world should feel familiar to applied ethnographers. Abstractions are also the foundation for making any machine learning possible. Without abstractions, machine learning models would have to cope with an infinite amount of categories with one data point each. For example, we might use the abstraction of a dog to build machine learning models that are able to detect dogs across breed, age, size, and so on. In our setting, the most useful level of abstraction would allow a machine learning model to reduce the inherent complexity of context and to hone in on what is most relevant for the human in a given moment.

To guide the development of useful abstractions, we studied the literature of conversational agents, or chatbots, an area where researchers have encountered similar challenges in terms of complexity. Our task involved us attempting to “read” and interpret a context for meaning. Similarly, chatbot-development involves seeking to extract “meaning” embedded in the syntax of language, treating a text as more than a sequence of words. Recent work has shown how hierarchies of abstractions can improve the performance of chatbots. In particular, research scientists Khatri et al. (2018) find that incorporating dialogue acts, inspired by philosopher John Searle (1969), can improve the performance of their contextual topic model for dialogue systems.

Inspired by recent advances in the field of conversational agents, we developed two sets of abstractions, Set A and Set B, that repackaged and represented the strongest patterns around the experience of context emerging from our hybrid analysis processes. Abstraction Set A was more holistic (more “zoomed out” in its representation of aspects of context) whereas Abstraction Set B was more granular, and broke down context into several components. Each abstraction set was mutually constitutive of the other (i.e. each abstraction set represented and reframed the content of the other), but each was also independent of the other (i.e. one set did not need the other set in order to be legible).

Our abstractions served as the foundation for the development of several data labelling protocols, which consisted of a set of instructions for how to generate a labelled data set of human experience of context. Ultimately, this labelled data set is needed to train a machine learning model to detect Abstraction Set A and B. In order for annotators to be able to label a piece of data as a given abstraction, they need to know what the abstraction is, which in our case was not as straightforward as, for instance, labelling whether or not an image contains a dog. Most of us share an understanding of the abstraction of a dog, and we have no difficulty pointing at examples. In comparison to the abstraction of a dog, Abstraction Set A and Set B were more ambiguous, and closer to concepts such as “freedom” or “democracy.” There is a rich tradition in the social sciences for how to reliably encode data with abstract concepts. Political science, in particular, contains several examples such as the Polity data, which rates countries on a numeric scale from democratic to authoritarian, the Comparative Manifesto Project, containing coded summaries of political party manifestos, an-often used source for placing parties on a left-right scale, or Transparency International’s yearly corruption perceptions index.

Traditionally, and in all three examples mentioned above, data that involves more abstract concepts are generated by experts, often academics with deep subject matter expertise. However, to generate data at sufficient scale to train a machine learning model, we need to be able to move beyond experts who are generally costly and in short supply. Thus, we needed to ensure annotators had a sufficiently nuanced understanding of our abstractions to be able to label data as if they were experts without requiring them to be trained ethnographers or have deep knowledge of our project — they abstractions needed to be teachable. Further, they needed to be able to detect an abstraction from video and audio alone, without access to our field notes. Despite the lower expertise of naive annotators, recent research indicates that deploying crowd-sourcing can generate results that are indistinguishable from expert approaches (see Benoit et al. 2016 for an example in the context of political texts).

Teaching new abstract concepts is hard. We took an examples-based approach, in which the abstractions were primarily taught through instructive examples in the form of brief video clips from our field recordings. We first provided the annotators a brief description of the abstraction. Afterwards, the annotators were shown three examples that highlight various aspects of the abstraction. The first example is a prototype of the given abstraction. This is the clearest illustration of the abstraction we have in our data. However, a clear example is not enough to be able to meaningfully label data from long-form video. It is equally important that annotators understand that moments can vary along important dimensions and still belong to the same abstraction. For this reason, we provide two additional examples that highlight meaningful variation within the abstraction. These examples helped annotators understand the different dimensions of an abstraction, which in turn helped them set boundaries and differentiate between various abstractions. Ideally we would have had a fuller training set, of several examples with a lot of variety for each abstraction, but at the time we had three training examples for each. Ultimately, the labelling protocol needs to strike a balance between sharing enough information to learn the abstraction but not so much information that the protocol starts to resemble an exhaustive catalog of variants.

Testing the annotators required a validation strategy. We were looking to test the degree to which the naive annotators were able to replicate our “expert” labels. To develop a benchmark, we took a piece of data and labelled it based on consensus among us as researchers over whether or not a given abstraction was present in the data. We did this for Abstraction Set A and Abstraction Set B, because we wanted to compare results and see which abstraction set was more easily learnable for naive annotators. Establishing this benchmark collaboratively, as an interdisciplinary team, meant an iterative discussion and refinement of what the definitions of the abstractions themselves were.

Training and testing annotators on the abstractions took a full day, and the data set available was modest (50 hours of video footage), as it came directly from the fieldwork. The team provided the training and assessed annotators as a group and individually against the benchmark data set, labelled by us. At this phase of the project the ambition was not to develop training data for a machine learning model, but to explore whether it was possible for a group of naive annotators to learn and apply our abstractions. Going forward, we envision a process where naive annotators are initially screened based on their ability to replicate our “expert” labels. After this screening, new data should be labelled based on majority voting among selected annotators, as is common in the literature (e.g., Fridman et al. 2019).

Our initial results are mainly positive. Overall, annotators had above-chance ability to agree with our labels, with the best-performing annotators missing the benchmark by only 5%. For the most part, there was relatively high intra-rater agreement between the naive annotators, indicating that different naive annotators would be able to independently reproduce approximately our abstraction labels. The team found that the best performing naive annotators understood and implemented the “rules” for coding (e.g. all abstractions in Abstraction Set A are mutually exclusive) and shared an understanding of the granularity of the labelling task. Difficulties in this area invited errors of two kinds: either parsing the long form video data too granularly (applying labels to less than salient evidence of an abstraction) or not granularly enough (failing to apply labels to salient evidence of an abstraction).

In our testing phase, we also captured, but have not yet analysed, annotators’ certainty when labelling a given piece of data, as well as both point estimates and ranges of start and end times for a given abstraction. These data allow us to analyse accuracy across different levels of (un)certainty, and understand the degree to which annotators disagree about when an abstraction starts and ends. We also allowed annotators to tag and suggest new potential abstraction labels within Abstraction Set A or B (whichever one they were coding), as a way to generate new potential abstractions that could be refined in further analysis going forward.

This process of developing a data labelling protocol led to some overall lessons on making work products from hybrid ethnography that are relevant to specific intellectual communities. Applied ethnographers should have the ability to recognize the limits of work products and their utility — for instance, abstractions alone are not useful in building technology infrastructure. Indeed, applied ethnography often deals in abstractions or frameworks but does not often go a step further and apply them to machine learning problems — to do this ethnographers need to build another kind of work product, namely data labelling protocols. Developing data labelling protocols requires developing training material that links abstractions back to very concrete, detectable, recognizable examples in the data. This requires, at a certain point in a project, moving away from the nuance and complexity of ethnographic thinking and being quite firm and mutually exclusive about what something is or isn’t in order for labelling to be possible. Establishing benchmarks (for the annotators to learn) requires consensus among the researchers, and iteration and refinement of the abstractions themselves in the process, as researchers are forced to be very clear about what something is or isn’t. We have found that this process helps to sharpen the precision of the original concepts themselves. Labels must then be tested with annotators, who look at raw data and label based on the learned abstractions — and here what we discovered is that inter-rater reliability when dealing with such complex topics (human experience of context) might be lower than what is organizationally common or acceptable, and that annotator training for such complex topics is time-consuming and research-intensive.

Overall, if applied ethnographers want to influence infrastructures like those that support context-aware assistive technologies, these teams are greatly helped by a willingness to extend their frameworks and use them to form new work products. In our case of hybrid methodology we did so by extending our abstractions into a tested data labelling protocol, in addition to informing experience design and the research agenda for cognitive science teams.