You too can collect big data!

Share Share Share Share Share


EPIC2014 Workshop by Anna Avrekh, Kathy Baxter, & Bob Evans

At the EPIC 2013 Keynote, Tricia Wang observed that, if you are not working with “Big Data,” the implication is that your data are “small.” Although the number of data points or participants may not be in the millions or ever thousands, the data we gather is actually far richer. As our community knows, web analytics or logs can tell us WHAT people are doing but never WHY. We may attempt to infer it based on what we see but unless we ask our users why they are doing something that we have recorded (with or without their knowledge), we can never know for sure.

Later in the conference, I hosted a Salon on “Big Data” with discussants Jens Riegelsberger (Google) and Todd Cherkasky (SapientNitro). The interest in the salon far exceeded the space available. One key theme that emerged was a desire to learn how to incorporate “Big Data” into their work. Few of the participants had the means to pull logs and do deep statistical analysis on them. This makes it extremely difficult to pair up the WHAT others are collecting with the WHY they are observing. I realized the community might be interested in a methodology we have been using the last three years at Google called “Experience Sampling Methodology” (ESM), which combines the best of both worlds in a scalable manner. With the help of a mobile app called the “Personal Analytics Companion” (PACO) created by Bob Evans, we have been able to conduct large scale ESM studies that have the richness of diary studies, frequency of measurement and context in field studies, and scale of small online experiments.

In our workshop, we discussed the background of ESM research, issues of validity, reliability, and biases, and best practices for conducting a large scale ESM study, as well as how to analyze the data. We finished with a hands-on exercise where participants built an experiment themselves.

OK, so what is an ESM study anyway? ESM asks participants at random points throughout their day about their experiences in the moment. Using a tool like PACO, participants are pinged randomly on their Android or iPhone 5-8 times per day to record what they are doing at that moment and describe their experience (e.g., satisfaction, where they are, what tools they are using). They can also share photos with us, if they feel it will help us understand what they are reporting.  If participants give us permission, we can even connect the logs from their phone with what the qualitative data they are reporting. Responding to each ping takes participants a few seconds to a couple minutes, making this a very lightweight data collection method. It is not as intrusive as following around a participant all day for several days and we know from past research that participants acclimate to the pings after only a couple days.

At the end of each day, we provide participants with a form that shows everything they reported that day, including any photos. We then ask some additional questions to further understand their experience (e.g., Did you complete your task?  What are all the ways you looked for that information today?).

By doing this for 5-7 days, we can get a detailed look into the lives of hundreds, if not thousands, of participants. In the last ESM study the Search research team conducted, we collected data from 1200 participants across 47 US states over a three-month period. We measured 186 variables resulting in 4.756 million cells! It is important to note (if it wasn’t clear already), all of this is done with the participant’s consent. They can see a dashboard of their own data and download it, which participants have told us is fascinating because it makes them aware of behaviors or patterns they hadn’t been cognizant of before.

If you’d like to try out a study for yourself as a participant, you can download PACO for free from Google Play and Apple Store. Do a search for experiments in the app and you will find many public experiments going on at any given time. If you’d like to use PACO for your research, go to PACO is Open Source so if you’re a developer, you can have a blast customizing the tool for your needs!


  4 comments for “You too can collect big data!

  1. This is not a comment on the substantive content of this post – PACO is great. I’ve used it before as well as other forms of ESM. I encourage others to experiment with it. Bob has done a terrific job!

    But, in the context of both my membership in a community of ethnographic inquiry that tends to take a range of positions towards the presence and utility of online analytics, and as someone who works with distributed computational systems on a regular basis, I find the term “big data” to be both frustrating and incredibly useless.

    It is, at best, meaningless, since there is no consensus on what “big data” even means, and, at worse a misnomer, since the “big” presumably implied in “big” data refers to data which an individual computer cannot effectively store or process and therefore must be processed in shards across multiple computers. What is unique is the performance of operations on tera-, peta-, and exabytes of data. Otherwise, there is nothing novel about the phenomena referred to above as “big data” nor is it different from other quantitative datasets over the last fifty years.

    But even worse are the consequences of discourse which objectify “big data” as a category of practice and there by grant it both status and power. It is precisely this mistake that our clients make when they seek value comparisons between the scale of big data the depth of anthropological inquiry, resulting in the scenario adumbrated by Todd Cherkasky himself in his paper with Adrian Slobin at EPIC 2010.

    This community of practice should find itself in a position to understand, assess, critique, comprehend, and deconstruct the layers of meaning assumed by this term, rather than operate within the confines of a symbolic order which implicitly grants it privilege. That does not preclude it’s usage as a method – but an informed specification of *what it is* and *how we know.*

    Further disussion, see also

  2. Interesting points Neal. It seems at the moment the term can really mean lots of things. In the workshop it was agreed to short hand the term for the collection of this data. Data Science, at least at Berkeley, is allowing for a rage of usage. What they point out, that I appreciate, is it is not the size of the data (or any of the 3 V’s: volume, velocity or variety) but the tools used to analyze data. They, however, asked 40 leaders to respond. Here is what they said:

  3. Moore’s Law of Big Data:
    The amount of nonsense packed into the term “big data” doubles every two years
    – Mike Pluta Aug 10, 2014
    I remember when we used to try to define what ethnography was . . .
    BTW Neal – nice reading list from your EPIC2014 salon at AskEPIC.

Leave a Reply