Today at Harvard Law Schools’s weekly Berkman Center lunch, Aaron Shaw presented into the potential Amazon‘s Mechanical Turk(AMT) holds for social science and the culture that surrounds it. His talk drew upon research-in-progress from the Berkman Center’s Online Cooperation group, in collaboration with Daniel Chen and John Horton.
Although the presentation itself, cheekily entitled “HIT me baby one more time, Or: How I learned to stop worrying & love Amazon Mechanical Turk,” was a bit light on statistics, the conversation within Berkman’s community around the issues of labor laws, privacy, methodology and technological potential were fascinating, as always.
Aaron Shaw at Berkman
As Shaw noted, the origin of the name for Amazon‘s Mechanical Turk lies in a chess-playing “automaton” that was no mechanical creation at all, but instead a clever contraption that hid a chessmaster inside. Amazon’s version farms out small tasks — or “HITs” — that require a human to accomplish.
As an aside, I have to note that, as Peggy Rouse pointed out in Mechanical Turk, Powerset and enterprise search, there may be considerably more to Amazon’s strategy than the creation of a crowdsourcing market for simple tasks. She thinks Mechanical Turk may play a role in enterprise search down the road. She’s a canny observer, I’d recommend reading her thoughts.
Early in his presentation, Shaw offered up a shoutout to Andy Baio (@waxpancake) who asked two questions late last year in “Faces of Mechanical Turk“: “What do [Amazon Turk users] look like, and how much does it cost for someone to reveal their face?”
Credit: Andy Baio, Faces of Mechanical Turk
The aggregated image is shown on the right. $0.50 was the magic price, apparently.
As Shaw noted, however, when it comes to the Turk, no public, trustworthy, aggregate data is available. What evidence is available derives from self-selecting surveys and experiments. Those samples showed a large number of women, from many countries of residence (although mostly in the US & India). Speculatively, he noted that the age of users appears to be low, while education and income is high.
Shaw posited that the geographically component is likely correlated to Amazon’s requirement that users hold a US banking account. As a result, Shaw’s research relied upon whatever his team could collect on the Turk or through interviews with users and Amazon executives.
So, does the Mechanical Turk work for its users? Sometimes. Shaw noted that once you get a few people performing a given task, the accuracy rate for completion goes up overall, providing the example of machine-learning algorithms.
As he noted wryly, it’s “Not all bots, cheaters and scripts.”
Task selection and design is important to that success rate: skill matters, on both sides. It’s not just the skill of users and their ability to follow instructions – success also relies upon the skill of the creators of the HITs. Social scientists — scientists of any stripe, really — recognize the issue here in experimental design.
The uses of Turk cover a broad spectrum, though by nature each represents some form of crowdsourcing. Amazon itself used to Turk to generate product descriptions, questions and answers, thereby “spamming itself,” as Shaw put it.
Spectrum of users of Amazon Mechanical Turk
How else is the Mechanical Turk being put to use?
- The Extraordinaries: “micro-volunteer opportunities to mobile phones that can be done on-demand and on-the-spot”
- CastingWords.com is using it for transcription
- AaronKoblin.com uses Mturk to create art. For .02, he pays users to draw a sheep facing left. He then sells sheets of them for $20, some portion of which is donated to charity.
- Also noted: oDesk, reCAPTCHA, Threadless, Aardvark, liveops
Aside from commercial, artistic or volunteer uses, Shaw believes that Mechanical Turk has considerable potential to enhance social science.
- As a pool of subjects for randomized experiments
- As a pool of inexpert raters for distributed observation, or “coding”
Advantages to labs?
Low cost of use, ease of paying subjects, speeds, diverse subjects (potentially), one HIT = one person, workers do not (usually) interact.
Experiments can consist of contextualized real-effort tasks. As the Turk has created a real labor market, as for text transcription, there’s utility in many areas, like canonical games in economics and paired surveys.
In other words, its neither reducible to a manifestation of the “Internet hivemind” or some sort of “latter day child labor,” at least in Shaw’s view. The online conversation around the presentation, which included Esther Dyston, was more skeptical on the latter point, noting that the potential for skirting labor laws was not inconsiderable. Shaw readily conceded that the issue is salient, although he sees such labor issues as “downstream,” he expects to see more given that the “tension is so clear, so stark.”
Shaw has been advised by Yochai Benkler while at Berkman, who evidently considers the Turk to be of use for content analysis for distributed observations. In this context, the ability for researchers to randomly assign HITs for raters to code objects is helpful. Shaw brought up Klaus Krippendorf, of UPenn, in the context of understanding some of the theory here; I’ll need to go do my due diligence in understanding Krippendorf’s work.
Yochai has noted that specific groups involved in distributing computing types, like SETI, have performed admirably. According to Shaw, in fact,“The Knights who say “Nee” perform quite well when measured against other countries with distributed computing.”
I also heard about the “Turkopticon,” a Firefox extension that allows users to submit feedback about HIT creators. Although Shaw said that it is not widely installed, there’s clearly a step towards community self-policing.
When asked about the utility of using the Turk for searching for missing computer scientist Jim Gray or searching for Steve Fossett’s plane, Shaw immediately recognized the value but hadn’t examined the data sets in question at length.
The question itself begged for a follow up, given the release of Chris Andersen’s “Free” this week: How and why are users motivated to provide hits when altruism is involved? Is work of higher quality when there is money involved?
Shaw offered a cautious affirmation, though with reservations: Payment vs free is “such a loaded issue in society. The symbolic value of money or donation is humongous.”
A Berkman Fellow in attendance, Chris Soghoian, noted that his advisor pays 5-10x the market rate and gets email about when the next task is coming, along with decent results.
In Shaw’s view, there needs to be “a more serious examination of the question. Experimental evidence of research suggest sub-populations of people who would respond differently. Some people will be motivated by doing good, others don’t care, want the .05. We need better ways to test. It’s situation-specific.”
As he wryly noted, “We’re not all homo economicus.”
As usual, this was an excellent lunch.You can view the archived video of the presentation as a .mov.
Following the presentation, Aaron wrote me to add the following:
“Daniel and John’s contributions to the field of experimental research on online labor markets include
- recognizing that AMT could serve as a venue for experimental studies;
- conducting the earliest labor market experiments on AMT;
- solving a bunch of difficult problems so that they could make valid causal inference based on the results of these experiments.”
I have to note one other organization I learned about today: “TxtEagle.” TxtEagle is a innovative concept for active “mobile crowdsourcing,” distributing small-scale jobs via SMS and payment the same method.
In other words, microjobs with micropayments. The mobile platform’s founders recognize that there are more than 2 billion mobile phone users in the developing world that could potentially be leveraged to perform tasks. The BBC wrote that “txteagle is changing the dynamics of outsourcing labour.” Hard to disagree with that.