Research Papers: Design Theory and Methodology

When Crowdsourcing Fails: A Study of Expertise on Crowdsourced Design Evaluation

[+] Author and Article Information
Alex Burnap

Design Science,
University of Michigan,
Ann Arbor, MI 48109
e-mail: aburnap@umich.edu

Yi Ren

Research Fellow
Department of Mechanical Engineering,
University of Michigan,
Ann Arbor, MI 48109
e-mail: yiren@umich.edu

Richard Gerth

Research Scientist
National Automotive Center,
Warren, MI 48397
e-mail: richard.j.gerth.civ@mail.mil

Giannis Papazoglou

Department of Mechanical Engineering,
Cyprus University of Technology,
Limassol 3036, Cyprus
e-mail: papazogl@umich.edu

Richard Gonzalez

Department of Psychology,
University of Michigan,
Ann Arbor, MI 48109
e-mail: gonzo@umich.edu

Panos Y. Papalambros

Fellow ASME
Department of Mechanical Engineering,
University of Michigan,
Ann Arbor, MI 48109
e-mail: pyp@umich.edu

1Corresponding author.

Contributed by the Design Theory and Methodology Committee of ASME for publication in the JOURNAL OF MECHANICAL DESIGN. Manuscript received April 29, 2014; final manuscript received November 6, 2014; published online January 9, 2015. Assoc. Editor: Jonathan Cagan.

This material is declared a work of the U.S. Government and is not subject to copyright protection in the United States. Approved for public release; distribution is unlimited.

J. Mech. Des 137(3), 031101 (Mar 01, 2015) (9 pages) Paper No: MD-14-1264; doi: 10.1115/1.4029065 History: Received April 29, 2014; Revised November 06, 2014; Online January 09, 2015

Crowdsourced evaluation is a promising method of evaluating engineering design attributes that require human input. The challenge is to correctly estimate scores using a massive and diverse crowd, particularly when only a small subset of evaluators has the expertise to give correct evaluations. Since averaging evaluations across all evaluators will result in an inaccurate crowd evaluation, this paper benchmarks a crowd consensus model that aims to identify experts such that their evaluations may be given more weight. Simulation results indicate this crowd consensus model outperforms averaging when it correctly identifies experts in the crowd, under the assumption that only experts have consistent evaluations. However, empirical results from a real human crowd indicate this assumption may not hold even on a simple engineering design evaluation task, as clusters of consistently wrong evaluators are shown to exist along with the cluster of experts. This suggests that both averaging evaluations and a crowd consensus model that relies only on evaluations may not be adequate for engineering design tasks, accordingly calling for further research into methods of finding experts within the crowd.

Copyright © 2015 by ASME
Your Session has timed out. Please sign back in to continue.


Hong, L., and Page, S. E., 2004, “Groups of Diverse Problem Solvers Can Outperform Groups of High-Ability Problem Solvers,” Proc. Natl. Acad. Sci. U.S.A., 101(46), pp. 16385–16389. [CrossRef] [PubMed]
Estellés-Arolas, E., and González-Ladrón-de Guevara, F., 2012, “Towards an Integrated Crowdsourcing Definition,” J. Inf. Sci., 38(2), pp. 189–200. [CrossRef]
Gerth, R. J., Burnap, A., and Papalambros, P., 2012, “Crowdsourcing: A Primer and its Implications for Systems Engineering,” 2012 NDIA Ground Vehicle Systems Engineering and Technology Symposium, Troy, MI, Aug. 14–16.
Kittur, A., Chi, E. H., and Suh, B., 2008, “Crowdsourcing User Studies With Mechanical Turk,” Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, Florence, Italy, Apr. 5–10, pp. 453–456. [CrossRef]
Von Ahn, L., Maurer, B., McMillen, C., Abraham, D., and Blum, M., 2008, “Recaptcha: Human-Based Character Recognition via Web Security Measures,” Science, 321(5895), pp. 1465–1468. [CrossRef] [PubMed]
Warnaar, D. B., Merkle, E. C., Steyvers, M., Wallsten, T. S., Stone, E. R., Budescu, D. V., Yates, J. F., Sieck, W. R., Arkes, H. R., Argenta, C. F., Shin, Y., and Carter, J. N., 2012, “The Aggregative Contingent Estimation System: Selecting, Rewarding, and Training Experts in a Wisdom of Crowds Approach to Forecasting,” Proceedings of the 2012 AAAI Spring Symposium: Wisdom of the Crowd, Palo Alto, CA, Mar. 26–28.
Ipeirotis, P. G., and Paritosh, P. K., 2011, “Managing Crowdsourced Human Computation: A Tutorial,” Proceedings of the 20th International World Wide Web Conference Companion, Hyderabad, India, Mar. 28–Apr. 1, pp. 287–288. [CrossRef]
Sheshadri, A., and Lease, M., 2013, “Square: A Benchmark for Research on Computing Crowd Consensus,” Proceedings of the 1st AAAI Conference on Human Computation and Crowdsourcing, Palm Springs, CA, Nov. 7–9.
Nunnally, J., and Bernstein, I., 2010, Psychometric Theory 3E, McGraw-Hill Series in Psychology, McGraw-Hill Education, New York.
Papalambros, P. Y., and Shea, K., 2005, “Creating Structural Configurations,” Formal Engineering Design Synthesis, E. K.Antonsson, and J.Cagan, eds., Cambridge University, Cambridge, UK, pp. 93–125. [CrossRef]
Amazon, 2005, “Amazon Mechanical Turk,” http://www.mturk.com
Van Horn, D., Olewnik, A., and Lewis, K., 2012, “Design Analytics: Capturing, Understanding, and Meeting Customer Needs Using Big Data,” ASME Paper No. DETC2012-71038. [CrossRef]
Tuarob, S., and Tucker, C. S., 2013, “Fad or Here to Stay: Predicting Product Market Adoption and Longevity Using Large Scale, Social Media Data,” ASME Paper No. DETC2013-12661. [CrossRef]
Stone, T., and Choi, S.-K., 2013, “Extracting Consumer Preference From User-Generated Content Sources Using Classification,” ASME Paper No. DETC2013-13228. [CrossRef]
Ren, Y., and Papalambros, P. Y., 2012, “On Design Preference Elicitation With Crowd Implicit Feedback,” ASME Paper No. DETC2012-70605. [CrossRef]
Ren, Y., Burnap, A., and Papalambros, P., 2013, “Quantification of Perceptual Design Attributes Using a Crowd,” Proceedings of the 19th International Conference on Engineering Design (ICED13), Design for Harmonies, Vol. 6, Design Information and Knowledge, Seoul, Korea, Aug. 19–22.
Kudrowitz, B. M., and Wallace, D., 2013, “Assessing the Quality of Ideas From Prolific, Early-Stage Product Ideation,” J. Eng. Des., 24(2), pp. 120–139. [CrossRef]
Grace, K., Maher, M. L., Fisher, D., and Brady, K., 2014, “Data-Intensive Evaluation of Design Creativity Using Novelty, Value, and Surprise,” Int. J. Des. Creativity and Innovation, pp. 1–23. [CrossRef]
Fuge, M., Stroud, J., and Agogino, A., 2013, “Automatically Inferring Metrics for Design Creativity,” ASME Paper No. DETC2013-12620. [CrossRef]
Yang, M. C., 2010, “Consensus and Single Leader Decision-Making in Teams Using Structured Design Methods,” Des. Stud., 31(4), pp. 345–362. [CrossRef]
Gurnani, A., and Lewis, K., 2008, “Collaborative, Decentralized Engineering Design at the Edge of Rationality,” ASME J. Mech. Des., 130(12), p. 121101. [CrossRef]
de Caritat, M. J. A. N., 1785, Essai sur l'application de l'analyse à la probabilité des décisions rendues à la pluralité des voix, L'imprimerie royale, Paris, France.
Lord, F. M., 1980, Applications of Item Response Theory to Practical Testing Problems, Erlbaum, Mahwah, NJ.
Rasch, G., 1960/1980, “Probabilistic Models for Some Intelligence and Achievement Tests, Expanded Edition (1980) With Foreword and Afterword by B. D. Wright,” Copenhagen, Danish Institute for Educational Research, Denmark.
Oravecz, Z., Anders, R., and Batchelder, W. H., 2013, “Hierarchical Bayesian Modeling for Test Theory Without an Answer Key,” Psychometrika (published online), pp. 1–24. [CrossRef]
Miller, N., Resnick, P., and Zeckhauser, R., 2005, “Eliciting Informative Feedback: The Peer-Prediction Method,” Manage. Sci., 51(9), pp. 1359–1373. [CrossRef]
Prelec, D., 2004, “A Bayesian Truth Serum for Subjective Data,” Science, 306(5695), pp. 462–466. [CrossRef] [PubMed]
Wauthier, F. L., and Jordan, M. I., 2011, “Bayesian Bias Mitigation for Crowdsourcing,” Adv. Neural Inf. Process. Syst., pp. 1800–1808.
Bachrach, Y., Graepel, T., Minka, T., and Guiver, J., 2012, “How to Grade a Test Without Knowing The Answers—A Bayesian Graphical Model for Adaptive Crowdsourcing and Aptitude Testing,” Proceedings of the 29th International Conference on Machine Learning, Edinburgh, Scotland, UK, June 26–July 1.
Welinder, P., Branson, S., Belongie, S., and Perona, P., 2010, “The Multidimensional Wisdom of Crowds,” Adv. Neural Inf. Process. Syst., 10, pp. 2424–2432.
Lakshminarayanan, B., and Teh, Y. W., 2013, “Inferring Ground Truth From Multi-Annotator Ordinal Data: A Probabilistic Approach,” preprint arXiv 1305.0015.
Tang, W., and Lease, M., 2011, “Semi-Supervised Consensus Labeling for Crowdsourcing,” Special Interest Group on Information Retrieval 2011 Workshop on Crowdsourcing for Information Retrieval, Beijing, China, July 28, pp. 1–6.
Liu, Q., Peng, J., and Ihler, A. T., 2012, “Variational Inference for Crowdsourcing,” Adv. Neural Inf. Process. Syst., pp. 701–709.
Whitehill, J., Ruvolo, P., Wu, T., Bergsma, J., and Movellan, J. R., 2009, “Whose Vote Should Count More: Optimal Integration of Labels From Labelers of Unknown Expertise,” Adv. Neural Inf. Process. Syst., 22, pp. 2035–2043.
Kim, J., Zhang, H., André, P., Chilton, L. B., Mackay, W., Beaudouin-Lafon, M., Miller, R. C., and Dow, S. P., 2013, “Cobi: A Community-Informed Conference Scheduling Tool,” Proceedings of the 26th Annual ACM symposium on User Interface Software and Technology, St Andrews, UK, Oct. 8–11, pp. 173–182. [CrossRef]
Snow, R., O'Connor, B., Jurafsky, D., and Ng, A. Y., 2008, “Cheap and Fast—but Is It Good?: Evaluating Non-Expert Annotations for Natural Language Tasks,” Proceedings of the Conference on Empirical Methods in Natural Language Processing, Honolulu, HI, pp. 254–263.
Zaidan, O. F., and Callison-Burch, C., 2011, “Crowdsourcing Translation: Professional Quality From Non-Professionals,” Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Portland, OR, pp. 1220–1229.
Sheng, V. S., Provost, F., and Ipeirotis, P. G., 2008, “Get Another Label? Improving Data Quality and Data Mining Using Multiple, Noisy Labelers,” Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Association for Computing Machinery, Las Vegas, NV, Aug. 24–27, pp. 614–622. [CrossRef]
Celaschi, F., Celi, M., and García, L. M., 2011, “The Extended Value of Design: An Advanced Design Perspective,” Des. Manage. J., 6(1), pp. 6–15. [CrossRef]
Bommarito, M. F. R., Gong, A., and Page, S., 2011, “Crowdsourcing Design and Evaluation Analysis of DARPA's XC2V Challenge,” University of Michigan Technical Report.
Caragiannis, I., Procaccia, A. D., and Shah, N., 2013, “When Do Noisy Votes Reveal the Truth?,” Proceedings of the Fourteenth ACM Conference on Electronic Commerce, Philadelphia, PA, June 16–20, pp. 143–160. [CrossRef]
Powell, M. J., 1964, “An Efficient Method for Finding the Minimum of a Function of Several Variables Without Calculating Derivatives,” Comput. J., 7(2), pp. 155–162. [CrossRef]
Haario, H., Saksman, E., and Tamminen, J., 2001, “An Adaptive Metropolis Algorithm,” Bernoulli, 7(2), pp. 223–242. [CrossRef]
Gelfand, A. E., and Smith, A. F., 1990, “Sampling-Based Approaches to Calculating Marginal Densities,” J. Am. Stat. Assoc., 85(410), pp. 398–409. [CrossRef]
Patil, A., Huard, D., and Fonnesbeck, C. J., 2010, “PyMC: Bayesian Stochastic Modelling in Python,” J. Stat. Software, 35(4), pp. 1–81.
Schramm, U., Thomas, H., Zhou, M., and Voth, B., 1999, “Topology Optimization With Altair Optistruct,” Proceedings of the Optimization in Industry II Conference, Banff, Canada.
University of Michigan—Optimal Design Laboratory, 2013, “Turker Design—Crowdsourced Design Evaluation,” http:// www.turkerdesign.com.
Ester, M., Kriegel, H.-P., Sander, J., and Xu, X., 1996, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases With Noise,” Knowl. Discovery Data Min., 96, pp. 226–231.
Raykar, V. C., Yu, S., Zhao, L. H., Valadez, G. H., Florin, C., Bogoni, L., and Moy, L., 2010, “Learning From Crowds,” J. Mach. Learn. Res., 11, pp. 1297–1322.
Prelec, D., Seung, H. S., and McCoy, J., 2013, “Finding Truth Even If the Crowd Is Wrong, Technical Report, Working Paper,” MIT.
Rzeszotarski, J. M., and Kittur, A., 2011, “Instrumenting the Crowd: Using Implicit Behavioral Measures to Predict Task Performance,” Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, Santa Barbara, CA, pp. 13–22. [CrossRef]
Budescu, D. V., and Chen, E., 2014, “Identifying Expertise to Extract the Wisdom of the Crowds,” Management Science (published online) pp. 1–34. [CrossRef]
Della Penna, N., and Reid, M. D., 2012, “Crowd & Prejudice: An Impossibility Theorem for Crowd Labelling Without a Gold Standard,” Proceedings of 2012 Collective Intelligence Conference, Cambridge, MA, Apr. 18–20.
Waggoner, B., and Chen, Y., 2013, “Information Elicitation Sans Verification,” Proceedings of the 3rd Workshop on Social Computing and User Generated Content, Philadelphia, PA, June 16.
Davis-Stober, C. P., Budescu, D. V., Dana, J., and Broomell, S. B., 2014, “When Is a Crowd Wise?,” Decision, 1(2), pp. 79–101. [CrossRef]
Kruger, J., Endriss, U., Fernández, R., and Qing, C., 2014, “Axiomatic Analysis of Aggregation Methods for Collective Annotation,” Proceedings of the 2014 International Conference on Autonomous Agents and Multi-Agent Systems, Paris, France, May 5–9, pp. 1185–1192.


Grahic Jump Location
Fig. 1

Graphical representation of the Bayesian network crowd consensus model. This model describes a crowd of evaluators making evaluations rpd that have error from the true score Φd. Each evaluator has an expertise ap and each design has an difficulty dd. The gray shading on the evaluation rpd denotes that it is the only observed data for this model.

Grahic Jump Location
Fig. 2

(a) Low evaluation expertise (dashed) relative to the design evaluation difficulty results in an almost uniform distribution of an evaluator's evaluation response, while high evaluation expertise (dotted) results in evaluators making evaluations closer to the true score. (b) An evaluator's evaluation error variance σpd2 as a function of that evaluator's expertise ap given some fixed design difficulty dd and crowd-level parameters θ and γ.

Grahic Jump Location
Fig. 5

Case II: Design evaluation error over a set of designs for a mixed crowd with low average evaluation expertise. With increasing crowd variance of expertise there is an increasingly higher proportion of high-expertise evaluators present within the crowd. This leads to a point where the Bayesian network is able to identify the cluster of high-expertise evaluators, upon which evaluation error drops to zero.

Grahic Jump Location
Fig. 7

Clustering of evaluators based on how similar their evaluations are across all eight designs. Each black or colored point represents an individual evaluator, where colored points represent evaluators who were similar to at least 3 other evaluators, and black points represent evaluators who tended to evaluate more uniquely

Grahic Jump Location
Fig. 4

Case I: Design evaluation error from the averaging and Bayesian network methods as a function of average evaluator expertise for homogeneous crowds. This plot shows that, when dealing with homogeneous crowds, aggregating the set of evaluations into the crowd's consensus score only sees marginal benefits from using the Bayesian network around 0.4–0.7 range of evaluator expertise.

Grahic Jump Location
Fig. 3

Crowd expertise distributions for Cases I and II that test how the expertise of evaluators within the crowd affect evaluation error for homogeneous and heterogeneous crowds, respectively. Three possible sample crowds are shown for both cases.

Grahic Jump Location
Fig. 6

(a) Boundary conditions for bracket strength evaluation and (b) the set of all eight bracket designs

Grahic Jump Location
Fig. 8

Design evaluation error with respect to additional experts

Grahic Jump Location
Fig. 9

Design evaluation error with respect to the proportion of the expert group



Some tools below are only available to our subscribers or users with an online account.

Related Content

Customize your page view by dragging and repositioning the boxes below.

Related Journal Articles
Related eBook Content
Topic Collections

Sorry! You do not have access to this content. For assistance or to subscribe, please contact us:

  • TELEPHONE: 1-800-843-2763 (Toll-free in the USA)
  • EMAIL: asmedigitalcollection@asme.org
Sign In