Research Papers: Design Automation

A Bayesian Sampling Method for Product Feature Extraction From Large-Scale Textual Data

[+] Author and Article Information
Sunghoon Lim

Industrial and Manufacturing Engineering,
The Pennsylvania State University,
University Park, PA 16802
e-mail: slim@psu.edu

Conrad S. Tucker

Engineering Design and Industrial
and Manufacturing Engineering,
The Pennsylvania State University,
University Park, PA 16802
e-mail: ctucker4@psu.edu

1Corresponding author.

Contributed by the Design Automation Committee of ASME for publication in the JOURNAL OF MECHANICAL DESIGN. Manuscript received June 29, 2015; final manuscript received March 24, 2016; published online April 20, 2016. Assoc. Editor: Gary Wang.

J. Mech. Des 138(6), 061403 (Apr 20, 2016) (9 pages) Paper No: MD-15-1456; doi: 10.1115/1.4033238 History: Received June 29, 2015; Revised March 24, 2016

The authors of this work propose an algorithm that determines optimal search keyword combinations for querying online product data sources in order to minimize identification errors during the product feature extraction process. Data-driven product design methodologies based on acquiring and mining online product-feature-related data are presented with two fundamental challenges: (1) determining optimal search keywords that result in relevant product related data being returned and (2) determining how many search keywords are sufficient to minimize identification errors during the product feature extraction process. These challenges exist because online data, which is primarily textual in nature, may violate several statistical assumptions relating to the independence and identical distribution of samples relating to a query. Existing design methodologies have predetermined search terms that are used to acquire textual data online, which makes the resulting data acquired, a function of the quality of the search term(s) themselves. Furthermore, the lack of independence and identical distribution of text data from online sources impacts the quality of the acquired data. For example, a designer may search for a product feature using the term “screen,” which may return relevant results such as “the screen size is just perfect,” but may also contain irrelevant noise such as “researchers should really screen for this type of error.” A text mining algorithm is introduced to determine the optimal terms without labeled training data that would maximize the veracity of the data acquired to make a valid conclusion. A case study involving real-world smartphones is used to validate the proposed methodology.

Copyright © 2016 by ASME
Your Session has timed out. Please sign back in to continue.


Tucker, C. S. , and Kim, H. M. , 2009, “ Data-Driven Decision Tree Classification for Product Portfolio Design Optimization,” ASME J. Comput. Inf. Sci. Eng., 9(4), p. 041004. [CrossRef]
de Albornoz, J. C. , Plaza, L. , Gervás, P. , and Díaz, A. , 2011, “ A Joint Model of Feature Mining and Sentiment Analysis for Product Review Rating,” Advances in Information Retrieval, Springer, Berlin, Heidelberg, pp. 55–66.
Rai, R. , 2012, “ Identifying Key Product Attributes and Their Importance Levels From Online Customer Reviews,” ASME Paper No. DETC2012-70493.
Tuarob, S. , and Tucker, C. , 2015, “ Quantifying Product Favorability and Extracting Notable Product Features Using Large Scale Social Media Data,” ASME J. Comput. Inf. Sci. Eng., 15(3), p. 031003. [CrossRef]
Chou, A. , and Shu, L. H. , 2014, “ Towards Extracting Affordances From Online Consumer Product Reviews,” ASME Paper No. DETC2014-35288.
Zhou, F. , (Roger) Jiao, J. , and Linsey, J. , “ Latent Customer Needs Elicitation by Use Case Analogical Reasoning From Sentiment Analysis of Online Product Reviews,” ASME J. Mech. Des., 137(7), p. 071401. [CrossRef]
Tuarob, S. , Tucker, C. S. , Salathe, M. , and Ram, N. , 2014, “ An Ensemble Heterogeneous Classification Methodology for Discovering Health-Related Knowledge in Social Media Messages,” J. Biomed. Inf., 49, pp. 255–268. [CrossRef]
Phan, X.-H. , Nguyen, L.-M. , and Horiguchi, S. , 2008, “ Learning to Classify Short and Sparse Text and Web With Hidden Topics From Large-Scale Data Collections,” 17th International Conference on World Wide Web, pp. 91–100.
Hu, X. , Sun, N. , Zhang, C. , and Chua, T.-S. , 2009, “ Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge,” 18th ACM Conference on Information and Knowledge Management, pp. 919–928.
Ginsberg, J. , Mohebbi, M. H. , Patel, R. S. , Brammer, L. , Smolinski, M. S. , and Brilliant, L. , 2009, “ Detecting Influenza Epidemics Using Search Engine Query Data,” Nature, 457(7232), pp. 1012–1014. [CrossRef] [PubMed]
Culotta, A. , 2010, “ Towards Detecting Influenza Epidemics by Analyzing Twitter Messages,” First Workshop on Social Media Analytics, New York, pp. 115–122.
Glier, M. W. , McAdams, D. A. , and Linsey, J. S. , 2014, “ Exploring Automated Text Classification to Improve Keyword Corpus Search Results for Bioinspired Design,” ASME J. Mech. Des., 136(11), p. 111103. [CrossRef]
Aramaki, E. , Maskawa, S. , and Morita, M. , 2011, “ Twitter Catches the Flu: Detecting Influenza Epidemics Using Twitter,” Conference on Empirical Methods in Natural Language Processing, Stroudsburg, PA, pp. 1568–1576.
Paul, M. J. , and Dredze, M. , 2011, “ A Model for Mining Public Health Topics From Twitter,” Health, Technical Report, Johns Hopkins University.
Stone, T. , and Choi, S.-K. , 2013, “ Extracting Consumer Preference From User-Generated Content Sources Using Classification,” ASME Paper No. DETC2013-13228.
Fuge, M. , Peters, B. , and Agogino, A. , 2014, “ Machine Learning Algorithms for Recommending Design Methods,” ASME J. Mech. Des., 136(10), p. 101103. [CrossRef]
Slonim, N. , and Tishby, N. , 2001, “ The Power of Word Clusters for Text Classification,” 23rd European Colloquium on Information Retrieval Research, Vol. 1.
Dong, A. , and Agogino, A. M. , 1997, “ Text Analysis for Constructing Design Representations,” Artif. Intell. Eng., 11(2), pp. 65–75. [CrossRef]
Wassenaar, H. J. , Chen, W. , Cheng, J. , and Sudjianto, A. , 2005, “ Enhancing Discrete Choice Demand Modeling for Decision-Based Design,” ASME J. Mech. Des., 127(4), pp. 514–523. [CrossRef]
Yoshimura, M. , Taniguchi, M. , Izui, K. , and Nishiwaki, S. , 2006, “ Hierarchical Arrangement of Characteristics in Product Design Optimization,” ASME J. Mech. Des., 128(4), pp. 701–709. [CrossRef]
Zhao, Y. , Qin, B. , Hu, S. , and Liu, T. , 2010, “ Generalizing Syntactic Structures for Product Attribute Candidate Extraction,” Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 377–380.
Wang, L. , Youn, B. D. , Azarm, S. , and Kannan, P. K. , 2011, “ Customer-Driven Product Design Selection Using Web Based User-Generated Content,” ASME Paper No. DETC2011-48338.
Tucker, C. S. , and Kim, H. M. , 2011, “ Trend Mining for Predictive Product Design,” ASME J. Mech. Des., 133(11), p. 111008. [CrossRef]
Poppa, K. , Arlitt, R. , and Stone, R. , 2013, “ An Approach to Automated Concept Generation Through Latent Semantic Indexing,” IIE Annual Conference, p. 151.
Lemeshow, S. , Hosmer, D. W. , Klar, J. , Lwanga, S. K. , and World Health Organization, 1990, Adequacy of Sample Size in Health Studies, Wiley, Chichester, UK.
Müller, P. , Parmigiani, G. , Robert, C. , and Rousseau, J. , 2004, “ Optimal Sample Size for Multiple Testing: The Case of Gene Expression Microarrays,” J. Am. Stat. Assoc., 99(468), pp. 990–1001. [CrossRef]
Fritz, M. S. , and MacKinnon, D. P. , 2007, “ Required Sample Size to Detect the Mediated Effect,” Psychol. Sci., 18(3), pp. 233–239. [CrossRef] [PubMed]
Byrd, R. H. , Chin, G. M. , Nocedal, J. , and Wu, Y. , 2012, “ Sample Size Selection in Optimization Methods for Machine Learning,” Math. Program., 134(1), pp. 127–155. [CrossRef]
“Customer Review,” Amazon, last accessed Jan. 24, 2016, http://www.amazon.com/review/R1ZZ4LU5RWTHXZ/ref=cm_cr_dp_cmt#wasThisHelpful
Liu, R. Y. , and Singh, K. , 1995, “ Using iid Bootstrap Inference for General Non-iid Models,” J. Stat. Plann. Inference, 43(1), pp. 67–75. [CrossRef]
Zhou, Z.-H. , Sun, Y.-Y. , and Li, Y.-F. , 2009, “ Multi-Instance Learning by Treating Instances as Non-iid Samples,” 26th Annual International Conference on Machine Learning, pp. 1249–1256.
Ganiz, M. C. , George, C. , and Pottenger, W. M. , 2011, “ Higher Order Naive Bayes: A Novel Non-IID Approach to Text Classification,” IEEE Trans. Knowl. Data Eng., 23(7), pp. 1022–1034. [CrossRef]
Görnitz, N. , Porbadnigk, A. K. , Binder, A. , Sannelli, C. , Braun, M. , Müller, K.-R. , and Kloft, M. , 2014, “ Learning and Evaluation in Presence of Non-iid Label Noise,” Seventeenth International Conference on Artificial Intelligence and Statistics, pp. 293–302.
Lavrenko, V. , and Croft, W. B. , 2001, “ Relevance Based Language Models,” 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120–127.
Zhang, K. , Cheng, Y. , Xie, Y. , Honbo, D. , Agrawal, A. , Palsetia, D. , Lee, K. , Liao, W. , and Choudhary, A. , 2011, “ SES: Sentiment Elicitation System for Social Media Data,” 11th International Conference on Data Mining Workshops, Vancouver, BC, Canada, Dec. 11, pp. 129–136.
Fox, C. , 1989, “ A Stop List for General Text,” ACM SIGIR Forum, Vol. 24, pp. 19–21. [CrossRef]
Cormen, T. H. , Leiserson, C. E. , Rivest, R. L. , and Stein, C. , 2001, Introduction to Algorithms, MIT Press, Cambridge, MA, Vol. 6.
Chu, C. , Kim, S. K. , Lin, Y.-A. , Yu, Y. , Bradski, G. , Ng, A. Y. , and Olukotun, K. , 2007, “ Map-Reduce for Machine Learning on Multicore,” Adv. Neural Inf. Process. Syst., 19, pp. 281–288.
“Wikipedia,” Wikipedia, last accessed Jan. 24, 2016, https://en.wikipedia.org/wiki/Wikipedia
Berthon, P. R. , Pitt, L. F. , McCarthy, I. , and Kates, S. M. , 2007, “ When Customers Get Clever: Managerial Approaches to Dealing With Creative Consumers,” Bus. Horiz., 50(1), pp. 39–47. [CrossRef]


Grahic Jump Location
Fig. 1

Overview of the proposed methodology

Grahic Jump Location
Fig. 2

N, R, and data containing “siri” or “ios”

Grahic Jump Location
Fig. 3

Term disambiguation problem and keyword recognition problem

Grahic Jump Location
Fig. 4

The process of the proposed methodology

Grahic Jump Location
Fig. 5

Average values of the F1 scores



Some tools below are only available to our subscribers or users with an online account.

Related Content

Customize your page view by dragging and repositioning the boxes below.

Related Journal Articles
Related eBook Content
Topic Collections

Sorry! You do not have access to this content. For assistance or to subscribe, please contact us:

  • TELEPHONE: 1-800-843-2763 (Toll-free in the USA)
  • EMAIL: asmedigitalcollection@asme.org
Sign In