Quantitative preference models are used to predict customer choices among design alternatives by collecting prior purchase data or survey answers. This paper examines how to improve the prediction accuracy of such models without collecting more data or changing the model. We propose to use features as an intermediary between the original customer-linked design variables and the preference model, transforming the original variables into a feature representation that captures the underlying design preference task more effectively. We apply this idea to automobile purchase decisions using three feature learning methods (principal component analysis (PCA), low rank and sparse matrix decomposition (LSD), and exponential sparse restricted Boltzmann machine (RBM)) and show that the use of features offers improvement in prediction accuracy using over 1 million real passenger vehicle purchase data. We then show that the interpretation and visualization of these feature representations may be used to help augment data-driven design decisions.

# Improving Design Preference Prediction Accuracy Using Feature Learning OPEN ACCESS

**Alex Burnap**

**Yanxin Pan**

**Ye Liu**

**Yi Ren**

**Honglak Lee**

**Richard Gonzalez**

**Panos Y. Papalambros**

^{1}A. Burnap and Y. Pan contributed equally to this work.

Contributed by the Design Automation Committee of ASME for publication in the JOURNAL OF MECHANICAL DESIGN. Manuscript received October 15, 2015; final manuscript received April 15, 2016; published online May 18, 2016. Assoc. Editor: Carolyn Seepersad.

*J. Mech. Des*138(7), 071404 (May 18, 2016) (12 pages) Paper No: MD-15-1708; doi: 10.1115/1.4033427 History: Received October 15, 2015; Revised April 15, 2016

Much research has been devoted to develop design preference models that predict customer design choices. A common approach is to (i) collect a large database of previous purchases that includes customer data, e.g., age, gender, income, and purchased product design data, e.g., no. of cylinders, length, and curb weight—for an automobile, and (ii) statistically infer a design preference model that links customer and product variables, using conjoint analysis or discrete choice analysis such as logit, mixed logit, and nested logit models [1,2].

However, a customer may not purchase a vehicle solely due to interactions between these two sets of variables, e.g., a 50-yr old male prefers six-cylinder engines. Instead, a customer may purchase a product for more “meaningful” design attributes that are functions of the original variables, such as environmental sustainability or sportiness [3,4]. These meaningful intermediate functions of the original variables, both of the customer and of the design, are hereafter termed *features*. We posit that using customer and product features, instead of just the original customer and product variables, may increase the prediction accuracy of the design preference model.

Our goal then is to find features that improve this preference prediction accuracy. To this end, one common approach is to ask design and marketing domain experts to choose these features intuitively, such as a design's social context [5] and visual design interactions [6]. For example, eco-friendly vehicles may be a function of miles per gallon (MPG) and emissions, whereas environmentally active customers may be a function of age, income, and geographic region. An alternative explored in this paper is to find features “automatically” using feature learning methods studied in computer science and statistics. As shown in Fig. 1, feature learning methods create an intermediate step between the original data and the design preference model by forming a more efficient “feature representation” of the original data. Certain well-known methods such as PCA may be viewed similarly, but more recent feature learning methods have shown impressive results in 1D waveform prediction [7] and 2D image object recognition [8].

We conduct an experiment on automobile purchasing preferences to assess whether three feature learning methods increase design preference prediction accuracy: (1) principle component analysis, (2) LSD, and (3) exponential family sparse RBMs [9,10]. We cast preference prediction as a binary classification task by asking the question: “Given customer **x**, do they purchase vehicle *p* or vehicle *q*?” Our dataset is comprised of 1,161,056 data points generated from 5582 real passenger vehicle purchases in the United States during model year 2006 (MY2006).

The first contribution of this work is an increase of preference prediction accuracy by 2–7% just using simple “single-layer” feature learning methods, as compared with the original data representation. These results suggest features indeed better represent the customer's underlying design preferences, thus offering deeper insight to inform decisions during the design process. Moreover, this finding is complementary to recent work in crowdsourced data gathering [11,12] and nonlinear preference modeling [13,14] since they do not affect the preference model or dataset itself.

The second contribution of this work is to show how features may be used in the design process. We show that feature interpretation and feature visualization offer designers additional tools for augmenting design decisions. First, we interpret the most influential pairings of vehicle features and customer features to the preference task, and contrast this with the same analysis using the original variable representation. Second, we visualize the theoretically optimal vehicle for a given customer within the learned feature representation and show how this optimal vehicle, which does not exist, may be used to suggest design improvements upon current models of vehicles that do exist in the market.

Methodological contributions include being the first to use recent feature learning methods on heterogeneous design and marketing data. Recent feature learning research has focused on homogeneous data, in which all variables are real-valued numbers such as pixel values for image recognition [8,15]; in contrast, we explicitly model the heterogeneous distribution of the input variables, for example, “age” being a real-valued variable and “General Motors” being a categorical variable. Subsequently, we give a number of theoretical extensions: First, we use exponential family generalizations for the sparse RBMs, enabling explicit modeling of statistical distributions for heterogeneous data. Second, we derive theoretical bounds on the reconstruction error of the LSD feature learning method.

This paper is structured as follows: Section 2 discusses efforts to increase prediction accuracy by the design community, as well as feature learning advances in the machine learning community. Section 3 sets up the preference prediction task as a binary classification problem. Section 4 details three feature learning methods and their extension to suit heterogeneous design and market data. Section 5 details the experimental setup of the preference prediction task, followed by results showing improvement of preference prediction accuracy. Section 6 details how features may be used to inform design decisions through feature interpretation and feature visualization. Section 7 concludes this work.

Design preference modeling has been investigated in design for market systems, where quantitative engineering and marketing models are linked to improve enterprise-wide decision making [16–18]. In such frameworks, the design preference model is used to aggregate input across multiple stakeholders, with special importance on the eventual customer within the targeted market segment [19].

These design preference models have been shown to be especially useful for the design of passenger vehicles, as demonstrated across a variety of applications such as engine design [20], vehicle packaging [21], brand recognition [22], and vehicle styling [3,6,23]. Connecting many of these research efforts is the desire for improved prediction accuracy of the underlying design preference model. With increased prediction accuracy, measured using “held out” portions of the data, greater confidence may be placed in the fidelity of the resulting design conclusions.

Efforts to improve prediction accuracy involve (i) developing more complex statistical models to capture the heterogeneous and stochastic nature of customer preferences; examples include mixed and nested logit models [1,2], consideration sets [24], and kernel-based methods [13,14,25], and (ii) creating adaptive questionnaires to obtain stated information more efficiently using a variety of active learning methods [26,27].

This work is different from (i) above in that the set of features learned is agnostic of the particular preference model used. One can just as easily switch out the *l*^{2} logit design preference model used in this paper for another model, whether it be mixed logit or a kernel machine. This work is also different from (ii) in that we are working with a set of revealed data on actual vehicle purchases, rather than eliciting this data through a survey. Accordingly, this work is among recent efforts toward data-driven approaches in design [28], including design analytics [29] and design informatics [30], in that we are directly using data to augment existing modeling techniques and ultimately suggest actionable design decisions.

Feature learning methods capture statistical dependencies implicit in the original variables by “encoding” the original variables in a new feature representation. This representation keeps the number of data the same while changing the length of each data point from *M* variables to *K* features. The idea is to minimize an objective function defining the reconstruction error between the original variables and their new feature representation. If this representation is more meaningful for the discriminative design preference prediction task, we can use the same supervised model (e.g., logit model) as before to achieve higher predictive performance. More details are given in Sec. 4.

The first feature learning method we examined is PCA. While not conventionally referred to as a feature learning method, PCA is chosen for its ubiquitous use and its qualitative difference from the other two methods. In particular, PCA makes the strong assumption that the data are Gaussian noise distributed around a linear subspace of the original variables, with the goal of learning the eigenvectors spanning this subspace [31]. The features in our case are the coefficients of the original variables when projected onto this subspace or, equivalently, the inner product with the learned eigenvectors.

The second feature learning method is LSD. This method is chosen as it defines the features implicitly within the preference model. In particular, LSD decomposes the “part-worth” coefficients contained in the design preference model (e.g., conjoint analysis or discrete choice analysis) into a low-rank matrix plus a sparse matrix. This additive decomposition is motivated by results from the marketing literature suggesting that certain purchase consideration is linearly additive [32], and thus well captured by decomposed matrices [33]. An additional motivation for a linear decomposition model is the desire for interpretability [34]. Predictive consumer marketing oftentimes uses these learned coefficients to work hand-in-hand with engineering design to generate competitive products or services [35]. Such advantages are bolstered by separation of factors captured by matrix decomposition, as separation may lead to better capture of heterogeneity among market segments [36]. Readers are referred to Ref. [37] for further in-depth discussion.

The third feature learning method is the exponential family sparse RBM [9,38]. This method is chosen as it explicitly represents the features, in contrast with the LSD. The method is a special case of a Boltzmann machine, an undirected graph model in which the energy associated within an energy state space defines the probability of finding the system in that state [9]. In the RBM, each state is determined by both visible and hidden nodes, where each node corresponds to a random variable. The visible nodes are the original variables, while the hidden nodes are the feature representation. The “restricted” portion of the RBM refers to the restriction on visible–visible connections and hidden–hidden connections, later detailed and depicted in Sec. 4 and Fig. 4, respectively.

All three feature learning methods are considered “simple” in that they are single-layer models. The aforementioned results in 1D waveform speech recognition and 2D image object recognition have been achieved using hierarchical models, built by stacking multiple single-layer models. We chose single-layer feature learning methods here as an initial effort and to explore parameter settings more easily; as earlier noted, there is limited work on feature learning methods for heterogeneous data (e.g., categorical variables) and most advances are currently only on homogeneous data (e.g., real-valued 2D image pixels).

We cast the task of predicting a customer's design preferences as a binary classification problem: Given customer *j*, represented by a vector of heterogeneous customer variables $xc(j)$, as well as two passenger vehicle designs *p* and *q*, each represented by a vector of heterogeneous vehicle design variables $xd(p)$ and $xd(q)$, which passenger vehicle will the customer purchase? We use a real dataset of customers and their passenger vehicle purchase decisions as detailed below [39].

The data used in this work combine the Maritz vehicle purchase survey from 2006 [39], the Chrome vehicle variable database [40], and the 2006 estimated U.S. state income and living cost data from the U.S. Census Bureau [41] to create a dataset with both customer and passenger vehicle variables. These combined data result in a matrix of purchase records, with each row corresponding to a separate customer and purchased vehicle pair, and each column corresponding to a variable describing the customer (e.g., age, gender, and income) or the purchased vehicle (e.g., no. of cylinders, length, and curb weight).

From this original dataset, we focus only on the customer group who bought passenger vehicles of size classes between minicompact and large vehicles, thus excluding data for station wagons, trucks, minivans, and utility vehicles. In addition, purchase data for customers who did not consider other vehicles before their purchases were removed, as well data for customers who purchased vehicles for another party.

The resulting database contained 209 unique passenger vehicle models bought by 5582 unique customers. The full list of customer variables and passenger vehicle variables can be found in Tables 1 and 2. The variables in these tables are grouped into three unit types: real, binary, and categorical, based on the nature of the variables.

We converted the dataset of 5582 passenger vehicle purchases into a binary choice set by generating all pairwise comparisons between the purchased vehicle and the other 208 vehicles in the dataset for all 5582 customers. This resulted in *N* = 1,161,056 data points, where each datum indexed by *n* consisted of a triplet (*j*, *p*, *q*) of a customer indexed by *j* and two passenger vehicles indexed by *p* and *q*, as well as a corresponding indicator variable $y(n)\u2208{0,1}$ describing which of the two vehicles was purchased.

These full data were then randomly shuffled and split into training, validation, and testing sets. As previous studies have shown the impact on prediction performance given different generations of choice sets [42], we created ten random shufflings and subsequent data splits of our dataset, and run the design preference prediction experimental procedure of Sec. 5 on each one independently. This work is therefore complementary to studies on developing appropriate choice set generation schemes such as Ref. [43]. Full details into the data processing procedure are given in Sec. 5.

We adopt the conventions of utility theory for the measure of customer preference over a given product [44]. Formally, each data point consists of a pairwise comparison between vehicles *p* and *q* for customer *j*, with corresponding customer variables $xc(j)$ for $j\u2208{1,\u2026,5582}$ and original variables of the two vehicle designs, $xd(p)$ and $xd(q)$ for $p,q\u2208{1,\u2026,209}$. We assume a bilinear utility model for customer *j* and vehicle *p*

where ⊗ is an outer product for vectors, $vec(\xb7)$ is vectorization of a matrix, $[\xb7,\xb7]$ is concatenation of vectors, and *ω* is the part-worth vector.

The preference model refers to the assumed relationship between the bilinear utility model described in Sec. 3.3 and a label indicating which of the two vehicles the customer actually purchased. While the choice of preference model is not the focus of this paper, we pilot tested popularly used models including *l*^{1} and *l*^{2} logit model, Naïve Bayes, *l*^{1} and *l*^{2} linear as well as kernelized support vector machine, and random forests.

Based on these pilot results, we chose the *l*^{2} logit model due to its widespread use in the design and marketing communities [37,45]; in particular, we used the primal form of the logit model. Equation (2) captures how the logit model describes the probabilistic relationship between customer *j*'s preference for either vehicle *p* or vehicle *q* as a function of their associated utilities given by Eq. (1). Note that *ϵ* are Gumbel-distributed random variables accounting for noise over the underlying utility of the customer *j*'s preference for either vehicle *p* or vehicle *q*

We estimate the parameters of the logit model in Eq. (2) using conventional convex loss function minimization using the log-loss regularized with the *l*^{2} norm

where $y(n)=y(jpq)$ is 1 if customer *j* chose vehicle *p* to purchase and 0 if vehicle *q* was purchased, and *α* is the *l*^{2} regularization hyperparameter. The optimization algorithm used to minimize this regularized loss function was stochastic gradient descent, with details of hyperparameter settings given in Sec. 5.

PCA maps the original data representation $x=[x1,x2,\u2026,xM]T\u2208\mathbb{R}M\xd71$ to a new feature representation $h=[h1,h2,\u2026,hK]T\u2208\mathbb{R}K\xd71,\u2009K\u2264M$, with an orthogonal transformation $W\u2208\mathbb{R}M\xd7K$ (Fig. 2). Assume that the original data representation **x** has zero empirical mean (otherwise, we simply subtract the empirical mean from **x**). The mapping is given by

The PCA representation has the following properties: (1) *h*_{1} has the largest variance, and the variance of *h _{i}* is not smaller than the variance of

*h*for all

_{j}*j*<

*i*, (2) the columns of

**W**are orthogonal unit vectors, and (3)

**h**and

**W**minimize the reconstruction error

*ε*

When the *q* columns of **W** consist of the first *q* eigenvectors of $xTx$, the above properties are all satisfied, and the PCA feature representation can be calculated by Eq. (4). Since PCA is a projection onto a subspace, the features **h** in this case are not “higher order” functions of the original variables, but rather a linear mapping from original variables to a strictly smaller number of linear coefficients over the eigenvectors.

The utility model *U _{rp}* given in Eq. (1) can be rewritten into matrix form, in which Ω is a matrix reshaped from the part-worth coefficients vector

*ω*

The decomposition of the original part-worth coefficients into a low-rank matrix and a sparse matrix may better represent customer purchase decisions than the large coefficient matrix of all pairwise interactions given in Eq. (1) and as detailed in Sec. 2. Accordingly, we decompose Ω into a low-rank matrix **L** of rank *r* superimposed with a sparse matrix **S**, i.e., $\Omega =L+S$ (Fig. 3). This problem may be solved in the general case exactly with the following optimization problem:

where $Xu$ and $Xc$ are the full set of customer and vehicle data, **y** is the vector of whether customer *j* chose vehicle *p* or vehicle *q*, $l(\xb7)$ is the log-loss without the *l*^{2} norm

and $C$ is a convex set corresponding to the sparse matrix **S**. As this problem is intractable (NP-hard), we instead learn this decomposition of matrices using an approximation obtained via regularized loss function minimization

where $||\xb7||*$ is the nuclear norm to promote low-rank structure, and $||\xb7||1$ is the *l*_{1} norm.

In particular, while a number of low-rank regularizations may be used to solve Eq. (9), e.g., trace norm and log-determinant norm [46]. We choose the nuclear norm as it may be applied to any general matrix, while the trace norm and log-determinant regularization are limited to positive semidefinite matrices. Moreover, the nuclear norm is often considered optimal as $||L||*$ is the convex envelop of $Rank(L)$, implying that $||L||*$ is the largest convex function smaller than $Rank(L)$ [46].

The nondifferentiability of the convex low-rank + sparse approximation given in Eq. (9) necessitates optimizations techniques such as augmented Lagrangian [47], semidefinite programing [48], and proximal methods [49]. Due to theoretical guarantees on convergence, we choose to train our model using proximal methods which are defined as follows.

Definition 2. *Let *$f:\mathbb{R}n\u2192\mathbb{R}\u222a{+\u221e}$* be a closed proper convex function. The proximal operator of f is defined as*

With these preliminaries, we now detail the proximal gradient algorithm used to solve Eq. (9) using low-rank and *l*^{1} proximal operators. Denote $f(\xb7)=||\xb7||*$, and its proximal operator as prox* _{f}*. Similarly, denote the proximal operator for the

*l*

^{1}regularization term by prox

*, $i=1,\u2026n$. Details of calculating prox*

_{S}*and prox*

_{f}*may be found in the Appendix.*

_{S}Lemma 1. *Convergence Property*. *When *$\u2207l$* is Lipschitz continuous with constant ρ, this method can be shown to converge with rate *$O(1/k)$* when a fixed step size *$\eta t=\eta \u2208(0,1/\rho ]$* is used. If ρ is not known, the step sizes η _{t} can be found by a line search; that is, their values are chosen in each step*.

^{Algorithm 1}Low-Rank + Sparse Matrix Decomposition

**Input:**Data $Xc,\u2009Xd$,

**y**Initialize $L0=0,S0=0$

**repeat**$Lt+1=proxf(Lt\u2212\eta t\u2207Ltl(L,S;Xc,Xd,y))$ $St+1=proxS(St\u2212\eta t\u2207Stl(L,S;Xc,Xd,y))$

**until**$Lt,\u2009Sit$ are converged

We additionally prove a variational bound that guarantees this parameter estimation method converges to a unique solution with bounded error as given by the following theorem.

*where *$L*$* is the optima of problem (9) and *$L0$* is the matrix minimizing the loss function *$l(\xb7)$.

The RBM is an energy-based model in which an energy state is defined by a layer of *M* visible nodes corresponding to the original variables **x** and a layer of *K* features denoted as **h**. The energy for a given pair of original variables and features determines the probability associated with finding the system in that state; like nature, systems tend to states that minimize their energy and thus maximize their probability. Accordingly, maximizing the likelihood of the observed data $x(1)\u2026x(N)\u2208\mathbb{R}M$ and its corresponding feature representation $h(1)\u2026h(N)\u2208\mathbb{R}K$ is a matter of finding the set of parameters that minimize the energy for all observed data.

While traditionally this likelihood consists of binary variables and binary features, as described in Tables 1 and 2, our passenger vehicle purchase dataset consists of *M _{G}* Gaussian variables,

*M*binary variables, and

_{B}*M*categorical variables. We accordingly define three corresponding energy functions

_{C}*E*,

_{G}*E*, and

_{B}*E*, in which each energy function connects the original variables and features via a weight matrix

_{C}**W**, as well as biases for each original variable and feature,

**a**and

**b**, respectively.

Real-valued random variables (e.g., vehicle curb weight) are modeled using the Gaussian density. The energy function for Gaussian inputs and binary hidden nodes is

where the variance term is clamped to unity under the assumption that the input data are standardized.

Binary random variables (e.g., gender) are modeled using the Bernoulli density. The energy function for Bernoulli nodes in both the input layer and hidden layer is

Categorical random variables (e.g., vehicle manufacturer) are modeled using the categorical density. The energy function for categorical inputs with *Z _{m}* classes for

*m*th categorical input variable (e.g., Toyota, General Motors, etc.) is given by

Given these energy functions for the heterogeneous original variables, the probability of a state with energy $E(x,h;\theta )=EG(x,h;\theta )+EB(x,h;\theta )+EC(x,h;\theta )$, in which $\theta ={W,a,b}$ are the energy function weights and bias parameters, is defined by the Boltzmann distribution

The “restriction” on the RBM is to disallow visible–visible and hidden–hidden node connections. This restriction results in conditional independence of each individual hidden unit *h* given the vector of inputs **x**, and each visible unit *x* given the vector of hidden units **h**

The conditional density for a single binary hidden unit given the combined *K _{G}* Gaussian,

*K*binary, and

_{B}*K*categorical input variables is then

_{C}For an input data point $x(n)$, its corresponding feature representation $h(n)$ is given by sampling the “activations” of the hidden nodes

To train the model, we optimize the weight and bias parameters $\theta ={W,b,a}$ by minimizing the negative log-likelihood of the data ${x(1)\u2026x(N)}$ using gradient descent. The gradient of the log-likelihood is

The gradient is the difference of two expectations: the first of which is easy to compute since it is “clamped” at the input datum **x**, but the second of which requires the joint density over the entire **x** space for the model.

In practice, this second expectation is approximated using the contrastive divergence algorithm by Gibbs, sampling the hidden nodes given the visible nodes, then the visible nodes given the hidden nodes, and iterating a sufficient number of steps for the approximation [50]. During training, we induce sparsity of the hidden layer by setting a target activation *β _{k}*, fixed to 0.1, for each hidden unit

*h*[38]. The overall objective to be minimized is then the negative log-likelihood from Eq. (18) and a penalty on the deviation of the hidden layer from the target activation. Since the hidden layer is made up of sigmoid densities, the overall objective function is

_{k}The goal in this experiment was to assess how preference prediction accuracy changes when using the same preference model on three different representations of the same dataset. The preference model used, as discussed in Sec. 3.4 was the *l*^{2} logit, while the three representations were the original variables, low-rank + sparse features, and RBM features. The same experimental procedure was run on each of these three representations, where the first representation acts as a baseline for prediction accuracy, and the next two representations demonstrate the relative gain in preference prediction accuracy when using features.

In addition, we performed an analysis of how the hyperparameters affected design preference prediction accuracy for the hyperparameters used in the PCA, LSD, and RBM feature learning methods. For PCA, the hyperparameter was the dimensionality *K* of the subspace spanned by the eigenvectors of the PCA method. For LSD, the hyperparameters were the rank penalty *λ*_{1}, which affects the rank of the low-rank matrix *L*, and the sparsity penalty *λ*_{2}, which influences the number of nonzero elements in the sparse matrix *S*, both found in Eq. (9). For RBM, the hyperparameters were the sparsity penalty *λ*_{3}, which controls the number of features activated for a given input datum, and the overcompleteness factor *γ*, which defines by what factor the dimensionality of the feature space is larger than the dimensionality of the original variable space, both of which are found in Eq. (19).

^{(1)}The raw choice dataset of pairs of customers and purchased designs, described in Sec. 3.1, was randomly split ten times into 70% training, 10% validation, and 20% test sets. This was done in the beginning to ensure no customers in the training sets ever existed in the validation or test sets.^{(2)}Choice sets were generated for each training, validation, and test sets for all ten randomly shuffled splits as described in Sec. 3.2. This process created a training dataset of 832,000 data points, a validation dataset of 104,000 data points, and a testing dataset of 225,056 data points, for each of the ten shuffled splits.^{(3)}Feature learning was conducted on the training sets of customer variables and vehicle variables for a vector of five different values of*K*for PCA features, a grid of 25 different pairs of low-rank penalty*λ*_{1}and sparsity penalty*λ*_{2}for the LSD features, and a grid of 56 different pairs of sparsity*λ*_{3}and overcompleteness*γ*hyperparameters for RBM features. For PCA features, these hyperparameters were $K\u2208{30,50,70,100,150}$. For LSD features, these hyperparameters were $\lambda 1\u2208{0.005,0.01,0.05,0.1,0.5}$ and $\lambda 2\u2208{0.005,0.01,0.05,0.1,0.5}$. For RBM, these hyperparameters were $\lambda 3\u2208{4.0,5.0,6.0,7.0,8.0,9.0,10.0}$ and $\gamma \u2208{0.25,0.5,0.75,1.0,1.5,2.0,2.5,3.0}$. These hyperparameter settings were selected by pilot testing large ranges of parameter settings to find relevant regions for upper and lower hyperparameter bounds, with numbers of hyperparameters selected based on computational constraints.^{(4)}Each of the validation and testing datasets were encoded using the feature learning methods learned for each of the 5 PCA hyperparameters*K*, 25 $(\lambda 1,\lambda 2)$ LSD hyperparameter pairs, and 56 $(\lambda 3,\gamma )$ RBM hyperparameter pairs.^{(5)}The encoded feature data were combined with the original variable data in order to separate linear term effects of the original variables with higher order effects from the features. While this introduces a degree of information redundancy between features and original variables, the regularization term in Eq. (3) mitigates effects of collinearity. Each datum consists of the features concatenated with the original variables, then input into the bilinear utility model. Specifically, for some customer features $hu$ and customer variables $xu$, we used $hu\u2032T:=[xuT,huT]$ to define the new representation of the customer; likewise, for some vehicle features $hc$ and vehicle variables $xc$, we used $hc\u2032T:=[xcT,hcT]$ to define the new representation of the customer. Combined with Eq. (1), a single data point used for training is the difference in utilities between vehicle*p*and vehicle*q*for a given customer*r*Display Formula (20)$[hu\u2032(r)\u2297(hc\u2032(p)\u2212hc\u2032(q)),hc\u2032(p)\u2212hc\u2032(q)]$Note that the dimensionality of each datum could range above 100,000 dimensions for the largest values of*γ*.^{(6)}For each of these training sets, six logit models were trained in parallel over minibatches of the training data, corresponding to six different settings of the*l*^{2}regularization parameter $\alpha =0.00001,0.0001,0.001,0.01,0.1,1.0$. These logit models were optimized using stochastic gradient descent, with learning rates inversely related to the number of training examples seen [51].^{(7)}Each logit model was then scored according to its respective held-out validation dataset. The hyperparameter settings $(\alpha BASELINE)$ for the original variables, $(KPCA,\alpha PCA)$ for PCA feature learning, $(\lambda 1,\lambda 2)$ for LSD feature learning, and $(\lambda 3,\gamma ,\alpha RBM)$ for RBM feature learning with the best validation accuracy were saved. For each of these four sets of best hyperparameters, step (3) was repeated to obtain the set of corresponding features on each of the ten random shuffled training plus validation sets.^{(8)}Logit models corresponding to the baseline, PCA features, LSD features, and RBM features were retrained for each of the ten randomly shuffled and combined training and validation. The prediction accuracy for each of these ten logit models was assessed on the corresponding held-out test sets in order to give average and standard deviations of the design preference predictive accuracy for the baseline, PCA features, LSD features, and RBM features.

Table 3 shows the averaged test set prediction accuracy of the logit model using the original variables, PCA features, LSD features, and RBM features. Prediction accuracy averaged over ten random training and held-out testing data splits are given, both for the partial data *N* = 10,000 and the full data *N* = 1,161,056 cases. Furthermore, we include the standard deviation of the prediction accuracies and a two-sided *t*-test relative to the baseline accuracy for each feature representation.

The logit model trained with LSD features achieved the highest predictive accuracy on both the partial and full datasets, at 76.59% and 77.58%, respectively. This gives evidence that using features can improve design preference prediction accuracy as the logit model using the original variables achieved an averaged accuracy of 69.98% and 75.29%, respectively. The improvement in design preference prediction accuracy is greatest for the partial data case, as evidenced by both the LSD and RBM; yet, the improvement with the full data case shows that the LSD feature learning method is still able to improve prediction accuracy within the capacity of the logit model. The RBM results for the full data case do not show significant improvement in prediction accuracy. Finally, we note a relative loss in design preference prediction accuracy when using PCA as a feature learning method, both for the partial and full datasets, suggesting that the heavy assumptions built into PCA are overly restrictive.

The parameter settings for the LSD feature learning method give additional insight to the preference prediction task. In particular, the optimal settings of *λ*_{1} and *λ*_{2} obtained through cross validation on the ten random training sets were ranged from *r* = 29 to *r* = 31. This significantly reduced rank of the part-worth coefficient matrix given in Eq. (1) suggests that the vast majority of interactions between customer variables and design variables given in Tables 1 and 2 do not significantly contribute to overall design preferences. This insight allows us to introspect into important feature pairings on a per-customer basis to inform design decisions.

We have shown that even simple single-layer feature learning can significantly increase predictive accuracy for design preference modeling. This finding signifies that features more effectively capture the design preferences than the original variables, as features form functions of the original variables more representative of the customer's underlying preference task. This offers designers opportunity for new insights if these features can be successfully interpreted and translated to actionable design decisions; however, given the relatively recent advances in feature learning methods, interpretation and visualization of features remains an open challenge—see Sec. 6 for further discussion.

Further increases to prediction accuracy might be achieved by stacking multiple feature learning layers, often referred to as “deep learning.” Such techniques have recently shown impressive results by breaking previous records in image recognition by large margins [8]. Another possible direction for increasing prediction accuracy may be in developing novel architectures that explicitly capture the conditional statistical structure between customers and designs. These efforts may be further aided through better understanding of the limitations of using feature learning methods for design and marketing research. For example, the large number of parameters associated with feature learning methods results in greater computational cost when performing model selection; in addition to the cross-validation techniques used in this paper, model selection metrics such as Bayesian information criteria and Akaike information criteria may give further insight along these lines.

Using features can support the design process in at least two directions: (1) Features interpretation can offer deeper insights into customer preferences than the original variables, and (2) feature visualization can lead to a market segmentation with better clustering than with the original variables. These two directions are still open challenges given the relative nascence of feature learning methods. Further investigation is necessary to realize the above design opportunities and to justify the computational cost and implementation challenges associated with feature learning methods.

The interpretation and visualization methods may be used with conventional linear discrete choice modeling (e.g., logit models). However, deeper insights are possible through interpreting and visualizing features, assuming that features capture more effectively the underlying design preference prediction task of the customer as shown through improved prediction accuracy on held-out data. Since we are capturing “functions” of the original data, we are more likely to interpret and visualize feature pairings such as “eco-friendly” vehicle and “environmentally conscious” customer; such pairing may ultimately lead to actionable design decisions.

Similar to PCA, LSD provides an approach to interpret the learned features by looking at the linear combinations of original variables. The major difference between features learned using PCA versus LSD is their different linear combinations; in particular, features learned by LSD are more representative as they contain information from both the data distribution and the preference task, while PCA features only contain information from the data distribution.

As introduced in Sec. 4.2, the weight matrix Ω is decomposed into a low-rank matrix **L** and a sparse matrix **S**, i.e., $\Omega =L+S$. The nonzero elements in the sparse matrix **S** may be interpreted as the weight of the product of its corresponding original design variables and customer variables. As for the low-rank matrix **L**, the features can be extracted by linearly combining the original variable according to the singular value decomposition (SVD) for **L**. The SVD is a factorization of the $(m+1)\xd7n$ matrix **L** in the form $L=U\Sigma V$, where **U** is a $(m+1)\xd7(m+1)$ unitary matrix, Σ is an *m* × *n* rectangular diagonal matrix with non-negative real numbers $\sigma 1,\sigma 2,\u2026,\sigma min(m+1,n)$ on the diagonal, and **V** is a $(n)\xd7(n)$ unitary matrix. Rewriting Eq. (6) yields

where $ui$ is the *i*th column of matrix *U*, and $vi$ is the *i*th row of matrix *V*. The *i*th user feature $[(xc(j))T,1]ui$ is a linear combination of original user variables; the *i*th design feature $vixdp$ is a linear combination of original design variables; and *σ _{i}* represents the importance of this pair of features for the customer's design preferences.

Interpreting these features in the vehicle preference case study, we found that the most influential feature pairing (i.e., largest *σ _{i}*) corresponds to preference trends at the population level: Low price but luxury vehicles are preferred, and Japanese vehicles receive the highest preference while GM vehicles receive the lowest preference. The second most influential feature pairing represents a rich customer group, with preferred vehicle groups being both expensive and luxurious. The third most influential feature pairing represents an elder user group, with their preferred vehicles as large but with low net horsepower.

We now visualize features to understand what insights for design decision making. Specifically, we make early stage inroad to visual market segmentation performed in an estimated feature space, thus clustering customers in a representation that better captures their underlying design preference decisions.

We begin by looking at the utility model *U _{rp}* given in Eq. (1) and note that the inner product between Ω and the variables $xu(r)$ representing customer

*r*may be interpreted as customer

*r*'s optimal vehicle, denoted $xopt(r)$

where Ω_{out} is the matrix reshaped from the coefficients of Ω corresponding to the outer product given in Eq. (1), Ω_{main} is the matrix reshaped from the remaining coefficients, and $1$ is a vector consisting of 1's with the same dimension as $xu(r)$. We rewrite the utility model *U _{rp}* given in Eq. (1) in terms of the optimal vehicle $xopt(r)$

According to the geometric meaning of inner product, the smaller the angle between $xdp$ and $xopt(r)$ is, the larger will be the utility *U _{rp}*. In this way, we have an interpretable method of improving upon the actual purchased vehicle design in the form of an “optimal” vehicle vector. This optimal vehicle vector could be useful for a manufacturer developing a next-generation design from a current design, particularly as the manufacturer would target a specific market segment.

We now provide a visual demonstration of using an optimal vehicle derived from feature learning to suggest a design improvement direction. First, we calculate the optimal vehicle using Eq. (22) for every customer in the dataset. Then, we visualize these optimal vehicle points by reducing their dimension using *t*-distributed stochastic neighbor embedding, an advanced nonlinear dimension reduction technique that embeds similar objects into nearby points [52]. Finally, optimal vehicles from targeted market segments are marked in larger red points.

Figure 6 shows the optimal vehicles for the SCI-XA, MAZDA6, ACURA-TL, and INFINM35 customer groups using red points, respectively. We observe that the optimal vehicle moves from the left-top corner to the right-bottom corner as the purchased vehicles become more luxurious using the LSD features, while the optimal vehicles in the original variable representation show overlap, especially for MAZDA6 and ACURA-TL customers. In other words, we are visualizing what has been shown quantitatively through increased preference prediction accuracy; namely, the optimal vehicles represented using LSD features as opposed to the original variables result in a larger separation of various market segments' optimal vehicles.

The contribution of this demonstration is not the particular introspection on the chosen example with MAZDA6 and ACURA-TL customers. Instead, this demonstration is significant as it suggests that it is possible to perform feature-based market segmentation purely using visual analysis. Such visual analysis is likely to be more useful to practicing designers and marketers, as it abstracts away the underlying mathematical mechanics of feature learning.

Feature learning is a promising method to improve design preference prediction accuracy without changing the design preference model or the dataset. This improvement is obtained by transforming the original variables to a feature space acting as an intermediate step as shown in Fig. 1. Thus, feature learning complements advances in both data gathering and design preference modeling.

We presented three feature learning methods—PCA, LSD, and sparse exponential family RBMs—and applied them to a design preference dataset consisting of customer and passenger vehicle variables with heterogeneous unit types, e.g., gender, age, and no. of cylinders.

We then conducted an experiment to measure design preference prediction accuracy involving 1,161,056 data points generated from a real purchase dataset of 5582 customers. The experiment showed that feature learning methods improve preference prediction accuracy by 2–7% for a small and full dataset, respectively. This finding is significant, as it shows that features offer a better representation of the customer's underlying design preferences than the original variables. Moreover, the finding shows that feature learning methods may be successfully applied to design and marketing datasets made up of variables with heterogeneous data types; this is a new result as feature learning methods have primarily been applied on homogeneous datasets made up of variables of the same distribution.

Feature interpretation and visualization offer a promise for using features to support the design process. Specifically, interpreting features can give designers deeper insights of the more influential pairings of vehicle features and customer features, while visualization of the feature space can offer deeper insights when performing market segmentation. These new findings suggest opportunities to develop feature learning algorithms that are not only more representative of the customer preference task as measured by prediction accuracy but also easier to interpret and visualize by a domain expert. Methods allowing easier interpretation of features would be valuable when translating the results of more sophisticated feature learning and preference prediction models into actionable design decisions.

An earlier conference version of this work appeared at the 2014 International Design Engineering Technical Conference. This work has been supported by the National Science Foundation under Grant No. CMMI-1266184. This support is gratefully acknowledged. The authors would like to thank Bart Frischknecht and Kevin Bolon for their assistance in coordinating datasets, Clayton Scott for useful suggestions, and Maritz Research, Inc., for generously making use of their data possible.

*a*,_{k}*b*=_{m}bias parameter for RBM units

- $D(\xb7)$ =
singular value threshold shrinkage operator

- $E[\xb7]$ =
expectation

- $EG,EB,EC$ =
energy function for Gaussian, binary, and categorical variables

**h**=features

*j*=index for arbitrary customer

*K*=dimension of features

**L**=low-rank matrix

*M*=dimension of original variables

- $MG,MB,MC$ =
dimension of Gaussian, binary, and categorical original variables

*N*=number of data points

*n*,*m*,*k*=indices for

*N*,*M*,*K*- $P(\xb7)$ =
probability

*p*,*q*=indices for arbitrary design pair

*r*=rank of low-rank matrix

**S**=sparse matrix

*s*=_{i}singular value

- $Ss(\xb7)$ =
soft-threshold operator

*t*=step index for proximal gradient

*T*=transpose

- $u,v$ =
arbitrary vectors (used for proof)

- $U,\Sigma ,V$ =
matrices of SVD decomposition

*w*=_{mk}weight parameter for RBM

- $xc$ =
customer

- $xd$ =
design

- $Xc$ =
all customers

- $Xd$ =
all designs

**y**=all purchase design indicators

- $y(jpq)$ =
indicator variable of purchased design

*Z*=_{m}dimension of categorical variable

*α*=*l*^{2}norm regularization parameter*β*=sparsity activation target for RBM

*δ*=delta function

*γ*=feature/original variable size ratio

*η*=step size for proximal gradient

*ε*=_{jp}gumbel random variable

- $\omega ,\Omega $ =
part-worth coefficients of preference model: vector, matrix

*λ*_{1}=nuclear norm regularization parameter

*λ*_{2}=sparsity regularization parameter (LSD)

*λ*_{3}=sparsity regularization parameter (RBM)

*σ*=sigmoid function

*θ*=parameters (generic)

- $\Vert \xb7\Vert $ =
*l*^{2}norm - $\Vert \xb7\Vert 1$ =
*l*^{1}norm - $\Vert \xb7\Vert \u2009*$ =
nuclear norm

- $proxf(\xb7)$ =
proximal operator for

*f* - $[\xb7,\xb7]$ =
vector concatenation

- ⊗ =
outer product

Though the low-rank matrix is estimated jointly with the decomposed matrices as well as the loss function, an accurate estimation of the low-rank matrix can still be achieved as guaranteed by the bound as in this section. We subsequently provide a variational bound of the divergence of the estimated likelihood from the true likelihood.

Proposition 1. *The proximal operator for the nuclear norm *$f=\lambda 1||\xb7||*$* is the singular value shrinkage operator *$D\lambda 1$.

Consider the SVD of a matrix $X\u2208\mathbb{R}m\xd7n$ with rank *r*

where the soft-thresholding operator $S\lambda 1(\Sigma )=diag({max(si\u2212\lambda 1,0)}i=1,\u2026,min(m,n))$. Moreover, $S\lambda 2(\xb7)$ is also the proximal operator for the *l*^{1} norm.

Proposition 2. *Separable Sum Property. If f is separable across two variables x and y*, *i.e.,*$f(x,y)=f1(x)+f2(y)$, *then*

Our proof proceeds as follows. Let us denote the optima of problem (9) as $L*$, the gradient of the loss function $l(\xb7)$ with respect to $L*$ as $\u2207L*l$, and the matrix minimizing the loss function $l(\xb7)$ as $L0$.

We next prove the following theorems: Theorem 2 provides a tight bound on $\u2207L*l$. Corollary 1 bounds the estimation error for the learned matrix $L*$. Theorem 3 follows by bounding the divergence of likelihood from the true data distribution where $l(\xb7)$ is a likelihood function.

First, we make the weak assumption that the optimization problem given in Eq. (9) is strictly convex, since a necessary and sufficient condition is that the saddle points for $l(\xb7)$ and the regularization terms are not overlapping.

*Proof*. Under the strictly convex assumption, the stationary point (i.e., the optima $L*$ for the optimization problem (9)) is unique. By Lemma 1, iterations of the proximal gradient optimization method $Lk$ converge to this optima $L*$. According to the fixed point equation for **L** (Algorithm 1), we have

Denote $L*\u2212\eta \u2207L*l$ as **M**, representing the argument of the proximal operator at the optimal low-rank estimation. The SVD for $L*$, **M**, and $proxf(M)$ yields

where **U**, $Uprox\u2208\mathbb{R}m\xd7r$; $VT,\u2009VproxT\u2208\mathbb{R}r\xd7n$; and $\Sigma ,\Sigma prox\u2208\mathbb{R}r\xd7r$ with $\Sigma =diag({si}i=1,...,r)\u2009and\u2009\u2009\Sigma prox=diag({siprox}i=1,...,r)$. $UM\u2208\mathbb{R}m\xd7m,\u2009VMT\u2208\mathbb{R}n\xd7n$, and $\Sigma M$ is an *m* × *n* rectangular diagonal matrix.

Without loss of generality, assume that $s1>s2>\u2026>sr>0$, i.e., these singular values are distinct and positive, thus ensuring column orderings are unique. Thus, we may assert that $U=Uprox$, $V=Vprox$, and $\Sigma =\Sigma prox$ due to the uniqueness of SVD for distinct singular values in $L*=proxf(M)$.

Note that the dimensionality of $proxf(M)$ is less than that of the value of *M*. To bridge the gap between them, we define diagonal submatrices $\Sigma +M$ and $\Sigma \u2212M$. (In other words, we partition $\Sigma M$ into two submatrices $\Sigma +M$ and $\Sigma \u2212M$.) For all singular values $siM$ of **M**, $i=1,2,\u2026,min(m,n)$, if $siM\u2212\eta \u22650$, then $siM$ is a diagonal element of the submatrix $\Sigma +M$; otherwise, $siM$ is a diagonal element of the submatrix $\Sigma \u2212M$. Hence, $max(\Sigma +M\u2212\eta I,0)=\Sigma +M\u2212\eta I$ and $max(\Sigma \u2212M\u2212\eta I,0)=0$

where $UM+$$(VM+)$ are left-singular (right-singular) vectors corresponding to $\Sigma +M;\u2009UM\u2212$ and $VM\u2212$ are also defined, respectively. Again, due to the uniqueness of SVD, we have $UM+=U$ and $VM+=V$.

We now rewrite the SVD formula for $prox(M)$ and **M** as

Equations (A3) and (A8) indicate that

By Eqs. (A9), (A10), and (A12), we have

Note that every diagonal element $s\u2212iM$ in $\Sigma \u2212M$ satisfies $0\u2009\u2212iM\u2264\eta $. Hence

where $U:i$ or $V:i$ is the *i*th column in matrix **U** or **V**, and $[UM\u2212]:j$ or $[VM\u2212]:j$ is the *j*th column in matrix $UM\u2212$ or $VM\u2212$.

Summarizing the proof of Theorem 2, the gradient of the loss function at the estimated low-rank matrix is bounded by a unit ball within the original problem space that has radius of the low-rank regularization parameter *λ*_{1}. The relaxation of the bound partially comes from the second term in inequality (A14). This implies that the bound is tighter if the rank of $L*$ is increased.

Based on the gradient bound given in Theorem 2, we now bound the estimation error of the learned low-rank matrix $L*$. Although the value of the bound is not explicit in this proof, in some cases we are able to explicitly calculate its value.

Corollary 1. *Learned Low-Rank Matrix Estimation Error. The error *$||L*\u2212L0||2$* is bounded by the diameter of minimum-sized ball that includes the following set:*

Since the loss function $l(\xb7)$ is convex, the Euclidean norm of its gradient $\u2207Ll$ is nondecreasing as the Euclidean distance $||L\u2212L0||2$ is increasing.

When the loss function is sharp around its minima, then ${L:||\u2207Ll||2\u2264min(m,n)}$ is a small region which implies that $L*$ is a good estimation of $L0$.

We next bound the likelihood divergence when the loss function $l(\xb7)$ is a likelihood function. To do this, we use Theorem 2 and Corollary 1 to construct a variational bound.

*Proof*. By the Lagrangian mean value theorem, there exists $L1\u2208{L:Lij\u2208[Lij*,L0ij]}$ such that

where $\u2329A,B\u232a$ denotes inner product of $vec(A)$ and $vec(B)$, in which $vec(\xb7)$ is the matrix vectorization operator.

Because of the convexity of $l(\xb7),\u2009||\u2207L1l||2\u2264||\u2207L*l||2$. By Theorem 2

Summarizing the proof of Theorem 3, the variational bound of the estimated likelihood depends on both the bound of gradient of the likelihood function $l(\xb7)$ given in Theorem 2 and the property of the likelihood function in the neighborhood of its optima $L0$ as described in Corollary 1.

## References

## Figures

The concept of LSD using an example “part-worth coefficients” matrix of size 10 × 10 decomposed into two 10 × 10 matrices with low rank or sparse structure. Lighter colors represent larger values of elements in each decomposed matrix.

The concept of principle component analysis shown using an example with a data point represented by three original variables **x** projected to a two-dimensional subspace spanned by **w** to obtain features **h**

The concept of feature learning as an intermediate mapping between variables and a preference model. The diagram on top depicts conventional design preference modeling (e.g., conjoint analysis) where an inferred preference model discriminates between alternative design choices for a given customer. The diagram on bottom depicts the use of features as an intermediate modeling task.

The concept of the exponential family sparse RBM. The original data are represented by nodes in the visible layer by [x1,x2], while the feature representation of the same data is represented by nodes in the hidden layer [h1,h2,h3,h4]. Undirected edges are restricted to being only between the original layer and the hidden layer, thus enforcing conditional independence between nodes in the same layer.

Optimal vehicle distribution visualization. Every point represents the optimal vehicle for one consumer. In the left column, the optimal vehicle is inferred using the utility model with original variables. In the right column, LSD features are used to infer the optimal vehicle. In the first row, the optimal vehicles from SCI-XA customers are marked in big red points. Similarly, the optimal vehicles from MAZDA6, ACURA-TL, and INFINM35 customers are marked in big red points, respectively.

## Tables

Customer variable | Type | Customer variable | Type |
---|---|---|---|

Age | Real | U.S. state cost of living | Real |

Number of house members | Real | Gender | Binary |

Number of small children | Real | Income bracket | Categorical |

Number of med. children | Real | House region | Categorical |

Number of large children | Real | Education level | Categorical |

Number of children | Real | U.S. state | Categorical |

U.S. state average income | Real |

Design variable | Type | Design variable | Type |
---|---|---|---|

Invoice | Real | AWD/4WD | Binary |

MSRP | Real | Automatic transmission | Binary |

Curb weight | Real | Turbocharger | Binary |

Horsepower | Real | Supercharger | Binary |

MPG (combined) | Real | Hybrid | Binary |

Length | Real | Luxury | Binary |

Width | Real | Vehicle class | Categorical |

Height | Real | Manufacturer | Categorical |

Wheelbase | Real | Passenger capacity | Categorical |

Final drive | Real | Engine size | Categorical |

Diesel | Binary |

Design preference model | Feature representation | Prediction accuracy (SD) (ρ-value) N = 10,000 | Prediction accuracy (SD) (ρ-value) N = 1,161,056 |
---|---|---|---|

Logit model | Original variables (no features) | 69.98% (1.82%) | 75.29% (0.98%) |

(N/A) | (N/A) | ||

Logit model | Principle component analysis | 61.69% (1.24%) | 62.03% (0.89%) |

(1.081 × 10^{−7}) | (8.22 × 10^{−10}) | ||

Logit model | Low-rank + sparse matrix decomposition | 76.59% (0.89%) | 77.58% (0.81%) |

(3.276 × 10^{−}^{8}) | (4.286 × 10^{−8}) | ||

Logit model | Exponential family sparse RBM | 74.99% (0.64%) | 75.15% (0.81%) |

(2.3 × 10^{−5}) | (0.136) |

## Errata

## Discussions

Some tools below are only available to our subscribers or users with an online account.