Introduction and the Research Problem
The paper “Customer segmentation in a large database of an online customized fashion business” (2015) by Pedro Quelhas Brito, Carlos Soares, Sergio Almeida, Ana Monte and Michel Byvoet sets out to explore how data mining (DM) techniques can be used to drive marketing approaches in highly customized industries, such as the (online) fashion industry.
The issue with highly customized industries is how to establish clear patterns which would work to identify customer preferences, because their diverse products may make the whole process time-consuming.
The focus is placed on two DM approaches: clustering and subgroup discovery, with the use of K-Medoids and CN2-SD algorithms. The goal is to establish to what extent these two approaches complement one another in solving a segmentation problem.
Furthermore, the aim of the paper is also practical/managerial: to find out how to adjust manufacturing process and product design to make them a better match for the customers’ preferences. Ultimately, the results can give insights into how to enhance communication efforts and boost sales.
Research Problem Solution
Fashion manufacturer Bivolino is chosen because its product (unique shirts) is highly customized. It operates an e-business and has to constantly keep track of its customers’ preferences. The whole production process is very complex for this fashion manufacturer, and overweight consumers are their main targets. All of this reveals the need to segment the market accordingly.
The following are types of variables used as segmentation criteria inputs in this research project:
Figure 1: Segmentation criteria used in the study
The algorithm used in the first study is the K-medoids clustering algorithm which seeks to find k clusters in n objects by:
- identifying a representative object (medoid) for each cluster
- assigning each remaining object to the medoid with which it shares the most similarities
- selecting the most representative object in each cluster and
- repeating the second and third step until a stopping criterion is met.
The second study relies on the subgroup discovery DM technique, specifically the CN2-SD algorithm. The subgroup discovery DM technique has the goal of identifying those subgroups of the population that exhibit statistical distributions with unique features in relation to the global distribution of the target variable.
The formula used is R (rules): Cond → Class value, where “class value” is the value of the target variable. “Cond” is a combination of attributes that outline the statistical distribution of the subgroup.
Subgroup discovery algorithms are allocated to one of these three groups: 1) algorithms based on classification (such as EXPLORA or CN2-SD) 2) algorithms based on association (such as APRIORI, APRIORI SD, SD MAP etc) and 3) evolutionary algorithms (including SDIGA, MESDIF etc.).
The authors of this study opted for the CN2-SD algorithm and WRAcc, as the quality measure used to determine the importance of the subgroups based on their unusualness. WRAcc is short for Weighted Relative Accuracy heuristic. It encompasses two components: 1) the generality of the rule in question and 2) the unusualness measure of the distribution.
The CN2-SD algorithm
The CN2-SD algorithm is a variety of the CN2 algorithm, operating with the formula Cond → Class value, where “Cond” refers to a set of attributes together with their values. “Class value” is the value of the target variable. The CN2 algorithm, as applied to subgroup discovery, operates by:
- performing a beam search to identify a single rule that has high discriminative power in relation to the training data and
- carrying out a control procedure which repeats step 1) until a satisfactory set of rules is obtained.
A rule has high discriminative power if it can be said to cover many examples belonging to a single class and only a few of those belonging to the other class.
The Main Findings
Study 1
This study encompassed two steps: 1) clustering analysis based on characteristics of the product in question and 2) expanding the clustering analysis by including customer characteristics in the data observed. Under step 1), the goal was to identify the most relevant fashion trends based on what customers decided were their preferred characteristics of shirts. The table below shows clustering results based on shirt attributes:
Figure 2: Cluster results
Six clusters were identified, each with at least one distinguishing medoid, separating it from other clusters. Clusters 1, 2 and 4 all belong to the group of work shirts, accounting for 65% of the number of orders. Clusters 3, 5 and 6 belong to the group of fashion shirts, accounting for 35% of all orders. As we can see, Cluster 2 represents the most common choice (45%).
Under step 2), the data was expanded by including the following input variables: demographic, biometric, psychographic, geographic and behavioral. Тhe most important conclusion the authors reached here was that it was important for fashion manufacturers or retailers to segment their market according to the age. For example, it was shown that younger customers prefer the “super slim fit” shirts, whereas older customers preferred the “comfort fit”. Age also determined whether customers wanted their shirts to have pockets.
Other important factors determining customer preferences were found to be: 1) geographic location 2) psychographic profile (lifestyle and purpose, professional requirements, fashion interest) and 3) price sensitivity (with women being more price sensitive than men).
Study 2
The second study characterized orders by 19 variables, using the Body Mass Index (BMI) as the target variable. The authors used the following measures to evaluate their results: 1) size of the subgroups 2) deviation of how the subgroups were distributed in relation to the full data. They utilized the CN2-SD algorithm in Rapid Miner, obtaining the model which consisted of 54 rules.
Each rule identified a population subgroup with a distribution that differed from the total population distribution. Since the goal was to obtain unusual rules, the authors quantified the difference between the proportion of examples in the subgroup assigned to the class corresponding to each rule and the proportion of examples belonging to the same class in relation to total orders.
Since their aim was to determine how useful these rules were for the design of new shirts, the authors had to analyze and classify the subgroups in accordance with the following criteria:
1) uninteresting subgroups or
i) those that are obvious from the common sense point of view
ii) subgroups with a small deviation when compared with the total distribution of orders and
iii) small groups that represented less than 0.5% of all orders;
2) interesting subgroups from the viewpoint of marketing
3) interesting subgroups for design, satisfying the following conditions:
i) Being a large subgroup, or corresponding to at least 30% of the total orders
ii) Subgroup that accounts for at least 5% of total orders, but with a significantly high deviation in class probability, concerning the total distribution of orders.
The authors concluded that not all of their 54 rules provided interesting knowledge, even if they described unusual distributions. Some were obvious, such as those stating that overweight customers selected regular fit shirts. Another obvious rule was that average weight customers opted for the Super Slim Fit type of shirts. However, they obtained some very interesting results from the viewpoint of marketing.
Particularly, those that segmented the market according to geographic location. For example, they found that the UK customers tended to be more obese. Another rule was that customers aged 35-44 were mostly overweight.
The most interesting rules were obtained from the standpoint of shirts design. For example, the subgroup with customers that tended to be obese chose a specific type of fabric: FabricID12186. This was also the largest group and the one with the most unusual statistical distribution.
Future Work Suggestions and Implications for Practitioners
Obtained results
It is pointed that highly customized industries make it difficult to identify clear patterns of customer preferences, because of how diverse their products are. In this study, two DM approaches were used: 1) clustering (K-medoids) and 2) subgroup discovery (CN2-SD algorithm). The results that were obtained offered detailed knowledge in terms of design and marketing. This enables the fashion manufacturer to respond to customer requests by efficiently segmenting the market.
The authors believe that the same approach can also be applied in other industries such as the banking and automotive industries, because they use the same type of variables utilized in this case study. For this reason, they suggest developing a tool that could be applied across different domains. The fact that they combined a number of tools in their own study, works to support this suggestion.
Limitations of research
However, the authors also pointed out a number of limitations in their research:
- The K-Medoids clustering algorithm was complicated to use, because they had to apriori define the number of clusters, which was a difficult task at hand and time-consuming
- Many of the variables they used had the categorical nature, with a number of them mutually exclusive. Furthermore, the majority of algorithms available on Rapid Miner could not process categorical data.
- Finally, they selected algorithms that required significant computational effort. This challenge is particularly pronounced when dealing with large datasets. Therefore, this approach might not be the most appropriate for all studies. The authors in this study had the benefit of working with a small amount of data, which is why they were able to overcome this challenge. The solution in future work could be to select other variations of algorithms, many of which are available on Rapid Miner.
The authors conclude by saying that they also encountered problems in developing their DM project. Above all, it required trial-and-error strategy and was time-consuming. This reveals the need for the involvement of the domain experts in order to make the process more efficient.
Suggestions for future work
Finally, the authors suggest expanding research to cover these underresearched areas:
- New heuristic approaches for finding the ideal number of clusters
- Defining the appropriate distance measure by making it specific to the business problem
- Putting other subgroup discovery algorithms to a test
- Finding the optimal algorithm by comparing different results.