Automatic Product Name Recognition from Short Product Descriptions
by Elnaz Pazhouhi
This thesis studies the problem of product name recognition from short product descriptions. This is an important problem especially with the increasing use of ERP (Enterprise Resource Planning) software at the core of modern business management systems, where the information of business transactions is stored in unstructured data stores. A solution to the problem of product name recognition is especially useful for the intermediate businesses as they are interested in finding potential matches between the items in product catalogs (produced by manufactures or another intermediate business) and items in the product requests (given by the end user or another intermediate business).
In this context the problem of product name recognition in specifically challenging because product descriptions are typically short, ungrammatical, incomplete, abbreviated and multilingual. In this thesis we investigate the application of supervised machine-learning techniques and gazetteer-based techniques to our problem. To approach the problem, we define it as a classification problem where the tokens of product descriptions are classified into I, O and B classes according to the standard IOB tagging scheme. Next we investigate and compare the performance of a set of hybrid solutions that combine machine learning and gazetteer-based approaches. We study a solution space that uses four learning models: linear and non-linear SVC, Random Forest, and AdaBoost. For each solution, we use the same set of features. We divide the features into four categories: token-level features, document-level features, gazetteer-based features and frequency-based features. Moreover, we use automatic feature selection to reduce the dimensionality of data; that consequently improves the training efficiency and avoids over-fitting.
To be able to evaluate the solutions, we develop a machine learning framework that takes as its inputs a list of predefined solutions (i.e. our solution space) and a preprocessed labeled dataset (i.e. a feature vector X, and a corresponding class label vector Y). It automatically selects the optimal number of most relevant features, optimizes the hyper-parameters of the learning models, trains the learning models, and evaluates the solution set. We believe that our automated machine learning framework, can effectively be used as an AutoML framework that automates most of the decisions that have to be made in the design process of a machine learning solution for a particular domain (e.g. for product name recognition).
Moreover, we conduct a set of experiments and based on the results, we answer the research questions of this thesis. In particular, we determine (1) which learning models are more effective for our task, (2) which feature groups contain the most relevant features (3) what is the contribution of different feature groups to the overall performance of the induced model, (4) how gazetteer-based features are incorporated with the machine learning solutions, (5) how effective gazetteer-based features are, (6) what the role of hyper-parameter optimization is and (7) which models are more sensitive to the hyper-parameters optimization.
According to our results, the solutions with maximum and minimum performance are non-linear SVC with an F1 measure of 65% and AdaBoost with an F1 measure of 59% respectively. This reveals that the role of classifiers is not considerable in the final outcome of the learning model, at least according to the studied dataset. Additionally, our results show that the most effective feature group is the document-level features with 14.8% contribution to the overall performance (i.e. F1 measure), in the second position, there is the group of token-level features, with 6.8% contribution. The other two groups, the gazetteer-based features and frequency-based features have small contributions of 1% and 0.5% respectively. However more investigations relate the poor performance of gazetteer-based features to the low coverage of the used gazetteer (i.e. ETIM).
Our experiments also show that all learning models over-fit the training data when a large number of features is used; thus the use of feature selection techniques is essential to the robustness of the proposed solutions. Among the studied learning models, the performance of non-linear SVC and AdaBoost models strongly depends on the used hyper-parameters. Therefore for those models the computational cost of the hyper-parameters tuning is justifiable.