Please use this identifier to cite or link to this item:
A domain adaptation method for text classification based on self-adjusted training approach
Acceso Abierto
Text analysis
Information retrieval
Information grows rapidly everyday, most of this information is kept in digital text documents, web-pages, posts on social networks, blogs, e-mails [39], electronic books [17], and scientific publications [28]. Organizing and categorizing all this text information automatically results helpful for many tasks. Supervised learning is the most successful approach for automatic text classification. Supervised learning assumes that the training and test set come from the same distribution. Sometimes there are not labeled data available on the target domain, instead we have a labeled data set from a similar or related domain that we can use as auxiliary domain. Despite domains are similar, their feature space and the distribution are different, hence the performance of a supervised classifier demeans. This situation is called the domain adaptation problem. The domain adaptation algorithms are designed to narrow the gap between the target domain distribution and the auxiliary domain distribution. The semi-supervised technique of selftraining allows to iteratively enrich the training test with data from the test set. Using self-training for domain adaptation presents some challenges in the text classification scenario; first, the feature space changes on each iteration because new vocabulary is transferred from the target domain to the training set, second, a way to select the more confidently labeled instances is needed, because adding wrong labeled instances to the training set will affect the model. Many of the methods addressing this problem need user defined parameters like the number of instances selected per iteration or the stop criteria. Tuning these parameters into a real problem is another problem by itself. On this work we propose a self-adjusting training approach method, which is able to adapt itself to the new distributions obtained on a self-training process. This method integrates some strategies to adjust its own settings each iteration. The proposed method obtains good results on the thematic cross-domain text classification task, it reduces the error rate in 65.13% on average from the supervised learning approach on the testing dataset. It also was tested in the cross-domain sentiment analysis, reducing the error rate by 15.62% on average from the supervised learning approach on the testing dataset. The performance obtained in the evaluation of the proposed method is competitive with other state of the art methods.
Instituto Nacional de Astrofísica, Óptica y Electrónica
Tesis de maestría
Público en general
Garrido-Marquez I.
Versión aceptada
acceptedVersion - Versión aceptada
Appears in Collections:Maestría en Ciencias Computacionales

Upload archives

File Description SizeFormat 
GarridoMI.pdf898.89 kBAdobe PDFView/Open