MDLText aplicado na Filtragem Automática de SPIM e SMS Spam

Renato Moraes Silva, Tiago A. Almeida, Akebo Yamakami

Resumo


A filtragem automática de spam em mensagens instantâneas e SMS é um problema desafiador, pois as mensagens são frequentemente curtas e repletas de ruídos, tais como gírias, expressões idiomáticas, símbolos, emoticons e abreviações, o que dificulta a extração de conhecimento e predição. Para enfrentar esse problema, neste artigo é avaliado um método de classificação de texto baseado no princípio da descrição mais simples, que é eficiente, rápido, escalável, multiclasse e possui aprendizado incremental. Experimentos realizados com uma base de dados real e pública, em cenários de aprendizado online e offline, indicam que o método proposto é promissor para a tarefa de detecção de spam em mensagens instantâneas e SMS.


Palavras-chave


Aprendizado online; Navalha de Occam; Categorização de texto; Aprendizado de máquina

Texto completo:

PDF

Referências


Abdulhamid, S. M., Latiff, M. S. A., Chiroma, H., Osho, O., Abdul-Salaam, G., Abubakar, A. I., e Herawan, T. (2017). A review on mobile SMS spam filtering techniques. IEEE Access, 5:15650–15666. doi: 10.1109/ACCESS.2017.2666785

Ahmed, I., Ali, R., Guan, D., Lee, Y.-K., Lee, S., e Chung, T. (2015). Semi-supervised learning using frequent itemset and ensemble learning for SMS classification. Expert Systems with Applications, 42(3):1065–1073. doi: 10.1016/j.eswa.2014.08.054

Almeida, T. A., Hidalgo, J. M. G., e Yamakami, A. (2011a). Contributions to the study of SMS spam filtering: new collection and results. In Proceedings of the 11th ACM Symposium on Document engineering (DocEng’11), pages 259–262, Mountain View, CA, USA. ACM. doi: 10.1145/2034691.2034742

Almeida, T. A., Silva, T. P., Santos, I., e Hidalgo, J. M. G. (2016). Text normalization and semantic indexing to enhance instant messaging and SMS spam filtering. Knowledge-Based Systems, 108:25–32. doi: 10.1016/j.knosys.2016.05.001

Almeida, T. A., Yamakami, A., e Almeida, J. (2011b). Spam filtering: how the dimensionality reduction affects the accuracy of naive Bayes classifiers. Journal of Internet Services and Applications, 1(3):183–200. doi: 10.1007/s13174-010-0014-7

Assis, F., Yerazunis, W., Siefkes, C., e Chhabra, S. (2006). Exponential differential document count – a feature selection factor for improving Bayesian filters accuracy. In Proceedings of the 2006 MIT Spam Conference (SP’06), pages 1–6, Cambridge, MA, USA.

Bi, J., Wu, J., e Zhang, W. (2008). A trust and reputation based anti-SPIM method. In Proceedings of the 27th IEEE Conference on Computer Communications (INFOCOM’08), pages 1–5, Phoenix, Arizona, USA. IEEE Computer Society. doi: 10.1109/INFOCOM.2008.319

Bishop, C. M. (2006). Pattern Recognition and Machine Learning. Springer, New York, NY, USA, 1st edition.

Breiman, L. (2001). Random forests. Machine Learning, 45(1):5–32. doi: 10.1023/A:1010933404324

Breiman, L., Friedman, J. H., Olshen, R. A., e Stone, C. J. (1984). Classification and Regression Trees. Wadsworth International Group, Belmont, California, USA.

Carpenter, L. M. e Hubbard, G. B. (2014). Cyberbullying: Implications for the psychiatric nurse practitioner. Journal of Child and Adolescent Psychiatric Nursing, 27(3):142–148. doi: 10.1111/jcap.12079

Cortes, C. e Vapnik, V. N. (1995). Support-vector networks. Machine Learning, 20(3):273–297. doi: 10.1007/BF00994018

Cover, T. M. e Hart, P. E. (1967). Nearest neighbor pattern classification. IEEE Transaction on Information Theory, 13(1):21–27. doi: 10.1109/TIT.1967.1053964

Crammer, K., Dredze, M., e Pereira, F. (2012). Confidence-weighted linear classification for text categorization. Journal of Machine Learning Research, 13(1):1891–1926.

Das, S., Pourzandi, M., e Debbabi, M. (2012). On SPIM detection in LTE networks. In Proceedings of the 25th IEEE Canadian Conference on Electrical and Computer Engineering (CCECE’2012), pages 1–4, Montreal, Québec, Canada. IEEE. doi: 10.1109/CCECE.2012.6334959

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. Journal of Machine Learning Research, 7:1–30.

Domingos, P. (1999). The role of Occam’s razor in knowledge discovery. Data Mining and Knowledge Discovery, 3:409–425. doi: 10.1023/A:1009868929893

Freund, Y. e Schapire, R. E. (1999). Large margin classification using the perceptron algorithm. Machine Learning, 37(3):277–296. doi: 10.1023/A:1007662407062

Galavotti, L., Sebastiani, F., e Simi, M. (2000). Experiments on the use of feature selection and negative evidence in automated text categorization. In Proceedings of the 4th European Conference on Research and Advanced Technology for Digital Libraries (ECDL’00), Lisbon, Portugal. Springer. doi: 10.1007/3-540-45268-0_6

García, S., Fernández, A., Luengo, J., e Herrera, F. (2009). A study of statistical techniques and performance measures for genetics-based machine learning: Accuracy and interpretability. Soft Computing, 13(10):959–977. doi: 10.1007/s00500-008-0392-y

García, S., Fernández, A., Luengo, J., e Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information Sciences, 180(10):2044–2064. doi: 10.1016/j.ins.2009.12.010

Gentile, C. (2002). A new approximate maximal margin classification algorithm. Journal of Machine Learning Research, 2:213–242.

Gomez-Martin, L. E. (2012). Smartphone usage and the need for consumer privacy laws. Pittsburgh Journal of Technology Law and Policy, 12:217–237. doi: 10.5195/tlp.2012.96

Goswami, G., Singh, R., e Vatsa, M. (2016). Automated spam detection in short text messages. In Singh, R., Vatsa, M., Majumdar, A., e Kumar, A., editors, Machine Intelligence and Signal Processing, volume 390, pages 85–98. Springer India, New Delhi. doi: 10.1007/978-81-322-2625-3_8

Grünwald, P. D., Myung, I. J., e Pitt, M. A. (2005). Advances in Minimum Description Length: Theory and Applications. The MIT Press.

Hastie, T. J., Tibshirani, R. J., e Friedman, J. H. (2009). The elements of statistical learning: data mining, inference, and prediction. Springer, New York, NY, USA, 2th edition.

Hoi, S. C. H., Wang, J., e Zhao, P. (2014). Libol: A library for online learning algorithms. Journal of Machine Learning Research, 15(1):495–499.

Hsu, C., Chang, C., e Lin, C. (2003). A practical guide to support vector classification. Technical report, National Taiwan University.

Japkowicz, N. e Shah, M. (2011). Evaluating Learning Algorithms: A Classification Perspective. Cambridge University Press, New York, NY, USA.

Joachims, T. (1998). Text categorization with suport vector machines: Learning with many relevant features. In Proceedings of the 10th European Conference on Machine Learning (ECML’98), pages 137–142, Chemnitz, Germany. Springer. doi: 10.1007/BFb0026683

Li, Y. e Long, P. M. (2002). The relaxed online maximum margin algorithm. Machine Learning, 46(1-3):361–387. doi: 10.1023/A:1012435301888

Liu, Z., Lin, W., Li, N., e Lee, D. (2005). Detecting and filtering instant messaging spam: A global and personalized approach. In Proceedings of the First International Conference on Secure Network Protocols (NPSEC’05), pages 19–24. IEEE Computer Society. doi: 10.1109/NPSEC.2005.1532048

Manning, C. D., Raghavan, P., e Schütze, H. (2009). Introduction to Information Retrieval. Cambridge University Press, New York, NY, USA.

McCallum, A. e Nigam, K. (1998). A comparison of event models for naive Bayes text classification. In Proceedings of the 15th AAAI Workshop on Learning for Text Categorization (AAAI’98), pages 41–48, Madison, Wisconsin.

Ng, H. T., Goh, W. B., e Low, K. L. (1997). Feature selection, perceptron learning, and a usability case study for text categorization. In Proceedings of the 20th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR’97), pages 67–73, Philadelphia, PA, USA. ACM. doi: 10.1145/258525.258537

Rissanen, J. (1978). Modeling by shortest data description. Automatica, 14(5):465–471. doi: 10.1016/0005-1098(78)90005-5

Rissanen, J. (1996). Fisher information and stochastic complexity. IEEE Transaction on Information Theory, 42(1):40–47. doi: 10.1109/18.481776

Rocchio, J. J. (1971). Relevance feedback in information retrieval. In Salton, G., editor, The Smart retrieval system - experiments in automatic document processing, pages 313–323. Prentice-Hall, Englewood Cliffs, NJ.

Santafe, G., Inza, I. n., e Lozano, J. A. (2015). Dealing with the evaluation of supervised classification algorithms. Artificial Intelligence Review, 44(4):467–508. doi: 10.1007/s10462-015-9433-y

Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1):1–47. doi: 10.1145/505282.505283

Shang, W., Huang, H., Zhu, H., Lin, Y., Qu, Y., e Wang, Z. (2007). A novel feature selection algorithm for text categorization. Expert Systems with Applications, 33(1):1–5. doi: 10.1016/j.eswa.2006.04.001

Silva, R. M., Alberto, T. C., Almeida, T. A., e Yamakami, A. (2016a). Filtrando comentários do YouTube através de classificação online baseada no princípio MDL e indexação semântica. In Anais do 13th Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’16), pages 2–15, Recife, PE, Brasil.

Silva, R. M., Almeida, T. A., e Yamakami, A. (2015). Quanto mais simples, melhor! Categorização de textos baseada na navalha de Occam. In Anais do 12th Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’15), pages 2–15, Natal, RN, Brasil.

Silva, R. M., Almeida, T. A., e Yamakami, A. (2016b). Detecção automática de SPIM e SMS spam usando método baseado no princípio da descrição mais simples. In Anais do 13th Encontro Nacional de Inteligência Artificial e Computacional (ENIAC’16), pages 2–15, Recife, PE, Brasil.

Silva, R. M., Almeida, T. A., e Yamakami, A. (2017). MDLText: An efficient and lightweight text classifier. Knowledge-Based Systems, 118:152–164. doi: 10.1016/j.knosys.2016.11.018

Tsakalidis, G. e Vergidis, K. (2017). A systematic approach toward description and classification of cybercrime incidents. IEEE Transactions on Systems, Man, and Cybernetics: Systems, PP(99):1–20. doi: 10.1109/TSMC.2017.2700495

Uysal, A. K. e Gunal, S. (2012). A novel probabilistic feature selection method for text classification. Knowledge-Based Systems, 36:226–235. doi: 10.1016/j.knosys.2012.06.005

Uysal, A. K., Gunal, S., Ergin, S., e Gunal, E. S. (2012). A novel framework for SMS spam filtering. In Proceedings of the 2012 International Symposium on Innovations in Intelligent Systems and Applications (INISTA’12), pages 1–4, Trabzon, Turkey. IEEE. doi: 10.1109/INISTA.2012.6246947

Wilbur, W. J. e Kim, W. (2009). The ineffectiveness of within-document term frequency in text classification. Information Retrieval, 12(5):509–525. doi: 10.1007/s10791-008-9069-5

Yang, Y. e Pedersen, J. O. (1997). A comparative study on feature selection in text categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML’97), pages 412–420, Nashville, TN, USA. Morgan Kaufmann.

Zhang, T. (2004). Solving large scale linear prediction problems using stochastic gradient descent algorithms. In Proceedings of the 21th International Conference on Machine Learning (ICML’04), pages 116–123, Banff, Alberta, Canada. ACM. doi: 10.1145/1015330.1015332

Zinkevich, M. (2003). Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the 20th International Conference on Machine Learning (ICML’03), pages 928–936, Washington, DC, USA. AAAI Press.




Article Metrics

Metrics Loading ...

Metrics powered by PLOS ALM


iSys - Revista Brasileira de Sistemas de Informação - CESI/SBC
ISSN Eletrônico: 1984-2902