Electrical and Computer Engineering ETDs

Publication Date

9-1-2015

Abstract

Given the continuous growth of illicit activities on the Internet, there is a need for intelligent systems to identify malicious web pages. It has been shown that URL anal- ysis is an e\u21b5ective tool for detecting phishing, malware, and other attacks. Previous studies have performed URL classification using a combination of lexical features, network tra c, hosting information, and other strategies. These approaches require time-intensive lookups which introduce significant delay in real-time systems. This paper describes a lightweight approach for classifying malicious web pages using URL lexical analysis alone. The goal is to explore the upper-bound of the classification accuracy of a purely lexical approach. Another aim is to develop an approach which could be used in a real-time system. These goal culminate in the development of a classification system based on lexical analysis of URLs. It correctly classifies URLs of malicious web pages with 99.1% accuracy, a 0.4% false positive rate, an F1-Score of 98.7, and requires 0.62 milliseconds on average. This method substantially out- performs previously published algorithms on out-of-sample data.

Keywords

Machine Learning, Malware Detection, Classification, Malicious Web Pages, Supervised Learning, Natural Language Processing

Sponsors

Amrita Center for CyberSecurity

Document Type

Thesis

Language

English

Degree Name

Computer Engineering

Level of Degree

Masters

Department Name

Electrical and Computer Engineering

First Committee Member (Chair)

Jordan, Ramiro

Second Committee Member

Lamb, Chris

Share

COinS