Our Cascaded Learning Framework for Phish Detection and an Online Demo

Components Architecture Usage Improvement


Essentially, our online cascaded phish detector is composed of a client-side component and a server-side component. The client side is implemented as a Chrome extension, which injects content script to web pages and extracts the corresponding HTML DOMs. The server side is implemented as a Java web application that runs in the Java Servlet Environment provided by the Google App Engine (GAE).



The system diagram of our online cascaded phish detector is shown above. Basically, there are four major steps in classifying a given web page, among which the first step extracts the HTML DOM via a chrome extension and the third step handles the classification task in the backend server-side code.

  1. The user opens a web page in the Chrome browser and clicks the icon of our extension.
  2. The HTML DOM of the web page is extracted by the dynamically injected content script in our chrome extension, and sent together with the web page URL to our server-side code hosted on GAE. For web pages with frames/iframes, we apply some heuristics to find the URL in the frame/iframe that is most likely to contain malicious content.
  3. Our online cascaded phish detector classifies the web page.
  4. The classification result as well as some diagnostic statistics are sent back and displayed in the browser on the client side.


The client-side Chrome extension can be found here. Installing it on your Chrome browser simply takes a mouse click.

Here is the screenshot of an example phish that we successfully detected using our cascaded detector.


And here is the result of our classification.




One observation is that sometimes a significant percent of the computation is spent on extracting the HTML DOM from the web page content string on the server side. Currently, we use JTidy as the DOM parser on the server side, and maybe there are better alternatives that will reduce the runtime on this part.