Malware are one of the biggest threats in IT security, with millions of malicious applications released every year. Static analysis techniques may fail when working on obfuscated binaries, encryption of the payload etc… An alternative is to use classical dynamic analysis, executing a program in a safe environment and extracting information on its observed behavior. A limitation of this approach is that it is passive, meaning that it does not interact with malware during execution. However, malicious behavior is often performed only if triggered by specific actions on the system.
Active Malware Analysis (AMA) focuses on acquiring knowledge about dangerous software by executing actions that trigger a response in the malware. Recently, Reinforcement Learning (RL) techniques have been proposed as a viable tool for AMA. In the context of AMA, RL techniques can select the most informative triggering action to execute on the infected system, so as to generate an accurate model for the behavior of the Malware.
In this project we focus on RL techniques for AMA and specifically on Android systems. Android is the most used mobile system with an ever growing number of malware released every day. Our aim is to design a system that uses dynamic analysis to classify and group real Android malware samples on the basis of similar behaviors/characteristic patterns (i.e., what the malware does), and not only static features (i.e., how the malware code looks like). In more detail, the system will use static analysis to pre-analyze malware, so as to generate a set of triggering actions. Such actions will then be automatically selected by the system to analyze the malware samples.
The project will bring together academics with strong background on Artificial Intelligence and Cyber Security as well as expertise from a highly innovative enterprise that has static and dynamic analysis of malware as its core business. The proposed approach will be implemented and empirically evaluated on real malware samples. Both the effectiveness (i.e., accuracy of classification) and efficiency (i.e., run time and computational requirements) will be carefully evaluated.