Code similarity is an important component of program analysis that finds application in many fields of computer science, such as reverse engineering of big collections of code fragments, clone detection, identification of violations of the intellectual property of programs, malware detection, software maintenance, software attribution and software forensics. In these applications, when comparing two fragments of code it is important to take into account changes due to code evolution, compiler optimisation and post-compile obfuscation. These code changes give rise to fragments of code that are syntactically different while having the same intended behaviour. In the past years, researchers have developed many static and dynamic techniques for automatically identifying similar code fragments. The identification of similar code fragments obtained as the result of code obfuscation is particularly challenging, since code obfuscation is explicitly designed to produce code variants that break program analysis. Indeed, the static analysis of obfuscated code is notoriously known to be intractable in the general case. Thus, researchers usually resort to dynamic analysis or hybrid static/dynamic approaches when dealing with semantically equivalent code samples that are the result of code obfuscation.
Indeed, state-of-the-art generic deobfuscation methodologies are based on dynamic symbolic execution that suffers from limitations such as code coverage and scalability.
The project BinTrace is intended to design a new dynamic analysis technique for the identification of similar code samples in binaries. The idea is to develop a methodology for the similarity analysis of fragments of execution traces that is both fast and precise. Next, BinTrace lifts the similarity between trace fragments to the similarity of the whole binary code. Focusing on trace fragments will allow us to be efficient in identifying similarities among obfuscated code samples.