Abstract:
The Android mobile operating system runs roughly 95% of mobile phones worldwide making it a major target for malware. To counter the threat there continues to be significant research efforts in mobile malware detection. One of the key tools used in these efforts is supervised machine learning. Significant progress has been made in detection accuracy but one necessary step in that process often is overlooked or treated with less importance than it should, feature selection, i.e., determining which attributes of Android are most effective as inputs into machine learning models.To tackle this research problem, we have tested 11 feature selection techniques and analyzed their performance on three large primary Android feature sets used by researchers: Permissions, Intents and API Calls. The generated feature sets were validated using three popular machine learning classifiers. As an ongoing research effort, in this paper, we specifically answer the following research questions: 1) How does machine learning model accuracy vary across feature selection algorithms? 2) How does feature set size affect model accuracy across feature selection methods? and 3) What are the primary guidelines for building a feature set in Android malware detection? Answers to these key research questions provided significant insights in Android malware detection and will shed light on other malware detection methods.