Mnemocrypt is a random forest classifier based tool for detection and partial identification of cryptographic functions in x86 executables. The machine learning model bases its predictions on general metrics related to the structure of functions, as well as on statistics on metrics related to their content with different levels of granularity, the building blocks of which are essentially mnemonics of assembly instructions. Mnemocrypt can be considered as a kind of generalisation of Caballero heuristics based approaches and incorporates some of them. Mnemocrypt IDA plugin can provide partial cryptographic identification information if combined with a slightly modified version of Findcrypt3 (included in this repository), which is an IDA plugin for cryptography detection and identification, based on yara rules. The tool has been tested on IDA Pro 8.3 and 9.0, both with Python 3.9.2, under Windows environment with WSL.
The primary role of the repository is to serve as support to the research paper Mnemocrypt: A Machine Learning Approach for Cryptographic Function Detection in x86 Executables, while the release version contains a directly ready to use version of the plugin.
-
Coloring convention:
- yellow: confidence score 0.5-0.75
- orange: confidence score 0.75-0.95
- red: confidence score 0.95-1.0
-
Minimal confidence score and coloring convention can be changed in the plugin script mnemocrypt.py
-
Higher the confidence score is and more likely, according to Mnemocrypt, a given function is to perform cryptographic operations.
-
Most frequent kinds of false postiives with high confidence score (greater than 0.9): compression or encoding related functions as well as functions performing some complex, not cryptography related, computations or data processing.
What to do just after having downloaded the repository to quickly test Mnemocrypt on the provided malware dataset
- If the user has IDA under Linux environment or under Windows environment and have WSL, then run
./prepare_environment.sh
and answer to the prompts. This should automatically initialize some essential veriables in scripts to run later. - If the user is dealing with some other environment or if ./prepare\environment.sh is not working, then he/she has to set idat_path variable, with absolute path to idat.exe, in ./common/building_wrapper.sh and ./tool/plugin_batch.sh, and set repository_dirpath variable in findcrypt3.py with mnemocrypt.py, which is the path to the location of this repository; finally, the user has to move the files from move_to_ida_plugins/ to IDA plugins directory (C:\Users\john\Programs\IDA_Pro_9.0\plugins for example).
- Disable antivirus, because it can interfere with IDA databases generation; the antivirus can be reactivated after the end of generation of IDA database files of executables from the malware dataset.
- Install necessary Python modules by running
pip install -r requirements.txt
(Python 3.9.2 recommended). - Run
./quick_start.sh
; expect the process to take several hours (e.g. 2 hours on architecture with 2.7GHz, 16GB RAM and 250GB SSD under Windows environment). - Use Mnemocrypt in IDA GUI (by opening IDA database files of provided malware samples) at your wish (shortcut Ctrl-Shift-M; name of the plugin in IDA GUI: Mnemocrypt) or run Mnemocrypt in batch mode for all analyzed binaries with
./tool/plugin\_batch.sh mnemocrypt
and access exported results in ./tool/mnemocrypt_results.csv.
How to use the provided pre-trained Mnemocrypt model to classify functions from your set of binaries
- Perform the steps 1, 2, 3 and 4 from the previous section, unless already done.
- Ensure that pre-trained Mnemocrypt model trained_mnemocrypt.pkl is present in ./common/ folder (unzip ./data.zip with provided password, if necessary).
- If ./tool/raw_executables/, ./tool/ida_databases/ or ./tool/computed_features/ already exist then remove them, by making backups of them and of previously generated files (./tool/immediate_crypto_functions.json, ./tool/immediate_non_crypto_functions.json, ./tool/unrecognized_mnemonics.json, ./tool/findcrypt_matches.csv, ./tool/findcrypt_tags.json and ./tool/mnemocrypt_predictions.csv), if necessary.
- Place the raw binaries from your set to ./tool/raw_executables/.
- Run
./common/building\_wrapper.sh databases && ./common/building\_wrapper.sh features && ./tool/plugin\_batch.sh findcrypt
- See step 6 from the section above.
- In this README, "./" stands for the root directory of the repository
- The zip archive ./data.zip (stored in the repo via git LFS) with executables and trained model is protected by password (hardcoded in quick_start.sh)
- The pre-trained model corresponds to the default training, and the user can customize it by modifying hyperparameters in ./training/train_mnemocrypt.py or features in ./common/internal_compute_features.py, or any data used to generate features.
- The zip archive contains malware samples, so, in case the user wants to run Mnemocrypt on them, it is recommended to work in isolated sandbox environment and it is mandatory to deactivate antivirus at least until all IDA databases of malware samples have been generated
- Mnemocrypt can't be used on a binary (its IDA database in practice) unless its corresponding .csv file with computed features is present in computed_features
- Relation between Mnemocrypt and Findcrypt IDA plugins: Mnemocrypt is fully independant from Findcrypt in its approach to address the problem of cryptography detection, so it can be run without even having Findcrypt plugin installed. However, the cryptographic byte patterns matched by rules in Findcrypt allow to visualize more cryptographic identification information than with Mnemocrypt alone (natively supporting only AES-NI and Intel SHA extensions at the moment).
mnemocrypt/
├─ common/ // Regroups files related to both tool and training modes
│ ├─ building_wrapper.sh // Generic script to either create IDA databases from raw executables or run features computation for already built IDA databases
| ├─ categories.json // Categories of mnemonics and their associated roots
| ├─ internal_compute_features.py // Features computation
| ├─ prepare_roots.py // Combines information from categories.json and root_prefixes.json to build prepared_roots.json
| ├─ prepared_roots.json // Used in internal_compute_features.py for time efficient mnemonics-roots matching
| ├─ root_prefixes.json // Mnemonics prefixes appended to roots for mnemonics matching during features computation
| ├─ training_set_basenames_listing.txt // Regroups the basenames (i.e. filenames without extensions) of the executables belongining to the training set
|
├─ data.zip // Initially large files are stored in zip format in order to minimize the size of the repository; the zip contains OpenSSL and Libsodium cryptographic libraries built with different configurations (training set), some real-world malware samples and pre-trained Mnemocrypt model
|
├─ doc/ // Additional explanatory information
│ ├─ crypto_functions_labels.txt // Convention on the cryptographic labels from files in crypto_functions_names/ directory
│ ├─ malware_samples_name_mapping.json // Stores information on original names of provided malware samples, given by their hashes
│ ├─ merged_roots.txt // Indicates what some mnemonics roots, statistics on which are present among features, actually correspond to
│ ├─ prefixes_documentation.txt // Origine of each root prefix
│ ├─ roots_documentation.txt // Explanation of semantics behind each declared mnemonics root
|
├─ LICENSE
├─ mnemocrypt.png
|
├─ move_to_ida_plugins/ // Files from this directory are to be moved to the directory plugins/ of IDA
│ ├─ findcrypt3.py // Slightly modified version of Findcrypt3 (add to output Xrefs to functions matching crypto-signatures by their code or referencing data matching crypto-signatures)
│ ├─ findcrypt3.rules // Updated rules (crypto-signatures) used by Findcrypt (merge between yara rules from original repository https://github.com/polymorf/findcrypt-yara and the ones from https://github.com/packmad/findcrypt-PYara)
│ ├─ mnemocrypt.py // Mnemocrypt plugin
|
├─ quick_start.sh // Unzip data, build IDA databases of binaries from the provided malware dataset, compute their features, run Findcrypt and then Mnemocrypt on them in batch mode with export of results
├─ requirements.txt // Using Python 3.9.2 is highly recommended!
├─ README.md
|
├─ tool/ // Regroups information related to the binaries to use Mnemocrypt with (dataset of real world malware samples provided as example) and Mnemocrypt and modified Findcrypt plugins-related scripts
| ├─ internal_findcrypt_batch.py // Automatically run modified Findcrypt plugin on given binary and exports results
| ├─ internal_mnemocrypt_batch.py // Automatically run Mnemocrypt plugin on given and exports results
| ├─ plugin_batch.sh // Generic script allowing to run either modified Findcrypt or Mnemocrypt plugins on all binaries under study (based on the content of computed_features/) with results export
|
├─ training/ // Includes information on the training set and model training script
| ├─ crypto_functions_names/ // Regroups functions from the training set labeled as cryptographic (achieved by manual labelling process); the training process is heavliy based on this information
│ ├─ train_mnemocrypt.py // Train the random forest classifier of Mnemocrypt basing on features from computed_features/ directory; script not to be run unless the user wants to customize Mnemocrypt
- ./[training|tool]/computed_features/: Contains computed features (in form of .csv files associated to each binaries)
- ./tool/findcrypt_matches.csv: Results of Findcrypt run on all the executables from ./tool/ directory
- ./tool/findcrypt_tags.json: Part of information from ./tool/findcrypt_matches.csv which can be used by Mnemocrypt plugin to indicate some cryptographic identification information on flagged functions
- ./[training|tool]/ida_databases/: IDA databases associated to raw executables used
- ./[training|tool]/immediate_crypto_functions.json: Functions containing mnemonics from AES-NI or Intel SHA extensions instruction sets; such functions are directly considered by Mnemocrypt as cryptographic without even passing through the random forest classifier
- ./[training|tool]/immediate_non_crypto_functions.json: Functions containing floating-point related mnemonics or only one basic block with few instructions; the content of the file is not used and mainly serves for tracking purpose
- ./tool/mnemocrypt_predictions.csv: Results of Findcrypt run on all the executables from ./tool/ directory
- ./[training|tool]/raw_executables/: Contains the executables to analyze
- ./common/trained_mnemocrypt.pkl: Pre-trained Mnemocrypt model to avoid the user training the model if there is no need to customize it; can be generated with ./training/train_mnemocrypt.py script
- ./[training/tool]/unrecognized.json: Unrecognized mnemonics during features computation; the content of the file is not used and is only relevant if the user is interested in customizing Mnemocrypt
- ./common/weights_trained_mnemocrypt.txt Weights of the features of ./common/trained_mnemocrypt.pkl, sorted in decreasing order of their importance and only playing informative role for the user (to get insight on what Mnemocrypt essentially is essentially basing its classification decisions on); generated along with ./common/trained_mnemocrypt.pkl by ./training/train_mnemocrypt.py
- ./common/building_wrapper.sh: the first argument can take value "databases" or "features" for respectively IDA databases generation or features computation performed on already built IDA databases; the second argument may not take any value at all, in which case the script will not consider binaries from training set, it can also take the value "training" for the opposite case (only the training set is considered) or "all" value for processing both training and not training data (unless the user wants to customize Mnemocrypt model, there is normally no need to set the second argument).
- ./tool/plugin_batch.sh: the first (and only) argument can take value "findcrypt" or "mnemocrypt" to respectively run the modified version of Findcrypt or Mnemocrypt in batch mode on all binaries (except the ones from the training set) with results export.