This dataset contains the samples and the models trained and tested in this article. It also contains the directories used for generating the samples. It contains five separated compressed files:
- samplesAndModels.tar.xz (41.5MB)
- samples/: The samples used for training and testing all the models in the article.
- ransomwareSamples_train: The ransomware samples (for each binary) once the inactive ones were removed (32 hours).
- ransomwareSamples_test: All ransomware samples for each binary. Including those gathered when the ransomware was still not started or was already finished the encryption (122 hours).
- userSamples: All user samples splitted by day (0 to 6). The 'day0' samples were used for training the models (except for the case of training day comparison)
- scaler.scaler: The StandardScaler object use for normalizing the sample values.
- NN_CNN_LSTM_Comparison/: NN, CNN and LSTM model generated and compared in Section 3.3 of the article. These models were trained with all ransomware samples and the 'day0' users' traffic trace.
- chronologicalModel/: It contains the models generated for the chronological evaluation.
- MODEL-ALT/: It contains the models generated for the comparison between training with different day of not-infected traffic.
- Standard1Directory.tar.xz (MB)
- inputfile_Standard1: Configuration file for generating the Standard1 directory.
- Standard1/: Directory Standard1 generated with impressions software (link here).
- Standard2Directory.tar.xz (MB)
- inputfile_Standard2: Configuration file for generating the Standard2 directory.
- Standard2/: Directory Standard2 generated with impressions software (link here).
- SmallFilesDirectory.tar.xz (MB)
- inputfile_SmallFiles: Configuration file for generating the SmallFiles directory.
- SmallFiles/: Directory SmallFiles generated with impressions software (link here).
- LargeFilesDirectory.tar.xz (MB)
- inputfile_LargeFiles: Configuration file for generating the LargeFiles directory.
- LargeFiles/: Directory LargeFiles generated with impressions software (link here).
The files containing samples are structured as follows:
- Each line is one sample.
- Each sample has 30 features, the label (1 if it is 'infected' sample and 0 if it is not) and the timestamp of the last sample interval in the trace (in seconds since the beginning of the trace).
- The features are separated by ',' because it is a csv file.
- The values are not normalized, but the StandardScaler object use for doing it is in the file scaler.scaler
The models can be loaded in a python script using keras. Some important considerations about them are explained in the following lines.
Neural Network model (NN)
The Neural Network model is composed by three hidden layers with 512, 256 and 128 cells. The input layer has 30 cells, and the output one has only 1 (binary classification).
The complete information about its structure is in NN.json, in the main repository's directory. The file was obtained by the command to_json() from the keras model.
Convolutional Neural Network model (CNN)
The Convolutional Neural Network model is composed by two convolutiona layers followed by two pooling layers and the last one unit dense layer for classify the binary sample.
The complete information about its structure is in CNN.json, in the main repository's directory. The file was obtained by the command to_json() from the keras model.
Long Short Term Memory models (LSTM)
ALl the Long Short Term Memory models compiled in this article has the same structure. They have the input layer and an additional hidden one, followed by the output layer that has only one cell.
As in previous cases, the complete information is in LSTM.json, in the main repository's directory. The file was obtained by the command to_json() from the keras model.
General considerations
In a prediction, each model gets a value between 0 and 1, instead of getting a binary output. Due to our classification problem is binary (two classes), we should set a threshold for the classifier output. After some experiments we considered that the best option is set the threshold to 0.99, because the false positives are much more problematic than the false negatives. All the experiments performed in the article has been performed with this threshold.