In this article I present my analysis of a network intrusion detection system for computer networks. The goal of the supervised learning experiment is to detect intrusions between normal and abnormal traffic in a computer network. I decided to apply the logistic regression algorithm for two class problems.
The analysis are based on a study, which is mentioned in de references. The train dataset as well as the validation dataset is obtained from the Kaggle Community.
After analyzing the data I observed for some features exceptional high numerical values and very low values. To overcome difficulties I applied logarithmic transformation on the relevant variables followed by a translation to the correct quadrant.
The LR algortithm is applied on the dataset with and without the applied operations and tested against an available validation dataset. For both situations, a classification report is created.
The number of datapoints in the dataset used is $125945$. A few services are removed from the dataset because they were not available in the validation dataset.
source("C:\\Anaconda\\R\\Blog - Intrusion Detection System Functions.r")
set.seed(777)
options(repr.matrix.max.rows = 600, repr.matrix.max.cols=50, scipen=999, repr.plot.width=15, repr.plot.height=15, warn = -1)
do_plot <- TRUE
df_nid <- read_data("NID_network_intrusion_detection_dataset.csv", azure_ml)
df_nid <- df_nid[df_nid$service %ni% c('http_8001', 'http_2784', 'aol', 'harvest', 'red_i', 'urh_i', 'tftp_u'), ]
df_nid$diff_level <- NULL
df_nid$is_host_login <- NULL
The dataset consists of numerical and categorical variables and there are no missing values. The class variable reprents the attack types and its values [1]. The values of the categorical variables are many.
Below are successively studied:
str(df_nid)
df_categoricals = return_frequency_categoricals(df_nid)
names(df_categoricals) <- c('categorical', 'value', 'frequency' )
print(df_categoricals)
table(df_nid$class)
df_barplot <- as.data.frame(table(df_nid$class))
names(df_barplot) <- c('class', 'frequency')
df_barplot$class_normal <- ifelse(df_barplot$class == 'normal', 'Normal', 'Abnormal')
p <- ggbarplot(df_barplot, x = 'class', y= 'frequency' , ggtheme = theme_gray(base_size=20),
font.legend = c(20),
legend.title = "Networktraffic",
legend.position = "right",
x.text.angle = 90,
color='class_normal', palette = 'jco')
ggpar(p, main = "Frequencies of computer network types")
paste('Missing values:', sum(is.na(df_nid)))
The dataset consists of numerical and categorical variables. For the values of the categorical variables binary features are constructed and the original categorical featues are removed from the dataset. The same is applied to the class label.
The following variables caused singularities when applying LR and, therefore left outside the dataset. It concerns then the following variables:
df_nid$class_normal = ifelse(df_nid$class == 'normal', 1, 0)
df_nid$class <- NULL
df_nid <- dummy.data.frame(df_nid, names = c('protocol_type', 'service', 'flag'), sep = '.')
df_nid <- df_nid[, names(df_nid) %ni% c('protocol_type.udp', 'service.urp_i', 'service.Z39_50', 'flag.SH')]
datapoints <- !duplicated(df_nid)
df_nid <- df_nid[as.vector(datapoints),]
df_class_frequency <- as.data.frame(table(df_nid$class))
names(df_class_frequency) <- c('class', 'frequency')
df_class_frequency
D <- return_X_y(df_nid)
X = D$X
y = D$y
return_summary(X)
sapply(X, var)
df_nid$num_outbound_cmds <- NULL
Here I consider the variables that mutually correlate more than $0.75$. Such correlations can negatively affect the coefficients in a regression model. Principal Component Analysis can be a good solution for this. This method will only be used here to present the computer network traffic in a 2-dimensional plane.
D <- return_X_y(df_nid)
highly_correlated <- return_indices_correlated_variables(D$X, 0.75)
variables <- names(df_nid[highly_correlated])
print(variables)
return_plot_pca(df_nid)
The obtained dataset consists of continuous and discrete variables. The continuous variables need some attention.
The first thing that pops up is that the values of these variables $\geq 0$. Furthermore, there are 2 variables related to a single connection, namely the src_bytes (bytes from source to destination in single connection) and dst_bytes (bytes from destination to source in single connection). If the src_bytes are displayed as a function of dst_bytes in the XY-plane, each point can be assigned to a value that is, for example, the distance from the origin to this point. So these variables can be replaced by 1 variable bytes without losing information. I consider the variables which are very much bigger than $1$. It concerns the following variables:
3 operations are now being performed on these variables:
Transformations are performed on the dataset for visualization purposes.
df_nid_transformations <- return_log(df_nid, epsilon)
df_nid_transformations <- return_bytes(df_nid_transformations)
In this analysis the plots for the continuous variables 'count' and 'srv_count' are considered.
if (do_plot == TRUE){
variables <- c('duration', 'bytes', 'num_root', 'num_compromised', 'count',
'hot', 'num_file_creations', 'srv_count', 'dst_host_count', 'dst_host_srv_count')
df_plots <- df_nid_transformations[, names(df_nid_transformations)]
df_plots$class_normal <- as.factor(ifelse(df_plots$class_normal == 0, 'Abnormal', 'Normal'))
for (x in variables){
I = which(x == variables)
for (y in variables){
J = which(y == variables)
if (J > I){
if ((x == 'count') & (y == 'srv_count')){
p <- ggscatter(df_plots, x = x, y = y,
legend.title = "Nerworktraffic", legend.position = "top", color='class_normal',
ggtheme = theme_gray(base_size=20), palette = "jco")
ggpar(p, main = paste(x,'versus',y))
print(p)
}
}
}
}
}
For the possible outliers at present, visualizations can be applied. In the above figure, probably outliers are close to the origin and will be removed from the dataset.
df_nid_transformations$outlier <-
ifelse((df_nid_transformations$count <= epsilon) & (df_nid_transformations$srv_count <= epsilon), 1, 0)
df_outliers <- as.data.frame(df_nid_transformations$outlier)
names(df_outliers) <- c('outlier')
df_nid <- cbind(df_nid, df_outliers)
df_nid <- df_nid[df_nid$outlier == 0,]
df_nid$outlier <- NULL
ind <- sample(2, nrow(df_nid), replace = TRUE, prob=c(0.7,0.3))
df_train <- df_nid[ind == 1,]
df_train_transformations <- return_log(df_train, epsilon)
df_train_transformations <- return_bytes(df_train_transformations)
model_lr <- glm(df_train$class_normal~., data = df_train)
model_lr_transformations <- glm(df_train_transformations$class_normal~., data = df_train_transformations)
return_classification_report(model_lr_transformations, return_validation_dataset_transformed(), 'response')
return_classification_report(model_lr, return_validation_dataset(), 'response')
Transformations are performed on the dataset for some continuous variables under consideration. There are two advantages to do so. The first one is related to finding outliers in the dataset en the second one is observed from the classification reports, which shows that the predictive power has improved significantly.