Smart Transformations in a Network Intrusion Detection System

Summary

In this article I present my analysis of a network intrusion detection system for computer networks. The goal of the supervised learning experiment is to detect intrusions between normal and abnormal traffic in a computer network. I decided to apply the logistic regression algorithm for two class problems.
The analysis are based on a study, which is mentioned in de references. The train dataset as well as the validation dataset is obtained from the Kaggle Community.
After analyzing the data I observed for some features exceptional high numerical values and very low values. To overcome difficulties I applied logarithmic transformation on the relevant variables followed by a translation to the correct quadrant.
The LR algortithm is applied on the dataset with and without the applied operations and tested against an available validation dataset. For both situations, a classification report is created.

Read Dataset

The number of datapoints in the dataset used is $125945$. A few services are removed from the dataset because they were not available in the validation dataset.

In [1]:
source("C:\\Anaconda\\R\\Blog - Intrusion Detection System Functions.r")
set.seed(777)
options(repr.matrix.max.rows = 600, repr.matrix.max.cols=50, scipen=999, repr.plot.width=15, repr.plot.height=15, warn = -1)
do_plot <- TRUE
Warning message:
"package 'ggplot2' was built under R version 3.6.3"
Warning message:
"package 'caret' was built under R version 3.6.3"
Loading required package: lattice

Welcome! Related Books: `Practical Guide To Cluster Analysis in R` at https://goo.gl/13EFCZ

dummies-1.5.6 provided by Decision Patterns



Attaching package: 'kernlab'


The following object is masked from 'package:ggplot2':

    alpha


Warning message:
"package 'ggpubr' was built under R version 3.6.3"
Loading required package: magrittr

Warning message:
"package 'PCAmixdata' was built under R version 3.6.3"
In [2]:
df_nid <- read_data("NID_network_intrusion_detection_dataset.csv", azure_ml)
df_nid <- df_nid[df_nid$service %ni% c('http_8001', 'http_2784', 'aol', 'harvest', 'red_i', 'urh_i', 'tftp_u'), ]
df_nid$diff_level <- NULL
df_nid$is_host_login <- NULL

Stucture of the Dataset

The dataset consists of numerical and categorical variables and there are no missing values. The class variable reprents the attack types and its values [1]. The values of the categorical variables are many.
Below are successively studied:

  1. Number of Datapoints and the Variable Names and their Types
  2. Exploration of the Categorical Variables
  3. Exploration of the Class Variable
  4. Missing Values

Number of Datapoints en the Variable Names and their Types

In [3]:
str(df_nid)
'data.frame':	125945 obs. of  41 variables:
 $ duration                   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ protocol_type              : Factor w/ 3 levels "icmp","tcp","udp": 2 3 2 2 2 2 2 2 2 2 ...
 $ service                    : Factor w/ 70 levels "aol","auth","bgp",..: 18 43 48 22 22 48 48 48 50 48 ...
 $ flag                       : Factor w/ 11 levels "OTH","REJ","RSTO",..: 10 10 6 10 10 2 6 6 6 6 ...
 $ src_bytes                  : int  491 146 0 232 199 0 0 0 0 0 ...
 $ dst_bytes                  : int  0 0 0 8153 420 0 0 0 0 0 ...
 $ land                       : int  0 0 0 0 0 0 0 0 0 0 ...
 $ wrong_fragment             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ urgent                     : int  0 0 0 0 0 0 0 0 0 0 ...
 $ hot                        : int  0 0 0 0 0 0 0 0 0 0 ...
 $ num_failed_logins          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ logged_in                  : int  0 0 0 1 1 0 0 0 0 0 ...
 $ num_compromised            : int  0 0 0 0 0 0 0 0 0 0 ...
 $ root_shell                 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ su_attempted               : int  0 0 0 0 0 0 0 0 0 0 ...
 $ num_root                   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ num_file_creations         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ num_shells                 : int  0 0 0 0 0 0 0 0 0 0 ...
 $ num_access_files           : int  0 0 0 0 0 0 0 0 0 0 ...
 $ num_outbound_cmds          : int  0 0 0 0 0 0 0 0 0 0 ...
 $ is_guest_login             : int  0 0 0 0 0 0 0 0 0 0 ...
 $ count                      : int  2 13 123 5 30 121 166 117 270 133 ...
 $ srv_count                  : int  2 1 6 5 32 19 9 16 23 8 ...
 $ serror_rate                : num  0 0 1 0.2 0 0 1 1 1 1 ...
 $ srv_serror_rate            : num  0 0 1 0.2 0 0 1 1 1 1 ...
 $ rerror_rate                : num  0 0 0 0 0 1 0 0 0 0 ...
 $ srv_rerror_rate            : num  0 0 0 0 0 1 0 0 0 0 ...
 $ same_srv_rate              : num  1 0.08 0.05 1 1 0.16 0.05 0.14 0.09 0.06 ...
 $ diff_srv_rate              : num  0 0.15 0.07 0 0 0.06 0.06 0.06 0.05 0.06 ...
 $ srv_diff_host_rate         : num  0 0 0 0 0.09 0 0 0 0 0 ...
 $ dst_host_count             : int  150 255 255 30 255 255 255 255 255 255 ...
 $ dst_host_srv_count         : int  25 1 26 255 255 19 9 15 23 13 ...
 $ dst_host_same_srv_rate     : num  0.17 0 0.1 1 1 0.07 0.04 0.06 0.09 0.05 ...
 $ dst_host_diff_srv_rate     : num  0.03 0.6 0.05 0 0 0.07 0.05 0.07 0.05 0.06 ...
 $ dst_host_same_src_port_rate: num  0.17 0.88 0 0.03 0 0 0 0 0 0 ...
 $ dst_host_srv_diff_host_rate: num  0 0 0 0.04 0 0 0 0 0 0 ...
 $ dst_host_serror_rate       : num  0 0 1 0.03 0 0 1 1 1 1 ...
 $ dst_host_srv_serror_rate   : num  0 0 1 0.01 0 0 1 1 1 1 ...
 $ dst_host_rerror_rate       : num  0.05 0 0 0 0 1 0 0 0 0 ...
 $ dst_host_srv_rerror_rate   : num  0 0 0 0.01 0 1 0 0 0 0 ...
 $ class                      : Factor w/ 23 levels "back","buffer_overflow",..: 12 12 10 12 12 10 10 10 10 10 ...

Exploration of the Categorical Variables

In [4]:
df_categoricals = return_frequency_categoricals(df_nid)
names(df_categoricals) <- c('categorical', 'value', 'frequency' )
print(df_categoricals)
     categorical       value frequency
1  protocol_type        icmp      8273
2  protocol_type         tcp    102682
3  protocol_type         udp     14990
4        service         aol         0
5        service        auth       955
6        service         bgp       710
7        service     courier       734
8        service    csnet_ns       545
9        service         ctf       563
10       service     daytime       521
11       service     discard       538
12       service      domain       569
13       service    domain_u      9043
14       service        echo       434
15       service       eco_i      4586
16       service       ecr_i      3077
17       service         efs       485
18       service        exec       474
19       service      finger      1767
20       service         ftp      1754
21       service    ftp_data      6860
22       service      gopher       518
23       service     harvest         0
24       service   hostnames       460
25       service        http     40338
26       service   http_2784         0
27       service    http_443       530
28       service   http_8001         0
29       service       imap4       647
30       service         IRC       187
31       service    iso_tsap       687
32       service      klogin       433
33       service      kshell       299
34       service        ldap       410
35       service        link       475
36       service       login       429
37       service         mtp       439
38       service        name       451
39       service netbios_dgm       405
40       service  netbios_ns       347
41       service netbios_ssn       362
42       service     netstat       360
43       service        nnsp       630
44       service        nntp       296
45       service       ntp_u       168
46       service       other      4359
47       service     pm_dump         5
48       service       pop_2        78
49       service       pop_3       264
50       service     printer        69
51       service     private     21853
52       service       red_i         0
53       service  remote_job        78
54       service         rje        86
55       service       shell        65
56       service        smtp      7313
57       service     sql_net       245
58       service         ssh       311
59       service      sunrpc       381
60       service      supdup       544
61       service      systat       477
62       service      telnet      2353
63       service      tftp_u         0
64       service       tim_i         8
65       service        time       654
66       service       urh_i         0
67       service       urp_i       602
68       service        uucp       780
69       service   uucp_path       689
70       service       vmnet       617
71       service       whois       693
72       service         X11        73
73       service      Z39_50       862
74          flag         OTH        46
75          flag         REJ     11228
76          flag        RSTO      1562
77          flag      RSTOS0       103
78          flag        RSTR      2421
79          flag          S0     34849
80          flag          S1       365
81          flag          S2       127
82          flag          S3        49
83          flag          SF     74924
84          flag          SH       271

Exploration of the Class Variable

In [5]:
table(df_nid$class)
           back buffer_overflow       ftp_write    guess_passwd            imap 
            956              30               8              53              11 
        ipsweep            land      loadmodule        multihop         neptune 
           3599              18               9               7           41214 
           nmap          normal            perl             phf             pod 
           1493           67322               3               4             201 
      portsweep         rootkit           satan           smurf             spy 
           2931              10            3626            2646               2 
       teardrop     warezclient     warezmaster 
            892             890              20 
In [6]:
df_barplot <- as.data.frame(table(df_nid$class))
names(df_barplot) <- c('class', 'frequency')
df_barplot$class_normal <- ifelse(df_barplot$class == 'normal', 'Normal', 'Abnormal')
p <- ggbarplot(df_barplot, x = 'class', y= 'frequency' , ggtheme = theme_gray(base_size=20),
               font.legend = c(20),
               legend.title = "Networktraffic", 
               legend.position = "right", 
               x.text.angle = 90,
               color='class_normal', palette = 'jco') 
ggpar(p, main = "Frequencies of computer network types")

Missing Values

In [7]:
paste('Missing values:', sum(is.na(df_nid)))
'Missing values: 0'

Construction of the Numerical Dataset

The dataset consists of numerical and categorical variables. For the values of the categorical variables binary features are constructed and the original categorical featues are removed from the dataset. The same is applied to the class label.
The following variables caused singularities when applying LR and, therefore left outside the dataset. It concerns then the following variables:

  1. protocol_type.udp
  2. service.urp_i
  3. service.Z39_50
  4. flag.SH

In [8]:
df_nid$class_normal = ifelse(df_nid$class == 'normal', 1, 0)
df_nid$class <- NULL
df_nid <- dummy.data.frame(df_nid, names = c('protocol_type', 'service', 'flag'), sep = '.')
df_nid <- df_nid[, names(df_nid) %ni% c('protocol_type.udp', 'service.urp_i', 'service.Z39_50', 'flag.SH')]
datapoints <- !duplicated(df_nid)
df_nid <- df_nid[as.vector(datapoints),]

Frequency Table for the new Class Label

In [9]:
df_class_frequency <- as.data.frame(table(df_nid$class))
names(df_class_frequency) <- c('class', 'frequency')
df_class_frequency
classfrequency
0 58614
1 67322

Calculate Statistics Numerical Variables

In [10]:
D <- return_X_y(df_nid)
X = D$X
y = D$y

Summary

In [11]:
return_summary(X)
minimum1st Qu.mediaangemiddelde3rd Qu.maximum
duration0 0.00 0.00 287.22901314954 0.00 42908
protocol_type.icmp0 0.00 0.00 0.06562063270 0.00 1
protocol_type.tcp0 1.00 1.00 0.81535065430 1.00 1
service.auth0 0.00 0.00 0.00758321687 0.00 1
service.bgp0 0.00 0.00 0.00563778427 0.00 1
service.courier0 0.00 0.00 0.00582835726 0.00 1
service.csnet_ns0 0.00 0.00 0.00432759497 0.00 1
service.ctf0 0.00 0.00 0.00447052471 0.00 1
service.daytime0 0.00 0.00 0.00413702198 0.00 1
service.discard0 0.00 0.00 0.00427201118 0.00 1
service.domain0 0.00 0.00 0.00451816796 0.00 1
service.domain_u0 0.00 0.00 0.07180631432 0.00 1
service.echo0 0.00 0.00 0.00344619489 0.00 1
service.eco_i0 0.00 0.00 0.03638355990 0.00 1
service.ecr_i0 0.00 0.00 0.02439334265 0.00 1
service.efs0 0.00 0.00 0.00385116250 0.00 1
service.exec0 0.00 0.00 0.00376381654 0.00 1
service.finger0 0.00 0.00 0.01403093635 0.00 1
service.ftp0 0.00 0.00 0.01392770931 0.00 1
service.ftp_data0 0.00 0.00 0.05447211282 0.00 1
service.gopher0 0.00 0.00 0.00411320036 0.00 1
service.hostnames0 0.00 0.00 0.00365264896 0.00 1
service.http0 0.00 0.00 0.32030555203 1.00 1
service.http_4430 0.00 0.00 0.00420848685 0.00 1
service.imap40 0.00 0.00 0.00513753017 0.00 1
service.IRC0 0.00 0.00 0.00148488121 0.00 1
service.iso_tsap0 0.00 0.00 0.00545515182 0.00 1
service.klogin0 0.00 0.00 0.00343825435 0.00 1
service.kshell0 0.00 0.00 0.00237422183 0.00 1
service.ldap0 0.00 0.00 0.00325562190 0.00 1
service.link0 0.00 0.00 0.00377175708 0.00 1
service.login0 0.00 0.00 0.00340649219 0.00 1
service.mtp0 0.00 0.00 0.00348589760 0.00 1
service.name0 0.00 0.00 0.00358118409 0.00 1
service.netbios_dgm0 0.00 0.00 0.00321591920 0.00 1
service.netbios_ns0 0.00 0.00 0.00275536781 0.00 1
service.netbios_ssn0 0.00 0.00 0.00287447592 0.00 1
service.netstat0 0.00 0.00 0.00285859484 0.00 1
service.nnsp0 0.00 0.00 0.00500254097 0.00 1
service.nntp0 0.00 0.00 0.00235040020 0.00 1
service.ntp_u0 0.00 0.00 0.00133401093 0.00 1
service.other0 0.00 0.00 0.03461281921 0.00 1
service.pm_dump0 0.00 0.00 0.00003970271 0.00 1
service.pop_20 0.00 0.00 0.00061936222 0.00 1
service.pop_30 0.00 0.00 0.00209630288 0.00 1
service.printer0 0.00 0.00 0.00054789734 0.00 1
service.private0 0.00 0.00 0.17352464744 0.00 1
service.remote_job0 0.00 0.00 0.00061936222 0.00 1
service.rje0 0.00 0.00 0.00068288655 0.00 1
service.shell0 0.00 0.00 0.00051613518 0.00 1
service.smtp0 0.00 0.00 0.05806917800 0.00 1
service.sql_net0 0.00 0.00 0.00194543260 0.00 1
service.ssh0 0.00 0.00 0.00246950832 0.00 1
service.sunrpc0 0.00 0.00 0.00302534621 0.00 1
service.supdup0 0.00 0.00 0.00431965443 0.00 1
service.systat0 0.00 0.00 0.00378763817 0.00 1
service.telnet0 0.00 0.00 0.01868409351 0.00 1
service.tim_i0 0.00 0.00 0.00006352433 0.00 1
service.time0 0.00 0.00 0.00519311396 0.00 1
service.uucp0 0.00 0.00 0.00619362216 0.00 1
service.uucp_path0 0.00 0.00 0.00547103291 0.00 1
service.vmnet0 0.00 0.00 0.00489931394 0.00 1
service.whois0 0.00 0.00 0.00550279507 0.00 1
service.X110 0.00 0.00 0.00057965951 0.00 1
flag.OTH0 0.00 0.00 0.00036526490 0.00 1
flag.REJ0 0.00 0.00 0.08915639690 0.00 1
flag.RSTO0 0.00 0.00 0.01240312540 0.00 1
flag.RSTOS00 0.00 0.00 0.00081787575 0.00 1
flag.RSTR0 0.00 0.00 0.01922405031 0.00 1
flag.S00 0.00 0.00 0.27671992123 1.00 1
flag.S10 0.00 0.00 0.00289829755 0.00 1
flag.S20 0.00 0.00 0.00100844874 0.00 1
flag.S30 0.00 0.00 0.00038908652 0.00 1
flag.SF0 0.00 1.00 0.59486564604 1.00 1
src_bytes0 0.00 44.00 45580.12090268073276.00 1379963888
dst_bytes0 0.00 0.00 19784.92552566383516.00 1309937401
land0 0.00 0.00 0.00019851353 0.00 1
wrong_fragment0 0.00 0.00 0.02269406683 0.00 3
urgent0 0.00 0.00 0.00011116758 0.00 3
hot0 0.00 0.00 0.20446893660 0.00 77
num_failed_logins0 0.00 0.00 0.00122284335 0.00 5
logged_in0 0.00 0.00 0.39585186126 1.00 1
num_compromised0 0.00 0.00 0.27933235929 0.00 7479
root_shell0 0.00 0.00 0.00134195147 0.00 1
su_attempted0 0.00 0.00 0.00110373523 0.00 2
num_root0 0.00 0.00 0.30228052344 0.00 7468
num_file_creations0 0.00 0.00 0.01267310380 0.00 43
num_shells0 0.00 0.00 0.00041290814 0.00 2
num_access_files0 0.00 0.00 0.00409731927 0.00 9
num_outbound_cmds0 0.00 0.00 0.00000000000 0.00 0
is_guest_login0 0.00 0.00 0.00942542244 0.00 1
count0 2.00 14.00 84.10231387371143.00 511
srv_count0 2.00 8.00 27.74502922119 18.00 511
serror_rate0 0.00 0.00 0.28456255558 1.00 1
srv_serror_rate0 0.00 0.00 0.28255248698 1.00 1
rerror_rate0 0.00 0.00 0.11994409859 0.00 1
srv_rerror_rate0 0.00 0.00 0.12117917037 0.00 1
same_srv_rate0 0.09 1.00 0.66091570321 1.00 1
diff_srv_rate0 0.00 0.00 0.06299707788 0.06 1
srv_diff_host_rate0 0.00 0.00 0.09732641977 0.00 1
dst_host_count0 82.00 255.00 182.16675930631255.00 255
dst_host_srv_count0 10.00 63.00 115.68368854021255.00 255
dst_host_same_srv_rate0 0.05 0.51 0.52130534557 1.00 1
dst_host_diff_srv_rate0 0.00 0.02 0.08290925549 0.07 1
dst_host_same_src_port_rate0 0.00 0.00 0.14833296277 0.06 1
dst_host_srv_diff_host_rate0 0.00 0.00 0.03255161352 0.02 1
dst_host_serror_rate0 0.00 0.00 0.28452912591 1.00 1
dst_host_srv_serror_rate0 0.00 0.00 0.27855045420 1.00 1
dst_host_rerror_rate0 0.00 0.00 0.11881177741 0.00 1
dst_host_srv_rerror_rate0 0.00 0.00 0.12023551645 0.00 1

Variance

In [12]:
sapply(X, var)
duration
6785468.77590846
protocol_type.icmp
0.061315052140771
protocol_type.tcp
0.150555160321643
service.auth
0.00752577145263177
service.bgp
0.00560604417491026
service.courier
0.0057944335234093
service.csnet_ns
0.00430890110566534
service.ctf
0.00445057445974294
service.daytime
0.00411993974311131
service.discard
0.0042537948781912
service.domain
0.00449778983151396
service.domain_u
0.066650696784999
service.echo
0.00343434590397104
service.eco_i
0.0350600748681655
service.ecr_i
0.0237984964575255
service.efs
0.00383636150545766
service.exec
0.00374968000126428
service.finger
0.0138341790246713
service.ftp
0.0137338372800697
service.ftp_data
0.0515053107242642
service.gopher
0.00409631446552371
service.hostnames
0.00363933601839286
service.http
0.217711634115679
service.http_443
0.00419080876618299
service.imap4
0.00511117654327538
service.IRC
0.00148268811064297
service.iso_tsap
0.00542543622263467
service.klogin
0.00342645996637819
service.kshell
0.00236860370567149
service.ldap
0.00324504859665456
service.link
0.00375756076853609
service.login
0.00339491495495286
service.mtp
0.00347377370035443
service.name
0.00356838754892467
service.netbios_dgm
0.00320560251498943
service.netbios_ns
0.00274779757312394
service.netbios_ssn
0.00286623607190543
service.netstat
0.00285044591143799
service.nnsp
0.0049775550814856
service.nntp
0.00234489444189327
service.ntp_u
0.00133224191975544
service.other
0.0334150372895912
service.pm_dump
0.0000397014450825371
service.pop_2
0.000618983521238521
service.pop_3
0.00209192500923987
service.printer
0.000547601501435001
service.private
0.143414982963389
service.remote_job
0.000618983521238521
service.rje
0.000682425630341708
service.shell
0.000515872880559929
service.smtp
0.0546975828905491
service.sql_net
0.00194166331053982
service.ssh
0.00246342941129913
service.sunrpc
0.00301621743832061
service.supdup
0.00430102916577092
service.systat
0.00377332192476354
service.telnet
0.0183351437485696
service.tim_i
0.0000635207988673638
service.time
0.00516618655233596
service.uucp
0.00615531007835246
service.uucp_path
0.00544114391017619
service.vmnet
0.00487534937309504
service.whois
0.00547255777199457
service.X11
0.000579328104623812
flag.OTH
0.000365134377375337
flag.REJ
0.0812081786287106
flag.RSTO
0.0122493851441604
flag.RSTOS0
0.000817213314790325
flag.RSTR
0.0188546359169129
flag.S0
0.200147595704622
flag.S1
0.00288992036681617
flag.S2
0.00100743976662997
flag.S3
0.000388938220197033
flag.SF
0.241002422890627
src_bytes
34470912236617.3
dst_bytes
16175356435419.3
land
0.000198475699064858
wrong_fragment
0.0642961873744926
urgent
0.00020644335295128
hot
4.62371004790153
num_failed_logins
0.00204718054650616
logged_in
0.239155064218237
num_compromised
573.389777633419
root_shell
0.00134016127527699
su_attempted
0.00203951705882358
num_root
595.516248739695
num_file_creations
0.234261910718523
num_shells
0.000492146970870063
num_access_files
0.00987720477432791
num_outbound_cmds
0
is_guest_login
0.00933665798679895
count
13104.5171770019
srv_count
5277.32608796834
serror_rate
0.199359922985565
srv_serror_rate
0.199857466177481
rerror_rate
0.102672365954294
srv_rerror_rate
0.104743908311428
same_srv_rate
0.193267768301944
diff_srv_rate
0.0324592558297904
srv_diff_host_rate
0.0675097550763183
dst_host_count
9841.18998600713
dst_host_srv_count
12255.3716340414
dst_host_same_srv_rate
0.201553273627199
dst_host_diff_srv_rate
0.0356531155514495
dst_host_same_src_port_rate
0.0954524256338979
dst_host_srv_diff_host_rate
0.0126740371184511
dst_host_serror_rate
0.197870132029747
dst_host_srv_serror_rate
0.198649497640958
dst_host_rerror_rate
0.0939715900613152
dst_host_srv_rerror_rate
0.102049882991877

Remove Variable for Variance 0

In [13]:
df_nid$num_outbound_cmds <- NULL

Correlations

Here I consider the variables that mutually correlate more than $0.75$. Such correlations can negatively affect the coefficients in a regression model. Principal Component Analysis can be a good solution for this. This method will only be used here to present the computer network traffic in a 2-dimensional plane.

In [14]:
D <- return_X_y(df_nid)
highly_correlated <- return_indices_correlated_variables(D$X, 0.75)
variables <- names(df_nid[highly_correlated])
print(variables)
 [1] "num_root"                 "is_guest_login"          
 [3] "srv_serror_rate"          "rerror_rate"             
 [5] "srv_rerror_rate"          "same_srv_rate"           
 [7] "dst_host_same_srv_rate"   "dst_host_srv_serror_rate"
 [9] "dst_host_rerror_rate"     "dst_host_srv_rerror_rate"
[11] "flag.S0"                  "flag.SF"                 
[13] "serror_rate"             

Visualization of Computer Network Traffic using Principal Component Analysis

In [15]:
return_plot_pca(df_nid)

Transformations

The obtained dataset consists of continuous and discrete variables. The continuous variables need some attention.
The first thing that pops up is that the values of these variables $\geq 0$. Furthermore, there are 2 variables related to a single connection, namely the src_bytes (bytes from source to destination in single connection) and dst_bytes (bytes from destination to source in single connection). If the src_bytes are displayed as a function of dst_bytes in the XY-plane, each point can be assigned to a value that is, for example, the distance from the origin to this point. So these variables can be replaced by 1 variable bytes without losing information. I consider the variables which are very much bigger than $1$. It concerns the following variables:

  1. duration
  2. bytes
  3. num_root
  4. num_compromised
  5. count
  6. hot
  7. num_file_creations
  8. srv_count
  9. dst_host_count
  10. dst_host_srv_count

3 operations are now being performed on these variables:

  1. a small value is added to the variable where its value equals zero.
  2. the logarithm function is computed for the variable.
  3. a translation of the variable is performed such that the values are positioned in the first quadrant again.

Transformations of the Continuous Variables under Consideration

Transformations are performed on the dataset for visualization purposes.

In [16]:
df_nid_transformations <- return_log(df_nid, epsilon)
df_nid_transformations <- return_bytes(df_nid_transformations)

Visualizations

In this analysis the plots for the continuous variables 'count' and 'srv_count' are considered.

In [17]:
if (do_plot == TRUE){
    variables <- c('duration', 'bytes', 'num_root', 'num_compromised', 'count',
                'hot', 'num_file_creations', 'srv_count', 'dst_host_count', 'dst_host_srv_count')
    df_plots <- df_nid_transformations[, names(df_nid_transformations)] 
    df_plots$class_normal <- as.factor(ifelse(df_plots$class_normal == 0, 'Abnormal', 'Normal'))
    for (x in variables){
        I = which(x == variables)
        for (y in variables){
            J = which(y == variables)
            if (J > I){
                if ((x == 'count') & (y == 'srv_count')){
                    p <- ggscatter(df_plots, x = x, y = y,  
                         legend.title = "Nerworktraffic", legend.position = "top", color='class_normal',
                         ggtheme = theme_gray(base_size=20), palette = "jco")
                    ggpar(p, main = paste(x,'versus',y))
                    print(p)
                }
            }
        }
    }
}

Outlier Detection

For the possible outliers at present, visualizations can be applied. In the above figure, probably outliers are close to the origin and will be removed from the dataset.

In [18]:
df_nid_transformations$outlier <-
    ifelse((df_nid_transformations$count <= epsilon) & (df_nid_transformations$srv_count <= epsilon), 1, 0) 
df_outliers <- as.data.frame(df_nid_transformations$outlier)
names(df_outliers) <- c('outlier')

Remove Outliers from the Dataset

In [19]:
df_nid <- cbind(df_nid, df_outliers)
df_nid <- df_nid[df_nid$outlier == 0,]
df_nid$outlier <- NULL

Create the Train Datasets

In [20]:
ind <- sample(2, nrow(df_nid), replace = TRUE, prob=c(0.7,0.3))
df_train <- df_nid[ind == 1,]
In [21]:
df_train_transformations <- return_log(df_train, epsilon)
df_train_transformations <- return_bytes(df_train_transformations)

Machine Learning with LR Model

In [22]:
model_lr <- glm(df_train$class_normal~., data = df_train)
model_lr_transformations <- glm(df_train_transformations$class_normal~., data = df_train_transformations)

Classification Report 1: Transformations applied

In [23]:
return_classification_report(model_lr_transformations, return_validation_dataset_transformed(), 'response')
Confusion Matrix and Statistics

   
        0     1
  0 10227  1223
  1  2603  8488
                                               
               Accuracy : 0.8303               
                 95% CI : (0.8253, 0.8351)     
    No Information Rate : 0.5692               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.6598               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022
                                               
            Sensitivity : 0.7971               
            Specificity : 0.8741               
         Pos Pred Value : 0.8932               
         Neg Pred Value : 0.7653               
             Prevalence : 0.5692               
         Detection Rate : 0.4537               
   Detection Prevalence : 0.5080               
      Balanced Accuracy : 0.8356               
                                               
       'Positive' Class : 0                    
                                               

Classification Report 2: No Transformations applied

In [24]:
return_classification_report(model_lr, return_validation_dataset(), 'response')
Confusion Matrix and Statistics

   
       0    1
  0 8118  658
  1 4712 9053
                                               
               Accuracy : 0.7618               
                 95% CI : (0.7562, 0.7673)     
    No Information Rate : 0.5692               
    P-Value [Acc > NIR] : < 0.00000000000000022
                                               
                  Kappa : 0.5377               
                                               
 Mcnemar's Test P-Value : < 0.00000000000000022
                                               
            Sensitivity : 0.6327               
            Specificity : 0.9322               
         Pos Pred Value : 0.9250               
         Neg Pred Value : 0.6577               
             Prevalence : 0.5692               
         Detection Rate : 0.3601               
   Detection Prevalence : 0.3893               
      Balanced Accuracy : 0.7825               
                                               
       'Positive' Class : 0                    
                                               

Conclusion

Transformations are performed on the dataset for some continuous variables under consideration. There are two advantages to do so. The first one is related to finding outliers in the dataset en the second one is observed from the classification reports, which shows that the predictive power has improved significantly.