Introduction

EdNet is a package for training deep neural network models with fully connected layers on a number of different regression and classification tasks. This package was built as a personal project to develop my coding skills and my understanding of neural networks and the algorithms used to train them. While there are many deep learning libraries out there that are faster and contain far greater functionality, this one is simple and easy to use in R. You may find it useful if you want a neural network package in R to play around with in order to improve your understanding and hone your intuition on network architectures and hyperparameters.

Currently EdNet supports binary and multiclass classification as well as regression for the following distributions: Normal, Poisson, Gamma, Tweedie. Drop-out and L1 and L2 regularisation are also supported, as well as training with weights and offset models. Training can be with or without mini-batch learning and with Gradient Descent, Momentum, RMSProp, or Adam optimisation.

Installation

EdNet can be installed directly from GitHub using the devtools package. The code is below.

devtools::install_github("EdwinGraham/EdNet")

Binary classification example

EdNet comes with a number of dummy datasets with which to try out different model architectures and parameters to begin to get a feel for them. For this example we will load in the dummy binary data into our environment.

library(EdNet)
data("dfBinary")
head(dfBinary)
##           x1         x2         x3 target
## 1  0.4092032  0.9873708 -0.1096844      0
## 2 -0.3230250 -1.2885071  1.6707485      1
## 3  0.6358523  2.6934964  0.7363947      1
## 4 -1.8461288  0.2513315 -0.6128222      1
## 5  0.9536474 -0.3293031  1.3687059      1
## 6  1.1884898  1.0886427 -0.7108311      1

There are 3 random normally distributed variables, x1, x2 and x3 and a binary target variable that depends in some probabilistic way on the other variables. We will train a neural network with 5 hidden layers for 10 epochs, using out the first 200 rows as holdout data.

modelFit <- EdNetTrain(X=as.matrix(dfBinary[, 1:3]),
                       Y=as.matrix(dfBinary[, 4]),
                       family="binary",
                       learning_rate=0.01,
                       num_epochs=40,
                       hidden_layer_dims=c(64, 64, 32, 32, 16),
                       hidden_layer_activations="relu",
                       optimiser="Adam",
                       mini_batch_size=32,
                       print_every_n=40L,
                       dev_set = 1:200)
## Cost after 0 epochs: train-2.075543, dev-1.962871
## Cost after 40 epochs: train-0.5137914, dev-0.6829946
## Best dev score achieved was 0.6292755 after 35 epochs.

The costs are calculated on the final mini-batch of an epoch which explains the noisiness.

There may be a better possible set of parameters, for example we could add drop-out or regularisation. Now we have build a model, We can use the generic predict function to score probabilities onto the dataset.

dfBinary$prediction <- predict(modelFit, as.matrix(dfBinary[, 1:3]))
head(dfBinary)
##           x1         x2         x3 target prediction
## 1  0.4092032  0.9873708 -0.1096844      0 0.04878962
## 2 -0.3230250 -1.2885071  1.6707485      1 0.99316671
## 3  0.6358523  2.6934964  0.7363947      1 0.63875027
## 4 -1.8461288  0.2513315 -0.6128222      1 0.87940706
## 5  0.9536474 -0.3293031  1.3687059      1 0.40622267
## 6  1.1884898  1.0886427 -0.7108311      1 0.86937323

Looking only at the rows we held out, we can calculate an accuracy metric.

devSetAccuracy <- mean((dfBinary$prediction[1:200]>.5)==dfBinary$target[1:200])
devSetAccuracy
## [1] 0.85

The model we’ve built acheives a 85% accuracy on the hold-out data, so we are doing a lot better than random.

Multiclass classification with Iris data

EdNet also supports multiclass classification. The following example uses the famous Iris dataset (from the stats package) to train a neural network to predict iris species using sepal and petal lengths and widths.

First we bring the Iris data into the environment.

data("iris")
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa
# Plot sepal length v sepal width and colour by species
plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species, pch=16,
     xlab="Sepal Length", ylab="Sepal Width", main="Iris species by sepal length and width")
legend("topright",
       legend=c("setosa", "versicolor", "virginica"),
       col=1:3, pch=16)

Before fitting the model. For multi-class learning, EdNetTrain requires the input labels to be one-hot encoded. It’s also good practice to normalise input features before training.

X <- sapply(iris[, 1:4], normalise)
head(X)
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]   -0.8976739  1.01560199    -1.335752   -1.311052
## [2,]   -1.1392005 -0.13153881    -1.335752   -1.311052
## [3,]   -1.3807271  0.32731751    -1.392399   -1.311052
## [4,]   -1.5014904  0.09788935    -1.279104   -1.311052
## [5,]   -1.0184372  1.24503015    -1.335752   -1.311052
## [6,]   -0.5353840  1.93331463    -1.165809   -1.048667
Y <- onehotEncode(iris$Species)
head(Y)
##      setosa versicolor virginica
## [1,]      1          0         0
## [2,]      1          0         0
## [3,]      1          0         0
## [4,]      1          0         0
## [5,]      1          0         0
## [6,]      1          0         0

We will hold out 20 rows at random (there are 150 altogether) in order to test the model is working as well as we expect.

set.seed(1984)
holdoutRows <- sample(1:150, 20)

We’re now ready to train a model. We will train a network with 3 fully-connected hidden layers with dimensions 128, 64 and 32. Each of these layers will use a relu activiation function and a drop-out percentage of 50%. We will train for 40 epochs with a mini-batch size of 16 using Adam optimisation.

modelFit <- EdNetTrain(X,
                       Y,
                       family="multiclass",
                       learning_rate=0.01,
                       num_epochs=40,
                       hidden_layer_dims=c(128, 64, 32),
                       hidden_layer_activations="relu",
                       optimiser="Adam",
                       lambda=0.001,
                       keep_prob=0.5,
                       mini_batch_size=16,
                       print_every_n=40L,
                       dev_set=holdoutRows)
## Cost after 0 epochs: train-2.607286, dev-2.721248
## Cost after 40 epochs: train-0.956158, dev-0.4660453
## Best dev score achieved was 0.3098807 after 30 epochs.

We can generate the predicted class probabilities by using the predict function.

classProbs <- predict(modelFit, X)
head(classProbs)
##           [,1]       [,2]       [,3]
## [1,] 0.9445057 0.03389728 0.02159702
## [2,] 0.9007412 0.06537104 0.03388779
## [3,] 0.9411574 0.03608371 0.02275888
## [4,] 0.9216281 0.04916146 0.02921045
## [5,] 0.9562985 0.02585889 0.01784260
## [6,] 0.9213687 0.04758339 0.03104795

Using the predictedClass function we can calculate the predicted class for each row (which is the class with the highest probability.)

iris$predictions <- predictedClass(classProbs, levels(iris$Species))
head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species predictions
## 1          5.1         3.5          1.4         0.2  setosa      setosa
## 2          4.9         3.0          1.4         0.2  setosa      setosa
## 3          4.7         3.2          1.3         0.2  setosa      setosa
## 4          4.6         3.1          1.5         0.2  setosa      setosa
## 5          5.0         3.6          1.4         0.2  setosa      setosa
## 6          5.4         3.9          1.7         0.4  setosa      setosa

Plotting these predictions on the holdout only (it is easy to generate good predictions on the training data.)

dev <- iris[holdoutRows, ]

par(mfrow=c(1,2))
plot(dev$Sepal.Length, dev$Sepal.Width, col=dev$Species, pch=16,
     xlab="Sepal Length", ylab="Sepal Width", main="Actual")
plot(dev$Sepal.Length, dev$Sepal.Width, col=dev$predictions, pch=16,
     xlab="Sepal Length", ylab="Sepal Width", main="Predicted")
legend("topright", inset=0.07,
       legend=c("setosa", "versicolor", "virginica"),
       col=1:3, pch=16)

Predictions on the holdout data set have 100% accuracy!

Poisson regression example with offset model

For this example we use a dummy Poisson dataset with 3 features, an exposure column and a target column. The target column is integer count data and the exposure column represents the length of time that the Poisson process ran for. The underlying rate depends on the three feature columns in some way (that we want to model.)

data("dfPoisson")
head(dfPoisson)
##           x1         x2         x3 exposure target
## 1  0.4092032  0.9873708 -0.1096844       12      2
## 2 -0.3230250 -1.2885071  1.6707485       30     18
## 3  0.6358523  2.6934964  0.7363947       33      1
## 4 -1.8461288  0.2513315 -0.6128222       31     18
## 5  0.9536474 -0.3293031  1.3687059        9      2
## 6  1.1884898  1.0886427 -0.7108311       44     12

The target should depend on exposure in a well-understood linear way that we need to account for in building the model.

plot(dfPoisson$exposure, dfPoisson$target)

To take exposure into account we use an offset. We need to apply the log function to exposure as the offset is applied in training before the final activation function (which is the exponential function in this case.)

modelFit <- EdNetTrain(X=as.matrix(dfPoisson[, 1:3]),
                       Y=as.matrix(dfPoisson[, 5]),
                       offset=log(as.matrix(dfPoisson[, 4])),
                       family="poisson",
                       learning_rate=0.01,
                       num_epochs=40,
                       hidden_layer_dims=c(128, 32, 8),
                       hidden_layer_activations="relu",
                       optimiser="Adam",
                       lambda = 0.004,
                       keep_prob=0.8,
                       mini_batch_size=8,
                       print_every_n=40L,
                       dev_set=1:200)
## Cost after 0 epochs: train-106.9975, dev-50.1871
## Cost after 40 epochs: train-5.787886, dev-2.501501
## Best dev score achieved was 2.216769 after 38 epochs.

Having built a model we can use predict to put predicted rate per exposure onto the original data.

dfPoisson$predictedRate <- predict(modelFit, as.matrix(dfPoisson[, 1:3]))
dfPoisson$actualRate <- dfPoisson$target/dfPoisson$exposure
head(dfPoisson)
##           x1         x2         x3 exposure target predictedRate
## 1  0.4092032  0.9873708 -0.1096844       12      2     0.2685433
## 2 -0.3230250 -1.2885071  1.6707485       30     18     0.6815773
## 3  0.6358523  2.6934964  0.7363947       33      1     0.2545321
## 4 -1.8461288  0.2513315 -0.6128222       31     18     0.5573603
## 5  0.9536474 -0.3293031  1.3687059        9      2     0.4512897
## 6  1.1884898  1.0886427 -0.7108311       44     12     0.3327345
##   actualRate
## 1 0.16666667
## 2 0.60000000
## 3 0.03030303
## 4 0.58064516
## 5 0.22222222
## 6 0.27272727

Plotting the actual v predicted rate on the hold-out data shows that the model is broadly picking up the correct trend.

plot(dfPoisson$predictedRate[1:200], dfPoisson$actualRate[1:200])
abline(c(0, 0), 1)

Gamma distribution example and use of checkpoint model

This example uses dummy Gamma distributed data and makes use of the checkpoint feature to continue learning from an existing model. First we bring in the data.

data("dfGamma")
head(dfGamma)
##           x1         x2         x3    target
## 1  0.4092032  0.9873708 -0.1096844  6.242109
## 2 -0.3230250 -1.2885071  1.6707485 72.051513
## 3  0.6358523  2.6934964  0.7363947 16.776243
## 4 -1.8461288  0.2513315 -0.6128222 87.097421
## 5  0.9536474 -0.3293031  1.3687059 20.420865
## 6  1.1884898  1.0886427 -0.7108311 49.606084

Now we train a model using gradient descent for 15 epochs without using mini-batches.

modelFit <- EdNetTrain(X=as.matrix(dfGamma[, 1:3]),
                       Y=as.matrix(dfGamma[, 4]),
                       family="gamma",
                       learning_rate=0.002,
                       num_epochs=15,
                       hidden_layer_dims=c(128, 32, 8),
                       hidden_layer_activations="relu",
                       optimiser="GradientDescent",
                       print_every_n=15,
                       dev_set=1:200)
## Cost after 0 epochs: train-52.38222, dev-42.30829
## Cost after 15 epochs: train-2.823976, dev-2.409988
## Best dev score achieved was 2.409701 after 14 epochs.

The cost function on the training set is decreasing monotonically since we are not using drop-out or mini-batches. However, the learning seems to have stalled. We can Use first model as a checkpoint and continue training for another 25 epochs with Adam optimisation, mini-batch descent, drop-out and regularisation.

modelFit <- EdNetTrain(X=as.matrix(dfGamma[, 1:3]),
                       Y=as.matrix(dfGamma[, 4]),
                       family="gamma",
                       learning_rate=0.005,
                       num_epochs=25,
                       optimiser="Adam",
                       lambda=0.004,
                       keep_prob=0.8,
                       mini_batch_size=8,
                       print_every_n=25,
                       dev_set=1:200,
                       checkpoint=modelFit)
## Cost after 0 epochs: train-2.416403, dev-2.409988
## Cost after 25 epochs: train-0.9476705, dev-0.7255556
## Best dev score achieved was 0.7255556 after 25 epochs.

Changing the optimisation parameters has caused a significant improvement. We can use predict to put the predictions back on the original dataset.

dfGamma$predictions <- predict(modelFit, as.matrix(dfPoisson[, 1:3]))
head(dfGamma)
##           x1         x2         x3    target predictions
## 1  0.4092032  0.9873708 -0.1096844  6.242109    29.02339
## 2 -0.3230250 -1.2885071  1.6707485 72.051513    64.97015
## 3  0.6358523  2.6934964  0.7363947 16.776243    33.14870
## 4 -1.8461288  0.2513315 -0.6128222 87.097421    63.74657
## 5  0.9536474 -0.3293031  1.3687059 20.420865    41.97481
## 6  1.1884898  1.0886427 -0.7108311 49.606084    37.77477

A plot on the hold-out data shows a reasonable correlation between actual values and predictions (the data is reasonably noisy.)

plot(dfGamma$predictions[1:200], dfGamma$target[1:200])
abline(c(0, 0), 1)

Tweedie distribution example with weights

The final example uses dummy Tweedie distributed data (continuous with a positive mass at zero.) The target variable is drawn from a compound Poisson-Gamma distribution where the Poisson part of the distribution also has a variable exposure as in the Poisson example above. This is similar to an insurance claim cost example, where different policies may have different periods of exposure.

data("dfTweedie")
head(dfTweedie)
##           x1         x2         x3 exposure     target
## 1  0.4092032  0.9873708 -0.1096844       12  10.560170
## 2 -0.3230250 -1.2885071  1.6707485       30 103.750467
## 3  0.6358523  2.6934964  0.7363947       33   7.509364
## 4 -1.8461288  0.2513315 -0.6128222       31 102.136073
## 5  0.9536474 -0.3293031  1.3687059        9  10.397050
## 6  1.1884898  1.0886427 -0.7108311       44  66.671503

We can see that in this example (unlike most insurance examples) only a small proportion of rows have a value of zero.

mean(dfTweedie$target==0)
## [1] 0.053

When modelling using the Tweedie distribution we need to specify a power parameter between 1 and 2. In order to adjust for the exposure We’re modelling the target per exposure and using exposure as a weight.

X <- as.matrix(dfTweedie[, 1:3])
Y <- as.matrix(dfTweedie$target/dfTweedie$exposure)

modelFit <- EdNetTrain(X=X,
                       Y=Y,
                       weight=dfTweedie$exposure,
                       family="tweedie",
                       tweediePower=1.3,
                       learning_rate=0.001,
                       num_epochs=40,
                       hidden_layer_dims=c(128, 64, 32, 8),
                       hidden_layer_activations="relu",
                       optimiser="Adam",
                       lambda=0.01,
                       keep_prob=0.8,
                       mini_batch_size=8,
                       print_every_n=40,
                       dev_set=1:200)
## Cost after 0 epochs: train-3.011956, dev-1.148983
## Cost after 40 epochs: train-1.940185, dev-0.4530936
## Best dev score achieved was 0.4530936 after 40 epochs.

Using predict we can again put the model predictions back on the original data.

dfTweedie$predictedRate <- predict(modelFit, X)
dfTweedie$actualRate <- dfTweedie$target/dfTweedie$exposure
head(dfTweedie)
##           x1         x2         x3 exposure     target predictedRate
## 1  0.4092032  0.9873708 -0.1096844       12  10.560170      1.526473
## 2 -0.3230250 -1.2885071  1.6707485       30 103.750467      2.654835
## 3  0.6358523  2.6934964  0.7363947       33   7.509364      1.542838
## 4 -1.8461288  0.2513315 -0.6128222       31 102.136073      2.791624
## 5  0.9536474 -0.3293031  1.3687059        9  10.397050      1.875061
## 6  1.1884898  1.0886427 -0.7108311       44  66.671503      1.847696
##   actualRate
## 1  0.8800142
## 2  3.4583489
## 3  0.2275565
## 4  3.2947120
## 5  1.1552278
## 6  1.5152614

Plotting the actual v predicted rate on the hold-out data.

plot(dfTweedie$predictedRate[1:200], dfTweedie$actualRate[1:200])
abline(c(0, 0), 1)