EdNet is a package for training deep neural network models with fully connected layers on a number of different regression and classification tasks. This package was built as a personal project to develop my coding skills and my understanding of neural networks and the algorithms used to train them. While there are many deep learning libraries out there that are faster and contain far greater functionality, this one is simple and easy to use in R. You may find it useful if you want a neural network package in R to play around with in order to improve your understanding and hone your intuition on network architectures and hyperparameters.
Currently EdNet supports binary and multiclass classification as well as regression for the following distributions: Normal, Poisson, Gamma, Tweedie. Drop-out and L1 and L2 regularisation are also supported, as well as training with weights and offset models. Training can be with or without mini-batch learning and with Gradient Descent, Momentum, RMSProp, or Adam optimisation.
EdNet can be installed directly from GitHub using the devtools
package. The code is below.
devtools::install_github("EdwinGraham/EdNet")
EdNet comes with a number of dummy datasets with which to try out different model architectures and parameters to begin to get a feel for them. For this example we will load in the dummy binary data into our environment.
library(EdNet)
data("dfBinary")
head(dfBinary)
## x1 x2 x3 target
## 1 0.4092032 0.9873708 -0.1096844 0
## 2 -0.3230250 -1.2885071 1.6707485 1
## 3 0.6358523 2.6934964 0.7363947 1
## 4 -1.8461288 0.2513315 -0.6128222 1
## 5 0.9536474 -0.3293031 1.3687059 1
## 6 1.1884898 1.0886427 -0.7108311 1
There are 3 random normally distributed variables, x1, x2 and x3 and a binary target variable that depends in some probabilistic way on the other variables. We will train a neural network with 5 hidden layers for 10 epochs, using out the first 200 rows as holdout data.
modelFit <- EdNetTrain(X=as.matrix(dfBinary[, 1:3]),
Y=as.matrix(dfBinary[, 4]),
family="binary",
learning_rate=0.01,
num_epochs=40,
hidden_layer_dims=c(64, 64, 32, 32, 16),
hidden_layer_activations="relu",
optimiser="Adam",
mini_batch_size=32,
print_every_n=40L,
dev_set = 1:200)
## Cost after 0 epochs: train-2.075543, dev-1.962871
## Cost after 40 epochs: train-0.5137914, dev-0.6829946
## Best dev score achieved was 0.6292755 after 35 epochs.
The costs are calculated on the final mini-batch of an epoch which explains the noisiness.
There may be a better possible set of parameters, for example we could add drop-out or regularisation. Now we have build a model, We can use the generic predict
function to score probabilities onto the dataset.
dfBinary$prediction <- predict(modelFit, as.matrix(dfBinary[, 1:3]))
head(dfBinary)
## x1 x2 x3 target prediction
## 1 0.4092032 0.9873708 -0.1096844 0 0.04878962
## 2 -0.3230250 -1.2885071 1.6707485 1 0.99316671
## 3 0.6358523 2.6934964 0.7363947 1 0.63875027
## 4 -1.8461288 0.2513315 -0.6128222 1 0.87940706
## 5 0.9536474 -0.3293031 1.3687059 1 0.40622267
## 6 1.1884898 1.0886427 -0.7108311 1 0.86937323
Looking only at the rows we held out, we can calculate an accuracy metric.
devSetAccuracy <- mean((dfBinary$prediction[1:200]>.5)==dfBinary$target[1:200])
devSetAccuracy
## [1] 0.85
The model we’ve built acheives a 85% accuracy on the hold-out data, so we are doing a lot better than random.
EdNet also supports multiclass classification. The following example uses the famous Iris dataset (from the stats
package) to train a neural network to predict iris species using sepal and petal lengths and widths.
First we bring the Iris data into the environment.
data("iris")
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
# Plot sepal length v sepal width and colour by species
plot(iris$Sepal.Length, iris$Sepal.Width, col=iris$Species, pch=16,
xlab="Sepal Length", ylab="Sepal Width", main="Iris species by sepal length and width")
legend("topright",
legend=c("setosa", "versicolor", "virginica"),
col=1:3, pch=16)
Before fitting the model. For multi-class learning, EdNetTrain
requires the input labels to be one-hot encoded. It’s also good practice to normalise input features before training.
X <- sapply(iris[, 1:4], normalise)
head(X)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,] -0.8976739 1.01560199 -1.335752 -1.311052
## [2,] -1.1392005 -0.13153881 -1.335752 -1.311052
## [3,] -1.3807271 0.32731751 -1.392399 -1.311052
## [4,] -1.5014904 0.09788935 -1.279104 -1.311052
## [5,] -1.0184372 1.24503015 -1.335752 -1.311052
## [6,] -0.5353840 1.93331463 -1.165809 -1.048667
Y <- onehotEncode(iris$Species)
head(Y)
## setosa versicolor virginica
## [1,] 1 0 0
## [2,] 1 0 0
## [3,] 1 0 0
## [4,] 1 0 0
## [5,] 1 0 0
## [6,] 1 0 0
We will hold out 20 rows at random (there are 150 altogether) in order to test the model is working as well as we expect.
set.seed(1984)
holdoutRows <- sample(1:150, 20)
We’re now ready to train a model. We will train a network with 3 fully-connected hidden layers with dimensions 128, 64 and 32. Each of these layers will use a relu
activiation function and a drop-out percentage of 50%. We will train for 40 epochs with a mini-batch size of 16 using Adam optimisation.
modelFit <- EdNetTrain(X,
Y,
family="multiclass",
learning_rate=0.01,
num_epochs=40,
hidden_layer_dims=c(128, 64, 32),
hidden_layer_activations="relu",
optimiser="Adam",
lambda=0.001,
keep_prob=0.5,
mini_batch_size=16,
print_every_n=40L,
dev_set=holdoutRows)
## Cost after 0 epochs: train-2.607286, dev-2.721248
## Cost after 40 epochs: train-0.956158, dev-0.4660453
## Best dev score achieved was 0.3098807 after 30 epochs.
We can generate the predicted class probabilities by using the predict
function.
classProbs <- predict(modelFit, X)
head(classProbs)
## [,1] [,2] [,3]
## [1,] 0.9445057 0.03389728 0.02159702
## [2,] 0.9007412 0.06537104 0.03388779
## [3,] 0.9411574 0.03608371 0.02275888
## [4,] 0.9216281 0.04916146 0.02921045
## [5,] 0.9562985 0.02585889 0.01784260
## [6,] 0.9213687 0.04758339 0.03104795
Using the predictedClass
function we can calculate the predicted class for each row (which is the class with the highest probability.)
iris$predictions <- predictedClass(classProbs, levels(iris$Species))
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species predictions
## 1 5.1 3.5 1.4 0.2 setosa setosa
## 2 4.9 3.0 1.4 0.2 setosa setosa
## 3 4.7 3.2 1.3 0.2 setosa setosa
## 4 4.6 3.1 1.5 0.2 setosa setosa
## 5 5.0 3.6 1.4 0.2 setosa setosa
## 6 5.4 3.9 1.7 0.4 setosa setosa
Plotting these predictions on the holdout only (it is easy to generate good predictions on the training data.)
dev <- iris[holdoutRows, ]
par(mfrow=c(1,2))
plot(dev$Sepal.Length, dev$Sepal.Width, col=dev$Species, pch=16,
xlab="Sepal Length", ylab="Sepal Width", main="Actual")
plot(dev$Sepal.Length, dev$Sepal.Width, col=dev$predictions, pch=16,
xlab="Sepal Length", ylab="Sepal Width", main="Predicted")
legend("topright", inset=0.07,
legend=c("setosa", "versicolor", "virginica"),
col=1:3, pch=16)
Predictions on the holdout data set have 100% accuracy!
For this example we use a dummy Poisson dataset with 3 features, an exposure column and a target column. The target column is integer count data and the exposure column represents the length of time that the Poisson process ran for. The underlying rate depends on the three feature columns in some way (that we want to model.)
data("dfPoisson")
head(dfPoisson)
## x1 x2 x3 exposure target
## 1 0.4092032 0.9873708 -0.1096844 12 2
## 2 -0.3230250 -1.2885071 1.6707485 30 18
## 3 0.6358523 2.6934964 0.7363947 33 1
## 4 -1.8461288 0.2513315 -0.6128222 31 18
## 5 0.9536474 -0.3293031 1.3687059 9 2
## 6 1.1884898 1.0886427 -0.7108311 44 12
The target should depend on exposure in a well-understood linear way that we need to account for in building the model.
plot(dfPoisson$exposure, dfPoisson$target)
To take exposure into account we use an offset. We need to apply the log function to exposure as the offset is applied in training before the final activation function (which is the exponential function in this case.)
modelFit <- EdNetTrain(X=as.matrix(dfPoisson[, 1:3]),
Y=as.matrix(dfPoisson[, 5]),
offset=log(as.matrix(dfPoisson[, 4])),
family="poisson",
learning_rate=0.01,
num_epochs=40,
hidden_layer_dims=c(128, 32, 8),
hidden_layer_activations="relu",
optimiser="Adam",
lambda = 0.004,
keep_prob=0.8,
mini_batch_size=8,
print_every_n=40L,
dev_set=1:200)
## Cost after 0 epochs: train-106.9975, dev-50.1871
## Cost after 40 epochs: train-5.787886, dev-2.501501
## Best dev score achieved was 2.216769 after 38 epochs.
Having built a model we can use predict
to put predicted rate per exposure onto the original data.
dfPoisson$predictedRate <- predict(modelFit, as.matrix(dfPoisson[, 1:3]))
dfPoisson$actualRate <- dfPoisson$target/dfPoisson$exposure
head(dfPoisson)
## x1 x2 x3 exposure target predictedRate
## 1 0.4092032 0.9873708 -0.1096844 12 2 0.2685433
## 2 -0.3230250 -1.2885071 1.6707485 30 18 0.6815773
## 3 0.6358523 2.6934964 0.7363947 33 1 0.2545321
## 4 -1.8461288 0.2513315 -0.6128222 31 18 0.5573603
## 5 0.9536474 -0.3293031 1.3687059 9 2 0.4512897
## 6 1.1884898 1.0886427 -0.7108311 44 12 0.3327345
## actualRate
## 1 0.16666667
## 2 0.60000000
## 3 0.03030303
## 4 0.58064516
## 5 0.22222222
## 6 0.27272727
Plotting the actual v predicted rate on the hold-out data shows that the model is broadly picking up the correct trend.
plot(dfPoisson$predictedRate[1:200], dfPoisson$actualRate[1:200])
abline(c(0, 0), 1)
This example uses dummy Gamma distributed data and makes use of the checkpoint feature to continue learning from an existing model. First we bring in the data.
data("dfGamma")
head(dfGamma)
## x1 x2 x3 target
## 1 0.4092032 0.9873708 -0.1096844 6.242109
## 2 -0.3230250 -1.2885071 1.6707485 72.051513
## 3 0.6358523 2.6934964 0.7363947 16.776243
## 4 -1.8461288 0.2513315 -0.6128222 87.097421
## 5 0.9536474 -0.3293031 1.3687059 20.420865
## 6 1.1884898 1.0886427 -0.7108311 49.606084
Now we train a model using gradient descent for 15 epochs without using mini-batches.
modelFit <- EdNetTrain(X=as.matrix(dfGamma[, 1:3]),
Y=as.matrix(dfGamma[, 4]),
family="gamma",
learning_rate=0.002,
num_epochs=15,
hidden_layer_dims=c(128, 32, 8),
hidden_layer_activations="relu",
optimiser="GradientDescent",
print_every_n=15,
dev_set=1:200)
## Cost after 0 epochs: train-52.38222, dev-42.30829
## Cost after 15 epochs: train-2.823976, dev-2.409988
## Best dev score achieved was 2.409701 after 14 epochs.
The cost function on the training set is decreasing monotonically since we are not using drop-out or mini-batches. However, the learning seems to have stalled. We can Use first model as a checkpoint and continue training for another 25 epochs with Adam optimisation, mini-batch descent, drop-out and regularisation.
modelFit <- EdNetTrain(X=as.matrix(dfGamma[, 1:3]),
Y=as.matrix(dfGamma[, 4]),
family="gamma",
learning_rate=0.005,
num_epochs=25,
optimiser="Adam",
lambda=0.004,
keep_prob=0.8,
mini_batch_size=8,
print_every_n=25,
dev_set=1:200,
checkpoint=modelFit)
## Cost after 0 epochs: train-2.416403, dev-2.409988
## Cost after 25 epochs: train-0.9476705, dev-0.7255556
## Best dev score achieved was 0.7255556 after 25 epochs.
Changing the optimisation parameters has caused a significant improvement. We can use predict
to put the predictions back on the original dataset.
dfGamma$predictions <- predict(modelFit, as.matrix(dfPoisson[, 1:3]))
head(dfGamma)
## x1 x2 x3 target predictions
## 1 0.4092032 0.9873708 -0.1096844 6.242109 29.02339
## 2 -0.3230250 -1.2885071 1.6707485 72.051513 64.97015
## 3 0.6358523 2.6934964 0.7363947 16.776243 33.14870
## 4 -1.8461288 0.2513315 -0.6128222 87.097421 63.74657
## 5 0.9536474 -0.3293031 1.3687059 20.420865 41.97481
## 6 1.1884898 1.0886427 -0.7108311 49.606084 37.77477
A plot on the hold-out data shows a reasonable correlation between actual values and predictions (the data is reasonably noisy.)
plot(dfGamma$predictions[1:200], dfGamma$target[1:200])
abline(c(0, 0), 1)
The final example uses dummy Tweedie distributed data (continuous with a positive mass at zero.) The target variable is drawn from a compound Poisson-Gamma distribution where the Poisson part of the distribution also has a variable exposure as in the Poisson example above. This is similar to an insurance claim cost example, where different policies may have different periods of exposure.
data("dfTweedie")
head(dfTweedie)
## x1 x2 x3 exposure target
## 1 0.4092032 0.9873708 -0.1096844 12 10.560170
## 2 -0.3230250 -1.2885071 1.6707485 30 103.750467
## 3 0.6358523 2.6934964 0.7363947 33 7.509364
## 4 -1.8461288 0.2513315 -0.6128222 31 102.136073
## 5 0.9536474 -0.3293031 1.3687059 9 10.397050
## 6 1.1884898 1.0886427 -0.7108311 44 66.671503
We can see that in this example (unlike most insurance examples) only a small proportion of rows have a value of zero.
mean(dfTweedie$target==0)
## [1] 0.053
When modelling using the Tweedie distribution we need to specify a power parameter between 1 and 2. In order to adjust for the exposure We’re modelling the target per exposure and using exposure as a weight.
X <- as.matrix(dfTweedie[, 1:3])
Y <- as.matrix(dfTweedie$target/dfTweedie$exposure)
modelFit <- EdNetTrain(X=X,
Y=Y,
weight=dfTweedie$exposure,
family="tweedie",
tweediePower=1.3,
learning_rate=0.001,
num_epochs=40,
hidden_layer_dims=c(128, 64, 32, 8),
hidden_layer_activations="relu",
optimiser="Adam",
lambda=0.01,
keep_prob=0.8,
mini_batch_size=8,
print_every_n=40,
dev_set=1:200)
## Cost after 0 epochs: train-3.011956, dev-1.148983
## Cost after 40 epochs: train-1.940185, dev-0.4530936
## Best dev score achieved was 0.4530936 after 40 epochs.
Using predict
we can again put the model predictions back on the original data.
dfTweedie$predictedRate <- predict(modelFit, X)
dfTweedie$actualRate <- dfTweedie$target/dfTweedie$exposure
head(dfTweedie)
## x1 x2 x3 exposure target predictedRate
## 1 0.4092032 0.9873708 -0.1096844 12 10.560170 1.526473
## 2 -0.3230250 -1.2885071 1.6707485 30 103.750467 2.654835
## 3 0.6358523 2.6934964 0.7363947 33 7.509364 1.542838
## 4 -1.8461288 0.2513315 -0.6128222 31 102.136073 2.791624
## 5 0.9536474 -0.3293031 1.3687059 9 10.397050 1.875061
## 6 1.1884898 1.0886427 -0.7108311 44 66.671503 1.847696
## actualRate
## 1 0.8800142
## 2 3.4583489
## 3 0.2275565
## 4 3.2947120
## 5 1.1552278
## 6 1.5152614
Plotting the actual v predicted rate on the hold-out data.
plot(dfTweedie$predictedRate[1:200], dfTweedie$actualRate[1:200])
abline(c(0, 0), 1)