Background
Predictive process models are essential engineering tools that enable better process design and more efficient operations. Predictive modeling is becoming increasingly prevalent and is most useful whenever we want to use data from the past to predict what will happen in the future.
In the current context of low oil prices and increased regulatory oversight, operators need to find new ways to improve operational efficiency, improve reliability, and overall, reduce costs. Machine Learning is a set of technologies that has the potential to make a significant impact on all these dimensions.
Process Ecology has been investigating the use of some of the most widely used Machine Learning algorithms for process predictive modeling. Our experience with these algorithms includes active participation in the Kaggle community where a number of competitions are posted to the community.
Recent conversations with operators have identified a series of potential application for Machine Learning models that can better predict the future performance of complex processing steps where available first principles engineering models have failed to assist to the optimal operation of these facilities.
In this brief article, we describe our recent experience with a review of Machine Learning algorithms aimed at prediction of water volumes for hydraulic fracturing. The data source is the FracFocus.ca site that reports on water volumes for multiple locations.
We note that, in general, predictive modeling problems can be defined as having the following characteristics:
 There exists data where there are a number of variables with predictive power (called features) which correspond to variable(s) to be predicted (response).
 The “accuracy” or “quality” of the prediction can be measured using a metric.
As noted above, in numerous engineering applications the processes involved may be too complex to analyze using traditional equations based on firstprinciples models (i.e. physics, chemistry). However, there may be enough measured data points that a datadriven approach can be used. Instead of learning all the individual engineering equations that define a process, we can focus solely on making the most accurate prediction of the response from the features available.
Case Study: Predicting fracking water volume
Engineering data often comes in tabular format, where each row/record corresponds to one event, one run, etc. As an illustration, this case study looks at a dataset of hydraulic fracturing where location, date and water volume are recorded for each instance. The goal is to predict the volume of water based on the features.
Figure 1: Sample dataset for hydraulic fracturing. In this study, over 100 000 records were used for the models.
The full dataset must be split into a training set and a test set (in some cases, a validation set too). Why is this necessary?
Although the goal is to make the best predictions on future data, this can be viewed as two related but separate objectives:
 Minimize bias (fit the training data as well as possible)
 Minimize variance (fit the test data as well as possible)
Initially, improvements to the model will meet both objectives concurrently. However, there typically comes a point where improving the model will only decrease bias, but variance will plateau or even increase. At this point, the model is “overfitting”, which usually suggests the model is too complex and will perform worse on future predictions.
Many algorithms exist for learning a model to fit the data. However, selecting a good algorithm requires an understanding of the realworld problem, the nature of different algorithms, and experience. Although in machine learning we want models to learn as autonomously as possible, the modeler must select a set of hyperparameters for each model. These parameters determine factors such as how quickly the model converges, how much the model overfits, etc.
Algorithm selection/testing:
Based on our experience and understanding of the problem, a few algorithms were tested. These algorithms and their pros and cons are presented in Table 1.
Table 1: Comparison of algorithms
Algorithm
 Pros
 Cons
 Hyperparameters

kNearest Neighbours (kNN)
  Runs very fast
 Model easily interpretable
  Features should be scaled to consistent “scales”
 Not clear how to incorporate time
  Number of neighbors (k)
 Weighting of neighbors (uniform or distance)

Recurrent neural network (RNN) + Long Short Term Memory (LSTM)
  Has the ability to learn sequential nature of data
 In theory, scales well with large dataset
  Takes a long time to run
 Testing required to discover optimal network architecture
  Number of previous points to take
 Typical NN parameters (learning rate, regularization, optimizer, etc.)

Gradient boosting machine (Light GBM)
  Does not require data to be scaled
 Robust to missing data
 Can consider time as a feature
  Takes a long time to run
 Models not very interpretable
 
Model descriptions and applications:
 k Nearest Neighbours (kNN)
 This is a relatively easy to interpret model which takes the ‘k’ nearest points in the dataset to predict the response for a future sample. The predictors should all be scaled to a consistent scale so that “nearest” has a useful meaning.
 In this case, latitude, longitude and depth (TVD) are scaled to be in miles. Therefore, “nearest” literally means closest in actual physical space. The model can be interpreted as: a new sample’s TotalBaseWaterVolume will be predicted as the average of the TotalBaseWaterVolume of the ‘k’ nearest locations from the training set.Figure 2: RMSE vs. # of neighbours, k
 The most important tuning parameter in kNN is the number of neighbours. A low number can result in the model overfitting to too few nearby points. A high number can result in the model underfitting by not capturing trends in the data. We can test the model using different values of ‘k’ to determine which value works best. Figure 2 shows that k = 7 gave the lowest RMSE of 3.87E6.
 Recurrent Neural Network (RNN) with Long Short Term Memory (LSTM)
 Recurrent neural networks have been shown to perform well on tasks where the time dimension or sequential nature of the data are important (i.e. predictions rely on knowing about the past). RNNs combine the ability of neural networks to learn complex functions with the “recurrent” ability to gather information from past outputs.
 Gradient boosting machine (LightGBM implementation)
 This model has been shown to be very effective for tabular data in general. It creates an ensemble of decision trees, which alone are weak predictors of the data, to create a strong predictor. The details of the algorithm and implementation can be found at the documentation: https://lightgbm.readthedocs.io/en/latest/
 In our case, the predictor variables are latitude, longitude, TVD and date.
Comparison of models:
Figure 3: Comparison of actual vs. predicted water volumes for various tested models
In Figure 3, a model with perfect accuracy would have all points exactly on the bottom left to top right diagonal (i.e. actual response is equal to predicted response). Of the models tested, LightGBM > kNN > LSTM in terms of predictive power. In general, tabular data is handled well by LightGBM. LSTM may have performed better if the problem were formulated differently or with more careful tuning of parameters – these are challenges generally encountered when designing complex models such as neural networks. kNN, while having some predictive power, fails to incorporate the information in the time variable.
General tips on dealing with data:
 Is the data accurate (i.e. if a certain variable in the data has value X, can we be sure that the variable’s actual value was X)? Is the data precise (i.e. if a certain variable in the data with value X has a true value of Y, can we be sure that future readings with value X will also have a true value of Y)? This is the most important question when building data models. If there isn’t some reliable guarantee that these questions are true most of the time, then the data is unreliable. “Garbage in, garbage out.”
 Is there missing data? Missing data, while not ideal, can still be handled reasonably if it comprises only a small percentage of the dataset. Let’s assume we’re dealing with tabular data. If a row is missing data for many columns or for a particularly important column, the entire row can be removed. If a row is missing data for only a few columns, imputation can be used, which means to fill in the missing columns using reasonable estimates from the other columns.