Building a harvest model for cucumbers
Modelling the ordinal nature of the data
WARNING: This is quite an extensive read, but I believe it is necessary to give you a good idea about the case. And about why modelling is all about tries and failures. Actually, quite a lot of tries and failures, and I want to showcase all of them because that is what modelling is about. Not the mere application of techniques for the sake of the technique, but to tackle a business question. And the business question in this regard had to do with being able to predict the cucumber harvest based on historical production and environmental data.
Like agriculture, horticulture has seen a steep rise in sensors and data collection and so the greenhouses in which various plants are placed can be monitored all of the time and in great detail. That means detailed tracking of temperature, humidity, and light on various parts and locations of the greenhouse. Also the weight of the plant can be tracked. The entire worth of suck extensive tracking remains to be seen though. Just having data is not enough — it needs to become information.
These variables are a means to an end and that is understanding the factors that contribute the most to production — in this case the cucumber. Now, biologically, there is already quite some information on how to best grow these plants, but in an ever changing environment (and an energy transition to boot) it becomes ever more important to optimize the triangle production-cost-sustainability. This is not an easy thing to do.
So what I tried, and will describe in this blog, is using the dataset from a single farmer containing a single year of data, including two harvests. I will use the production data of the first harvest to connect it with environmental variables of that harvest to predict the second harvest. It is not perfect data, but for a proof-of-concept it will suffice.
Now, before I begin, it is good to understand that cucumbers are classified into weight classes which happens after harvesting. Lets get started!
library(readr)
library(readxl)
library(ggplot2)
theme_set(theme_bw())
library(dplyr)
library(lubridate)
ggplot(df_temp, aes(x=Tijd, y=Temperature, col=bron))+
geom_line(alpha=0.5)+
labs(x="Time")
ggplot(df_temp, aes(x=Tijd, y=Temperature, col=bron))+
geom_line(alpha=0.5) + facet_grid(~bron)+
labs(x="Time")
ggplot(df_humidity, aes(x=Tijd, y=Humidity, col=bron))+
geom_line(alpha=0.5) + facet_grid(~bron)
ggplot(df_weight, aes(x=Tijd, y=TotalWeight))+geom_line()+
labs(x="Date")
One of the many ways to get some grip on high volatility data is to transform it. Straightforward transformations often employ the use of the log transformation, but the cumulative distribution is not bad either. In fact, it will tell you a lot about how the patterns over time are shifting and when events happened. High volatility data have a tendency to obscure this because of their volatile nature. So lets use the cumulative function of the daily weight data.
df_weight%>%
timetk::summarise_by_time(.,
Tijd,
"day",
mean=mean(TotalWeight))%>%
mutate(Datum = as.Date((Tijd)),
cum = cumsum(mean))%>%
select(-Tijd)%>%
ggplot(.)+
geom_line(aes(x=Datum, y=cum, col="Cumulative Sum"))+
labs(col="Outcome",
x="Date",
y="Value")
Lets add temperature as well, but stick with the daily data for now.
df_weight_day<-df_weight%>%
timetk::summarise_by_time(.,
Tijd,
"day",
mean=mean(TotalWeight),
var = var(TotalWeight))%>%
mutate(Datum = as.Date((Tijd)))%>%
select(-Tijd)
df_temp%>%
group_by(bron)%>%
timetk::summarise_by_time(.,
Tijd,
"day",
temp_mean=mean(Temperature),
temp_var = var(Temperature))%>%
mutate(Datum = as.Date((Tijd)))%>%
select(-Tijd)%>%
ggplot(.)+
geom_line(aes(x=Datum, y=temp_mean, col="Temp"))+
geom_line(data=df_weight_day, aes(x=Datum, y=mean, col="Weight"))+
facet_grid(~bron)+
labs(col="Outcome",
x="Date",
y="Value")
I want to move towards the production data now. It is good to realize that production is measured by the number of cucumbers, and their respective total weight, per meter². This represents the maximum production attached as a ratio of the size of the greenhouse. The gvg is the ratio of the total number of cucumbers divided by the total amount of kilograms harvested. The m2 is a function of the size, or the numbers, based on the land available for harvest.
production_long%>%
filter(category%in%c("gvg", "m2"))%>%
ggplot(., aes(x=Datum, y=mean, col=category))+geom_line()+
labs(col="Outcome",
x="Date",
y="Value")
On to the ratios. I already mentioned that the weight classes are a result, not something that is determined up front. So, what happens is that cucumbers are harvested, counted, and then classified. The classes are always the same, so the summed ratio of each class equals one.
production%>%
select(-c(gvg, kg_per_m2))%>%
mutate(kg_st_2530 = kg_2530/stuks_2530,
kg_st_3035 = kg_3035 /stuks_3035,
kg_st_3540 = kg_3540/stuks_3540,
kg_st_4050 = kg_4050/stuks_4050,
kg_st_5060 = kg_5060/stuks_5060)%>%
select(Datum, kg_st_2530,kg_st_3035,kg_st_3540,kg_st_4050,kg_st_5060)%>%
tidyr::pivot_longer(!Datum, names_to = "Outcome",values_to = "value")%>%
mutate(value = value*100)%>%
filter(value>20 & value<75)%>%
ggplot(., aes(x=Datum, y=value, col=Outcome))+
geom_line()+
labs(x="Date")
If we zoom in, we can slice and dice the data to look for specifics. Here, I am asking the data for the total weight of the cucumbers harvested in the 250–300gram weight class.
production_long%>%
filter(metric%in%c("kg") & category=="2530")%>%
ggplot(., aes(x=Datum, y=mean, col=category))+geom_line()+
labs(x="Date")
And I can also ask for the gvg metric which, if your remember correctly, provided me the ratio of the total amount and the total weight of cucumbers.
production_long%>%
filter(category%in%c("gvg"))%>%
ggplot(.)+
geom_line(aes(x=Datum,y=mean))+
theme(legend.position = "bottom")+
labs(x="Date")
I want to look closer at the gvg metric and will compare it to the 400–450 gram weight class.
df_gvg<-production_long%>%
filter(category%in%c("gvg"))
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
filter(category =="4050")%>%
ggplot(.)+
geom_point(aes(x=Datum,y=Ratio,col="4050"))+
geom_line(aes(x=Datum,y=Ratio, col="4050"), alpha=0.5)+
geom_point(data=df_gvg, aes(x=Datum, y=mean, col="gvg"))+
geom_line(data=df_gvg, aes(x=Datum, y=mean, col="gvg"), alpha=0.5)+
theme(legend.position = "bottom")+
labs(x="Date",
y="Ratio",
fill="Pieces",
col="GVG")
And what if we stack all the ratios on top of each other (to sum up to one) and then plot the gvg on top of that? What pattern does it reveal? Which of the weight classes will best mimic the gvg metric?
df_gvg<-production_long%>%
filter(category%in%c("gvg"))
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
ggplot(.)+
geom_bar(aes(x=Datum,y=Ratio,fill=category), stat="identity")+
geom_line(data=df_gvg, aes(x=Datum, y=mean), col="black")+
theme(legend.position = "bottom")+
labs(x="Date",
y="Ratio")
Still, I want to dig deeper. So what I am going to do is average the gvg metric to weekly values, create ratios for all weight classes, average them per week, and then see if I can spot changes that cointegrate. The reason for doing all of this is trying to discern how cucumber production looks like. And if it has any additional worth in modelling, besides the more straightforward connection with environmental variables. Please note that I am interested in both the mean and the variance, since the latter can often tell you more about what is going on than the former.
df_gvg_week<-production_long%>%
filter(category%in%c("gvg"))%>%
timetk::summarise_by_time(.,
Datum,
"week",
value=mean(mean),
var = var(mean))
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("stuks"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
group_by(category)%>%
timetk::summarise_by_time(.,
Datum,
"week",
value=mean(Ratio),
var = var(Ratio))%>%
ggplot(.)+
geom_point(aes(x=Datum,y=value,col=category, size=var))+
geom_point(data=df_gvg_week, aes(x=Datum, y=value, size=var), col="black", alpha=0.1)+
theme(legend.position = "bottom",
plot.title = element_text(color = "black", size = 12, face = "bold"))+
facet_grid(~category)+
labs(x="Date",
y="Mean",
size="variance",
col="Category",
title="Mean kg ratio per category per week over time",
subtitle="in black the average weight per cucumber")
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
ggplot(.)+
geom_bar(aes(x=Datum,y=Ratio,fill=category), stat="identity")+
geom_point(aes(x=Datum,y=Ratio), size=0.5)+
facet_grid(~category)+
theme(legend.position = "bottom")+
labs(x="Date",
y="Ratio")
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
mutate(Weekday = factor(weekdays(Datum),
levels = c("Monday", "Tuesday","Wednesday",
"Thursday","Friday","Saturday",
"Sunday")))%>%
ggplot(.)+
geom_line(aes(x=Datum,y=Ratio,col=category), stat="identity")+
facet_grid(~Weekday)+
theme(legend.position = "bottom")+
labs(x="Date",
y="Ratio")
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
mutate(Weekday = factor(weekdays(Datum),
levels = c("Monday", "Tuesday","Wednesday",
"Thursday","Friday","Saturday",
"Sunday")),
Month = lubridate::month(Datum, label=TRUE))%>%
ggplot(.)+
geom_tile(aes(x=category,y=Weekday,fill=Ratio))+
scale_fill_viridis_c(option="magma")+
facet_grid(~Month)+
theme(legend.position = "bottom")+
labs(y="Weekday",
x="Category")
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
mutate(Weekday = factor(weekdays(Datum),
levels = c("Monday", "Tuesday","Wednesday",
"Thursday","Friday","Saturday",
"Sunday")),
Month = lubridate::month(Datum, label=TRUE))%>%
ggplot(.)+
geom_tile(aes(x=Month,y=Weekday,fill=Ratio))+
scale_fill_viridis_c(option="magma")+
facet_grid(~category)+
theme(legend.position = "bottom")+
labs(y="Weekday",
x="Month")
Since we are dealing with time-series data it is a good approach to analyze it as a time-series. However, sometimes, it is just better to take as many elements as possible away from the time-series away and look for more distinctive patterns. This will allow for more straightforward modelling. And in the case of a time-series, the cumulative distribution function is offers such a way. We already used it in previous plots, but I want to take another look and broaden my perspective.
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
select(-metric)%>%
group_by(category)%>%
arrange(category, Datum)%>%
mutate(cum = cumsum(mean))%>%
ggplot(.)+
geom_line(aes(x=Datum, y=cum, col=category))+
labs(col="Outcome",
x="Date",
y="Value")
production_long%>%
filter(category%in%c("gvg", "m2"))%>%
group_by(category, metric)%>%
arrange(category, Datum)%>%
mutate(cum = cumsum(mean))%>%
ggplot(.)+
geom_line(aes(x=Datum, y=cum, col=category))+
facet_grid(~metric)+
labs(col="Outcome",
x="Date",
y="Value")
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
ggplot(.)+
geom_tile(aes(x=Datum,y=category, fill=Ratio))+
scale_fill_viridis_c(option="magma")+
labs(x="Date",
y="Outcome",
fill="Ratio")
Or, they represent just the cucumbers that have been left to develop longer which will make them heavier as they grow over time. Like I said before, it is not that easy to discern if the cucumbers are plucked by category or if the category is just assigned to them based on the assessment after the collection has been made. To me, the latter is clearly in order, also knowing that this farmer does not have the capability to screen each and every cucumber and select them based on the estimated weight class.
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
ungroup()%>%
select(-c(mean, metric))%>%
arrange(category, Datum)%>%
group_by(category)%>%
mutate(cum=cumsum(Ratio))%>%
ggplot(.)+
geom_line(aes(x=Datum,y=cum, col=category))+
theme(legend.position = "bottom")+
labs(x="Date",
y="Cumulative Sum of the Ratio")
df_temp_day_all<-df_temp%>%
timetk::summarise_by_time(.,
Tijd,
"day",
mean=mean(Temperature),
var = var(Temperature))%>%
mutate(Datum = as.Date((Tijd)))%>%
select(-Tijd)%>%
mutate(Outcome="Temperature")
df_humidity_day_all<-df_humidity%>%
timetk::summarise_by_time(.,
Tijd,
"day",
mean=mean(Humidity),
var = var(Humidity))%>%
mutate(Datum = as.Date((Tijd)))%>%
select(-Tijd)%>%
mutate(Outcome="Humidity")
ggplot()+
geom_line(data=df_temp_day_all, aes(x=Datum, y=log(mean), col="Temp"))+
geom_line(data=df_weight_day, aes(x=Datum, y=log(mean), col="Weight"))+
geom_line(data=df_humidity_day_all, aes(x=Datum, y=log(mean), col="Humidity"))+
geom_line(data=df_light_day, aes(x=Datum, y=log(mean), col="Light"))+
geom_bar(data=production, aes(x=Datum, y=log(stuks_2530+0.00001), fill="Pieces of 2530"), stat="identity", alpha=0.5)+
labs(col="Outcome",
x="Date",
y="Value",
fill="")
Something that I believed to be of interest (but later turned out not be, but I will show it anyhow) is a calculated metric showing the growth of a weight category. It is the ratio of the total weight of the weight class harvested by the number of pieces which is then again set as a ratio over time. In other words, I want to see if there is a change in the ratio of weight to pieces for a given category. If a difference is detected, then that means that there is a bit of a shift in the weight class itself as the weight class is determined afterwards. They key metric here is time, which represents the number of day from start (which is the first day of production data collected). Do note that I am not splitting the data per harvest, yet. So the ratio will be split by an increasing denominator — one day at a time. This means that the cucumbers not plucked at day x, but at day y, had a longer time to reach the same weight class (which is determined afterwards). And so their growth is slower.
growth<-production%>%
select(-c(gvg, kg_per_m2))%>%
mutate(kg_st_2530 = kg_2530 / stuks_2530)%>%
select(Datum, kg_st_2530,kg_2530,stuks_2530)%>%
mutate(time = row_number())%>%
mutate(gpd = (kg_st_2530/time)*100)%>%
tidyr::drop_na()
ggplot(growth, aes(x=time, y=gpd, size=stuks_2530), alpha=0.7)+
geom_point()+
labs(x="Time",
y="Growth per day",
size="Number of 25_30")
growth<-production%>%
select(-c(gvg, kg_per_m2,stuks_per_m2))%>%
mutate(kg_st_2530 = kg_2530 / stuks_2530,
kg_st_3035 = kg_3035 / stuks_3035,
kg_st_3540 = kg_3540 / stuks_3540,
kg_st_4050 = kg_4050 / stuks_4050,
kg_st_5060 = kg_5060 / stuks_5060)%>%
mutate(time = row_number())%>%
mutate(gpd_25_30 = (kg_st_2530/time)*100,
gpd_30_35 = (kg_st_3035/time)*100,
gpd_35_40 = (kg_st_3540/time)*100,
gpd_40_50 = (kg_st_4050/time)*100,
gpd_50_60 = (kg_st_5060/time)*100)%>%
tidyr::pivot_longer(gpd_25_30:gpd_50_60,
names_to = "Outcome",values_to = "value")%>%
tidyr::drop_na()%>%
filter(value<1)
growth%>%
select(time, Outcome, value)%>%
print(n=30)
# A tibble: 368 x 3
time Outcome value
<int> <chr> <dbl>
1 65 gpd_25_30 0.457
2 65 gpd_30_35 0.533
3 65 gpd_35_40 0.592
4 65 gpd_40_50 0.668
5 65 gpd_50_60 0.840
6 93 gpd_25_30 0.312
7 93 gpd_30_35 0.370
8 93 gpd_35_40 0.359
9 93 gpd_40_50 0.472
10 93 gpd_50_60 0.588
11 94 gpd_25_30 0.312
12 94 gpd_30_35 0.365
13 94 gpd_35_40 0.408
14 94 gpd_40_50 0.463
15 94 gpd_50_60 0.587
16 95 gpd_25_30 0.311
17 95 gpd_30_35 0.361
18 95 gpd_35_40 0.465
19 95 gpd_40_50 0.458
20 95 gpd_50_60 0.577
21 101 gpd_25_30 0.290
22 101 gpd_30_35 0.340
23 101 gpd_35_40 0.380
24 101 gpd_40_50 0.432
25 101 gpd_50_60 0.566
26 114 gpd_25_30 0.256
27 114 gpd_30_35 0.303
28 114 gpd_35_40 0.339
29 114 gpd_40_50 0.390
30 114 gpd_50_60 0.496
# ... with 338 more rows
ggplot(growth, aes(x=time, y=value, col=Outcome), alpha=0.7)+
geom_point()+
labs(x="Time",
y="Growth per day",
col="Per outcome")
ggplot()+
geom_point(data=growth, aes(x=Datum, y=value, col=Outcome), alpha=0.7)+
geom_line(data=growth, aes(x=Datum, y=value, col=Outcome), alpha=0.7)+
geom_bar(data = df_gvg, aes(Datum, y=mean), stat="identity", fill="black", alpha=0.1)+
labs(x="Time",
y="Growth per day",
col="Per outcome")
GPD<-production%>%
select(-c(gvg, kg_per_m2,stuks_per_m2))%>%
mutate(kg_st_2530 = kg_2530 / stuks_2530,
kg_st_3035 = kg_3035 / stuks_3035,
kg_st_3540 = kg_3540 / stuks_3540,
kg_st_4050 = kg_4050 / stuks_4050,
kg_st_5060 = kg_5060 / stuks_5060)%>%
mutate(time = row_number())%>%
mutate(gpd_25_30 = (kg_st_2530/time)*100,
gpd_30_35 = (kg_st_3035/time)*100,
gpd_35_40 = (kg_st_3540/time)*100,
gpd_40_50 = (kg_st_4050/time)*100,
gpd_50_60 = (kg_st_5060/time)*100)%>%
tidyr::pivot_longer(gpd_25_30:gpd_50_60,
names_to = "GPD",values_to = "GPD_value")%>%
select(Datum,GPD,GPD_value)
Pieces<-production%>%
select(c(Datum, stuks_2530,stuks_3035,stuks_3540,stuks_4050,stuks_5060))%>%
tidyr::pivot_longer(stuks_2530:stuks_5060,
names_to = "Pieces",values_to = "Pieces_value")%>%
select(Datum,Pieces,Pieces_value)
WeightKG<-production%>%
select(c(Datum, kg_2530,kg_3035,kg_3540,kg_4050,kg_5060))%>%
tidyr::pivot_longer(kg_2530:kg_5060,
names_to = "Weight",values_to = "Weight_value")%>%
select(Datum,Weight,Weight_value)
combined<-cbind(GPD, Pieces[,2:3], WeightKG[,2:3])
combined%>%
filter(GPD_value<=1)%>%
ggplot(., aes(x=Datum, y=GPD_value, col=GPD, size=Pieces_value))+
geom_point()+
labs(x="Date",
y="Growth per day",
col="Size category",
size ="Number of pieces")
combined%>%
filter(GPD_value<=1)%>%
ggplot(., aes(x=GPD_value,
size=Weight_value,
y=Pieces_value,
col=GPD))+
geom_point()+
labs(x="Growth per day",
y="Pieces per category harvested",
col="Size category",
size ="Total weight per category harvested")
combined%>%
filter(GPD_value<=1)%>%
mutate(month = lubridate::month(Datum))%>%
ggplot(., aes(x=GPD_value,
size=Weight_value,
y=Pieces_value,
col=GPD))+
geom_point()+
facet_grid(~month, scales="free", space="free")+
labs(x="Growth per day",
y="Pieces per category harvested",
col="Size category",
size ="Total weight per category harvested")+
theme(legend.position = "bottom")
Last, but not least for the graphical part, the mean and variance of the environmental variables (some of them).
combined%>%
filter(bron=="Vrucht")%>%
ungroup()%>%
select(Datum, mean, var, Outcome)%>%
tidyr::drop_na()%>%
ggplot(., aes(x=Datum,
y=mean,
size=var,
col=Outcome))+
geom_point()+
labs(x="Date",
y="Mean value",
size="Variance value",
col="Outcome")
combined%>%
filter(Outcome=="Temperature")%>%
ungroup()%>%
select(Datum, mean, var, bron)%>%
tidyr::drop_na()%>%
ggplot(., aes(x=Datum,
y=mean,
size=var,
col=bron))+
geom_point()+
labs(x="Date",
y="Mean value",
size="Variance value",
col="Source")
This concludes the end of the graphical part which was quite extensive (I warned you) since I did not have prior experience with this kind of data and I wanted to see if the data in itself has sufficient information to support modelling.
In a previous blogpost I showed how to use the Prophet algorithm to deploy Fourier transformations and model cyclical data. Some of the data in this dataset are quite cyclical as well, so lets see if this fast algorithm can show me some additional information. I do not expect it will be my end model, especially since the outcome by itself is ordinal in nature.
For this first try, I will focus on the number of cucumbers per m² and apply Fourier transforms on month, week and day.
df<-production_long%>%
filter(metric%in%c("stuks_per"))%>%
filter(category%in%c("m2"))%>%
select(Datum, mean)
m1 <- prophet::prophet(weekly.seasonality=FALSE,
yearly.seasonality=FALSE,
daily.seasonality=FALSE,
growth="logistic",
seasonality.mode = 'additive',
changepoint.prior.scale = 0.8,
interval.width = 0.5)
m1 <- prophet::add_seasonality(m1, name='monthly',
period=30.5,
fourier.order=20)
m1 <- prophet::add_seasonality(m1, name='daily',
period=1,
fourier.order=12)
m1 <- prophet::add_seasonality(m1, name='weekly',
period=7,
fourier.order=52)
df$cap <- 2
df$floor <- 0
fit <- df%>%
rename(ds=Datum, y=mean)%>%
select(ds,y, cap, floor)%>%
prophet::fit.prophet(m1,.)
future <- prophet::make_future_dataframe(fit,
periods=31,
freq= "day",
include_history = TRUE)
future$cap <- 2
future$floor <- 0
forecast <- predict(fit, future)
plot(fit, forecast) + prophet::add_changepoints_to_plot(fit)
prophet::prophet_plot_components(fit, forecast)
forecast%>%
dplyr::select(ds,
yhat,
yhat_lower,
yhat_upper)%>%
ggplot(., aes(x=ds, y=yhat))+
geom_point()+
geom_ribbon(aes(x=ds, y=yhat, ymin=yhat_lower, ymax=yhat_upper),
fill="black",alpha=0.2)+
theme_bw()
To model time-series, we normally need to make sure that the time-series adheres to certain standards like being stationary. Lets explore the ratios of each of the weight categories and see if they meet that assumption.
mymts<-production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("stuks"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
ungroup()%>%
select(Datum, mean, category)%>%
tidyr::spread(category, mean)%>%
arrange(Datum)%>%
ungroup()%>%
ts(.,frequency = 365.25,
start=c(lubridate::year(min(.$Datum)),
lubridate::month(min(.$Datum))))
plot(mymts[,2:6])
We can then apply the ADF test which will give you a p-value. A better way would be to just take differences and then calculate the autocorrelation. Looking at the plots will provide you more information than the p-value, although the latter has of course a safer feeling.
apply(mymts[, 2:6], 2, tseries::adf.test)
stnry = MTS::diffM(mymts[,2:6], 2)
apply(stnry, 2, tseries::adf.test)
plot.ts(stnry)
The model that first came to mind, before the Prophet model, was the application of a Vector Autoregressive Model (VAR) which is a type of model I have used before. The greatest advantage is that it can model outcomes together and let the time-series cointegrate if such a co-integration exists.
vars::VARselect(stnry,
type = "none", #type of deterministic regressors to include. We use none becasue the time series was made stationary using differencing above.
lag.max = 10)
Lets apply the selected autocorrelation to the model and see where we end up.
var.a <- vars::VAR(stnry,
lag.max = 8, # highest lag order for lag length selection according to the choosen ic
ic = "AIC", # information criterion
type = "none", # type of deterministic regressors to include
season = NULL,
exog=NULL)
summary(var.a)
plot(var.a)
There is a multitude of tests you can perform on time-series data, but the one I like the most except from what I already showed is the stability plot. This one shows me if there are some events (sudden changes) I missed. Events have a tendency of messing up analyses.
Stability1 <- vars::stability(var.a, type = "OLS-CUSUM"); plot(Stability1)
And from that model, we can make daily forecasts.
fcast = predict(var.a, n.ahead = 25) # we forecast over a short horizon because beyond short horizon prediction becomes unreliable or uniform
par(mar = c(2.5,2.5,2.5,2.5))
plot(fcast)
fcast = predict(var.a, n.ahead = 365) # we forecast over a short horizon because beyond short horizon prediction becomes unreliable or uniform
par(mar = c(2.5,2.5,2.5,2.5))
plot(fcast)
So, I already mentioned that the data I used up until now was not split by harvest. But it should! Especially since the aim of the proof-of-concept is to discern if I can predict harvest by using historical data of previous harvests connected with environmental data. So, lets first discern which harvests we have and how different they really are. We have two harvests. No difference between the two harvests would make this a very easy exercise, but I doubt they are similar.
production_long%>%
filter(category=="gvg")%>%
select(-metric)%>%
print(n=365)
harvest1_start = as.Date("2022-03-14")
harvest1_stop = as.Date("2022-07-12")
harvest2_start = as.Date("2022-08-02")
harvest2_stop = as.Date("2022-10-31")
production_long%>%
filter(category=="gvg")%>%
select(-metric)%>%
ggplot(.)+
geom_line(aes(x=Datum,y=mean))+
geom_vline(xintercept = harvest1_start, col="red")+
geom_vline(xintercept = harvest1_stop, col="red")+
geom_vline(xintercept = harvest2_start, col="blue")+
geom_vline(xintercept = harvest2_stop, col="blue")+
theme(legend.position = "bottom")+
labs(x="Date")
The first plot is good, but it would be better to overlay them. I will do that again for the gvg outcome, as I have done before.
production_long%>%
filter(category%in%c("gvg"))%>%
select(Datum, mean)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest>0)%>%
group_by(Harvest)%>%
mutate(Time = row_number())%>%
ggplot(.)+
geom_line(aes(x=Time,y=mean, col=factor(Harvest)))+
theme(legend.position = "bottom")+
labs(x="Days of harvest",
y="GVG",
col="Harvest")
Lets build a VAR model in which tthe data of the first harvest will be used to predict the second harvest. I will also include the exogenous outcomes that I have at my disposal, which are temperature, weight, and light. Humidity I cannot use.
Environ_combined_Harvest_1<-rbind(df_temp_day,
df_weight_day,
df_light_day)%>%
ungroup()%>%
mutate(Exo = paste(Outcome,"_",bron))%>%
select(-c(bron, Outcome, var))%>%
tidyr::spread(Exo,mean)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest==1)%>%
select(-Harvest)
Harvest_1<-production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("stuks"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
mutate(Ratio = if_else(is.na(Ratio),0,Ratio))%>%
ungroup()%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest==1)%>%
select(Datum, Ratio, category)%>%
tidyr::spread(category, Ratio)
Production_Harvest_1<-Harvest_1%>%
cbind(Environ_combined_Harvest_1[,2:8])%>%
arrange(Datum)%>%
ungroup()%>%
ts(.,frequency = 365.25,
start=c(lubridate::year(min(.$Datum)),
lubridate::month(min(.$Datum))))
> class(Production_Harvest_1)
[1] "mts" "ts" "matrix"
> dim(Production_Harvest_1)
[1] 121 13
plot(Production_Harvest_1[,2:11])
apply(Production_Harvest_1[, 2:13], 2, tseries::adf.test)
stnry = MTS::diffM(Production_Harvest_1[, 2:13], 2)
apply(stnry, 2, tseries::adf.test)
plot.ts(stnry[,2:11])
It is quite tricky to run a VAR model with exogenous data in R. Took me a couple of tries to get the data and result-matrices set up in such a way that it made sense for the model to do calculations on.
endo_old<-Production_Harvest_1[,2:6]
exo_old<-as.matrix(Production_Harvest_1[,7:12])
var.b <- vars::VAR(endo_old,
lag.max = 10,
ic = "AIC",
type = "none",
season = NULL,
exog=cbind(exo_old))
Now that I have the model I can start looking for ways to predict. The data wrangling, however, is quite heavy.
Environ_combined_Harvest_2<-rbind(df_temp_day,
df_weight_day,
df_light_day)%>%
ungroup()%>%
mutate(Exo = paste(Outcome,"_",bron))%>%
select(-c(bron, Outcome, var))%>%
mutate(mean = if_else(is.na(mean),0,mean))%>%
tidyr::spread(Exo,mean)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest==2)%>%
select(-Harvest)
Harvest_2<-production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("stuks"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
ungroup()%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest==2)%>%
select(Datum, Ratio, category)%>%
tidyr::spread(category, Ratio)
Production_Harvest_2<-Harvest_2%>%
cbind(Environ_combined_Harvest_2[,2:8])%>%
arrange(Datum)%>%
ungroup()%>%
ts(.,frequency = 365.25,
start=c(lubridate::year(min(.$Datum)),
lubridate::month(min(.$Datum))))
endo_new<-Production_Harvest_2[,2:6]
exo_new1<-as.matrix(data.frame(Production_Harvest_2[,7:12]))
exo_new1[,6]<-0
fcast = predict(var.b,
dumvar=cbind(exo_new1),
n.ahead=6)
df_pred<-as.data.frame(do.call(rbind, fcast$fcst))%>%
mutate(category = rep(c("2530","3035","3540","4050","5060"),
each=dim(exo_new)[1],
length.out=dim(exo_new)[1]*5))%>%
mutate(Datum = rep(Environ_combined_Harvest_2$Datum,
length.out=dim(exo_new)[1]*5))
df_production_ex<-production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("stuks"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
ungroup()%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest==2)%>%
select(Datum, Ratio, category)%>%
filter(Datum<"2022-08-15")
df_pred%>%
as_tibble()%>%
group_by(category)%>%
arrange(category, Datum)%>%
filter(Datum<"2022-08-15")%>%
ggplot(.)+
geom_point(aes(x=Datum, y=fcst, col=category))+
geom_ribbon(aes(x=Datum, y=fcst,ymin=lower, ymax=upper, fill=category), alpha=0.5)+
geom_point(data=df_production_ex,
aes(x=Datum,
y=Ratio,
group=category))+
facet_grid(~category, scales="free", space="free")+
labs(x="Date",
y="Ratio of weight catgeory to total",
fill="95% Prediction interval",
col="Weight category")+
theme(legend.position="bottom")
For those of you who took a close look, I had missing data in the exogenous variables I wanted / needed to use to forecast the future for. And what I did was replace the NA with zero. Which is not harmless. Below you can see the results after applying a Kalman-Filter. You will see the results change.
Production_Harvest_2<-Harvest_2%>%
cbind(Environ_combined_Harvest_2[,2:8])%>%
arrange(Datum)%>%
ungroup()%>%
imputeTS::na_kalman(.) %>%
ts(.,frequency = 365.25,
start=c(lubridate::year(min(.$Datum)),
lubridate::month(min(.$Datum))))
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(metric%in%c("kg"))%>%
select(Datum, mean, category)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest>0)%>%
group_by(Harvest, category)%>%
mutate(Time = row_number())%>%
ggplot(.)+
geom_line(aes(x=Time,y=mean, col=factor(Harvest)))+
facet_grid(~category)+
theme(legend.position = "bottom")+
labs(x="Days of harvest",
y="KG",
col="Harvest")
What if we model the data in a cumulative form? It is much easier to model. So easy, in fact, that we can even start with a simple linear regression.
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
ungroup()%>%
select(Datum, mean, category, Ratio)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest>0)%>%
mutate(Ratio = tidyr::replace_na(Ratio, 0))%>%
group_by(Harvest, category)%>%
arrange(category,Datum)%>%
mutate(Time = row_number(),
Cum = cumsum(Ratio))%>%
ggplot(.)+
geom_line(aes(x=Time,y=Cum, col=factor(Harvest)))+
facet_grid(~category)+
theme(legend.position = "bottom")+
labs(x="Days of harvest",
y="Cumulative ratio",
col="Harvest")
df_CumRatio<-production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
ungroup()%>%
select(Datum, mean, category, Ratio)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest>0)%>%
mutate(Ratio = tidyr::replace_na(Ratio, 0))%>%
group_by(Harvest, category)%>%
arrange(category,Datum)%>%
mutate(Time = row_number(),
Cum = cumsum(Ratio))
fit<-df_CumRatio%>%
filter(Harvest==1)%>%
lm(Cum~Time*category, data=.)
summary(fit)
Call:
lm(formula = Cum ~ Time * category, data = .)
Residuals:
Min 1Q Median 3Q Max
-2.5883 -0.2030 0.0034 0.3556 3.4782
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.020624 0.162263 0.127 0.89890
Time 0.005069 0.002308 2.196 0.02848 *
category3035 0.736145 0.229474 3.208 0.00141 **
category3540 0.543583 0.229474 2.369 0.01816 *
category4050 -3.238624 0.229474 -14.113 < 2e-16 ***
category5060 -0.103103 0.229474 -0.449 0.65338
Time:category3035 0.112518 0.003265 34.466 < 2e-16 ***
Time:category3540 0.359676 0.003265 110.176 < 2e-16 ***
Time:category4050 0.439812 0.003265 134.723 < 2e-16 ***
Time:category5060 0.003985 0.003265 1.221 0.22266
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.8869 on 595 degrees of freedom
Multiple R-squared: 0.996, Adjusted R-squared: 0.9959
F-statistic: 1.643e+04 on 9 and 595 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(fit)
pred<-broom.mixed::augment(fit)
ggplot(pred, aes(x=Cum, y=.fitted))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")
ggplot(pred, aes(x=Time, y=.resid))+
geom_point()+
geom_smooth(aes(col=category, fill=category))+
facet_wrap(~category, ncol=1, scale="free")
Lets see how far this model brings us when wanting to predict the second harvest.
df_pred_harvest2<-df_CumRatio%>%
filter(Harvest==2)%>%
predict(fit, newdata=.)
df_CumRatio%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
ggplot(., aes(x=Cum, y=.fitted))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")
df_CumRatio%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
ggplot(., aes(x=Cum, y=.fitted, col=.resid))+
geom_point(alpha=0.8)+
scale_color_viridis_c(option="magma")+
geom_abline(intercept=0, slope=1, lty=2, col="black")+
facet_grid(~category, scales="free")
Lets include the environmental variables.
df_cumRatio_combined<-merge(df_CumRatio, Environ_combined_wide, by="Datum")%>%
group_by(category)%>%
arrange(category, Datum)%>%
ungroup()
colnames(df_cumRatio_combined)
colnames(df_cumRatio_combined)<-c("Datum", "mean", "category","Ratio","Harvest","Time",
"Cum","Light","Temp_bloem","Temp_doek","Temp_kop",
"Temp_nok","Temp_vrucht","Weight")
fit2<-df_cumRatio_combined%>%
filter(Harvest==1)%>%
lm(Cum~Time*category + Light + Temp_bloem + Temp_doek + Temp_kop + Temp_nok +
Temp_vrucht + Weight, data=.)
summary(fit2)
Call:
lm(formula = Cum ~ Time * category + Light + Temp_bloem + Temp_doek +
Temp_kop + Temp_nok + Temp_vrucht + Weight, data = .)
Residuals:
Min 1Q Median 3Q Max
-2.6512 -0.2745 0.0064 0.3464 3.3987
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.8882071 1.4334736 -1.317 0.18828
Time 0.0062967 0.0028644 2.198 0.02832 *
category3035 0.7361446 0.2294852 3.208 0.00141 **
category3540 0.5435826 0.2294852 2.369 0.01817 *
category4050 -3.2386240 0.2294852 -14.113 < 2e-16 ***
category5060 -0.1031033 0.2294852 -0.449 0.65339
Light -0.0015367 0.0008004 -1.920 0.05534 .
Temp_bloem 0.2226424 0.2191329 1.016 0.31004
Temp_doek -0.0873605 0.2434585 -0.359 0.71985
Temp_kop 0.0404817 0.2622962 0.154 0.87740
Temp_nok -0.0463816 0.0497856 -0.932 0.35191
Temp_vrucht -0.0349427 0.1956109 -0.179 0.85829
Weight 0.0078287 0.0222947 0.351 0.72560
Time:category3035 0.1125182 0.0032647 34.465 < 2e-16 ***
Time:category3540 0.3596757 0.0032647 110.170 < 2e-16 ***
Time:category4050 0.4398122 0.0032647 134.716 < 2e-16 ***
Time:category5060 0.0039853 0.0032647 1.221 0.22268
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.887 on 588 degrees of freedom
Multiple R-squared: 0.996, Adjusted R-squared: 0.9959
F-statistic: 9242 on 16 and 588 DF, p-value: < 2.2e-16
anova(fit2, fit)
Analysis of Variance Table
Model 1: Cum ~ Time * category + Light + Temp_bloem + Temp_doek + Temp_kop +
Temp_nok + Temp_vrucht + Weight
Model 2: Cum ~ Time * category
Res.Df RSS Df Sum of Sq F Pr(>F)
1 588 462.58
2 595 468.04 -7 -5.4613 0.9917 0.4361
fit3<-df_cumRatio_combined%>%
filter(Harvest==1)%>%
lm(Cum~Time + category + Time*category + Time*Light + Time*Temp_bloem +
Time*Temp_doek + Time*Temp_kop + Time*Temp_nok +
Time*Temp_vrucht + Time*Weight, data=.)
summary(fit3)
par(mfrow=c(2,2))
plot(fit3)
anova(fit3, fit2, fit)
Analysis of Variance Table
Model 1: Cum ~ Time + category + Time * category + Time * Light + Time *
Temp_bloem + Time * Temp_doek + Time * Temp_kop + Time *
Temp_nok + Time * Temp_vrucht + Time * Weight
Model 2: Cum ~ Time * category + Light + Temp_bloem + Temp_doek + Temp_kop +
Temp_nok + Temp_vrucht + Weight
Model 3: Cum ~ Time * category
Res.Df RSS Df Sum of Sq F Pr(>F)
1 581 448.97
2 588 462.58 -7 -13.6072 2.5155 0.01487 *
3 595 468.04 -7 -5.4613 1.0096 0.42320
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The graph above shows that ANOVA does not favor more complex models.
pred<-broom.mixed::augment(fit3)
ggplot(pred, aes(x=Cum, y=.fitted))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")
ggplot(pred, aes(x=Cum, y=.fitted, col=.resid))+
geom_point(alpha=0.8)+
scale_color_viridis_c(option="magma")+
geom_abline(intercept=0, slope=1, lty=2, col="black")+
facet_grid(~category, scales="free")
df_pred_harvest2<-df_cumRatio_combined%>%
filter(Harvest==2)%>%
predict(fit3, newdata=.)
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
ggplot(., aes(x=Cum, y=.fitted))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
ggplot(., aes(x=Cum, y=.fitted, col=.resid))+
geom_point(alpha=0.8)+
scale_color_viridis_c(option="magma")+
geom_abline(intercept=0, slope=1, lty=2, col="black")+
facet_grid(~category, scales="free")
Lets see how the predicted vs observed looks like if we transform the data from cumulative sum back to ratio (both observed and predicted).
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
select(Datum, category, Cum, .fitted)%>%
ungroup()%>%
group_by(category)%>%
mutate(Ratio_obs = Cum - lag(Cum),
Ratio_pred = .fitted - lag(.fitted))%>%
ungroup()%>%
ggplot(., aes(x=Ratio_obs, y=Ratio_pred))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
select(Datum, category, Cum, .fitted)%>%
ungroup()%>%
group_by(category)%>%
mutate(Ratio_obs = Cum - lag(Cum),
Ratio_pred = .fitted - lag(.fitted))%>%
mutate(Possible_pred = if_else(Ratio_pred<0, "No", "Yes"))%>%
ungroup()%>%
ggplot(., aes(x=Ratio_obs, y=Ratio_pred, col=Possible_pred))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")+
labs(x="Observed ratios",
y="Predicted ratios",
col="Possible prediction?")+
theme(legend.position="bottom")
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
select(Datum, category, Cum, .fitted)%>%
ungroup()%>%
group_by(category)%>%
mutate(Ratio_obs = Cum - lag(Cum),
Ratio_pred = .fitted - lag(.fitted))%>%
mutate(Possible_pred = if_else(Ratio_pred<0, "No", "Yes"))%>%
ungroup()%>%
ggplot(.)+
geom_point(aes(x=Datum, y=Ratio_obs), col="grey", lty=2)+
geom_point(aes(x=Datum, y=Ratio_pred, col=Possible_pred))+
facet_wrap(~category, scales="free", ncol=1)+
labs(x="Date",
y="Ratios (observed and predicted)",
col="Possible prediction?")+
theme(legend.position="bottom")
So, lets first see if we can tackle the temporal issues by deploying a Generalized Additive Model which is basically made up of splines. I really like these models and in this case I am using the Mixed Model variant which means including random components.
fit_gam <- df_cumRatio_combined%>%
filter(Harvest==1)%>%
mutate(category=factor(category))%>%
gamm4::gamm4(Cum~s(Time, by=category)+
category+
s(Light)+
s(Temp_bloem)+
s(Temp_doek)+
s(Temp_kop)+
s(Temp_nok)+
s(Temp_vrucht)+
s(Weight),
data=.,
random=~(1+Time|category))
fit_gam
$mer
Linear mixed model fit by REML ['lmerMod']
REML criterion at convergence: -479.8367
Random effects:
Groups Name Std.Dev. Corr
Xr s(Time):category2530 3.446e-01
Xr.0 s(Time):category3035 7.816e+00
Xr.1 s(Time):category3540 1.216e+01
Xr.2 s(Time):category4050 9.618e+00
Xr.3 s(Time):category5060 7.583e-01
Xr.4 s(Light) 4.878e-02
Xr.5 s(Temp_bloem) 0.000e+00
Xr.6 s(Temp_doek) 0.000e+00
Xr.7 s(Temp_kop) 0.000e+00
Xr.8 s(Temp_nok) 1.104e-01
Xr.9 s(Temp_vrucht) 4.406e-05
Xr.10 s(Weight) 8.040e-02
category (Intercept) 8.632e+00
Time 6.051e+00 -0.44
Residual 1.309e-01
Number of obs: 605, groups:
Xr, 8; Xr.0, 8; Xr.1, 8; Xr.2, 8; Xr.3, 8; Xr.4, 8; Xr.5, 8; Xr.6, 8; Xr.7, 8; Xr.8, 8; Xr.9, 8; Xr.10, 8; category, 5
Fixed Effects:
X(Intercept) Xcategory3035 Xcategory3540 Xcategory4050 Xcategory5060
0.3297748 7.5998190 22.4838782 23.5900030 0.1400601
Xs(Time):category2530Fx1 Xs(Time):category3035Fx1 Xs(Time):category3540Fx1 Xs(Time):category4050Fx1 Xs(Time):category5060Fx1
0.1535347 6.0360774 13.4338150 10.5357889 0.6199651
Xs(Light)Fx1 Xs(Temp_bloem)Fx1 Xs(Temp_doek)Fx1 Xs(Temp_kop)Fx1 Xs(Temp_nok)Fx1
0.0089804 -0.0086288 -0.0561580 0.0008591 -0.0173314
Xs(Temp_vrucht)Fx1 Xs(Weight)Fx1
0.0430675 -0.0223692
$gam
Family: gaussian
Link function: identity
Formula:
Cum ~ s(Time, by = category) + category + s(Light) + s(Temp_bloem) +
s(Temp_doek) + s(Temp_kop) + s(Temp_nok) + s(Temp_vrucht) +
s(Weight)
<environment: 0x00000198bd99fc60>
Estimated degrees of freedom:
3.81 8.86 8.94 8.91 5.47 1.91 1.00
1.00 1.00 2.90 1.00 2.80 total = 52.59
lmer.REML score: -479.8367
plot(fit_gam$mer)
par(mfrow = c(4, 2))
plot(fit_gam$gam)
fit_gam_aug<-broom.mixed::augment(fit_gam$gam, confint=TRUE)
fit_gam_mer_aug<-broom.mixed::augment(fit_gam$mer, confint=TRUE)
fit_gam_aug%>%
ggplot(., aes(x=Cum, y=.fitted, colour=.resid))+
geom_point()+
theme_bw()+
facet_grid(~category)+
geom_abline(intercept=0, slope=1, lty=2, col="black")+
scale_color_viridis_c(option="turbo")
fit_gam_mer_aug%>%
ggplot(., aes(x=y, y=.fitted, colour=.resid))+
geom_point()+
theme_bw()+
facet_grid(~category)+
geom_abline(intercept=0, slope=1, lty=2, col="black")+
scale_color_viridis_c(option="turbo")
And lets take a look at how the model performs on the second harvest.
df_pred_harvest2<-df_cumRatio_combined%>%
filter(Harvest==2)%>%
ungroup()%>%
select(-c(Datum, mean))%>%
predict(fit_gam$gam, newdata=.)
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
ggplot(., aes(x=Cum, y=.fitted, col=.resid))+
geom_point(alpha=0.8)+
scale_color_viridis_c(option="magma")+
geom_abline(intercept=0, slope=1, lty=2, col="black")+
facet_grid(~category, scales="free")
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
select(Datum, category, Cum, .fitted)%>%
ungroup()%>%
group_by(category)%>%
mutate(Ratio_obs = Cum - lag(Cum),
Ratio_pred = .fitted - lag(.fitted))%>%
mutate(Possible_pred = if_else(Ratio_pred<0, "No", "Yes"))%>%
ungroup()%>%
ggplot(., aes(x=Ratio_obs, y=Ratio_pred, col=Possible_pred))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")+
labs(x="Observed ratios",
y="Predicted ratios",
col="Possible prediction?")+
theme(legend.position="bottom")
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
select(Datum, category, Cum, .fitted)%>%
ungroup()%>%
group_by(category)%>%
mutate(Ratio_obs = Cum - lag(Cum),
Ratio_pred = .fitted - lag(.fitted))%>%
mutate(Possible_pred = if_else(Ratio_pred<0, "No", "Yes"))%>%
ungroup()%>%
ggplot(.)+
geom_point(aes(x=Datum, y=Ratio_obs), col="grey", lty=2)+
geom_point(aes(x=Datum, y=Ratio_pred, col=Possible_pred))+
facet_wrap(~category, scales="free", ncol=1)+
labs(x="Date",
y="Ratios (observed and predicted)",
col="Possible prediction?")+
theme(legend.position="bottom")
A final check is done by adding up the ratio’s. We know that the observed ratio’s add up to one, as they are calculated afterwards. The predicted ratio’s must also add up to one to make sense, economically. If the predictions do not, the model has less value since it is becomes difficult to apply in the real world.
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
select(Datum, category, Cum, .fitted)%>%
ungroup()%>%
group_by(category)%>%
mutate(Ratio_obs = Cum - lag(Cum),
Ratio_pred = .fitted - lag(.fitted))%>%
group_by(Datum)%>%
mutate(TRD_obs = sum(Ratio_obs),
TRD_pred = sum(Ratio_pred))%>%
ungroup()%>%
select(Datum, TRD_obs, TRD_pred)%>%
distinct()%>%
arrange(Datum)%>%
ggplot()+
geom_point(aes(x=Datum, y=TRD_obs, col="Summed Ratio per Day - Observed"))+
geom_point(aes(x=Datum, y=TRD_pred, col="Summed Ratio per Day - Predicted"))+
labs(x="Date",
y="Summed Ratio per Day",
col="Observed - Predicted")+
theme(legend.position="bottom")
Up until now we have modeled the data using time-series techniques (Prophet, VAR) and linear regression (including mixed model splines). But none of them are able to deal with the ordinal part of the data, which is what I will tackle now. Because no matter how bad the predictions are, they must sum up to one. So lets step into the world of ordinal modelling.
df_Ratio<-production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
ungroup()%>%
dplyr::select(Datum, mean, category, Ratio)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest>0)%>%
mutate(Ratio = tidyr::replace_na(Ratio, 0))%>%
group_by(Harvest, category)%>%
arrange(category,Datum)%>%
mutate(Time = row_number(),
category = factor(category))
Above I showed the necessary steps to get the model in order. For ordinal analysis, the outcome is a single column containing levels. For the model to be able to ingest the data it needs to be structured in such a way that each row represents a single observation, or that the rows are summed from which a frequency variable is created. Luckily, I already have the data in such a structure (each row shows the number of cucumbers harvested for each weight category, thereby representing the frequency of each level). Lets first do a simple model, splitting the data by harvest inside the model. I want to discern if the ratio predictions make sense, and the model is set-up properly.
model<-MASS::polr(category ~ Time*Harvest,
weights=mean,
data = df_Ratio,
Hess = TRUE)
summary(model)
Call:
MASS::polr(formula = category ~ Time * Harvest, data = df_Ratio,
weights = mean, Hess = TRUE)
Coefficients:
Value Std. Error t value
Time 0.009907 4.245e-05 233.41
Harvest2 0.417498 4.756e-03 87.79
Time:Harvest2 -0.008698 8.998e-05 -96.66
Intercepts:
Value Std. Error t value
2530|3035 -3.9807 0.0067 -594.6270
3035|3540 -1.2113 0.0034 -359.7544
3540|4050 0.5798 0.0032 178.9780
4050|5060 4.2924 0.0051 846.5510
Residual Deviance: 5879895.05
AIC: 5879909.05
Lets showcase the elements inside the model.
effects::Effect(focal.predictors = "Time",model)
Time effect (probability) for 2530
Time
1 30 60 90 100
0.015276207 0.012803988 0.010662177 0.008875421 0.008348458
Time effect (probability) for 3035
Time
1 30 60 90 100
0.1830726 0.1586022 0.1360136 0.1160998 0.1100299
Time effect (probability) for 3540
Time
1 30 60 90 100
0.3989909 0.3822231 0.3608547 0.3363280 0.3276236
Time effect (probability) for 4050
Time
1 30 60 90 100
0.3864700 0.4270672 0.4693289 0.5109778 0.5245657
Time effect (probability) for 5060
Time
1 30 60 90 100
0.01619031 0.01930358 0.02314067 0.02771892 0.02943228
plot(effects::Effect(focal.predictors = "Time",model))
plot(effects::Effect(focal.predictors = c("Time", "Harvest"),model))
Lets split the data, model on harvest 1 and then predict for harvest 2.
model2<-df_Ratio%>%
filter(Harvest==1)%>%
MASS::polr(category ~ Time,
weights=mean,
data = .,
Hess = TRUE)
fitted(model2)%>%
as.data.frame()%>%
mutate(Time = row_number())%>%
tidyr::pivot_longer(cols=`2530`:`5060`,
names_to = "Pred_category",
values_to = "Pred_ratio")%>%
ggplot()+
geom_point(aes(x=Time, y=Pred_ratio, col=Pred_category))
I am pretty sure I know what will happen when I use this model to predict for the second harvest.
df_Ratio%>%
filter(Harvest==2)%>%
predict(model2, newdata=., type="probs")%>%
as.data.frame()%>%
mutate(Time = row_number())%>%
tidyr::pivot_longer(cols=`2530`:`5060`,
names_to = "Pred_category",
values_to = "Pred_ratio")%>%
ggplot()+
geom_point(aes(x=Time, y=Pred_ratio, col=Pred_category ))
Lets build another model, this time including environmental variables as well.
model3<-df_cumRatio_combined%>%
filter(Harvest==1)%>%
mutate(category=factor(category),
Harvest = factor(Harvest))%>%
MASS::polr(category ~ Time*Light + Time*Temp_bloem +
Time*Temp_doek + Time*Temp_kop + Time*Temp_nok +
Time*Temp_vrucht + Time*Weight,
weights=mean,
data = .,
Hess = TRUE)
fitted(model2)%>%
as.data.frame()%>%
mutate(Time = row_number())%>%
tidyr::pivot_longer(cols=`2530`:`5060`,
names_to = "Pred_category",
values_to = "Pred_ratio")%>%
group_by(Pred_category)%>%
arrange(Time)%>%
ggplot()+
geom_point(aes(x=Time, y=Pred_ratio, col=Pred_category ))
df_cumRatio_combined%>%
filter(Harvest==2)%>%
predict(model3, newdata=., type="probs")%>%
as.data.frame()%>%
mutate(Time = row_number())%>%
tidyr::pivot_longer(cols=`2530`:`5060`,
names_to = "Pred_category",
values_to = "Pred_ratio")%>%
ggplot()+
geom_point(aes(x=Time, y=Pred_ratio, col=Pred_category ))
df_Ratio_Test<-df_Ratio%>%
filter(Harvest==2)%>%
ungroup()%>%
dplyr::select(Datum, Ratio, category)%>%
arrange(Datum)
df_cumRatio_combined%>%
filter(Harvest==2)%>%
predict(model3, newdata=., type="probs")%>%
as.data.frame()%>%
mutate(Time = row_number())%>%
filter(Time<92)%>%
tidyr::pivot_longer(cols=`2530`:`5060`,
names_to = "Pred_category",
values_to = "Pred_ratio")%>%
cbind(.,df_Ratio_Test)%>%
ggplot()+
geom_point(aes(x=Datum, y=Ratio), col="black")+
geom_point(aes(x=Datum, y=Pred_ratio, col=Pred_category))+
facet_wrap(~category, scales="free", ncol=1)+
labs(x="Datum",
y="Ratio",
col="Predicted Category")
Although the model could for sure be fine-tined using splines, the model in itself does show me that the proof-of-concept is attainable.
If this seems like an abrupt end, well, it is! That is because you have spent a lot of time looking at graphs and seeing me try different approaches and different models. At a certain moment, all those try-out will lead to a more structured path and then things can move quite quickly.
Of course, the final model is not there yet, but I do have a sense that I can do what I wanted to do and that is to predict harvest.