Building a harvest model for cucumbers

Modelling the ordinal nature of the data

Dr. Marc Jacobs
27 min readFeb 24, 2023

--

WARNING: This is quite an extensive read, but I believe it is necessary to give you a good idea about the case. And about why modelling is all about tries and failures. Actually, quite a lot of tries and failures, and I want to showcase all of them because that is what modelling is about. Not the mere application of techniques for the sake of the technique, but to tackle a business question. And the business question in this regard had to do with being able to predict the cucumber harvest based on historical production and environmental data.

Like agriculture, horticulture has seen a steep rise in sensors and data collection and so the greenhouses in which various plants are placed can be monitored all of the time and in great detail. That means detailed tracking of temperature, humidity, and light on various parts and locations of the greenhouse. Also the weight of the plant can be tracked. The entire worth of suck extensive tracking remains to be seen though. Just having data is not enough — it needs to become information.

These variables are a means to an end and that is understanding the factors that contribute the most to production — in this case the cucumber. Now, biologically, there is already quite some information on how to best grow these plants, but in an ever changing environment (and an energy transition to boot) it becomes ever more important to optimize the triangle production-cost-sustainability. This is not an easy thing to do.

So what I tried, and will describe in this blog, is using the dataset from a single farmer containing a single year of data, including two harvests. I will use the production data of the first harvest to connect it with environmental variables of that harvest to predict the second harvest. It is not perfect data, but for a proof-of-concept it will suffice.

Now, before I begin, it is good to understand that cucumbers are classified into weight classes which happens after harvesting. Lets get started!

library(readr)
library(readxl)
library(ggplot2)
theme_set(theme_bw())
library(dplyr)
library(lubridate)
ggplot(df_temp, aes(x=Tijd, y=Temperature, col=bron))+
geom_line(alpha=0.5)+
labs(x="Time")
First plot shows all the temperature data that I have. As you can see, it represents a lot of measurements at different locations. The purple gap means that there is no information on the temperature of the cucumber. This gap shows the gap between two harvests.
ggplot(df_temp, aes(x=Tijd, y=Temperature, col=bron))+
geom_line(alpha=0.5) + facet_grid(~bron)+
labs(x="Time")
And here we can see the same data, but split per source. As you can see, the temperature data is not equal across sources.
ggplot(df_humidity, aes(x=Tijd, y=Humidity, col=bron))+
geom_line(alpha=0.5) + facet_grid(~bron)
The humidity data per source.
ggplot(df_weight, aes(x=Tijd, y=TotalWeight))+geom_line()+
labs(x="Date")
The weight data of the plant over time. There are quite some gaps. Unfortunately, we have more weight data than production data which means that the usability of the weight data is bound by the that of the production data. Also, if you look at the last gap, you can clearly see the separation between the first and the second harvest of 2022. The plant needs to grow before the cucumbers can grow, and these plants grow quite fast under optimal conditions. there is a lot a single time-series can show you once you start

One of the many ways to get some grip on high volatility data is to transform it. Straightforward transformations often employ the use of the log transformation, but the cumulative distribution is not bad either. In fact, it will tell you a lot about how the patterns over time are shifting and when events happened. High volatility data have a tendency to obscure this because of their volatile nature. So lets use the cumulative function of the daily weight data.

df_weight%>%
timetk::summarise_by_time(.,
Tijd,
"day",
mean=mean(TotalWeight))%>%
mutate(Datum = as.Date((Tijd)),
cum = cumsum(mean))%>%
select(-Tijd)%>%
ggplot(.)+
geom_line(aes(x=Datum, y=cum, col="Cumulative Sum"))+
labs(col="Outcome",
x="Date",
y="Value")
Clearly you can see a change in derivatives — the growth never stop but the slope of change definitely differs across time.

Lets add temperature as well, but stick with the daily data for now.

df_weight_day<-df_weight%>%
timetk::summarise_by_time(.,
Tijd,
"day",
mean=mean(TotalWeight),
var = var(TotalWeight))%>%
mutate(Datum = as.Date((Tijd)))%>%
select(-Tijd)
df_temp%>%
group_by(bron)%>%
timetk::summarise_by_time(.,
Tijd,
"day",
temp_mean=mean(Temperature),
temp_var = var(Temperature))%>%
mutate(Datum = as.Date((Tijd)))%>%
select(-Tijd)%>%
ggplot(.)+
geom_line(aes(x=Datum, y=temp_mean, col="Temp"))+
geom_line(data=df_weight_day, aes(x=Datum, y=mean, col="Weight"))+
facet_grid(~bron)+
labs(col="Outcome",
x="Date",
y="Value")
As you can see, temperature and weight data follow the same cyclical patterns. This is good to know and it is information that can be used in later stages.

I want to move towards the production data now. It is good to realize that production is measured by the number of cucumbers, and their respective total weight, per meter². This represents the maximum production attached as a ratio of the size of the greenhouse. The gvg is the ratio of the total number of cucumbers divided by the total amount of kilograms harvested. The m2 is a function of the size, or the numbers, based on the land available for harvest.

production_long%>%
filter(category%in%c("gvg", "m2"))%>%
ggplot(., aes(x=Datum, y=mean, col=category))+geom_line()+
labs(col="Outcome",
x="Date",
y="Value")
They are for sure on different scales. As you can see, the gvg is less volatile, which makes it a better metric to compare across harvest.

On to the ratios. I already mentioned that the weight classes are a result, not something that is determined up front. So, what happens is that cucumbers are harvested, counted, and then classified. The classes are always the same, so the summed ratio of each class equals one.

production%>%
select(-c(gvg, kg_per_m2))%>%
mutate(kg_st_2530 = kg_2530/stuks_2530,
kg_st_3035 = kg_3035 /stuks_3035,
kg_st_3540 = kg_3540/stuks_3540,
kg_st_4050 = kg_4050/stuks_4050,
kg_st_5060 = kg_5060/stuks_5060)%>%
select(Datum, kg_st_2530,kg_st_3035,kg_st_3540,kg_st_4050,kg_st_5060)%>%
tidyr::pivot_longer(!Datum, names_to = "Outcome",values_to = "value")%>%
mutate(value = value*100)%>%
filter(value>20 & value<75)%>%
ggplot(., aes(x=Datum, y=value, col=Outcome))+
geom_line()+
labs(x="Date")

You can see the ratio of each class over time. Which, presented like this, is actually not as informative as you might think, but I will show that later on.

If we zoom in, we can slice and dice the data to look for specifics. Here, I am asking the data for the total weight of the cucumbers harvested in the 250–300gram weight class.

production_long%>%
filter(metric%in%c("kg") & category=="2530")%>%
ggplot(., aes(x=Datum, y=mean, col=category))+geom_line()+
labs(x="Date")
There is a heavy spike at the end of the first harvest which means clearing the greenhouse. We see a similar spike at the end of the second harvest.

And I can also ask for the gvg metric which, if your remember correctly, provided me the ratio of the total amount and the total weight of cucumbers.

production_long%>%
filter(category%in%c("gvg"))%>%
ggplot(.)+
geom_line(aes(x=Datum,y=mean))+
theme(legend.position = "bottom")+
labs(x="Date")
Measured daily, the zero values represent days in which no cucumbers were harvested.

I want to look closer at the gvg metric and will compare it to the 400–450 gram weight class.


df_gvg<-production_long%>%
filter(category%in%c("gvg"))
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
filter(category =="4050")%>%
ggplot(.)+
geom_point(aes(x=Datum,y=Ratio,col="4050"))+
geom_line(aes(x=Datum,y=Ratio, col="4050"), alpha=0.5)+
geom_point(data=df_gvg, aes(x=Datum, y=mean, col="gvg"))+
geom_line(data=df_gvg, aes(x=Datum, y=mean, col="gvg"), alpha=0.5)+
theme(legend.position = "bottom")+
labs(x="Date",
y="Ratio",
fill="Pieces",
col="GVG")
As you can see, there is some cointegration going on, but not that much. I would not use it directly for connective purposes.

And what if we stack all the ratios on top of each other (to sum up to one) and then plot the gvg on top of that? What pattern does it reveal? Which of the weight classes will best mimic the gvg metric?

df_gvg<-production_long%>%
filter(category%in%c("gvg"))
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
ggplot(.)+
geom_bar(aes(x=Datum,y=Ratio,fill=category), stat="identity")+
geom_line(data=df_gvg, aes(x=Datum, y=mean), col="black")+
theme(legend.position = "bottom")+
labs(x="Date",
y="Ratio")
Actually, none of them. Except the harvest days. The plot looks nice, but carries less information than one would first assume.

Still, I want to dig deeper. So what I am going to do is average the gvg metric to weekly values, create ratios for all weight classes, average them per week, and then see if I can spot changes that cointegrate. The reason for doing all of this is trying to discern how cucumber production looks like. And if it has any additional worth in modelling, besides the more straightforward connection with environmental variables. Please note that I am interested in both the mean and the variance, since the latter can often tell you more about what is going on than the former.

df_gvg_week<-production_long%>%
filter(category%in%c("gvg"))%>%
timetk::summarise_by_time(.,
Datum,
"week",
value=mean(mean),
var = var(mean))
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("stuks"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
group_by(category)%>%
timetk::summarise_by_time(.,
Datum,
"week",
value=mean(Ratio),
var = var(Ratio))%>%
ggplot(.)+
geom_point(aes(x=Datum,y=value,col=category, size=var))+
geom_point(data=df_gvg_week, aes(x=Datum, y=value, size=var), col="black", alpha=0.1)+
theme(legend.position = "bottom",
plot.title = element_text(color = "black", size = 12, face = "bold"))+
facet_grid(~category)+
labs(x="Date",
y="Mean",
size="variance",
col="Category",
title="Mean kg ratio per category per week over time",
subtitle="in black the average weight per cucumber")
In gray the gvg metric (mean and variance) and in color the ratios. I am looking for the weight class that best mimics the gvg which would mean that the ratio of the overall would be mirrored by a ratio of a weight class. Which does not seem to be the case. Which also means that each weight class has some specific information by itself. Something to take note of when modelling later on.
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
ggplot(.)+
geom_bar(aes(x=Datum,y=Ratio,fill=category), stat="identity")+
geom_point(aes(x=Datum,y=Ratio), size=0.5)+
facet_grid(~category)+
theme(legend.position = "bottom")+
labs(x="Date",
y="Ratio")
Showing the development of the ratios does highlight that the lowest and highest classes have a limited ratio, meaning that the majority of the data is made up up the other three weight classes. Their respective developments do no seem to have a distinct pattern, only that in the first harvest the 3035 and 3540 decline and the 4050 goes up. The second harvest has a different pattern then the first harvest. Which will increase the fun of modelling.
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
mutate(Weekday = factor(weekdays(Datum),
levels = c("Monday", "Tuesday","Wednesday",
"Thursday","Friday","Saturday",
"Sunday")))%>%
ggplot(.)+
geom_line(aes(x=Datum,y=Ratio,col=category), stat="identity")+
facet_grid(~Weekday)+
theme(legend.position = "bottom")+
labs(x="Date",
y="Ratio")

Trying to see if there is a distinctive pattern by weekday. Not really, except that on most Sunday’s there is not much going on for the majority of the time. Which also means that the harvest on Monday is a bit higher.
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
mutate(Weekday = factor(weekdays(Datum),
levels = c("Monday", "Tuesday","Wednesday",
"Thursday","Friday","Saturday",
"Sunday")),
Month = lubridate::month(Datum, label=TRUE))%>%
ggplot(.)+
geom_tile(aes(x=category,y=Weekday,fill=Ratio))+
scale_fill_viridis_c(option="magma")+
facet_grid(~Month)+
theme(legend.position = "bottom")+
labs(y="Weekday",
x="Category")
And the same data, but then in heatmaps, with month added. Nothing spectacular.
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
mutate(Weekday = factor(weekdays(Datum),
levels = c("Monday", "Tuesday","Wednesday",
"Thursday","Friday","Saturday",
"Sunday")),
Month = lubridate::month(Datum, label=TRUE))%>%
ggplot(.)+
geom_tile(aes(x=Month,y=Weekday,fill=Ratio))+
scale_fill_viridis_c(option="magma")+
facet_grid(~category)+
theme(legend.position = "bottom")+
labs(y="Weekday",
x="Month")
Changed axes. Once again, nothing spectacular. The heatmap shows the distinction already revealed using different plots.

Since we are dealing with time-series data it is a good approach to analyze it as a time-series. However, sometimes, it is just better to take as many elements as possible away from the time-series away and look for more distinctive patterns. This will allow for more straightforward modelling. And in the case of a time-series, the cumulative distribution function is offers such a way. We already used it in previous plots, but I want to take another look and broaden my perspective.

production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
select(-metric)%>%
group_by(category)%>%
arrange(category, Datum)%>%
mutate(cum = cumsum(mean))%>%
ggplot(.)+
geom_line(aes(x=Datum, y=cum, col=category))+
labs(col="Outcome",
x="Date",
y="Value")
This is data that is much clearer to look at. Easy to spot is the separation of the first and second harvest (where the slope is 0). We can now also see that the different weight classes follow their own respective patten, and in parallel.
production_long%>%
filter(category%in%c("gvg", "m2"))%>%
group_by(category, metric)%>%
arrange(category, Datum)%>%
mutate(cum = cumsum(mean))%>%
ggplot(.)+
geom_line(aes(x=Datum, y=cum, col=category))+
facet_grid(~metric)+
labs(col="Outcome",
x="Date",
y="Value")
Applying the same function on the overall production ratio’s reveal a similarity between all three. Meaning that they are interchangeable although they do differ a bit by scale. In the end, the ratio of the weight classes is where I must focus but it is always good to make sure you have a full view of the data you want to model.
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
ggplot(.)+
geom_tile(aes(x=Datum,y=category, fill=Ratio))+
scale_fill_viridis_c(option="magma")+
labs(x="Date",
y="Outcome",
fill="Ratio")
The ratio of the weight classes as a function of time. Clear to see the gap between harvests, and the day in which no harvest is collected. You can also see the shift in ratios per weight class in which the heavier weight classes are harvested later.

Or, they represent just the cucumbers that have been left to develop longer which will make them heavier as they grow over time. Like I said before, it is not that easy to discern if the cucumbers are plucked by category or if the category is just assigned to them based on the assessment after the collection has been made. To me, the latter is clearly in order, also knowing that this farmer does not have the capability to screen each and every cucumber and select them based on the estimated weight class.

production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
tidyr::drop_na()%>%
distinct()%>%
ungroup()%>%
select(-c(mean, metric))%>%
arrange(category, Datum)%>%
group_by(category)%>%
mutate(cum=cumsum(Ratio))%>%
ggplot(.)+
geom_line(aes(x=Datum,y=cum, col=category))+
theme(legend.position = "bottom")+
labs(x="Date",
y="Cumulative Sum of the Ratio")
And the cumulative sum of the ratio. Nothing stands out when compared to all the other previous plots.
df_temp_day_all<-df_temp%>%
timetk::summarise_by_time(.,
Tijd,
"day",
mean=mean(Temperature),
var = var(Temperature))%>%
mutate(Datum = as.Date((Tijd)))%>%
select(-Tijd)%>%
mutate(Outcome="Temperature")
df_humidity_day_all<-df_humidity%>%
timetk::summarise_by_time(.,
Tijd,
"day",
mean=mean(Humidity),
var = var(Humidity))%>%
mutate(Datum = as.Date((Tijd)))%>%
select(-Tijd)%>%
mutate(Outcome="Humidity")
ggplot()+
geom_line(data=df_temp_day_all, aes(x=Datum, y=log(mean), col="Temp"))+
geom_line(data=df_weight_day, aes(x=Datum, y=log(mean), col="Weight"))+
geom_line(data=df_humidity_day_all, aes(x=Datum, y=log(mean), col="Humidity"))+
geom_line(data=df_light_day, aes(x=Datum, y=log(mean), col="Light"))+
geom_bar(data=production, aes(x=Datum, y=log(stuks_2530+0.00001), fill="Pieces of 2530"), stat="identity", alpha=0.5)+
labs(col="Outcome",
x="Date",
y="Value",
fill="")
This graph tries to connect the production data of one weight category (2530) to the environmental variables available. Clear to see is that humidity data stops before our production data starts. the weight data clearly shows when cucumbers are harvested and when not. This graph represents the entire range of variables we will be using to model so it is good to get some graphical representation of it.

Something that I believed to be of interest (but later turned out not be, but I will show it anyhow) is a calculated metric showing the growth of a weight category. It is the ratio of the total weight of the weight class harvested by the number of pieces which is then again set as a ratio over time. In other words, I want to see if there is a change in the ratio of weight to pieces for a given category. If a difference is detected, then that means that there is a bit of a shift in the weight class itself as the weight class is determined afterwards. They key metric here is time, which represents the number of day from start (which is the first day of production data collected). Do note that I am not splitting the data per harvest, yet. So the ratio will be split by an increasing denominator — one day at a time. This means that the cucumbers not plucked at day x, but at day y, had a longer time to reach the same weight class (which is determined afterwards). And so their growth is slower.

growth<-production%>%
select(-c(gvg, kg_per_m2))%>%
mutate(kg_st_2530 = kg_2530 / stuks_2530)%>%
select(Datum, kg_st_2530,kg_2530,stuks_2530)%>%
mutate(time = row_number())%>%
mutate(gpd = (kg_st_2530/time)*100)%>%
tidyr::drop_na()
ggplot(growth, aes(x=time, y=gpd, size=stuks_2530), alpha=0.7)+
geom_point()+
labs(x="Time",
y="Growth per day",
size="Number of 25_30")
You can see the that growth slows down, which is a function of time. The large black dot is at the end of harvest 1, and the size represents the number of pieces for the weight category 2530. However, the plot looks more informative than it is, because there is almost no deviation. It does represent growth, but the lack of variation shows that I am not really adding anything to the mix. I am probably just re-arranging data, and flipping the coordinates of the cumulative sum shown earlier.
growth<-production%>%
select(-c(gvg, kg_per_m2,stuks_per_m2))%>%
mutate(kg_st_2530 = kg_2530 / stuks_2530,
kg_st_3035 = kg_3035 / stuks_3035,
kg_st_3540 = kg_3540 / stuks_3540,
kg_st_4050 = kg_4050 / stuks_4050,
kg_st_5060 = kg_5060 / stuks_5060)%>%
mutate(time = row_number())%>%
mutate(gpd_25_30 = (kg_st_2530/time)*100,
gpd_30_35 = (kg_st_3035/time)*100,
gpd_35_40 = (kg_st_3540/time)*100,
gpd_40_50 = (kg_st_4050/time)*100,
gpd_50_60 = (kg_st_5060/time)*100)%>%
tidyr::pivot_longer(gpd_25_30:gpd_50_60,
names_to = "Outcome",values_to = "value")%>%
tidyr::drop_na()%>%
filter(value<1)
growth%>%
select(time, Outcome, value)%>%
print(n=30)

# A tibble: 368 x 3
time Outcome value
<int> <chr> <dbl>
1 65 gpd_25_30 0.457
2 65 gpd_30_35 0.533
3 65 gpd_35_40 0.592
4 65 gpd_40_50 0.668
5 65 gpd_50_60 0.840
6 93 gpd_25_30 0.312
7 93 gpd_30_35 0.370
8 93 gpd_35_40 0.359
9 93 gpd_40_50 0.472
10 93 gpd_50_60 0.588
11 94 gpd_25_30 0.312
12 94 gpd_30_35 0.365
13 94 gpd_35_40 0.408
14 94 gpd_40_50 0.463
15 94 gpd_50_60 0.587
16 95 gpd_25_30 0.311
17 95 gpd_30_35 0.361
18 95 gpd_35_40 0.465
19 95 gpd_40_50 0.458
20 95 gpd_50_60 0.577
21 101 gpd_25_30 0.290
22 101 gpd_30_35 0.340
23 101 gpd_35_40 0.380
24 101 gpd_40_50 0.432
25 101 gpd_50_60 0.566
26 114 gpd_25_30 0.256
27 114 gpd_30_35 0.303
28 114 gpd_35_40 0.339
29 114 gpd_40_50 0.390
30 114 gpd_50_60 0.496
# ... with 338 more rows
ggplot(growth, aes(x=time, y=value, col=Outcome), alpha=0.7)+
geom_point()+
labs(x="Time",
y="Growth per day",
col="Per outcome")
Same here — the plot looks more informative then it is. At the time, I really believed I had something interesting, but now I fully doubt that I do. It is just another metric that I can use to compare harvests, but if I want to I can just stick with cumulative sums which are more easy to understand.
ggplot()+
geom_point(data=growth, aes(x=Datum, y=value, col=Outcome), alpha=0.7)+
geom_line(data=growth, aes(x=Datum, y=value, col=Outcome), alpha=0.7)+
geom_bar(data = df_gvg, aes(Datum, y=mean), stat="identity", fill="black", alpha=0.1)+
labs(x="Time",
y="Growth per day",
col="Per outcome")
Connecting the growth per day metric I created with the gvg data (which is a growth metric by itself) does not really reveal a pattern to me.
GPD<-production%>%
select(-c(gvg, kg_per_m2,stuks_per_m2))%>%
mutate(kg_st_2530 = kg_2530 / stuks_2530,
kg_st_3035 = kg_3035 / stuks_3035,
kg_st_3540 = kg_3540 / stuks_3540,
kg_st_4050 = kg_4050 / stuks_4050,
kg_st_5060 = kg_5060 / stuks_5060)%>%
mutate(time = row_number())%>%
mutate(gpd_25_30 = (kg_st_2530/time)*100,
gpd_30_35 = (kg_st_3035/time)*100,
gpd_35_40 = (kg_st_3540/time)*100,
gpd_40_50 = (kg_st_4050/time)*100,
gpd_50_60 = (kg_st_5060/time)*100)%>%
tidyr::pivot_longer(gpd_25_30:gpd_50_60,
names_to = "GPD",values_to = "GPD_value")%>%
select(Datum,GPD,GPD_value)
Pieces<-production%>%
select(c(Datum, stuks_2530,stuks_3035,stuks_3540,stuks_4050,stuks_5060))%>%
tidyr::pivot_longer(stuks_2530:stuks_5060,
names_to = "Pieces",values_to = "Pieces_value")%>%
select(Datum,Pieces,Pieces_value)
WeightKG<-production%>%
select(c(Datum, kg_2530,kg_3035,kg_3540,kg_4050,kg_5060))%>%
tidyr::pivot_longer(kg_2530:kg_5060,
names_to = "Weight",values_to = "Weight_value")%>%
select(Datum,Weight,Weight_value)
combined<-cbind(GPD, Pieces[,2:3], WeightKG[,2:3])
combined%>%
filter(GPD_value<=1)%>%
ggplot(., aes(x=Datum, y=GPD_value, col=GPD, size=Pieces_value))+
geom_point()+
labs(x="Date",
y="Growth per day",
col="Size category",
size ="Number of pieces")
The same data as two plots before, but now the number of pieces included for size. The only thing the plot shows is actually a lack of events.
combined%>%
filter(GPD_value<=1)%>%
ggplot(., aes(x=GPD_value,
size=Weight_value,
y=Pieces_value,
col=GPD))+
geom_point()+
labs(x="Growth per day",
y="Pieces per category harvested",
col="Size category",
size ="Total weight per category harvested")
And showed differently, now as a function of weight and pieces. Clear to see is the natural boundary to the left, which shows the natural boundary of a ratio. Meaning that certain combinations cannot go below a certain threshold. For the rest, the picture looks nice, but once again, not that informative. At least not to me.
combined%>%
filter(GPD_value<=1)%>%
mutate(month = lubridate::month(Datum))%>%
ggplot(., aes(x=GPD_value,
size=Weight_value,
y=Pieces_value,
col=GPD))+
geom_point()+
facet_grid(~month, scales="free", space="free")+
labs(x="Growth per day",
y="Pieces per category harvested",
col="Size category",
size ="Total weight per category harvested")+
theme(legend.position = "bottom")
And the previous plot, but now by month.

Last, but not least for the graphical part, the mean and variance of the environmental variables (some of them).

combined%>%
filter(bron=="Vrucht")%>%
ungroup()%>%
select(Datum, mean, var, Outcome)%>%
tidyr::drop_na()%>%
ggplot(., aes(x=Datum,
y=mean,
size=var,
col=Outcome))+
geom_point()+
labs(x="Date",
y="Mean value",
size="Variance value",
col="Outcome")
Humidity is way more variable than temperature — a shame I do not have more data on that metric.
combined%>%
filter(Outcome=="Temperature")%>%
ungroup()%>%
select(Datum, mean, var, bron)%>%
tidyr::drop_na()%>%
ggplot(., aes(x=Datum,
y=mean,
size=var,
col=bron))+
geom_point()+
labs(x="Date",
y="Mean value",
size="Variance value",
col="Source")
And the variance across sources for temperature. Although a grand pattern is followed by all sources, it does make sense to analyze and use them separately. Especially the gap for the measurement of the cucumber itself.

This concludes the end of the graphical part which was quite extensive (I warned you) since I did not have prior experience with this kind of data and I wanted to see if the data in itself has sufficient information to support modelling.

In a previous blogpost I showed how to use the Prophet algorithm to deploy Fourier transformations and model cyclical data. Some of the data in this dataset are quite cyclical as well, so lets see if this fast algorithm can show me some additional information. I do not expect it will be my end model, especially since the outcome by itself is ordinal in nature.

For this first try, I will focus on the number of cucumbers per m² and apply Fourier transforms on month, week and day.

df<-production_long%>%
filter(metric%in%c("stuks_per"))%>%
filter(category%in%c("m2"))%>%
select(Datum, mean)
m1 <- prophet::prophet(weekly.seasonality=FALSE,
yearly.seasonality=FALSE,
daily.seasonality=FALSE,
growth="logistic",
seasonality.mode = 'additive',
changepoint.prior.scale = 0.8,
interval.width = 0.5)
m1 <- prophet::add_seasonality(m1, name='monthly',
period=30.5,
fourier.order=20)
m1 <- prophet::add_seasonality(m1, name='daily',
period=1,
fourier.order=12)
m1 <- prophet::add_seasonality(m1, name='weekly',
period=7,
fourier.order=52)
df$cap <- 2
df$floor <- 0
fit <- df%>%
rename(ds=Datum, y=mean)%>%
select(ds,y, cap, floor)%>%
prophet::fit.prophet(m1,.)
future <- prophet::make_future_dataframe(fit,
periods=31,
freq= "day",
include_history = TRUE)
future$cap <- 2
future$floor <- 0
forecast <- predict(fit, future)
plot(fit, forecast) + prophet::add_changepoints_to_plot(fit)
prophet::prophet_plot_components(fit, forecast)
forecast%>%
dplyr::select(ds,
yhat,
yhat_lower,
yhat_upper)%>%
ggplot(., aes(x=ds, y=yhat))+
geom_point()+
geom_ribbon(aes(x=ds, y=yhat, ymin=yhat_lower, ymax=yhat_upper),
fill="black",alpha=0.2)+
theme_bw()
It tries to model the data and actually does a fairly good job, but this is not the model I will want to use in the end. Also, it is not able to deal with the zero part of the data (and we will later see that no model can, since the 0 represented by no harvest is not a fixed recurrent day. Something to keep in mind now and see if we can fix later).

To model time-series, we normally need to make sure that the time-series adheres to certain standards like being stationary. Lets explore the ratios of each of the weight categories and see if they meet that assumption.

mymts<-production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("stuks"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
ungroup()%>%
select(Datum, mean, category)%>%
tidyr::spread(category, mean)%>%
arrange(Datum)%>%
ungroup()%>%
ts(.,frequency = 365.25,
start=c(lubridate::year(min(.$Datum)),
lubridate::month(min(.$Datum))))
plot(mymts[,2:6])
The original data — each time-series represents a weight category.

We can then apply the ADF test which will give you a p-value. A better way would be to just take differences and then calculate the autocorrelation. Looking at the plots will provide you more information than the p-value, although the latter has of course a safer feeling.

apply(mymts[, 2:6], 2, tseries::adf.test)
stnry = MTS::diffM(mymts[,2:6], 2)
apply(stnry, 2, tseries::adf.test)
plot.ts(stnry)
Left the ADF values on the original scale and to the right the ADF values of the differenced time-series (middle plot). Meaning that the value is no longer a function of time.

The model that first came to mind, before the Prophet model, was the application of a Vector Autoregressive Model (VAR) which is a type of model I have used before. The greatest advantage is that it can model outcomes together and let the time-series cointegrate if such a co-integration exists.

vars::VARselect(stnry, 
type = "none", #type of deterministic regressors to include. We use none becasue the time series was made stationary using differencing above.
lag.max = 10)
The automatic selection of the model, which carries only endogenous predictors (the outcomes themselves) shows that an autocorrelation of up to eight lags gives the best AIC score. Meaning that each of the five outcomes will be predicted by their own lag values and the lag values of the other outcomes for up to eight days back. These models become quite heavy quite quick.

Lets apply the selected autocorrelation to the model and see where we end up.

var.a <- vars::VAR(stnry,
lag.max = 8, # highest lag order for lag length selection according to the choosen ic
ic = "AIC", # information criterion
type = "none", # type of deterministic regressors to include
season = NULL,
exog=NULL)
summary(var.a)
plot(var.a)
Like I said, the model becomes heavy quite quick. Stay away from the p-values! (ps. this is just the first model, there are four more models in the same output — for each of the different outcomes).
The most interesting part of the model! The covariance / correlation plot shows how much sense it makes to model the outcomes as endogenous together. If a see a matrix like this, I would say that it makes a lot of sense. The correlation also gives me the feeling that there is indeed an ordinal nature to the data.
The plots for two of the outcomes coming from the model. The autocorrelation plots are the most interesting. You want them to be within the boundary meaning that you modeled the data properly.

There is a multitude of tests you can perform on time-series data, but the one I like the most except from what I already showed is the stability plot. This one shows me if there are some events (sudden changes) I missed. Events have a tendency of messing up analyses.

Stability1 <- vars::stability(var.a, type = "OLS-CUSUM"); plot(Stability1)
No events missed. Do take note that the data does not differentiate between first and second harvest — it is one long time-series. In a later phase, we will correct this.

And from that model, we can make daily forecasts.

fcast = predict(var.a, n.ahead = 25) # we forecast over a short horizon because beyond short horizon prediction becomes unreliable or uniform
par(mar = c(2.5,2.5,2.5,2.5))
plot(fcast)

fcast = predict(var.a, n.ahead = 365) # we forecast over a short horizon because beyond short horizon prediction becomes unreliable or uniform
par(mar = c(2.5,2.5,2.5,2.5))
plot(fcast)
Which offer me absolutely nothing! Wow, I did not see this coming.

So, I already mentioned that the data I used up until now was not split by harvest. But it should! Especially since the aim of the proof-of-concept is to discern if I can predict harvest by using historical data of previous harvests connected with environmental data. So, lets first discern which harvests we have and how different they really are. We have two harvests. No difference between the two harvests would make this a very easy exercise, but I doubt they are similar.

production_long%>%
filter(category=="gvg")%>%
select(-metric)%>%
print(n=365)

harvest1_start = as.Date("2022-03-14")
harvest1_stop = as.Date("2022-07-12")
harvest2_start = as.Date("2022-08-02")
harvest2_stop = as.Date("2022-10-31")

production_long%>%
filter(category=="gvg")%>%
select(-metric)%>%
ggplot(.)+
geom_line(aes(x=Datum,y=mean))+
geom_vline(xintercept = harvest1_start, col="red")+
geom_vline(xintercept = harvest1_stop, col="red")+
geom_vline(xintercept = harvest2_start, col="blue")+
geom_vline(xintercept = harvest2_stop, col="blue")+
theme(legend.position = "bottom")+
labs(x="Date")
The red lines show the beginning and the end of the first harvest. The blue lines the beginning and end of the second harvest.

The first plot is good, but it would be better to overlay them. I will do that again for the gvg outcome, as I have done before.

production_long%>%
filter(category%in%c("gvg"))%>%
select(Datum, mean)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest>0)%>%
group_by(Harvest)%>%
mutate(Time = row_number())%>%
ggplot(.)+
geom_line(aes(x=Time,y=mean, col=factor(Harvest)))+
theme(legend.position = "bottom")+
labs(x="Days of harvest",
y="GVG",
col="Harvest")
You can clearly see the differences between the harvests. Not only in length but also in days in which no harvest was conducted. And then there is of course the difference in gvg.

Lets build a VAR model in which tthe data of the first harvest will be used to predict the second harvest. I will also include the exogenous outcomes that I have at my disposal, which are temperature, weight, and light. Humidity I cannot use.

Environ_combined_Harvest_1<-rbind(df_temp_day, 
df_weight_day,
df_light_day)%>%
ungroup()%>%
mutate(Exo = paste(Outcome,"_",bron))%>%
select(-c(bron, Outcome, var))%>%
tidyr::spread(Exo,mean)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest==1)%>%
select(-Harvest)
Harvest_1<-production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("stuks"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
mutate(Ratio = if_else(is.na(Ratio),0,Ratio))%>%
ungroup()%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest==1)%>%
select(Datum, Ratio, category)%>%
tidyr::spread(category, Ratio)
Production_Harvest_1<-Harvest_1%>%
cbind(Environ_combined_Harvest_1[,2:8])%>%
arrange(Datum)%>%
ungroup()%>%
ts(.,frequency = 365.25,
start=c(lubridate::year(min(.$Datum)),
lubridate::month(min(.$Datum))))
> class(Production_Harvest_1)
[1] "mts" "ts" "matrix"
> dim(Production_Harvest_1)
[1] 121 13
plot(Production_Harvest_1[,2:11])  
apply(Production_Harvest_1[, 2:13], 2, tseries::adf.test)
stnry = MTS::diffM(Production_Harvest_1[, 2:13], 2)
apply(stnry, 2, tseries::adf.test)
plot.ts(stnry[,2:11])
Original data en differenced data.

It is quite tricky to run a VAR model with exogenous data in R. Took me a couple of tries to get the data and result-matrices set up in such a way that it made sense for the model to do calculations on.

endo_old<-Production_Harvest_1[,2:6]
exo_old<-as.matrix(Production_Harvest_1[,7:12])
var.b <- vars::VAR(endo_old,
lag.max = 10,
ic = "AIC",
type = "none",
season = NULL,
exog=cbind(exo_old))

Now that I have the model I can start looking for ways to predict. The data wrangling, however, is quite heavy.

Environ_combined_Harvest_2<-rbind(df_temp_day, 
df_weight_day,
df_light_day)%>%
ungroup()%>%
mutate(Exo = paste(Outcome,"_",bron))%>%
select(-c(bron, Outcome, var))%>%
mutate(mean = if_else(is.na(mean),0,mean))%>%
tidyr::spread(Exo,mean)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest==2)%>%
select(-Harvest)

Harvest_2<-production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("stuks"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
ungroup()%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest==2)%>%
select(Datum, Ratio, category)%>%
tidyr::spread(category, Ratio)

Production_Harvest_2<-Harvest_2%>%
cbind(Environ_combined_Harvest_2[,2:8])%>%
arrange(Datum)%>%
ungroup()%>%
ts(.,frequency = 365.25,
start=c(lubridate::year(min(.$Datum)),
lubridate::month(min(.$Datum))))

endo_new<-Production_Harvest_2[,2:6]
exo_new1<-as.matrix(data.frame(Production_Harvest_2[,7:12]))
exo_new1[,6]<-0

fcast = predict(var.b,
dumvar=cbind(exo_new1),
n.ahead=6)

df_pred<-as.data.frame(do.call(rbind, fcast$fcst))%>%
mutate(category = rep(c("2530","3035","3540","4050","5060"),
each=dim(exo_new)[1],
length.out=dim(exo_new)[1]*5))%>%
mutate(Datum = rep(Environ_combined_Harvest_2$Datum,
length.out=dim(exo_new)[1]*5))

df_production_ex<-production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(!metric%in%c("stuks"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
ungroup()%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest==2)%>%
select(Datum, Ratio, category)%>%
filter(Datum<"2022-08-15")

df_pred%>%
as_tibble()%>%
group_by(category)%>%
arrange(category, Datum)%>%
filter(Datum<"2022-08-15")%>%
ggplot(.)+
geom_point(aes(x=Datum, y=fcst, col=category))+
geom_ribbon(aes(x=Datum, y=fcst,ymin=lower, ymax=upper, fill=category), alpha=0.5)+
geom_point(data=df_production_ex,
aes(x=Datum,
y=Ratio,
group=category))+
facet_grid(~category, scales="free", space="free")+
labs(x="Date",
y="Ratio of weight catgeory to total",
fill="95% Prediction interval",
col="Weight category")+
theme(legend.position="bottom")
Predicted vs observed for the first six days of the second harvest for which I have data from the exogenous variables. The colored points and bands are the point predictions and prediction intervals of each weight catgeory, respectively. The black points are the observed ratios. This model is performing POORLY.

For those of you who took a close look, I had missing data in the exogenous variables I wanted / needed to use to forecast the future for. And what I did was replace the NA with zero. Which is not harmless. Below you can see the results after applying a Kalman-Filter. You will see the results change.

Production_Harvest_2<-Harvest_2%>%
cbind(Environ_combined_Harvest_2[,2:8])%>%
arrange(Datum)%>%
ungroup()%>%
imputeTS::na_kalman(.) %>%
ts(.,frequency = 365.25,
start=c(lubridate::year(min(.$Datum)),
lubridate::month(min(.$Datum))))
Still not good, but better for sure.
production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(metric%in%c("kg"))%>%
select(Datum, mean, category)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest>0)%>%
group_by(Harvest, category)%>%
mutate(Time = row_number())%>%
ggplot(.)+
geom_line(aes(x=Time,y=mean, col=factor(Harvest)))+
facet_grid(~category)+
theme(legend.position = "bottom")+
labs(x="Days of harvest",
y="KG",
col="Harvest")

What if we model the data in a cumulative form? It is much easier to model. So easy, in fact, that we can even start with a simple linear regression.

production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
ungroup()%>%
select(Datum, mean, category, Ratio)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest>0)%>%
mutate(Ratio = tidyr::replace_na(Ratio, 0))%>%
group_by(Harvest, category)%>%
arrange(category,Datum)%>%
mutate(Time = row_number(),
Cum = cumsum(Ratio))%>%
ggplot(.)+
geom_line(aes(x=Time,y=Cum, col=factor(Harvest)))+
facet_grid(~category)+
theme(legend.position = "bottom")+
labs(x="Days of harvest",
y="Cumulative ratio",
col="Harvest")
The cumulative ratio per weight category per harvest. Plotting the data like this shows a whole range of possibilities that will take you away from the complexity of the time-series data.
df_CumRatio<-production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
ungroup()%>%
select(Datum, mean, category, Ratio)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest>0)%>%
mutate(Ratio = tidyr::replace_na(Ratio, 0))%>%
group_by(Harvest, category)%>%
arrange(category,Datum)%>%
mutate(Time = row_number(),
Cum = cumsum(Ratio))


fit<-df_CumRatio%>%
filter(Harvest==1)%>%
lm(Cum~Time*category, data=.)
summary(fit)

Call:
lm(formula = Cum ~ Time * category, data = .)

Residuals:
Min 1Q Median 3Q Max
-2.5883 -0.2030 0.0034 0.3556 3.4782

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.020624 0.162263 0.127 0.89890
Time 0.005069 0.002308 2.196 0.02848 *
category3035 0.736145 0.229474 3.208 0.00141 **
category3540 0.543583 0.229474 2.369 0.01816 *
category4050 -3.238624 0.229474 -14.113 < 2e-16 ***
category5060 -0.103103 0.229474 -0.449 0.65338
Time:category3035 0.112518 0.003265 34.466 < 2e-16 ***
Time:category3540 0.359676 0.003265 110.176 < 2e-16 ***
Time:category4050 0.439812 0.003265 134.723 < 2e-16 ***
Time:category5060 0.003985 0.003265 1.221 0.22266
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8869 on 595 degrees of freedom
Multiple R-squared: 0.996, Adjusted R-squared: 0.9959
F-statistic: 1.643e+04 on 9 and 595 DF, p-value: < 2.2e-16
par(mfrow=c(2,2))
plot(fit)

pred<-broom.mixed::augment(fit)
ggplot(pred, aes(x=Cum, y=.fitted))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")

ggplot(pred, aes(x=Time, y=.resid))+
geom_point()+
geom_smooth(aes(col=category, fill=category))+
facet_wrap(~category, ncol=1, scale="free")
And the results of the linear regression. Not bad, although the residuals are quite messed up. Still, for prediction purposes it is really not that bad. Do take note that NO environmental variables have been included.

Lets see how far this model brings us when wanting to predict the second harvest.

df_pred_harvest2<-df_CumRatio%>%
filter(Harvest==2)%>%
predict(fit, newdata=.)
df_CumRatio%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
ggplot(., aes(x=Cum, y=.fitted))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")
Not bad either, although the graph does have a tendency to make the model better then it really is. Because if you look at the deviations, and remember that we are looking at cumulative data here, then it goes to show that the 3540 group is deviating A LOT in its predictions when compared to what has been observed.
df_CumRatio%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
ggplot(., aes(x=Cum, y=.fitted, col=.resid))+
geom_point(alpha=0.8)+
scale_color_viridis_c(option="magma")+
geom_abline(intercept=0, slope=1, lty=2, col="black")+
facet_grid(~category, scales="free")
Which is clearly shown here — these are serious deviations. And so the model had a nice start, but is nowhere near ready.

Lets include the environmental variables.

df_cumRatio_combined<-merge(df_CumRatio, Environ_combined_wide, by="Datum")%>%
group_by(category)%>%
arrange(category, Datum)%>%
ungroup()
colnames(df_cumRatio_combined)
colnames(df_cumRatio_combined)<-c("Datum", "mean", "category","Ratio","Harvest","Time",
"Cum","Light","Temp_bloem","Temp_doek","Temp_kop",
"Temp_nok","Temp_vrucht","Weight")
fit2<-df_cumRatio_combined%>%
filter(Harvest==1)%>%
lm(Cum~Time*category + Light + Temp_bloem + Temp_doek + Temp_kop + Temp_nok +
Temp_vrucht + Weight, data=.)

summary(fit2)
Call:
lm(formula = Cum ~ Time * category + Light + Temp_bloem + Temp_doek +
Temp_kop + Temp_nok + Temp_vrucht + Weight, data = .)

Residuals:
Min 1Q Median 3Q Max
-2.6512 -0.2745 0.0064 0.3464 3.3987

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -1.8882071 1.4334736 -1.317 0.18828
Time 0.0062967 0.0028644 2.198 0.02832 *
category3035 0.7361446 0.2294852 3.208 0.00141 **
category3540 0.5435826 0.2294852 2.369 0.01817 *
category4050 -3.2386240 0.2294852 -14.113 < 2e-16 ***
category5060 -0.1031033 0.2294852 -0.449 0.65339
Light -0.0015367 0.0008004 -1.920 0.05534 .
Temp_bloem 0.2226424 0.2191329 1.016 0.31004
Temp_doek -0.0873605 0.2434585 -0.359 0.71985
Temp_kop 0.0404817 0.2622962 0.154 0.87740
Temp_nok -0.0463816 0.0497856 -0.932 0.35191
Temp_vrucht -0.0349427 0.1956109 -0.179 0.85829
Weight 0.0078287 0.0222947 0.351 0.72560
Time:category3035 0.1125182 0.0032647 34.465 < 2e-16 ***
Time:category3540 0.3596757 0.0032647 110.170 < 2e-16 ***
Time:category4050 0.4398122 0.0032647 134.716 < 2e-16 ***
Time:category5060 0.0039853 0.0032647 1.221 0.22268
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.887 on 588 degrees of freedom
Multiple R-squared: 0.996, Adjusted R-squared: 0.9959
F-statistic: 9242 on 16 and 588 DF, p-value: < 2.2e-16

anova(fit2, fit)

Analysis of Variance Table

Model 1: Cum ~ Time * category + Light + Temp_bloem + Temp_doek + Temp_kop +
Temp_nok + Temp_vrucht + Weight
Model 2: Cum ~ Time * category
Res.Df RSS Df Sum of Sq F Pr(>F)
1 588 462.58
2 595 468.04 -7 -5.4613 0.9917 0.4361
Still messed up residuals showing me that I am missing out on temporal structures.
fit3<-df_cumRatio_combined%>%
filter(Harvest==1)%>%
lm(Cum~Time + category + Time*category + Time*Light + Time*Temp_bloem +
Time*Temp_doek + Time*Temp_kop + Time*Temp_nok +
Time*Temp_vrucht + Time*Weight, data=.)
summary(fit3)
par(mfrow=c(2,2))
plot(fit3)
A bit better now, when including time* interactions, but can be done better.
anova(fit3, fit2, fit)
Analysis of Variance Table

Model 1: Cum ~ Time + category + Time * category + Time * Light + Time *
Temp_bloem + Time * Temp_doek + Time * Temp_kop + Time *
Temp_nok + Time * Temp_vrucht + Time * Weight
Model 2: Cum ~ Time * category + Light + Temp_bloem + Temp_doek + Temp_kop +
Temp_nok + Temp_vrucht + Weight
Model 3: Cum ~ Time * category
Res.Df RSS Df Sum of Sq F Pr(>F)
1 581 448.97
2 588 462.58 -7 -13.6072 2.5155 0.01487 *
3 595 468.04 -7 -5.4613 1.0096 0.42320
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The graph above shows that ANOVA does not favor more complex models.

pred<-broom.mixed::augment(fit3)
ggplot(pred, aes(x=Cum, y=.fitted))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")

ggplot(pred, aes(x=Cum, y=.fitted, col=.resid))+
geom_point(alpha=0.8)+
scale_color_viridis_c(option="magma")+
geom_abline(intercept=0, slope=1, lty=2, col="black")+
facet_grid(~category, scales="free")
However, the fitted variables are quite nice.
df_pred_harvest2<-df_cumRatio_combined%>%
filter(Harvest==2)%>%
predict(fit3, newdata=.)
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
ggplot(., aes(x=Cum, y=.fitted))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
ggplot(., aes(x=Cum, y=.fitted, col=.resid))+
geom_point(alpha=0.8)+
scale_color_viridis_c(option="magma")+
geom_abline(intercept=0, slope=1, lty=2, col="black")+
facet_grid(~category, scales="free")
But the predicted variables for the second harvest are, again, quite bad. Also, the missing environmental variables for temperature will make it impossible to predict for the early stages.

Lets see how the predicted vs observed looks like if we transform the data from cumulative sum back to ratio (both observed and predicted).

df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
select(Datum, category, Cum, .fitted)%>%
ungroup()%>%
group_by(category)%>%
mutate(Ratio_obs = Cum - lag(Cum),
Ratio_pred = .fitted - lag(.fitted))%>%
ungroup()%>%
ggplot(., aes(x=Ratio_obs, y=Ratio_pred))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")
Now it becomes quite clear that that the observed vs predicted leaves much to be desired.
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
select(Datum, category, Cum, .fitted)%>%
ungroup()%>%
group_by(category)%>%
mutate(Ratio_obs = Cum - lag(Cum),
Ratio_pred = .fitted - lag(.fitted))%>%
mutate(Possible_pred = if_else(Ratio_pred<0, "No", "Yes"))%>%
ungroup()%>%
ggplot(., aes(x=Ratio_obs, y=Ratio_pred, col=Possible_pred))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")+
labs(x="Observed ratios",
y="Predicted ratios",
col="Possible prediction?")+
theme(legend.position="bottom")

And that predicting below zero is a definitive no-go. This is not a model that we can deploy.
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
select(Datum, category, Cum, .fitted)%>%
ungroup()%>%
group_by(category)%>%
mutate(Ratio_obs = Cum - lag(Cum),
Ratio_pred = .fitted - lag(.fitted))%>%
mutate(Possible_pred = if_else(Ratio_pred<0, "No", "Yes"))%>%
ungroup()%>%
ggplot(.)+
geom_point(aes(x=Datum, y=Ratio_obs), col="grey", lty=2)+
geom_point(aes(x=Datum, y=Ratio_pred, col=Possible_pred))+
facet_wrap(~category, scales="free", ncol=1)+
labs(x="Date",
y="Ratios (observed and predicted)",
col="Possible prediction?")+
theme(legend.position="bottom")
Which is again shown here.

So, lets first see if we can tackle the temporal issues by deploying a Generalized Additive Model which is basically made up of splines. I really like these models and in this case I am using the Mixed Model variant which means including random components.

fit_gam <- df_cumRatio_combined%>%
filter(Harvest==1)%>%
mutate(category=factor(category))%>%
gamm4::gamm4(Cum~s(Time, by=category)+
category+
s(Light)+
s(Temp_bloem)+
s(Temp_doek)+
s(Temp_kop)+
s(Temp_nok)+
s(Temp_vrucht)+
s(Weight),
data=.,
random=~(1+Time|category))
fit_gam

$mer
Linear mixed model fit by REML ['lmerMod']
REML criterion at convergence: -479.8367
Random effects:
Groups Name Std.Dev. Corr
Xr s(Time):category2530 3.446e-01
Xr.0 s(Time):category3035 7.816e+00
Xr.1 s(Time):category3540 1.216e+01
Xr.2 s(Time):category4050 9.618e+00
Xr.3 s(Time):category5060 7.583e-01
Xr.4 s(Light) 4.878e-02
Xr.5 s(Temp_bloem) 0.000e+00
Xr.6 s(Temp_doek) 0.000e+00
Xr.7 s(Temp_kop) 0.000e+00
Xr.8 s(Temp_nok) 1.104e-01
Xr.9 s(Temp_vrucht) 4.406e-05
Xr.10 s(Weight) 8.040e-02
category (Intercept) 8.632e+00
Time 6.051e+00 -0.44
Residual 1.309e-01
Number of obs: 605, groups:
Xr, 8; Xr.0, 8; Xr.1, 8; Xr.2, 8; Xr.3, 8; Xr.4, 8; Xr.5, 8; Xr.6, 8; Xr.7, 8; Xr.8, 8; Xr.9, 8; Xr.10, 8; category, 5
Fixed Effects:
X(Intercept) Xcategory3035 Xcategory3540 Xcategory4050 Xcategory5060
0.3297748 7.5998190 22.4838782 23.5900030 0.1400601
Xs(Time):category2530Fx1 Xs(Time):category3035Fx1 Xs(Time):category3540Fx1 Xs(Time):category4050Fx1 Xs(Time):category5060Fx1
0.1535347 6.0360774 13.4338150 10.5357889 0.6199651
Xs(Light)Fx1 Xs(Temp_bloem)Fx1 Xs(Temp_doek)Fx1 Xs(Temp_kop)Fx1 Xs(Temp_nok)Fx1
0.0089804 -0.0086288 -0.0561580 0.0008591 -0.0173314
Xs(Temp_vrucht)Fx1 Xs(Weight)Fx1
0.0430675 -0.0223692

$gam

Family: gaussian
Link function: identity

Formula:
Cum ~ s(Time, by = category) + category + s(Light) + s(Temp_bloem) +
s(Temp_doek) + s(Temp_kop) + s(Temp_nok) + s(Temp_vrucht) +
s(Weight)
<environment: 0x00000198bd99fc60>

Estimated degrees of freedom:
3.81 8.86 8.94 8.91 5.47 1.91 1.00
1.00 1.00 2.90 1.00 2.80 total = 52.59

lmer.REML score: -479.8367
plot(fit_gam$mer)
I am diving straight to this plot, which was messed up in all the previous regression plots (remember the tentacle like structure). Meaning that the splines tackled the temporal issues I missed in the simple linear regression.
par(mfrow = c(4, 2))
plot(fit_gam$gam)
You can see that the splines are not that informative, but that is because I am modelling cumulative data. So they will behave quite linear. The intersection does startle me, which is at time 80 (80 days in). I am not exactly sure why that is.
fit_gam_aug<-broom.mixed::augment(fit_gam$gam, confint=TRUE)
fit_gam_mer_aug<-broom.mixed::augment(fit_gam$mer, confint=TRUE)
fit_gam_aug%>%
ggplot(., aes(x=Cum, y=.fitted, colour=.resid))+
geom_point()+
theme_bw()+
facet_grid(~category)+
geom_abline(intercept=0, slope=1, lty=2, col="black")+
scale_color_viridis_c(option="turbo")
fit_gam_mer_aug%>%
ggplot(., aes(x=y, y=.fitted, colour=.resid))+
geom_point()+
theme_bw()+
facet_grid(~category)+
geom_abline(intercept=0, slope=1, lty=2, col="black")+
scale_color_viridis_c(option="turbo")
The predictions coming from the GAM part of the model (no random components) and the MER part of the model (random components included). The latter model clearly outperforms the former model, but I do worry for overfit.

And lets take a look at how the model performs on the second harvest.

df_pred_harvest2<-df_cumRatio_combined%>%
filter(Harvest==2)%>%
ungroup()%>%
select(-c(Datum, mean))%>%
predict(fit_gam$gam, newdata=.)
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
ggplot(., aes(x=Cum, y=.fitted, col=.resid))+
geom_point(alpha=0.8)+
scale_color_viridis_c(option="magma")+
geom_abline(intercept=0, slope=1, lty=2, col="black")+
facet_grid(~category, scales="free")
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
select(Datum, category, Cum, .fitted)%>%
ungroup()%>%
group_by(category)%>%
mutate(Ratio_obs = Cum - lag(Cum),
Ratio_pred = .fitted - lag(.fitted))%>%
mutate(Possible_pred = if_else(Ratio_pred<0, "No", "Yes"))%>%
ungroup()%>%
ggplot(., aes(x=Ratio_obs, y=Ratio_pred, col=Possible_pred))+
geom_point()+
geom_abline(intercept=0, slope=1, lty=2, col="red")+
facet_grid(~category, scales="free")+
labs(x="Observed ratios",
y="Predicted ratios",
col="Possible prediction?")+
theme(legend.position="bottom")
Not bad, but not really that good either. And then there is the issue of negative ratios which cannot be.
df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
select(Datum, category, Cum, .fitted)%>%
ungroup()%>%
group_by(category)%>%
mutate(Ratio_obs = Cum - lag(Cum),
Ratio_pred = .fitted - lag(.fitted))%>%
mutate(Possible_pred = if_else(Ratio_pred<0, "No", "Yes"))%>%
ungroup()%>%
ggplot(.)+
geom_point(aes(x=Datum, y=Ratio_obs), col="grey", lty=2)+
geom_point(aes(x=Datum, y=Ratio_pred, col=Possible_pred))+
facet_wrap(~category, scales="free", ncol=1)+
labs(x="Date",
y="Ratios (observed and predicted)",
col="Possible prediction?")+
theme(legend.position="bottom")
Here you can see the prediction for each time-point of the second harvest and when a negative ratio is predicted.

A final check is done by adding up the ratio’s. We know that the observed ratio’s add up to one, as they are calculated afterwards. The predicted ratio’s must also add up to one to make sense, economically. If the predictions do not, the model has less value since it is becomes difficult to apply in the real world.

df_cumRatio_combined%>%
filter(Harvest==2)%>%
cbind(., .fitted=df_pred_harvest2)%>%
mutate(.resid=.fitted-Cum)%>%
select(Datum, category, Cum, .fitted)%>%
ungroup()%>%
group_by(category)%>%
mutate(Ratio_obs = Cum - lag(Cum),
Ratio_pred = .fitted - lag(.fitted))%>%
group_by(Datum)%>%
mutate(TRD_obs = sum(Ratio_obs),
TRD_pred = sum(Ratio_pred))%>%
ungroup()%>%
select(Datum, TRD_obs, TRD_pred)%>%
distinct()%>%
arrange(Datum)%>%
ggplot()+
geom_point(aes(x=Datum, y=TRD_obs, col="Summed Ratio per Day - Observed"))+
geom_point(aes(x=Datum, y=TRD_pred, col="Summed Ratio per Day - Predicted"))+
labs(x="Date",
y="Summed Ratio per Day",
col="Observed - Predicted")+
theme(legend.position="bottom")
As we can see, the summed ratios seldomly approximate 1. In fact, they almost seem to represent a residual plot (which makes perfect sense considering the nature of the data).

Up until now we have modeled the data using time-series techniques (Prophet, VAR) and linear regression (including mixed model splines). But none of them are able to deal with the ordinal part of the data, which is what I will tackle now. Because no matter how bad the predictions are, they must sum up to one. So lets step into the world of ordinal modelling.

df_Ratio<-production_long%>%
filter(!category%in%c("gvg", "m2"))%>%
filter(metric%in%c("kg"))%>%
group_by(Datum, metric)%>%
mutate(Ratio = round(mean / sum(mean), 2))%>%
ungroup()%>%
dplyr::select(Datum, mean, category, Ratio)%>%
mutate(Harvest = if_else(Datum %within% lubridate::interval(harvest1_start,
harvest1_stop),1,
if_else(Datum %within% lubridate::interval(harvest2_start,
harvest2_stop),2,0)))%>%
filter(Harvest>0)%>%
mutate(Ratio = tidyr::replace_na(Ratio, 0))%>%
group_by(Harvest, category)%>%
arrange(category,Datum)%>%
mutate(Time = row_number(),
category = factor(category))

Above I showed the necessary steps to get the model in order. For ordinal analysis, the outcome is a single column containing levels. For the model to be able to ingest the data it needs to be structured in such a way that each row represents a single observation, or that the rows are summed from which a frequency variable is created. Luckily, I already have the data in such a structure (each row shows the number of cucumbers harvested for each weight category, thereby representing the frequency of each level). Lets first do a simple model, splitting the data by harvest inside the model. I want to discern if the ratio predictions make sense, and the model is set-up properly.

model<-MASS::polr(category ~ Time*Harvest, 
weights=mean,
data = df_Ratio,
Hess = TRUE)
summary(model)

Call:
MASS::polr(formula = category ~ Time * Harvest, data = df_Ratio,
weights = mean, Hess = TRUE)

Coefficients:
Value Std. Error t value
Time 0.009907 4.245e-05 233.41
Harvest2 0.417498 4.756e-03 87.79
Time:Harvest2 -0.008698 8.998e-05 -96.66

Intercepts:
Value Std. Error t value
2530|3035 -3.9807 0.0067 -594.6270
3035|3540 -1.2113 0.0034 -359.7544
3540|4050 0.5798 0.0032 178.9780
4050|5060 4.2924 0.0051 846.5510

Residual Deviance: 5879895.05
AIC: 5879909.05

Lets showcase the elements inside the model.

effects::Effect(focal.predictors = "Time",model)

Time effect (probability) for 2530
Time
1 30 60 90 100
0.015276207 0.012803988 0.010662177 0.008875421 0.008348458

Time effect (probability) for 3035
Time
1 30 60 90 100
0.1830726 0.1586022 0.1360136 0.1160998 0.1100299

Time effect (probability) for 3540
Time
1 30 60 90 100
0.3989909 0.3822231 0.3608547 0.3363280 0.3276236

Time effect (probability) for 4050
Time
1 30 60 90 100
0.3864700 0.4270672 0.4693289 0.5109778 0.5245657

Time effect (probability) for 5060
Time
1 30 60 90 100
0.01619031 0.01930358 0.02314067 0.02771892 0.02943228
plot(effects::Effect(focal.predictors = "Time",model))
plot(effects::Effect(focal.predictors = c("Time", "Harvest"),model))
Looking good. I am seeing patterns in the model that I recognize from the ratio plots made on the raw data.

Lets split the data, model on harvest 1 and then predict for harvest 2.

model2<-df_Ratio%>%
filter(Harvest==1)%>%
MASS::polr(category ~ Time,
weights=mean,
data = .,
Hess = TRUE)
fitted(model2)%>%
as.data.frame()%>%
mutate(Time = row_number())%>%
tidyr::pivot_longer(cols=`2530`:`5060`,
names_to = "Pred_category",
values_to = "Pred_ratio")%>%
ggplot()+
geom_point(aes(x=Time, y=Pred_ratio, col=Pred_category))
This IS strange. The exact same predictions, copied. I can clearly see that each set of predictions follows the length of the first harvest, but I am unsure why they are copied five times (besides the fact that have five weight categories).

I am pretty sure I know what will happen when I use this model to predict for the second harvest.

df_Ratio%>%
filter(Harvest==2)%>%
predict(model2, newdata=., type="probs")%>%
as.data.frame()%>%
mutate(Time = row_number())%>%
tidyr::pivot_longer(cols=`2530`:`5060`,
names_to = "Pred_category",
values_to = "Pred_ratio")%>%
ggplot()+
geom_point(aes(x=Time, y=Pred_ratio, col=Pred_category ))
No surprise here on the copy part. The predictions by themselves look good in the sense that the ratio’s seem to be adhered to (meaning that they sum to 1).

Lets build another model, this time including environmental variables as well.

model3<-df_cumRatio_combined%>%
filter(Harvest==1)%>%
mutate(category=factor(category),
Harvest = factor(Harvest))%>%
MASS::polr(category ~ Time*Light + Time*Temp_bloem +
Time*Temp_doek + Time*Temp_kop + Time*Temp_nok +
Time*Temp_vrucht + Time*Weight,
weights=mean,
data = .,
Hess = TRUE)
fitted(model2)%>%
as.data.frame()%>%
mutate(Time = row_number())%>%
tidyr::pivot_longer(cols=`2530`:`5060`,
names_to = "Pred_category",
values_to = "Pred_ratio")%>%
group_by(Pred_category)%>%
arrange(Time)%>%
ggplot()+
geom_point(aes(x=Time, y=Pred_ratio, col=Pred_category ))
df_cumRatio_combined%>%
filter(Harvest==2)%>%
predict(model3, newdata=., type="probs")%>%
as.data.frame()%>%
mutate(Time = row_number())%>%
tidyr::pivot_longer(cols=`2530`:`5060`,
names_to = "Pred_category",
values_to = "Pred_ratio")%>%
ggplot()+
geom_point(aes(x=Time, y=Pred_ratio, col=Pred_category ))
Looks good. Plenty of variation but within boundaries. Lets overlay it with observed data.
df_Ratio_Test<-df_Ratio%>%
filter(Harvest==2)%>%
ungroup()%>%
dplyr::select(Datum, Ratio, category)%>%
arrange(Datum)
df_cumRatio_combined%>%
filter(Harvest==2)%>%
predict(model3, newdata=., type="probs")%>%
as.data.frame()%>%
mutate(Time = row_number())%>%
filter(Time<92)%>%
tidyr::pivot_longer(cols=`2530`:`5060`,
names_to = "Pred_category",
values_to = "Pred_ratio")%>%
cbind(.,df_Ratio_Test)%>%
ggplot()+
geom_point(aes(x=Datum, y=Ratio), col="black")+
geom_point(aes(x=Datum, y=Pred_ratio, col=Pred_category))+
facet_wrap(~category, scales="free", ncol=1)+
labs(x="Datum",
y="Ratio",
col="Predicted Category")
Looking good again! Not perfect, but much more in range of what I would like to see. This is something we can use.

Although the model could for sure be fine-tined using splines, the model in itself does show me that the proof-of-concept is attainable.

If this seems like an abrupt end, well, it is! That is because you have spent a lot of time looking at graphs and seeing me try different approaches and different models. At a certain moment, all those try-out will lead to a more structured path and then things can move quite quickly.

Of course, the final model is not there yet, but I do have a sense that I can do what I wanted to do and that is to predict harvest.

BECOME a WRITER at MLearning.ai

--

Dr. Marc Jacobs

Scientist. Builder of models, and enthousiast of statistics, research, epidemiology, probability, and simulations for 10+ years.