MLOps Blog

Managing Dataset Versions in Long-Term ML Projects

Richmond Alake

7 min

8th November, 2023

MLOps

Long-term ML project involves developing and sustaining applications or systems that leverage machine learning models, algorithms, and techniques. As a result of the life span of these apps and systems, the ML models associated require to be constantly updated, redeployed, and maintained, which means that they require proper dataset version management.

An example of a long-term ML project will be a bank fraud detection system powered by ML models and algorithms for pattern recognition. The ML models within fraud detection will require constant updating to keep up with the evolving domain of fraud detection, significant pattern change, and behavioral changes of customers and fraudsters. In such ML projects, you can expect to find large volumes of data accumulated over time, complex algorithms, and increasing compute resources; all of these are characteristics of a maturing ML project.

As ML projects mature, the datasets used may change due to new technical requirements or industry shifts. For an ML project to be robust against such unexpected changes, tracking and maintaining different dataset versions is required. However, dataset version management can be a pain for maturing ML teams, mainly due to the following:

1 Managing large data volumes without utilizing data management platforms.
2 Ensuring and maintaining high-quality data.
3 Incorporating additional data sources.
4 The time-consuming process of labeling new data points.
5 Robust security and privacy policies and infrastructure which is necessary for handling more sensitive data.

Failure to consider the severity of these problems can lead to issues like degraded model accuracy, data drift, security issues, and data inconsistencies. In the upcoming sections, we will delve into the specific challenges associated with dataset version management in long term ml projects, that ml teams often encounter and the solutions that can be adopted to mitigate them.

Learn more

Version Control for Machine Learning and Data Science

Dataset version management challenges

Data storage and retrieval

As a machine learning project advances in its lifecycle, its demand for data also increases. For example, a machine learning algorithm designed to predict customer purchase behavior on an e-commerce site will require changing data points due to factors such as the evolving commerce industry, changing customer behavior, and unpredictable commercial markets. Additional data points will likely need to be captured over time to ensure data relevance and good prediction capability.

This increase in data requirements leads to the demand for data management solutions, one of them being data versioning. However, in scenarios where dataset versioning solutions are leveraged, there can still be various challenges experienced by ML/AI/Data teams.

Data aggregation: Data sources could increase as more data points are required to train ML models. Existing data pipelines will have to be modified to accommodate new data sources. New dataset versions might contain different structures different from previous existing dataset versions, and keeping track of what data sources correspond to a data version can become a challenge to manage.

Data preprocessing: As dataset structures evolve over time and versions, there’s a requirement to maintain data processing pipelines to meet data structure requirements and, at times, have back compatibility with previous dataset version data structures.

Data storage: In long-term projects, challenges such as storage’s horizontal and vertical scalability become paramount as storage capacity and demand become exhausted over the project’s lifespan.

Data retrieval: Having several dataset versions requires machine learning practitioners to know which dataset versions correspond to a certain model performance outcome.

Data drift and concept drift

Machine learning projects that handle large, evolving, and complex data may face two challenges: concept drift and data drift.

Concept drift happens when there’s a difference between the input data and the target variable, while data drift occurs when the data’s distribution changes across the entire dataset or dataset partitions, such as test and training data. Both can affect the performance of machine-learning models over time, leading to inaccurate results and decreased confidence in the model’s predictions.

More specifically, in long-term ML projects, data, and concept drift, if allowed to persist, result in poor ML model performance and reduced effectiveness. For a commercial system dependent on ML models and algorithms, if concept and data drift issues are left unaddressed, it can result in decreased user satisfaction and negative financial impacts on the company’s revenue.

For example, imagine a company using a machine learning model to predict product demand based on historical sales data. Over time, the company’s sales data may change due to external factors such as economic trends, changes in consumer preferences, or new competitors entering the market. Suppose the machine learning model is not updated regularly to reflect these changes. In that case, it may begin to make inaccurate predictions, resulting in stock shortages or overproduction, ultimately leading to lost revenue.

Data annotation and preprocessing

*Common data preprocessing tasks for machine learning | Source: Author*

Long-term machine learning projects require managing large amounts of data. However, manual data annotation can become challenging and lead to inconsistencies, making maintaining annotation quality and consistency difficult.

Data preprocessing transforms raw data into a format suitable for analysis and modeling. As long-term projects progress, preprocessing techniques may change to better address the problem at hand and improve model performance. This can lead to multiple dataset versions, each with its own set of preprocessing techniques. Such challenges include, but are not limited to:

1 Complex data growth can result in time-consuming and resource-intensive preprocessing tasks, leading to longer training times and decreased efficiency.
2 The lack of standardization in preprocessing steps can result in variations in the quality and accuracy of the data, which can negatively impact model training and performance.
3 Maintaining consistency and quality of preprocessed data over time is also challenging, especially when the original raw data changes.

Data security and privacy

Data security and privacy are essential considerations in managing and versioning datasets for machine learning projects. Data security measures protect data from unauthorized access, modification, or destruction during data storage, transmission, and processing. On the other hand, data privacy protects personal information and individuals’ rights to control how their data is collected, used, and shared.

However, data security and privacy pose a substantial challenge when managing and versioning datasets for machine learning projects. Securing sensitive data in machine learning pipelines is critical to avoid the severe outcomes of a security breach or unauthorized access to data, such as reputational damage, legal liability, and loss of trust from customers and stakeholders.

Data privacy is particularly crucial when dealing with personal data due to strict government requirements for collecting, processing, and storing such data. In data version management, the challenge is ensuring adequate personal and sensitive data protection throughout the versioning process.

Best practices for managing dataset versions

As we just discussed, in long-term machine learning projects, managing the different versions of the dataset can become complex and time-consuming as the importance of data in driving decision-making and business outcomes increases. Machine learning teams must have a robust strategy for managing dataset versions in such projects.

This section explores best practices that address these challenges. These practices are essential for data scientists, data engineers, or machine learning engineers to provide a comprehensive guide for managing dataset versions in a project that is supposed to run for a long time. Let’s go through them one by one.

Data storage and retrieval management

Storage

Data storage involves allocating datasets to physical, virtual, or hybrid storage devices like hard drives, cloud storage, or solid-state drives. These devices can be located on-premise or in data centers and aim to provide secure and accessible storage solutions for datasets used in machine learning projects.

ML teams can consider the following best practices associated with data storage:

Centralized Storage: Use centralized storage for easy access and tracking of dataset versions by all team members. Tools like neptune.ai provide centralized storage for metadata about datasets, experiments, and models, enabling easy data traceability.

How to Version and Compare Datasets in neptune.ai

Data Backup and Recovery: Have a data storage platform that supports a contingency plan for unexpected data loss and deletion, which can be quite common in a long-duration project. Tools such as AWS S3, Google Cloud Storage, and Microsoft Azure offer robust recovery solutions allowing data snapshots to be recovered at a specific time.

Data Compression: Explore data compression techniques to optimize storage space, primarily as long-term ML projects collect more data. For example, the Parquet file format can efficiently store and retrieve tabular data.

Retrieval

Modern machine learning applications require efficient data retrieval by components such as model training, data analysis, and application interfaces. In long-term machine learning projects, efficient data retrieval is critical for timely and accurate model training, data-driven insights, and decision-making. Common data retrieval best practices for both datasets and dataset versions include the following:

Data Indexing: Indexing occurs on several levels, from the indexing of versions to the data points in the dataset. Indexing enables the efficient retrieval of data objects based on an initial mapping corresponding to frequently executed search queries. Modern databases and storage solutions enable admins to configure or implement indexing techniques.

Data Caching: Frequent data retrieval can consume significant computing resources, leading to increased infrastructure costs. Caching is a technique that temporarily stores frequently accessed data in a fast and low latency storage solution, separate from the primary dataset storage, to reduce compute cost and improve data retrieval speed.

Data Partitioning: Large and complex datasets are better managed and retrieved in smaller chunks called partitions, which can improve the efficiency of data retrieval.

Data version control tools

Several factors contribute to success and longevity in a long-term ML project. Having a Data Version Control (DVC) system in place can help to maintain these factors, which include:

Software factors, such as the accuracy and consistency of the data and models used in the project. In addition, the reproducibility of models and dataset help track issues and replicate successful experiments across an organization.
Hardware factors, including the availability and performance of the computing resources used to run the machine learning models.
Human factors, including collaboration, communication, and accountability of the team members working on the project.

ML teams involved in long-term machine learning projects have several options for incorporating Data Version Control (DVC) into their existing workflows and managing dataset versions. The most widely used DVC tools among ML teams include DVC, MLflow, Databricks, and neptune.ai. These platforms and tools provide a range of features to support each stage of the machine learning pipeline, with dataset versioning being just one of many features offered.

Below is a table of key features available on common tools and platforms. Utilizing a platform or tool is decided by several criteria, such as

1 project life span
2 team expertise
3 organization maturity levels

Investing in a platform too early can result in overengineering and additional complexity; insecure infrastructure and an error-prone development environment if done too late. The table below provides an opinionated overview of when ML/AI/Data teams should leverage different platforms.

DVC

MLflow

Databricks

Weights & Biases

DagsHub

neptune.ai

DVC

MLflow

Databricks

Weights & Biases

DagsHub

neptune.ai

Data Version Control

Limited

Dataset Version Management

Collaboration

Integration

Git, AWS, GCP

Git, Conda, PyPI

AWS, GCP, Azure

Git, PyTorch, Keras

GitHub, GitLab, BitBucket

AWS, GCP, Azure

Model Tracking

Experiment Management

Visualization

Model Deployment

Auto-Logging

Organization ML Maturity Level

Intermediate

Intermediate / Advanced

Advanced

Intermediate / Advanced

Lifespan of Project

Short-Term

Short-Term to Long-Term

Long-Term

Short-Term to Long-Term

Cost

Free and Open Source / Paid Plan

Paid Plan

Free Plan / Paid Plan

When to Adopt

Early

Early/Mid

Mid/Late

Early/Mid

Comparison of machine learning data version control tools | Source: Author

Check also

Best 7 Data Version Control Tools That Improve Your Workflow With Machine Learning Projects

Team collaboration and role definition

Collaboration and appropriate role delegation within the ML team can contribute to ensuring efficient data management practices and, consequently, overall project success. An ML team involved in a long-term project will consist of a varying number of team members and various disciplines. A clear understanding of roles and responsibilities within the team enables efficient collaboration and promotes accountability, leading to better communication and coordination among team members.

Role delegation to enforce dataset version management practices includes identifying who is responsible for tasks such as data acquisition, annotation, and processing, typically done by a Data Manager.
Responsibilities involving data visualization, analysis, feature extraction, and exploration generally are assigned to a Data Analyst.
Responsibilities such as maintaining accurate data versions, data documentation, and security should be clearly defined as activities involved within every ML/AI/Data project role.

To encourage collaboration by leveraging tools such as DVC and platforms with collaborative features such as tagging team members, updating audits, and commenting will improve the synergy between various roles within the team and teams across the organization.
Standard communication protocols and change procedures should be documented and communicated to team members, enabling an efficient data management process. Communication protocols could include a code review process, meeting frequency, and identification of solution and platform owners to ensure issues are delegated appropriately.

Dataset update strategy

In long-term machine learning projects, adopting a more strategic and planned approach is critical to ensure the predictability of system changes. A well-defined dataset update strategy is an essential aspect of this approach. It outlines how changes to datasets are planned, scheduled, and executed according to agreed-upon rules and standards, ensuring data security, relevance, and accuracy over the long term.

Many dataset versioning platforms and systems have integrated dataset update and version management processes that can be configured by platform administrators. These configurations may include restrictions on dataset access based on user roles, update frequency, data sources, preprocessing jobs, and other factors. Here are three possible approaches that all ML teams should consider:

Incremental Updates: Regularly updating the dataset with new features is effective for real-time ML use cases, improving model performance and results over time.
Data Refresh Intervals: Defining the frequency of dataset refreshes allows ML teams to plan and schedule model training and optimization activities.
Data Archiving: A strategy for removing outdated datasets from active storage, transferring them to cold storage, and archiving them is crucial, especially in sensitive industries. Maintaining a historical record of a dataset’s journey is necessary for audit and regulatory purposes.

Dataset documentation

As machine learning teams scale, documentation of key aspects of the project becomes crucial, including software code, technical decisions, and installation procedures. Thus creating a comprehensive project record with essential information, context, and knowledge needed for reproducibility and knowledge transfer becomes essential.

Proper documentation includes recording details about the dataset, such as its:

1 Source
2 Business relevance
3 Creation and modification dates
4 Author
5 Format

Below are some key benefits of dataset documentation for ML teams

Data Governance: Dataset documentation tools can improve data governance by documenting data policies, lineage, and privacy information, ensuring compliance with relevant regulations and best practices.
Data Understanding: Dataset documentation ensures stakeholders clearly understand the data, improving the accuracy and efficiency of data analysis and modelling.
Team Collaboration: By providing a centralized repository of data, dataset documentation tools facilitate collaboration among team members, ensure everyone has up-to-date information, and make it easier to reproduce results and base analyses on high-quality data.

Dataset documentation may require an initial financial investment in tools but can evolve with the project if executed correctly. There are two traditional dataset documentation approaches, each with advantages and disadvantages.

*Approaches to documentation in dataset version management: manual and embedded documentation | Source: Author*

Documentation, if done properly, ensures that the dataset remains accessible and understandable over time, even as the team and project evolve.

Conclusion

This article explored the importance of dataset version management for machine learning projects with long lifespans. We explored challenges that can arise in such ML projects and investigated best practices to aid ML teams in mitigating or preventing these challenges. The best practices evaluated included a mix of tools, platforms, and techniques, with some requiring no initial financial investment and others requiring significant cost consideration.

The investment in implementing these best practices may seem daunting initially, but the benefits of minimizing the risk of data loss or corruption and ensuring models are trained on the most up-to-date data are significant.

By adopting these discussed practices, ML teams can ensure the dataset’s traceability, reproducibility, and security, allowing them to manage its lifecycle better and sustain the efficiency and relevancy of machine learning models trained on such data.