Optimizing MongoDB Usage in Data Science: Tips & Tricks

Vishank Shah
4 min readAug 1, 2023

As a Data Scientist, mastering database management is crucial for efficient data analysis and decision-making. MongoDB, a popular NoSQL database, offers great flexibility and scalability, making it a top choice for handling large and complex datasets. Over the past two years, MongoDB has been an integral part of my professional toolkit, and I’ve gathered valuable tips and tricks that can elevate your MongoDB experience as a Data Scientist.

1. Create a Useful CRUD Wrapper

Working with MongoDB often involves performing CRUD operations (Create, Read, Update, Delete) on documents. To streamline your database interactions, consider creating a reusable CRUD wrapper. This abstraction layer encapsulates common database tasks and provides a cleaner and more maintainable way to interact with MongoDB.

import pymongo
def get_mongo_client():
return pymongo.MongoClient(mongo_connection_string)
# Create operation
def create_document(collection, document_data):
with get_mongo_client() as client:
db = client[database_name]
col = db[collection]
result = col.insert_one(document_data)
return result.inserted_id
# Read operation by document name
def get_document_by_name(collection, record_name):
with get_mongo_client() as client:
db = client[database_name]
col = db[collection]
document = col.find_one({"name": record_name})
return document
# Update operation by document name
def update_document_by_name(collection, record_name, updated_data):
with get_mongo_client() as client:
db = client[database_name]
col = db[collection]
result = col.update_one({"name": record_name}, {"$set": updated_data})
return result.modified_count
# Delete operation
def delete_document(collection, name):
with get_mongo_client() as client:
db = client[database_name]
col = db[collection]
result = col.delete_one({"name": name})
return result.deleted_count

2. Master Aggregation Pipelines

MongoDB’s Aggregation Pipeline is a powerful tool for data transformation, aggregation, and analytics. It allows you to perform complex operations on large datasets efficiently. Let’s say we have a collection of sales data and want to calculate the total sales for each product category and sorting in descending order:

pipeline = [
{ "$group": { "_id": "$category", "total_sales": { "$sum": "$amount" } } },
{ "$sort": { "total_sales": -1 } }
]
result = list(collection.aggregate(pipeline))

3. Check for Duplicates Before Updating/Inserting

To maintain data integrity, it’s crucial to prevent duplicate records in your database. Before performing updates or inserts, check if the data already exists to avoid redundancy:

def update_or_insert_document(collection, data):
existing_record = collection.find_one({"_id": data["_id"]})

if existing_record:
collection.update_one({"_id": data["_id"]}, {"$set": data})
else:
collection.insert_one(data)

4. Utilize Indexes

Indexes play a vital role in optimizing query performance. Identify frequently queried fields and create indexes to speed up data retrieval:

collection.create_index("name")
collection.create_index("category")

5. Use Projections Wisely

When fetching data from the database, use projections to retrieve only the necessary fields, reducing network overhead and improving query performance:

projection = {"_id": 0, "name": 1, "price": 1}
result = list(collection.find({}, projection))

6. Create Document-Specific Unique Identifiers

While MongoDB automatically generates a unique _id for each document, consider using document-specific unique identifiers for certain entities like for stock market data, we can have unique record as the stock symbol. This can enhance query efficiency and ensure data uniqueness:

7. Implement Caching Systems

Frequent database queries can impact application performance. Implement caching mechanisms to store frequently accessed data and reduce query load:

import cachetools
@cachetools.cached(cache=cachetools.TTLCache(maxsize=100, ttl=300))
def get_stock_data(symbol):
# Fetch stock data from MongoDB
return collection.find_one({"symbol": symbol})

8. Maintain Data Integrity

Data integrity checks should include validating data types to ensure seamless interactions with MongoDB. MongoDB has specific data type requirements, and it’s essential to preprocess the data before insertion to handle data types that MongoDB may not support directly.

For example, as you mentioned, MongoDB doesn’t support numpy.int64 data type. When encountering such data types, it's advisable to convert them to native Python types like int or float before inserting into the database.

def insert_document(data):
# Check data types and convert if necessary
for key, value in data.items():
if isinstance(value, np.int64):
data[key] = int(value)
elif isinstance(value, np.float64):
data[key] = float(value)
collection.insert_one(data)

9. Avoid Unnecessary Operations and Create a Master Dataframe

Minimizing unnecessary database queries is crucial for optimizing the performance of your data science applications. If your code requires data from the same database multiple times, consider querying the database once and creating a master dataframe to serve as a central data source.

10. Implement Rate Limiting

To prevent abuse and maintain resource fairness, apply rate limiting to API endpoints interacting with MongoDB:

from flask_limiter import Limiter
limiter = Limiter(app)
@app.route("/get_data")
@limiter.limit("10 per minute")
def get_data():
# Query MongoDB and return data
pass

By adopting these tips and tricks, you can harness the full potential of MongoDB for your data science projects. Understanding these best practices will not only make your code more efficient but also empower you to extract valuable insights from vast datasets. Embrace the continuous learning journey and explore the endless possibilities that MongoDB offers for your data-driven success!

Happy coding and database exploration!

WRITER at MLearning.ai // Control AI Video // Personal AI Art Model

--

--