What is Fabric Lakehouse?

Fabric Lakehouse is one of the objects in Microsoft Fabric that can store and analyze data. It is a data architecture platform that manages and adapts to handle large amounts of structured and unstructured data.

What are the benefits of using notebooks in Microsoft Fabric?

Notebooks in Microsoft Fabric allow data professionals to develop machine learning experiments and facilitate their deployment. They offer a wide range of features for data exploration and experimentation.

How can you store files/data in a Lakehouse in Microsoft Fabric?

To store files/data in a Lakehouse in Microsoft Fabric, you can create a new Lakehouse, upload files from your local device, and connect the Lakehouse with your notebook.

What is the purpose of machine learning model training in Microsoft Fabric?

The purpose of machine learning model training in Microsoft Fabric is to develop accurate prediction models using algorithms and training data. These models can then be used for making predictions on new data.

What is the role of MLflow in model training in Microsoft Fabric?

MLflow is an API used for creating machine learning experiments, managing model training runs, and saving the trained models in the MLflow format. It helps in organizing and tracking machine learning experiments.

How can you load and use a saved machine learning model for inference in Microsoft Fabric?

To load and use a saved machine learning model for inference in Microsoft Fabric, you can load the model using the MLFlowTransformer class and apply it to a dataset using the transform method.

Can the code provided in the blog be run on platforms other than Microsoft Fabric Notebook?

No, the code provided in the blog is specifically designed for Microsoft Fabric Notebook and may not run correctly on other platforms like Colab.

What is the purpose of the Diabetes-Prediction experiment in Microsoft Fabric?

The purpose of the Diabetes-Prediction experiment in Microsoft Fabric is to organize and manage machine learning runs for predicting diabetes using trained models.

Mastering Data Science with Microsoft Fabric: A Tutorial for Beginners - Pragnakalp Techlabs: AI, NLP, Chatbot, Python Development

Q: What services does Microsoft Fabric's core offer?

Microsoft Fabric's core offers services such as Data Factory, Synapse Data Engineering, Synapse Data Science, Synapse Data Warehousing, Synapse Real-Time Analytics, and Power BI.

Introduction:

Microsoft Fabric is a cloud-based platform that offers a unified data science, data engineering, and business intelligence experience. It provides a variety of features and services, such as data preparation, machine learning, and visualization. Fabric’s comprehensive toolset enables data professionals and business users equally to unlock the full potential of their data and shape the future of AI.

Fabric’s core offered services such as Data Factory, Synapse Data Engineering, Synapse Data Science, Synapse Data Warehousing, Synapse Real-Time Analytics, and Power BI. Fabric provides a comprehensive and powerful solution for your data science needs, ranging from data integration and engineering to real-time analytics and visualization.

In this blog our focus will be on Fabric’s data science services, we will show how to use Microsoft Fabric to build a diabetic prediction model and will explore the remarkable tools of the notebook.

To access Microsoft Fabric create an account on app.fabric.microsoft.com for a free trial or if you are an existing Power BI customer you can sign in using your Power BI account credentials.

Check out our blog on Mastering Data Science with Microsoft Fabric: Introduction to Fabric Notebook Features to learn how to use amazing capabilities that will enhance your data exploration and experimentation process.

Fabric Lakehouse and Notebooks:

To start with our Diabetes prediction we will use the Diabetes dataset “pima-indians-diabetes” from the Kaggle dataset, which contains data on over 768 patients with diabetes.

When we refer to data, we may talk about storing structured and unstructured data. Fabric’s Lakehouse is one of the objects that can store data and is a data architecture platform for managing and analyzing data. It has the ability to expand and adapt to manage huge amounts of data and helps various kinds of data processing tools and frameworks. To know more about Data Lakehouse refer What is a lakehouse in Microsoft Fabric?

The Fabric utilizes the notebook artifact within the Data Science experience to demonstrate the Fabric framework’s diverse capabilities. The Fabric allows the use of notebooks for the purpose of developing machine learning experiments and facilitating their deployment. The Data Science service and notebook provide a wide range of features, which will be discussed further. You can refer to this How to use Microsoft Fabric notebooks to know more about Data Science services

Follow the below steps to store files/data in Lakehouse:

Go to the Microsoft Fabric home and select Data Engineering from the menu.

Create a new Lakehouse

Upload files from your local device. You will see updated files in the existing “Files” folder.

Now let’s see how we can train our model for Diabetes prediction.

You can either create a new notebook or import an existing notebook from the Data Engineering home page (shown in the image in step no. 2) or from the Data Science home page as shown in the below image

Connect Lakehouse with your notebook, you either create a new one or connect the existing Lakehouse.

Please follow this notebook code to train the machine-learning model of Diabetes prediction.

Machine Learning Model Training and Prediction Scoring

This section walks through the steps involved in training a Scikit-Learn model, including the process of saving the trained models. Furthermore, it demonstrates how to utilize the saved model for predictions once the training procedure is complete. To know more about models in Fabric please refer to How to train models with scikit-learn in Microsoft Fabric.

Please note that the code provided in this section is specifically designed for Microsoft Fabric Notebook. Attempting to run the code on other platforms such as Colab or any other platform may result in errors. This is because the PREDICT function utilized in the code requires the models to be saved in the MLflow format, which is primarily supported by Spark language.

A machine learning experiment is the basic organizing and management unit for all connected machine learning runs. To make an experiment for the trained model run the below code.

import mlflow
mlflow.set_experiment("Diabetes-Prediction")

It will create a new experiment named “Diabetes-Prediction” in your workspace. You can check Machine learning experiments in Microsoft Fabric to know more about “Experiment”

Or you can create an experiment using UI (from your workspace select experiment from dropdown)

The following code shows how to use the MLflow API to create a machine learning experiment and launch an MLflow run for an LGBMClassifier model built with the scikit-learn library. After that, the model’s version is saved and registered in the Microsoft Fabric workspace.

In the below code, write your model name in mlflow.sklearn.log_model()

import mlflow.sklearn
from mlflow.models.signature import infer_signature

mlflow.set_experiment("Diabetes-Prediction")
with mlflow.start_run() as run:
    model = LGBMClassifier(random_state = 12345)
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    score = model.score(X_train, y_train)
    signature = infer_signature(X, y)

    print('score...:',score)
    print('Accuracy...:',accuracy)
    mlflow.sklearn.log_model(
        model,
        "diabetes-model",
        signature=signature,
        registered_model_name="diabetes-model"
    )

Once the model has been saved, it can be loaded for the purpose of inference. In order to accomplish this, we will load the model and execute the inference process on a sample dataset. Please refer to the below code to make prediction on your testing data.

from pyspark.sql import SparkSession
from synapse.ml.predict import MLFlowTransformer

spark = SparkSession.builder.getOrCreate()
test = spark.read.format("csv").option("header","true").load("Files/diabetes_test.csv")
# df now is a Spark DataFrame containing CSV data from "Files/diabetes_test.csv".
display(test)

# You can substitute values below for your own input columns,
# output column name, model name, and model version
model = MLFlowTransformer(
    inputCols=test.columns,
    outputCol='predictions',
    modelName='diabetes-model',
    modelVersion=1
)
prdiction = model.transform(test).show()
pred_df = prdiction.toPandas()

Replace inputCols, modelName, and modelVersion, with your feature columns of test dataset, model name, and model version.
Or if you want to do it using UI, you can generate the above PREDICT code from a model’s item page for inference testing data.
Open the model from your workspace, where you have saved it

Select that model version from the sidebar, click on the “Apply model” button, and select “Apply this model in the wizard”. As shown in below image.

Follow the steps for the left sidebar outlined in the below image, from the Generate PREDICT code from a model’s item page and enter the notebook name where you want to save code the inference.

You will see the generated code in the given notebook

The “prediction” column will be added to your test data frame by running the below command.

This way you can use Fabric Notebook for your data science experiments.

Chatbots, Python Development, Machine Learning, Natural Language Processing (NLP)

Mastering Data Science with Microsoft Fabric: A Tutorial for Beginners

Introduction:

Fabric Lakehouse and Notebooks:

Machine Learning Model Training and Prediction Scoring

Leave a Reply Cancel reply