Image Visualization with Kangas

Applying built-in functions from Kangas UI to Hugging Face DataGrids

Felix Gutierrez
Heartbeat

--

Image from https://unsplash.com/photos/EXgCBYk4wCc

In a previous article we explored the first basic features of the Kangas API to construct our own DataGrids and then visualized them in the Kangas server. For your reference, this is the mentioned tutorial:

This time we will keep exploring some other features included in the Kangas library by importing already-built DataGrids for the purpose of image classification in the Kangas UI. These DataGrids are available for public use in the Hugging Face repository.

What’s Hugging Face

According to its website, Hugging Face is a “platform where Users can build, benchmark, share, version and deploy Repositories, which may include Models, Datasets and Machine Learning Applications.”

All the Hugging Face open-source projects are available on their GitHub page, and they include Transformers, Datasets and Tokenizers.

In order to have access to Hugging Face Datagrids, defined as “Datasets” in their platform, first we will need to install the datasets library in our Python environment, if using conda the recommendation is to create an isolated environment where you will need to install kangas and the datasets library:

pip install kangas

and

pip install datasets

Then we will continue performing all the analysis in a Jupyter Notebook, where we have previously activated the environment, in my particular case I have created a conda env with the name flask-app with Python3.9 installed.

Reading DataGrids with Kangas

In order to start exploring the DataGrid with Kangas we’ll import some basic packages that will load the dataset from Hugging face and put Kangas into action:

The datasets library accepts a bunch of parameters in order to download the dataset from a local file, an in-memory dataset, or from “The Hub,” we can get a complete list of all the Datasets available in Hugging Face Hub by calling the function list_datasets() or review them on the web page:

Image from Author
Image from https://huggingface.co/datasets

After that, we can proceed to load the dataset in the notebook by passing some parameters to the function, such as split:

split (Split or str) — Which split of the data to load. If None, will return a dict with all splits (typically datasets.Split.TRAIN and datasets.Split.TEST)

You can also look for all the parameters that take the datasets library here:

We will be working with the train split of the beans dataset and we will be taking all records considering is a relatively small one, the data and metadata of this dataset will be stored in your C:/Users/{user_name}/.cache/

Image from Author

I know you may be wondering why the DataGrid is stored in a .arrow format, and what the heck is that thing? Here is the answer to that.

Learn how to use Kangas with the HuggingFace Hub by watching this quick video.

Once downloaded in your .cache folder, you can go ahead and start working with your dataset in the notebook with the info() function:

Image from Author

We can also pass some other features of the DataGrid class before saving and start visualizing it, functions like get_coumns() or dg.head() , dg.tail() , dg.shape()and dg.info().

Image from Author

When we have explored the elements that comprise the DataGrid, we can save it. For the sake of simplicity, I will leave it saved in my Temp folder, then I will explore the schema of my DataGrid.

Image from Author

Through the get_schema() , as shown in the above image, we can get information about how is set the data and metadata of our DataGrid and also the data types of each of them.

You can also iterate over all rows of the DataGrid and get the asset_id for example:

Image from Author

Once the DataGrid is saved, we can start visualizing it in the Kangas server and we have to go to the directory where the DataGrid was saved and start the server from there:

Image from Author

Going to your browser by typing the URL where the server is executed, there you have your DataGrid, and you can start applying filters, sorting and grouping by columns:

Image from Author

Of course, this is a very basic example of what you can do with Kangas and I encourage you to look for more complex examples in the Kangas official GitHub repo. There are some good examples to help you get inspired to create your own object detection models based in Kangas and PyTorch.

Summary

In this article we have learned how to load datasets that we can use to start an analysis from scratch and select from a huge public repo of models and datasets oriented to Computer Vision, NLP and Audio recognition. We have also explored some other features of the Kangas API for DataGrids analysis and classification.

Remember, you can follow me here on Medium and also on LinkedIn. The code used for this article is also available on GitHub, feel free to clone de repo and use it for practical purposes, or even suggest improvements:

Editor’s Note: Heartbeat is a contributor-driven online publication and community dedicated to providing premier educational resources for data science, machine learning, and deep learning practitioners. We’re committed to supporting and inspiring developers and engineers from all walks of life.

Editorially independent, Heartbeat is sponsored and published by Comet, an MLOps platform that enables data scientists & ML teams to track, compare, explain, & optimize their experiments. We pay our contributors, and we don’t sell ads.

If you’d like to contribute, head on over to our call for contributors. You can also sign up to receive our weekly newsletter (Deep Learning Weekly), check out the Comet blog, join us on Slack, and follow Comet on Twitter and LinkedIn for resources, events, and much more that will help you build better ML models, faster.

--

--

Data Engineer. I write and learn-by-doing different topics related to Data.