Welcome to our guide on seamlessly loading datasets from Kaggle to Google Colab notebooks! Aspiring data scientists often turn to Kaggle for its vast repository of datasets spanning various domains, from entertainment to astronomy. However, working with these datasets requires efficient tools, and Google Colab emerges as a promising solution with its cloud-based environment and GPU support. In this article, we’ll walk you through the process of accessing Kaggle datasets directly within Google Colab, streamlining your data exploration and analysis workflow. Let’s dive in and unlock the potential of these powerful platforms!
This article was published as a part of the Data Science Blogathon.
Kaggle serves as a treasure trove of diverse datasets catering to various data science needs. Here’s a breakdown to help you grasp the landscape:
Accessing Kaggle datasets via API requires API tokens, serving as authentication keys to interact with Kaggle’s services. Here’s how to obtain and manage these credentials securely:
The steps below provided are for obtaining Kaggle API credentials. These steps guide users through the process of generating API tokens from their Kaggle profile, which are necessary for accessing Kaggle datasets programmatically and interacting with Kaggle’s services via API.
The first and foremost step is to choose your dataset from Kaggle. You can select datasets from competitions too. For this article, I am choosing two datasets: One random dataset and one from the active competition.
Screenshot from Google Smartphone Decimeter Challenge
Screenshot from The Complete Pokemon Images Data Set
To download data from Kaggle, you need to authenticate with the Kaggle services. For this purpose, you need an API token. This token can be easily generated from the profile section of your Kaggle account. Simply, navigate to your Kaggle profile and then,
Click the Account tab and then scroll down to the API section (Screenshot from Kaggle profile)
A file named “kaggle.json” will be download which contains the username and the API key.
This is a one-time step and you don’t need to generate the credentials every time you download the dataset.
Fire up a Google Colab notebook and connect it to the cloud instance (basically start the notebook interface). Then, upload the “kaggle.json” file that you just downloaded from Kaggle.
Now you are all set to run the commands need to load the dataset. Follow along with these commands:
Note: Here we will run all the Linux and installation commands starting with “!”. As Colab instances are Linux-based, you can run all the Linux commands in the code cells.
! pip install kaggle
! mkdir ~/.kaggle
! cp kaggle.json ~/.kaggle/
! chmod 600 ~/.kaggle/kaggle.json
The colab notebook is now ready to download datasets from Kaggle.
All the commands needed to set up the colab notebook.
Kaggle host two types of datasets: Competitions and Datasets. The procedure to download any type remains the same with just minor changes.
Downloading Competitions dataset:
! kaggle competitions download <name-of-competition>
Here, the name of the competition is not the bold title displayed over the background. It is the slug of the competition link followed after the “/c/”. Consider our example link.
“google-smartphone-decimeter-challenge” is the name of the competition to be passed in the Kaggle command. This will start downloading the data under the allocated storage in the instance:
Downloading Datasets:
These datasets are not part of any competition. You can download these datasets by:
! kaggle datasets download <name-of-dataset>
Here, the name of the dataset is the “user-name/dataset-name”. You can simply copy the trailing text after “www.kaggle.com/”. Therefore, in our case.
It will be: “arenagrenade/the-complete-pokemon-images-data-set”
The output of the command (Notebook screenshot)
In case you get a dataset with a zip extension, you can simply use the unzip command of Linux to extract the data:
! unzip <name-of-file>
Let us now look into some bonus tips that might help us.
You just saw how to download datasets from Kaggle in Google Colab. It is possible that you are only concerned about a specific file and want to download only that file. Then you can use the “-f” flag followed by name of the file. This will download only that specific file. The “-f” flag works for both competitions and datasets command.
Example:
! kaggle competitions download google-smartphone-decimeter-challenge -f baseline_locations_train.csv
You can check out Kaggle API official documentation for more features and commands.
In step 3, you uploaded the “kaggle.json” when executing the notebook. All the files uploaded in the storage provided while running the notebook are not retained after the termination of the notebook.
It means that you need to upload the JSON file every time the notebook is reloaded or restarted. To avoid this manual work,
! pip install kaggle
! mkdir ~/.kaggle
!cp /content/drive/MyDrive/kaggle.json ~/.kaggle/kaggle.json
Now you can easily use your Kaggle competitions and datasets command to download the datasets. This method has the added advantage of not uploading the credential file on every notebook re-run.
Google Colab is a great platform to practice data science questions. One of the major benefits of the Colab is the free GPU support. Data science aspirants, in the beginning, are short of computation resources, and therefore using Google Colab solves their hardware problems. The Colab notebooks run on Linux instances and therefore, you can run all the usual Linux commands and interact with the kernel more easily.
The RAM and disk allocation are more than enough for practice datasets but if your research requires more compute power, you can opt for the paid program “Colab pro”.
Mastering the process of loading datasets from Kaggle to Google Colab offers numerous opportunities for data scientists. By understanding the diverse datasets available on Kaggle, managing Kaggle API credentials securely, and utilizing Google Colab’s capabilities, individuals can streamline their data science workflow and gain hands-on experience. The synergy between Kaggle and Google Colab provides a dynamic platform for innovation and growth in the field of data science, enabling practitioners to explore competition datasets and standalone datasets.
A. Yes, you can seamlessly import datasets from Kaggle to Google Colab using the steps outlined in this article. By generating API tokens and setting up the necessary authentication, you can access Kaggle datasets directly within the Colab environment, facilitating efficient data analysis and experimentation.
A. Google Colab offers numerous advantages for data science practitioners, including free GPU support, a Linux-based environment for running commands, and ample computational resources. With Colab, users can execute code collaboratively, leverage pre-installed libraries, and access cloud storage, making it an ideal platform for practicing data science projects and experiments.
A. It’s crucial to handle API credentials securely to protect your Kaggle account and data. Best practices include storing your API token securely, such as in a hidden directory or encrypted file, and avoiding sharing credentials publicly. Additionally, consider rotating your API keys periodically and using secure methods like OAuth for authentication where possible.
The media shown in this article are not owned by Analytics Vidhya and are used at the Author’s discretion.
Lorem ipsum dolor sit amet, consectetur adipiscing elit,
Thank you Kaustubh Gupta, It is very helpful could you tell me where the folder kaggle is created, how can I visualize it.
It's really a great article. Looking forward to more content.
Thank you. Your post was very good and useful.