By Shivashish Thakur, Digital Marketing, DataFlair.
To Build a perfect model, you need a large amount of data. But finding the right dataset for your machine learning and data science project is sometimes quite a challenging task. There are many organizations, researchers, and individuals who’ve shared their work, and we will use their datasets to build our project.
So in this article, we are going to discuss 20+ Machine learning and Data Science dataset and project ideas that you can use for practicing and upgrading your skills.
1. Enron Email Dataset
The Enron Dataset is popular in natural language processing. It has more than 500K emails of over 150 users. The size of the data is around 432Mb. Out of 150 users, most of the users are the senior management of Enron.
Data Link: Enron email dataset
Project Idea: Using k-means clustering, you can build a model to detect fraudulent activities. K-means clustering is an unsupervised Machine learning algorithm. It separates the observations into k number of clusters based on the similar patterns in the data.
2. Chatbot Intents Dataset
The dataset for a chatbot is a JSON file that has disparate tags like goodbye, greetings, pharmacy_search, hospital_search, etc. Every tag has a list of patterns that a user can ask, and the chatbot will respond according to that pattern. The dataset is perfect for understanding how chatbot data works.
Data Link: Intents JSON Dataset
Project Idea: You can build a chatbot or understand the working of a chatbot by twisting and expanding the data with your observations. To build a Chatbot of your own, you need to have a good knowledge of Natural language processing concepts.
Source Code: Chatbot Project in Python
3. Flickr 30k Dataset
The Flickr 30k dataset has over 30,000 images, and each image is labeled with different captions. This dataset is used to build an image caption generator. And this dataset is an upgraded version of Flickr 8k used to build more accurate models.
Data Link: Flickr image dataset
Project Idea: You can build a CNN model that is great for analysing and extracting features from the image and generate a english sentence that describes the image that is called Caption.
4. Parkinson Dataset
Parkinson’s is a disease that can cause a nervous system disorder and affects the movement. Parkinson dataset contains biomedical measurements, 195 records of people with 23 different attributes. This data is used to differentiate healthy people and people with Parkinson’s disease.
Data Link: Parkinson dataset
Project Idea: You can build a model that can be used to differentiate healthy people from people having Parkinson’s disease. The algorithm that is useful for this purpose is XGboost, which stands for extreme gradient boosting, and it is based on decision trees.
Source Code: ML Project on Detecting Parkinson’s Disease
5. Iris Dataset
The iris dataset is a beginner-friendly dataset that has information about the flower petal and sepal sizes. This dataset has 3 classes with 50 instances in every class, so only contains 150 rows with 4 columns.
Data Link: Iris dataset
Project Idea: Classification is the task of separating items into their corresponding class. You can implement a machine learning classification or regression model on the dataset.
6. ImageNet dataset
ImageNet is a large image database that is organized according to the wordnet hierarchy. It has over 100,000 phrases and an average of 1000 images per phrase. The size exceeds 150 GB. It is suitable for image recognition, face recognition, object detection, etc. It also hosts a challenging competition named ILSVRC for people to build more and more accurate models.
Data Link: Imagenet Dataset
Project Idea: To implement image classification on this huge database and recognize objects. CNN model (Convolutional neural networks) are necessary for this project to get accurate results.
7. Mall Customers Dataset
The Mall customers dataset holds the details about people visiting the mall. The dataset has an age, customer id, gender, annual income, and spending score. It gains insights from the data and divides the customers into different groups based on their behaviors.
Dataset Link: mall customers dataset
Project Idea: Segment the customers based on their gender, age, interest. It is useful in customized marketing. Customer segmentation is an important practice of dividing customers based on individual groups that are similar.
Source Code: Customer segmentation with Machine learning.
8. Google Trends Data Portal
Google trends data can be used to examine and analyze the data visually. You can also download the dataset into CSV files with a simple click. We can find out what’s trending and what people are searching for.
Data Link: Google trends datasets
9. The Boston Housing Dataset
This is a popular dataset used in pattern recognition. It contains information about the different houses in Boston based on crime rate, tax, number of rooms, etc. It has 506 rows and 14 different variables in columns. You can use this dataset to predict house prices.
Data Link: Boston dataset
Project Idea: Predict the housing prices of a new house using linear regression. Linear regression is used to predict values of unknown input when the data has some linear relationship between input and output variables.
10. Uber Pickups Dataset
The dataset has information about 4.5 million Uber pickups in New York City from April 2014 to September 2014 and 14 million more from January 2015 to June 2015. Users can perform data analysis and gather insights from the data.
Data Link: Uber pickups dataset
Project Idea: To analyze the data of the customer rides and visualize the data to find insights that can help improve business. Data analysis and visualization is an important part of data science. They are used to gather insights from the data, and with visualization, you can get quick information from the data.
11. Recommender Systems Dataset
This is a portal to a collection of rich datasets that were used in lab research projects at UCSD. It contains various datasets from popular websites like Goodreads book reviews, Amazon product reviews, bartending data, data from social media, etc that are used in building a recommender system.
Data Link: Recommender systems dataset
Project Idea: Build a product recommendation system like Amazon. A recommendation system can suggest your products, movies, etc. based on your interests and the things you like and have used earlier.
Source Code: Movie Recommendation System Project
12. UCI Spambase Dataset
Classifying emails as spam or non-spam is a very common and useful task. The dataset contains 4601 emails and 57 meta-information about the emails. You can build models to filter out the spam.
Data Link: UCI spambase dataset
Project Idea: You can build a model that can identify your emails as spam or non-spam.
13. GTSRB (German traffic sign recognition benchmark) Dataset
The GTSRB dataset contains around 50,000 images of traffic signs belonging to 43 different classes and contains information on the bounding box of each sign. The dataset is used for multiclass classification.
Data Link: GTSRB dataset
Artificial Intelligence Project Idea: Build a model using a deep learning framework that classifies traffic signs and also recognizes the bounding box of signs. The traffic sign classification is also useful in autonomous vehicles for identifying signs and then taking appropriate actions.
Source Code: Traffic Signs Recognition Python Project
14. Cityscapes Dataset
This is an open-source dataset for Computer Vision projects. It contains high-quality pixel-level annotations of video sequences taken in 50 different city streets. The dataset is useful in semantic segmentation and training deep neural networks to understand the urban scene.
Data Link: Cityscapes dataset
Project Idea: To perform image segmentation and detect different objects from a video on the road. Image segmentation is the process of digitally partitioning an image into various different categories like cars, buses, people, trees, roads, etc.
15. Kinetics Dataset
There are three different datasets for Kinetics: Kinetics 400, Kinetics 600, and Kinetics 700 dataset. This is a large scale dataset that contains a URL link to around 6.5 million high-quality videos.
Data Link: Kinetics dataset
Project Idea: Build a human action recognition model and detect the action of a human. Human action recognition is recognized by a series of observations.
16. IMDB-Wiki dataset
The IMDB-Wiki dataset is one of the largest open-source datasets for face images with labeled gender and age. The images are collected from IMDB and Wikipedia. It has 5 million-plus labeled images.
Data Link: IMDB wiki dataset
Project Idea: Make a model that will detect faces and predict their gender and age. You can have categories in different ranges like 0-10, 10-20, 30-40, 50-60, etc.
17. Color Detection Dataset
The dataset contains a CSV file that has 865 color names with their corresponding RGB (red, green, and blue) values of the color. It also has the hexadecimal value of the color.
Data Link: Color Detection Dataset
Project Idea: The color dataset can use used to make a color detection app in which we can have an interface to pick a color from the image and the app will display the name of the color.
Source Code: Color Detection Python Project
18. Urban Sound 8K dataset
The urban sound dataset contains 8732 urban sounds from 10 classes like an air conditioner, dog bark, drilling, siren, street music, etc. The dataset is popular for urban sound classification problems.
Data Link: Urban Sound 8K dataset
Project Idea: We can build a sound classification system to detect the type of urban sound playing in the background. This will help you get started with audio data and understand how to work with unstructured data.
19. Librispeech Dataset
This dataset contains a large number of English speeches that are derived from the LibriVox project. It has 1000 hours of English-read speech in various accents. It is used for speech recognition projects.
Data Link: Librispeech dataset
Project Idea: Build a speech recognition model to detect what is being said and convert it into text. The objective of speech recognition is to automatically identify what is being said in the audio.
20. Breast Histopathology Images Dataset
This dataset contains 2,77,524 images of size 50×50 extracted from 162 mount slide images of breast cancer specimens scanned at 40x. There are 1,98,738 negative tests and 78,786 positive tests with IDC.
Data Link: Breast histopathology dataset
Project Idea: To build a model that can classify breast cancer. You build an image classification model with Convolutional neural networks.
Source Code: Breast Cancer Classification Python Project
21. Youtube 8M Dataset
The youtube 8M dataset is a large scale labeled video dataset that has 6.1 million Youtube video ids, 350,000 hours of video, 2.6 billion audio/visual features, 3862 classes, and 3 avg labels per video. It is used for video classification purposes.
Data Link: Youtube 8M
Project Idea: Video classification can be done by using the dataset, and the model can describe what video is about. A video takes a series of inputs to classify in which category the video belongs.
In this article, we saw more than 20 machine learning datasets that you can use to practice machine learning or data science. Creating a dataset on your own is expensive, so we can use other people’s datasets to get our work done. But we should read the documents of the dataset carefully because some datasets are free, while for some datasets, you have to give credit to the owner as stated by them.
Bio: Shivashish Thaku is an Analyst and technical content writer. He is a technology freak who loves to write about the latest cutting edge technologies that are transforming the world. He is also a sports fan who loves to play and watch football.