Today, I share my experience in dataset selection. This little article may help students or beginners in data science.

Recently, I find out that many people use machine learning algorithms with wrong data. For example, my classmate tried to learn a Faster-RCNN with wrong dataset. In another case, several students try to classify leave disease using their images, while their images so bad that even an expert herbalist can’t classify them. These scenarios leaded to they be disappointed from deep learning. This story inspired me to write this article. Before using an image dataset consider following parameters:
Resolution: it can be said that fundamental criterion in an image is its resolution. By more resolution, more details can be found. More details mean more extracted features or better classifying features.
As you can see above, the resolution is enough for object recognition. By the word recognition, here I identify a person. However, it can be used to test or learn people detection or face detection. Image resolution in B is almost enough for both detection and recognition task.
Image channels: instead of using passage “color or grayscale”, I used channels because it’s more general. For example, some cameras provide RGBD (D stands for distance) or some cameras deliver temperature for a pixel in addition to color properties. Finding a person in thermal camera sometimes is easier than traditional RGB cameras. If you try to teach your algorithm by RGB image in night (without flash light) you may failed.
Illumination: Many image datasets are provided for developing prototypes not the end product. For example, some algorithms should be applied in uneven illumination but at first you may try at almost perfect illuminations.
Imaging device: Do not forget even two color 2MP camera with same lens can deliver different quality of image. There are a lot of aspect that you should watch out them e.g., gain, shutter, pixel technologies, and even lens technologies.
Time line: I mean when you collect images. Imagine you want to make a skin disease classifier. Do not forget that each disease may lead to different types abnormalities on a skin. For example, disease may lead to red skin in first weak and changes its color to black in next month. You should think about it.
Number of samples: Good machine learning is based on a good teacher. If you don’t provide good samples you can expect good results. In other words, you should choose samples so that cover all problem space. You can’t achieve this by little number of samples. For example, when you try learn a face detector, you should provide samples from different size, different genders, with/without glasses, with/without mustaches, different ages, different skin colors. Also, you can use data augmentation by adding different type of pixel-based noises such salt & pepper, Gaussian, rotation, and other type of distortions. Another thing is number of each sample in a class. If you provide 1000 images of face and 100 non-face images, your classifier can easily achieve 90% of accuracy at start point. I can say even a flipping coin (of course a fair coin) can find disease by 90% of accuracy! So be careful about this concept, especially when you evaluate unfair classes.