First component of deep learning is DATA

Data is deep learning is like soul without that there is no deep learning. Data is the only thing in deep learning which you have provide and your deep learning model will learn from it. There are type of data , one is labeled data which is created by human or in the supervision of human and one is unlabeled data which is just only raw data . Obtaining labeled data is difficult and if you get some dataset for free than you are lucky guy . We will discuss all aspects and steps which you have to perform after obtaining labelled data . And what modifications you have to take care before inserting to the model

First of all we will discuss how many types of data you can have for deep learning . For learning you can find data set from kaggle and google is offering data set search engine . From there you can find data set . There are mainly four types of data which you can find are image , tabular , audio , video and text data. In Image data you will have images with labeled data . you can find labeled as a folder name or csv file with all labels in it. Labeled Audio / video data with labels in folder or csv file. You can also get text data in csv file with its label beside it. At-last you can also get tabular data in which each columns represent different features

step 1:your first is to find method to take your data from format which present into your local variable in x and its label in y . You can use different tools as pandas , numpy to get data from csv files . Once get data is x and y variable you need to dividee your data into three sets one is for training purpose , one for validation purpose and one for testing unseen data . You need to create three variables its name will be train, valid and test . Then assign each one of them both your input which is x and your output which is y. In testset you only have to assign input variable which is x to see model performance . Purpose for creating sets are given below ,

 

Train set:-You create this set to train your model with its label. This is the data from which your model learns

 

Validation set:-you create this set from train set only to check your model performance after each epoch . This set is basically created from train set only.By taking some of its example to test/monitor our model at each epoch. Because if we don't get improvement as our validation set scores are low than it shows that our model needs improvements

 

Test set:-It is used to find check our model performance before going into the deployment phase. It is first time our model is seeing unseen data . And shows us its results

 

It is said that if your validation set are good than your model generally works well . So you need to take precautions in dividing data . And try to make validation set with all possible cases your model can face in future 

 

Step 2:-After creating three sets of data you need to understand data by using techniques such as graph , chart ,histogram and checking its mean values and standard deviation value. if it is tabular data you need to see each column and understand its importance . And when needed you have to create and delete new column on basis of other column . This technique is called feature engineering because it is seen that feature engineering really helps deep learning model to learn more faster

Creating new feature from one or more feature in data set is called feature extraction . You also need to handle NULL and empty string in the data set which is called noise .which is sometimes created by human at the time of data entry to that table

Let say we have dataset of some cars company which have two column as distance and time . Distance represent distance covered by that car in time t which is given in other column . we can create new column of velocity by taking distance/time as input for each row so that our model learns quickly

Step 3:-After preprocessing your data you can create more data using data augmentation techniques this technique is mainly used in vision and audio . In which we intentionally introduce noise or create some data introducing changes in that data . This technique is mainly used to fight over fitting in deep learning . Because some times deep learning model learns more than needed . It becomes very good to learn only representation of input data , But works bad on unseen data or new data . 

For Data Augmentation in image we can use flip , rotate ,crop ,padding , lighting ,pfine. We will cover each one of data augmentation techniques in future posts

Step 4:- Lastly you need to convert your data into numericalize format because deep learning model can be created using numerical data. There are many techniques which can used to numericalize your text data into numbers. which are embedding conversions , Sparse matrices and many more . It is proven that deep learning model learns faster and efficiently when features providing is in range 0 and 1 . where whole mean is 0 and standard deviation is 1 . We normalize data first by making all inputs from 0 to 1 , -1 range with mean as 0 and standard deviation as 1 . After following this four steps we input our data into architecture to get it learned .

 

Besides all this steps you need to create a function to view your data progress . You have to write function to view data which you have . Let say you have image data then you should write funtion to view data using mathplotlib library 

Nature of your data which governs architecture which you will use for modeling in deep learning .let say you have image data then you will use most probably convolution architecture and if you have seq to seq data such as text then you will most probably used recurrent neural networks architecture and if you want model tabula data which is some times also called regression problem then you will use linear neural networks

 

CONCLUSION: In this post we have seen importance of data and steps needed to prepare data before introducing it to architecture . We have also discussed data augmentation, Data Preparation , you also need to device method for loading data to architecture in batches which are called data loaders. If you need any help let me know i will be happy to help you in make you understand the concepts

 




Taher Ali Badnawarwala

Taher Ali, drives to create something special, He loves swimming ,family and AI from depth of his heart . He loves to write and make videos about AI and its usage


Leave a Comment


No Comments Yet

Leave a Reply

Your email address will not be published. Required fields are marked *