1. What is data science? Why data science?
a. What is data science?
Data science is a combination of data inference, algorithm development, and technology in order to solve analytically complex real-life problems. For instance, Netflix uses data science in order to find out the patterns of users and recommend suitable movies for users.
At the core is data. Troves of raw information, streaming in and stored in enterprise data warehouses. Much to learn by mining it. Advanced capabilities we can build with it. The study about retrieving data from different sources is Data Engineer. A data engineer retrieves data from a lot of sources and arranges it in an order that is easy to use and to be studied.
There are two types of data science:
- Discovery of data insight: This aspect of data science is all about uncovering findings from data, diving in at a granular level to mine and understand complex behaviors, trends, and inferences. For example, Netflix finds out about trends, behaviors, and genres of movies that its users are into.
- Development of data product: This aspect of data science is all about utilize the input and algorithmically predict and draw output. For instance, Netflix uses features from the above example to recommend movies for users.
In order to be a data scientist, there’re a few requirements: Mathematics Expertise, Technology Hacking skills, and Business or Strategy Acumen. In this tutorial, it’s ok if you’re not familiar with these things, we’re going slowly to help you understanding Data Science (DS) and how to implement some basic algorithms with Raspberry Pi.
b. Why Data Science?
In the 2010s and the next decades, data science is an essential skill that’s not only required in the Technological majors but also for other majors that’s not related to technology. For instance, if you are a web developer, with just basic data science analyzation, you will be able to see the trend of webpage frontends and user interfaces that users few most comfortable with, therefore, your webpages become more interacting and attractive to users. Another example for non-technological major, if you are a business manager, data science will be a significantly helpful assistant for your work because it will help you with catching the trends, deciding the selling strategy for your business.
As you can see in both examples, data science can be very useful and it’s needed in the present era. Not only in real applications, but data science also has many applications in theoretical researches such as intense laser physics, molecular physics, microbiology … etc.
If you are professional at your major, it’s good to know another tool that is useful for your work and your business. If you are a student and deciding what major to commit in, Data Science is a trending major which is being studied and applied in many big firms and companies. Furthermore, the salary for this major is quite big. And that’s why you should learn about data science.
2. Process of data science
In this section, we will go through the process of data science for solving a problem by going through an example – House Price predicting.
Imagine you’re buying a house for your family and you want to estimate the house’s price to buy. In order to do so, firstly you will go around and ask real estate sellers for their offers. This part is called “data collecting” in data science’s terms. In real-life data science projects, this part is done by data engineers and it’s essential to do this part carefully and generally. If you ask the real estate seller of one area such as a district in a city, you only have data in that district which is not general because there might be another house with suitable prices for you in other districts. In brief, the first part to do is collect data which has to be done carefully and generally.
Secondly, after collecting data, you should preprocess it such as sorting the data into a different order, filling in the missing data … etc. For instance, when you collect the data, some houses will have the information about the rooms that they have and some don’t or the information about the location and social facilities around the area. This part is called “data cleaning”. At the end of this part, the data will be cleaned and it’s ready to be used.
The next part is “features engineering”. In this part, data scientists will choose the important features which significantly affect the house’s prices or come up with new features that’s more important. For instance, for a house, you can know the length, the width and the height of that house which are important features for that house’s price. Moreover, you can calculate the area and the volume of the house which are the new features.
Then comes the main part – “Applying algorithms”. In this part, we will choose the learning algorithm and use it to learn something from the data. This part is like you determining the patterns or the rules of the prices based on the features. There are many algorithms for the large or small size of data, therefore, choosing a suitable algorithm for the data affects how accurate the result is. For instance, tree-based algorithms are more suitable for fluctuating data, logistic regression is suitable for binary results problems. After choosing an algorithm, we build a model which will contain the parameters for the algorithm that we chose and this model will learn about the data – this process is call training.
Finally, we use the model that we trained in the previous part to predict the new data or test data.
Overall, imagine data science is collecting data, choose the data that qualify and teach a computer to learn parameters and information from the data so that we will have a model or analogically have a method to predict new data automatically.