New data is created every single second – a ton of it. This data comes from your smartphone that tracks your movements, your web browser that tracks your every click and keystroke, and even your smart fridge that is constantly capturing data about you – the consumer.
This is done so the companies that make your products can better understand your usage and improve their services/products. That’s the short version of it.
In this guide, we’ll take a deep look into:
- The world of data science and what it consists of
- Some basic data science terminology
- How data science works
- How data science translates into big data analytics
- Case Study: Southwest Airlines Save over $100 Million with Big Data Analytics
- What a legacy business needs to get started with data science and big data analytics
What is Data Science?
Data science is the practice of deriving valuable information like actionable insights by organizing and analyzing large datasets. Data science is a complex field that involves expert knowledge in the industry as well as mathematics, statistics, and programming expertise.
- Hacking Skills: Hacking skills refer to computer programming knowledge, the ability to write programs, come up with complicated algorithms, and materialize those concepts into reality using computer languages.
- Math & Statistics Knowledge: Mathematical and statistical knowledge gives data scientists the ability to base their problem concepts and algorithms on existing principles as well as tweak their programs for different real-world scenarios.
- Domain Expertise: The real-world scenarios would almost always be specific to an industry or market which is why domain expertise ) is also required.
Data science has brought together all of these abilities in order to create an entirely new stream of information, one that uses computer programming to access hundreds of gigabytes of data by automating the process of data mining, mathematics to understand the algorithms and data models, and domain expertise to put the resulting information into perspective, and thus to use.
Basic Data Science Terminology: What the Words Mean
Below are definitions of some important words that will help you understand many of the topics explained ahead.
1. Structured data
Organized data refers to information sorted and structured into rows and columns (representing observations and characteristics respectively).
2. Unstructured data
Unorganized data refers to raw datasets, including audio files, pictures, raw/unformatted text, etc.
3. Artificial Intelligence and Machine Learning
Artificial intelligence is a core element in the world of big data analytics and data science. Since big data involves extremely large datasets that cannot be processed (gathered, cleansed, and analyzed) manually, data scientists use artificial learning, particularly machine learning to train machines (like Cloud AI) to process the data for them with extremely high accuracy.
Machine learning is a sub-field of artificial intelligence (AI) that has grown into a very large field on its own. Put simply, machine learning refers to the process of training a computer to learn and act based on models and algorithms. As it finds new information, it can adjust its behavior and make accurate predictions without any human interference.
4. Data mining
Data mining refers to analyzing large datasets with the help of computers to find relationships between variables and derive insights.
5. Big data
Similar to machine learning, big data is a complex term that is often misunderstood and misused. A simple way of differentiating between big data and general datasets is to ask the following question:
“Can my home computer or laptop process and analyze this information on its own?”
If the answer is “no, it will probably crash” then the information likely belongs to the big data category.
6. Business Intelligence (BI)
Business Intelligence (BI) refers to adding business-centric metrics to computer algorithms and models in order to find insights and data that are relevant to your own company.
How it Works: The 5 Steps of Data Science
Now that you have a basic understanding of what data science is, you might think that it is just data analytics in disguise, and since you’re already analyzing your data, you’re involved in data science.
That assumption would be wrong and also a common misconception. While there are similarities between data analytics and data science, the scope of the latter is vastly superior. More importantly, data science follows a very strict process.
So what exactly is data science? Data science can be defined as the culmination of these 7 steps: data wrangling, data cleansing, data preparation, model learning, model validation, model deployment, and data visualization.
But, this isn’t an engineer’s guide to data science, it’s a business executive’s guide, so in this article, we’ll look at something more digestible – Ozdenir’s 5 Steps of Data Science.
In his book, Principles of Data Science, Sinan Ozdenir outlines five steps of data science that summarize the process in an easy to understand manner.
1. Asking an interesting question
As a business owner, your first step to data science should be a brainstorming session to come up with questions before even looking at your data. The main reason why you would want to do this before you do anything with the data is so you don’t limit yourself…
More often than not, data is not the limitation – it’s the analysis. And yet many entrepreneurs (and even data analysts) are guilty of thinking the opposite. Interesting questions go unanswered because the company decides that the data to answer that question may not exist so they don’t even try.
Do not fall for this trap.
2. Obtaining the data
The second step is, of course, obtaining the data (data mining). Depending on your requirements, you may have to look at private data or in the public domain – the procedure for obtaining data is different for both. The type of data you will obtain will also dictate the time and effort required. Data already packaged in databases is ideal but chances are, you’ll have to scrape the data yourself. Don’t worry there are plenty of tools available just for this.
3. Exploring the data
After the data has been gathered and cleaned (organized), it’s ready for exploration. Exploration is meant to help you understand the data, the relationships between variables, and various patterns in your dataset. If you’re doing tests or making predictions, you will form your hypothesis during this step and test it against random data analysis.
4. Modeling the data
Modeling the data is a very broad term and involves most of the core practices of data science including creating algorithms and training machine learning models. You can begin modeling your data after the early analysis has been done and you have enough information about your dataset that you can use statistical and machine learning models.
5. Communicating and visualizing the results
Data visualization might seem like the easiest step in data science but it’s actually quite difficult and arguably the most crucial step. When communicating and visualizing the results, it’s important to take into consideration the numerous psychological, artistic, and principles that can alter the way data is perceived by decision-makers.
How Data Science Translates into Big Data Analytics
Big data analytics is a subfield of data science that focuses on using smart computer software to process extremely large chunks of data, usually through cloud computing. Even though these two terms share very similar definitions, the average business is more interested in big data analytics for one main reason: ease of usage.
To get started with data science, you need to hire a team of data scientists who will, in short, obtain, explore, model, and communicate the data. However, data scientists are very sought-after and thus command a hefty salary. Google Cloud Platform suggests that companies have the following roles for an in-house data science department:
- Data analyst
- Data engineer
- Data scientist
- Statistician
- Applied ML Engineer
- Ethicist
- Social scientist
- Researcher
- Analytics manager
- Decision maker (Tech lead)
So instead, businesses turn to big data analytics as a SaaS (service-as-a-software). There is third-party software that businesses can use to analyze data with their existing software team. To make integration even simpler, many companies prefer to use their cloud service provider for big data analytics rather than a third-party vendor. For instance, Google Cloud Platform (GCP) has built-in big data and machine learning capabilities along with dozens of support services that manage your data all in one place.
Case Study: Southwest Airlines Save over $100 Million with Big Data Analytics
Up until 2015, Southwest Airlines did not have a system powerful and accurate enough to map out its hundreds of scheduled flights each week. As a result, the company loses billions of dollars on fuel and airport fees as its gigantic fleet of airplanes idle on the tarmac waiting for clearance.
This wait time could be avoided by better planning and scheduling trips. So in late 2015, Southwest Airlines became the first U.S. domestic airline to use a big data system to tackle this exact problem. The company started using General Electric’s Flight Efficiency Services (FES) unit, a big data data analytics system that was able to map out hundreds of flights.
Without data science and a robust big data system, it would’ve been impossible to take into consideration variables like air’s humidity and the fuel load on each leg and accurately predict so many trips.
What You Need to Get Started
To summarize, data science is an incredibly powerful emerging field of analytics that helps businesses unlock valuable insights from large chunks of unused data. However, since data science is a time-consuming process and requires data scientists, many companies prefer to stick to cloud-based big data analytics software.
Companies like Google Cloud have an entire ecosystem dedicated to leveraging data for insights. Using your own cloud vendor’s dedicated service is one of the fastest and easiest methods of getting started with big data analytics.
For legacy businesses that are not on the cloud, the best thing would be to partner up with a cloud-solutions expert to help install data-capture points as well as set up a pipeline that automatically captures relevant data, processes it, and delivers it to decision-makers.
D3V Tech has several years of experience building similar pipelines and helping legacy businesses on their journey of taming and mastering big data analytics. If you would like to learn more about big data analytics can help your business, reach out for a free consultation with one of our cloud-certified engineers.