. Source: Image created by Canva by the Author. This article contains a sample from our book “Descriptive stats for data-driven decisions making with Python”. Author(s: Pratik Shukla and Roberto Iriondo. Data science and machine-learning are both scientific fields that can be governed by mathematics and programming.
Today, corporations around the world generate huge amounts of data which can then be analyzed and visualised by professionals to identify trends and make forecasts.
Clear and easy to understand data will allow us to perform precise data visualizations. Data science is difficult because organizations often have too much data to work with. Therefore, it is crucial that we find patterns and structures in the data. Statistic provides tools and methods to uncover hidden patterns and structures in data, so specialists can draw conclusions. Statistics is the foundation of data science and machine-learning.
Statistics are necessary to convert observations into information. Machine learning uses a range of algorithms to predict, classify, and group data. There are many libraries that can perform math calculations. We need to understand the mathematics behind the various algorithms and statistical methods that we use.
This will give us insight into our actions and help us make data-driven decisions. The purpose of this work is to explain the fundamental concepts behind data science and machine learning. We want to demonstrate to our readers how calculations are performed and explain why such methods are necessary.
We have tried to show a handful of core statistical methods, along with the codes and examples that they use with Python in this book. It is possible that the output of some python programs will differ from what we receive by applying theoretical concepts. This happens because we use python library outputs and sometimes the logic used to make these outputs may be different. It is important to fully understand what the theoretical concepts mean.
Once you understand this concept it becomes relatively simple to create pseudocode or code to accomplish the task. We will be looking at descriptive statistics in this article. It is important to understand the basics of statistics before you dive into it. Statistics works with data. Statistics cannot work if there are no data.
To draw useful conclusions, we use data for various operations. Sometimes, however it’s not possible to collect all the data necessary for the study. It is impossible, for example, to obtain all data about all people if we need to determine their weight. We take data samples and perform operations.
We will first examine the sample and population, then discuss some sampling methods. Data and Population: The primary focus of statistical studies is data. Let us now look at two important types of data: Sample and Population. The number of observations within each dataset is the main indicator of population or sample.
Population is a collection of all elements and observations related to the study. Parameters are numbers that we get from the population. The (N) denotation is used to indicate the population. A sample is a collection of observations taken from the population.
Statistics are the term for numbers that are derived from samples. Samples are typically denoted by (n). There are many ways to get samples from the population. We will be looking at a handful of these methods in this book. Sampling techniques: Image by Author Probability sampling techniques: A technique that ensures each entity or subject has a chance at being selected to be included in a particular sample is known as the probability sampling technique.
They are usually representative of a greater population. Because there is little chance of sampling bias, they provide reliable results.
This method uses pseudorandom sampling and every member of the population gets the same chance of selection. We can also say that every member of the population is equal in their chance of selection. An unbiased representative of the whole population will come from a simple random sample.
Consider this example: We want to randomly select 10 students among a group 10000. We must first assign labels to students. There are students so the labels should start at 0 and finish with 9999.. Below is an illustration of the labeled students.
We will now use random sampling to choose 10 randomly students from an entire class of 10,000. We will need 4 boxes. Each box will contain balls ranging from 0-9. All of the boxes will be opaque, or not-transparent. Each student will have equal opportunities.
Next we’ll call a child to ask him for help. He will draw one ball out of each box. We will now take note of the number on each ball. If the child draws two balls from each box (two from the first, five from second, nine from third, and zero from fourth), then the 2590. student will be chosen. The process will be repeated until there are ten students. After noting the numbers down, the balls are returned to their original containers. There will always be 10 balls available at all times. Here is the final list that the child selected. This is how we choose randomly from among all the students.
Notice how we changed the balls from their boxes in this example. We will next look at another method of random sampling. We will not replace any of the data.
This is how it works: We will first write every number in a piece of chit, then put them in one large box and draw 10 from there. We don’t have to return the chits. This is the only way to replace them.
Advantages of Simple Random Sampling: This reduces sampling bias. It is easy to use. There are some disadvantages. You may make sampling mistakes. This isn’t suitable for large populations. It is not suitable for large populations. We can take samples