Confident Data Skills | Kirill Eremenko

Summary of: Confident Data Skills: Master the Fundamentals of Working with Data and Supercharge Your Career (Confident Series)
By: Kirill Eremenko


Embark on a journey to discover the complex world of data and its significance in the age of technology with ‘Confident Data Skills’ by Kirill Eremenko. This book summary provides insights into the three Vs of big data – volume, velocity, and variety – and the power of data in our daily lives. Understand the historical context of data science, the groundbreaking work of Alan Turing, and the future of artificial intelligence. Master the five-step Data Science Process by Joe Blitzstein and Hanspeter Pfister, which covers defining problems, gathering data, cleaning and transforming it, analyzing it using algorithms, and effectively presenting the results. Learn the importance of data visualization and the art of presenting data in an engaging manner.

The Value of Data

Data science relies on technology, but at its core are stories built of data nuggets. From social media posts to security cameras, data is a valuable commodity that businesses, governments, and researchers can collect in multiple ways. Scientists define big data based on three “Vs”: Volume, Velocity, and Variety. Museums, governments, and companies have gathered data for years in hard copies. However, with technology, data can now be collected, stored, and analyzed. Data is a crucial resource, having surpassed oil as the world’s most valuable resource.

The Future of Computing

In the 1950s, Alan Turing developed a test to distinguish between a human and a computer, which no computer has passed. Raymond Kurzweil predicts that by 2029, a computer will successfully fool the person administering the test. With technology rapidly evolving, users require a framework for successful data analysis projects. The five-step Data Science Process by Joe Blitzstein and Hanspeter Pfister is designed to help users navigate this complex landscape.

Refining Questions for Successful Data Analytics

Effective data analytics requires collaboration between business leaders and data scientists to define relevant questions and refine their investigation. The process involves understanding the company, its industry, and competitors through dialogue with colleagues, leaders, and subject area specialists. Business leaders must frame their concerns as questions, not problems, to provide a basis for data analysis. Data scientists must generate top-level visuals of the data set to identify trends and use efficient algorithms to analyze qualitative data. Collaboration helps identify missing sub-questions and refine the project’s scope, timeline, and milestones. Successful data analytics relies on thoughtful question-definition and constant dialogue throughout the process.

The Essential Guide to ETL Preparation Process

The ETL preparation process involves extracting, transforming and loading data, where cleaning the data takes the most time. Analysts must ensure that data is in the correct format with no errors or anomalies. Fields formatted for dates or currency may cause issues, but these can be avoided by using a YYYY-MM-DD format for dates and removing commas and symbols from currency while retaining two decimal points. Analysts should copy data from its original source and format it in a language that allows access by a relational database. Outliers and anomalies must be removed or corrected. Visualization tools such as Notepad++ or EditPad Lite can help view raw data. After loading clean data, check the total number of rows and the accuracy of the top and bottom 100 rows, as well as any text, date, and balance fields. Taking due diligence in this stage of data preparation can speed up the Data Science Process as a whole.

The Art of Algorithmic Data Analysis

Learn the different algorithms you can use to handle your data effectively, depending on your project goals and data sets’ characteristics.

Data analysis is a crucial factor in decision-making processes. However, the method of analyzing data can be a challenging task, especially with the variety of algorithms available. Different types of data sets require differing treatments to get correct results. One must know which algorithm to use for their data set, and this can be daunting.

For established data groups, classification algorithms such as decision trees or “random forest” regression are the most effective. For significant data sets, regression models such as simple or multiple linear regression work well. The latter model is useful when the study explores variables’ impact on a dependent model’s complex data set.

On the other hand, clustering is ideal for unidentified categories. Clustering enables the analysis of data sets of any size and is useful in improving targeted marketing. However, before the use of any algorithm, raw data must be correctly structured in the data set. Otherwise, later analysis stages might produce inaccurate results.

Reinforcement algorithms act similar to the learning process of humans. For instance, a robot that can teach itself how to walk, a common example of reinforcement learning, is more efficient than programming it with a defined process. Probabilistic analysis through Thompson sampling by reinforcement algorithms is a further enhancement.

Crowdsourcing programmers, such as SkinVision, which diagnoses a user’s mole conditions, use machine learning algorithms. The program recommends subsequent actions on future inspections. IBM’s Watson AI system, which uses advanced algorithms, can diagnose rare diseases in a matter of minutes.

Therefore, caution must be applied with choosing the right algorithms for the specific data sets while setting clear goals that help achieve meaningful results in the data analysis process.

Want to read the full book summary?

Leave a Reply

Your email address will not be published. Required fields are marked *

Fill out this field
Fill out this field
Please enter a valid email address.
You need to agree with the terms to proceed