Day 18 - Data and Databases, Part 1

Everything Product Managers should know about Data and Databases, explained step by step. Part 1

Dec 18, 2022

One of the core concepts in any modern software product is Data and Databases. I can't write or emphasize enough the importance of Data and "Insights". The best I can do is to say:

Data is not the new oil, it's even more valuable and important!

In the next couple of episodes, I'll cover more on Data and its related concepts, so you can feel more comfortable talking about Data and Data products, as well as be able to utilize it more and more in your day-to-day work.

Slides can be found here

What is Data?

Just to cover the basics, let's go with our definition of Data:

Data is a collection of facts, statistics, or other pieces of information that can be used to inform decisions or support conclusions.

Data can be numerical, textual, or graphical (data type) and can be collected from a variety of sources, such as databases, surveys, experiments, or observations.

Importance of Data

Data is a valuable asset for organizations and individuals, as it provides insights and information that can be used to make informed decisions and take strategic actions. Here are a few reasons why data is important:

Data helps to drive business growth and innovation: By analyzing data, companies can identify trends and patterns, and use this information to improve their products, services, and processes. Data can also help businesses to identify new opportunities and make better decisions about where to invest resources.
Data can improve efficiency and productivity: By analyzing data, organizations can identify inefficiencies and areas for improvement in their operations. Data can also help to automate and streamline processes, which can increase productivity and reduce costs.
Data helps to inform decision-making: Data provides a factual basis for making decisions, rather than relying on assumptions or gut feelings. By analyzing data, organizations and individuals can make more informed and accurate decisions.
Data is a key component of analytics and machine learning: Data is the fuel that powers analytics and machine learning algorithms and is used to train these systems to make predictions, detect patterns, and take actions.
Data can be used to personalize experiences and create targeted marketing: By analyzing data about customer behavior and preferences, companies can create personalized experiences and targeted marketing campaigns that are more likely to be successful.

Overall, data is a critical asset that can be used to drive growth, improve efficiency, inform decision-making, and create personalized experiences.

Qualities of Valuable Data

Just like "oil" which is not refined and not all oils are equal, Data should have several characteristics to be valuable and useful. These characteristics are:

Accuracy: Data should be accurate and free from errors, in order to be reliable and trustworthy.
Relevance: Data should be relevant to the problem or decision at hand, and should provide useful information for the context in which it is being used.
Completeness: Data should be complete and contain all relevant information, in order to provide a comprehensive and accurate view of the situation.
Timeliness: Data should be current and up-to-date, in order to be relevant and useful.
Accessibility: Data should be easily accessible and available to those who need it, in a format that is easy to use and understand.

By having these characteristics, data can be valuable and useful for a variety of purposes, including decision-making, problem-solving, and analysis. If any of these characteristics are under question or not verifiable, you might need to think twice before basing your "next big decision" on it.

📊 There is a lot to say about "Data-driven" decisions which deserve their own series after we covered the basics of Data and are comfortable with using and analyzing it.

What is a Database and why do we need it?

To go over the breadth quickly, our next stop in the "universe of data" 🪐 is Databases. It's defined as:

A database is a structured collection of data, usually stored and accessed electronically. It is a way to organize and store data that can be easily accessed, updated, and managed by computers.

In other words, a collection of data in an organized manner (easy to store, find, retrieve, edit, ...) is called a Database.

This organization of data helps us to utilize it for the purpose we have. (Note: There should be a purpose with data collection and analysis, sometimes we forget about this simple concept 🙈)

The history of databases dates back to the 1960s when the first database management systems were developed. These early systems were used to store and manage large amounts of data for scientific and government applications. In the following decades, databases became more widely used in business and other industries, and the development of the Internet and the World Wide Web led to the creation of many new database-driven applications.

Databases are used in a wide variety of applications, including online stores, social media platforms, banking systems, and customer relationship management systems. They are an integral part of modern software development and are used to store and manage data for many different types of applications.

☝️Fun fact: the word "database" was coined by IBM researcher Dr. Edgar F. Codd in the 1960s when he was working on the development of the first relational database management system.

Different types of Databases

Because there are many different types of data and we might use them for different purposes, naturally we need a different way to store and organize them.

There are several types of databases, including relational databases, which store data in tables with rows and columns, and non-relational databases, which store data in a more flexible format.

Some common examples of relational databases include MySQL, Oracle, and Microsoft SQL Server. Non-relational databases include MongoDB, Cassandra, and Redis.

When choosing a database for a particular application, it is important to consider the type of data that will be stored, the number of users who will be accessing the database, and the performance and scalability requirements of the application.

Are these the only database types we have?

Actually "No!", there are many more database (DB) types, depending on how we categorize them (how the data is stored, whether is it central or distributed, what data type is stored and so on). Let's go over these types in more detail:

Relational Database

Relational databases are organized into tables with rows and columns, and SQL is used to create, modify, and query these tables. SQL (Structured Query Language) is a programming language that is used to manage and manipulate data stored in relational databases.

NoSQL Database

NoSQL (Not Only SQL) is a term used to describe a category of databases that are designed to handle large amounts of unstructured or semi-structured data. These databases are typically non-relational, meaning that they do not use the table-based structure of relational databases. Instead, they use more flexible data models that can store data in a variety of formats, including documents, key-value pairs, and graphs. Some common examples of NoSQL databases include MongoDB, Cassandra, and Redis.

Other types of Non-relational databases are:

Column-based
Key-value database
Graph database
Document or Object-oriented database

Column-based Database and more

Column-based databases are a type of database that stores data in columns rather than rows. This allows for more efficient querying and faster access to specific data points. Column-based databases are often used for data warehousing and analytics applications, where large amounts of data need to be analyzed and queried quickly. Some examples of column-based databases include Apache Cassandra and Apache HBase.

Other types of databases include object-oriented databases, which store data as objects in an object-oriented programming language, and graph databases, which store data as nodes and edges in a graph structure.

It is important to choose the right type of database for a particular application, based on the data being stored, the performance and scalability requirements, and the needs of the application.

Concepts around Data and Databases

Now that we touched Data and Databases at a very high level (we'll get our hands dirty with SQL too, but that's for another day, my friend 😉), let's go over some key concepts that you might hear or see when talking about Data:

Data model

A data model is a way of organizing and structuring data in a database. Different types of data models include the relational model, which organizes data into tables with rows and columns, and the object-oriented model, which organizes data as objects in a programming language.

Table

In a relational database, a table is a collection of data organized into rows and columns. Tables are used to store data in a structured format, and each row represents a unique record, while each column represents a specific piece of information about that record.

Primary key

A primary key is a column or set of columns in a table that uniquely identifies each row in the table. Primary keys are used to ensure the integrity and uniqueness of data in a table.

Foreign key

A foreign key is a column or set of columns in a table that refers to the primary key of another table. Foreign keys are used to establish relationships between tables in a database.

Index

An index is a data structure that helps to improve the performance of queries by allowing the database to quickly locate specific rows in a table. Indexes can be created on one or more columns in a table.

Query

A query is a request to retrieve data from a database. Queries are written in a specialized language, such as SQL, and are used to search, filter, and manipulate data in a database.

Normalization

Normalization is the process of organizing a database in a way that reduces redundancy and dependency. Normalized databases are more efficient and easier to maintain than non-normalized databases.

I used to hear a lot when engineers were talking "but that denormalizes the data and that's not good". There are places where "denormalization" is helpful, but overall, because of the reduction in redundancy, engineers like it more.

Transaction

A transaction is a group of database operations that are treated as a single unit. Transactions are used to ensure the consistency and integrity of data in a database, by either committing all the operations in the transaction or rolling back any changes if an error occurs.

Let's call it a (sun)day and explore more concepts on Data in the following days!

What is your experience with working with Data? How comfortable are you with querying the data you need for your product? Let me know in the comments. 🙏🏻

Check today’s slides over here.

HackerPM

Discussion about this post