Transcript#
This transcript was generated automatically and may contain errors.
Hi all. In this workshop, Francois Michonneau will be discussing the topic of working with Larger than Memory Data in R. Francois is a solution engineer at Posit with a PhD in biology from the University of Florida and over 20 years of R experience. He's a long-time contributor to the Carpentries where he served as the curriculum development lead and senior director of technology and an active R package developer and educator. Today he'll walk us through practical strategies for analyzing datasets that exceed available memory using modern tools like DuckDB, Apache Arrow, duckplyr, all while keeping familiar tidyverse workflows.
Yeah, thank you so much Michael for this introduction. I'm going to preface this by saying that I've been very sick last weekend. I'm still fighting with some congestion and issues with my voice. I have plenty of fluid here, but I may have to take some breaks at some point. Thank you all for coming. I'm very happy to talk about this topic. It's something that I think existed or wish I knew existed when I was working on some problems during my PhD and later my postdoc and it really makes data analysis easier and faster.
Before we start, I'd just like to get to know you a little bit because we're going to spend the next about three hours together. If you go to this link, let me see if I can get it without losing. It's not going to work. Okay, let's do this. You can scan this QR code and it will take you to a Poll Everywhere site where you will have a few questions to answer. I'd love to know where you are based, whether you've already used Arrow or DuckDB, the type of format your data is in most often, and one thing you'd like to learn today during this workshop.
It's great with people from lots of different places. I love to see this diversity.
Right now we are at 60-40 in terms of people who've already used DuckDB or Arrow. I'm going to really focus a lot on DuckDB today. I'm going to mention Arrow a few times at the beginning, just because I thought that in three hours it was best to really focus on one topic. But once you understand how things work in DuckDB, then it will be easy to apply in the Arrow world.
A lot of people have files, it looks like in a CSV. That's good to know. But we have also representation of other formats, so that will be good to see.
Workshop overview
I'm just going to give you a quick overview of what we're going to do today. We're going to start with a little lecture. I'm going to talk about what I call the three layers of data analysis. It's really to help build a mental model of how things work inside the computer so you have a better feel for what it's going to be like to work with larger assets, especially with DuckDB. Then we'll switch gears and we'll go into hands-on mode. And I prepared a short tutorial slash demo with some exercises for you to work with DuckDB and using mostly the dplyr syntax to work with this.
Once we have done a little bit of coding in DuckDB, we'll go back a little bit more in lecture mode and we'll talk about another package called duckplyr. Then we'll talk a little bit about when to use which package, when sometimes is it better maybe to go with DuckDB, when sometimes it may be better to do with duckplyr. And then we'll do a wrap-up and I'm hoping to highlight some resources and things that you can do to keep learning about DuckDB and other things. At the end of the session, maybe tonight or maybe for me tomorrow, I'll make sure that the links for the slides are available. All the coding that I will be doing during this session will also be put on the GitHub repo so you'll have a chance to review that later. And I think we are recording as well, so I'm sure the recording will be made available as well.
The three layers of data analysis
So yeah, when we say what does that larger than memory mean, what does it entail? Typically, if you've worked with R before, you may have seen messages like this. You cannot allocate a vector of size with a very large number of gigabytes or you may even crash directly completely your R Studio session, for instance. And to understand what's happening, I'm trying to go back to what I call the three layers. The first layer is the layer with which you interact when you are doing data analysis at the user interface. And that's going to be what you type basically at your terminal to interact with the data. In the middle, there is an engine. The engine is really responsible for converting the instructions you gave the computer into something that it can use to interact with the data that is stored either inside the hard drive or in the RAM, the memory of the computer.
And so to make this a little bit more concrete, imagine that you are working with R and you are trying to read a CSV file. Many of you said that you were working with CSV files. It will look like this, right? You would type my underscore data is getting assigned from the reader, read underscore CSV, mydata.csv. The mydata.csv is a file that lives on your hard drive. And so when you type this, what's happening is that you are going to transfer the content of this file into a representation that is going to live in the memory of your computer. And from there, it will be available to all to be analyzed, filtered, or whatever you want to do with that data.
And then when you are going to work with this data set, everything is going to happen in memory. So you're going, for instance, to use dplyr to write queries or things you want to do with your data using the syntax that comes with dplyr. dplyr, internally, is using some R code and C++ code to interact with this data in memory to give back the summarized representation from your data.
The main issue with this, with working fully in memory, is that sometimes the CSV file is very large, with hundreds of thousands, millions of rows. And when you, when the computer tries to represent the CSV file in memory, then it's too big. It doesn't fit anymore. And you can't work with it. So it's a little bit like this truck that tries to go under this bridge, but it's a little too small. The bridge is too low for the size of the truck.
So, no, there are no limitations in the CSV specification for the size of a file, as far as I know. In this case, the main limitation for the size of your CSV file is going to be the size of your hard drive. Are you going to have a hard drive large enough to store it? But CSV is also quite inefficient at storing larger datasets. And so if you really have very large datasets, CSV is probably not the best format for that.
Databases and the column-oriented approach
But so this issue is not new. It's something that has existed since, maybe since the beginning of computing. But at least in the 70s, that's really when this problem was addressed by the creation of databases. If you're not familiar with this elephant, that's the logo for Postgres SQL, which is a very popular database format, database system. And in this case, the data is going to live on your disk. But when you are going to do a query, the database is designed in a way where you don't need to load the entire data inside the memory of the computer to get your answer. The engine is able to scan the data that is on disk, decide what is helpful for it to read, and it's going to transfer some of this content into the memory of the computer, because that's faster to interact with rather than when it's on disk. And then it will send the result back to the user interface.
The issue is that we are at a conference about all in medicine and not SQL in medicine. And so often people are familiar with all but not SQL. And so they may not know how to best work with this system. And so that's why tools like dbplyr came to be. Instead of working with your data fully in memory, like in the CSV file in the first example, you can use dbplyr to create a layer inside this three-layer system where you are going to be able to write R code and still work with data stored inside a database.
And so in this case, we can use dbplyr to create a connection to a table inside our database, and we can write dbplyr syntax to try to, for instance, do a summarize by group for our data. But under the hood, what's happening is that this syntax in dbplyr gets translated into SQL and then gets passed on to the Postgres or the other database engine to do the data described before where only some of the data that is in storage is going to be put into memory to be able to materialize the answer back to the user.
And so a lot of the things we're going to do today is the same idea, except we are going to use DuckDB. And so in this case, the syntax is going to look very similar to what I just showed with Postgres. We are going to use dbplyr and to some extent duckplyr to write code with which we are going to interact with DuckDB as an engine that is going to look at the inside of the data stored on disk and or in memory. And I'll come back to that in a little bit later. The Arrow package works very similarly in the sense that you can still use the dbplyr syntax, but under the hood, you are using the Apache Arrow engine to work with the data.
So that's really a strength of the way dbplyr was designed is that from the start, it was designed where you have a way to interact with data, but then you can change what is doing the work. So when you use dbplyr to work with DataFrame in R's memory, it's going to use some code that comes with dbplyr that is going to work using R code or C++ code. But there is also a way to extend what dbplyr can do by connecting to other types of engine, including DuckDB, Apache Arrow, or the database system that I showed earlier. And this design allows us, users, to always almost use the same code, even though we might be working with data in different formats stored in different locations.
And this design allows us, users, to always almost use the same code, even though we might be working with data in different formats stored in different locations.
A big difference in working with DuckDB compared to that in other databases like Postgres is that internally, the data is stored on disk in a different way. The internal representation of what the data looks like inside the hard drive or the RAM of your computer is going to be very different in DuckDB compared to other database systems. DuckDB uses a column-oriented storage, which means that the data for one particular column is going to be all together. And then it will be the data for another column in your data set, and so on and so forth.
In database systems like Postgres, the data is going to be stored in a row. So the data is going to be stored as the full row of the data of the table next to each other on the hard drive. That makes a big difference because you can do things differently. It's optimized for different things. When data is stored as a column, you can do a lot of analytics operations much faster because you can only select some of the data without having to scan the entire table.
If you are trying to do a summary statistic on a particular column, when the data is stored as a row, you will need to scan the entire space of your hard drive where your data is stored to be able to access the data for each column. But when it's oriented in a column, then you can quickly scan and select all the columns at once, and so it makes it much, much faster. Additionally, because the data is not stored in the same location, it means that you can split the work and your computer can work on different columns at the same time. So you can have one part of your CPU that is going to, for instance, calculate the mean of column 1, while another part of your CPU is going to calculate the average in another column. It's not something you can really easily do when the data is stored in rows.
And so, DuckDB belongs to this category of database that is really designed for analytics. Postgres is more designed for transactional data. So you don't want to use DuckDB if you are an online shop and you try to record the orders of each people because you want to have this data organized in rows when the orders are being written. So it's fast, you write only the data one row after the other. But when you want to analyze your sales record, then you may want to switch to a different representation of your data using a column-oriented flow to be much faster in computing your summary statistics.
Working with DuckDB: formats and storage
So, to summarize what we've just covered, basically, we have the user interface layer that is really how you're going to interact with my data. You're going to have the computing engine that is going to process the data and do the computation on your data. And then you need to potentially think about where the data is going to be processed. Is it going to be only in memory, potentially only on disk, or a mix of both?
So, when we are working with DuckDB, there are many ways of working with it. And that's why it might make things a little bit confusing at first. But when you start asking, how do I interact with my data when you are working with DuckDB, you have multiple options. You can work with DuckDB writing a typical SQL statement like any other databases. But because it's a database designed for analytics, many other languages have started to use a written interface. So, you can, for instance, in R, use R code using the dplyr verb to work with the data stored inside DuckDB.
If you're using DuckDB, you are going to use a DuckDB engine to process your data. And then the last question is, where is my data? So, I wrote this way. It's called in-memory with disk spillover. What does that mean? Typically, when you're going to work with DuckDB data, you are going to either read, for instance, a Parquet file that is going to be put in memory. And then that's where the engine is going to take it to process it. But DuckDB is designed in such a way where if you're working with a very large Parquet file, a very large CSV file, you don't have to worry about running out of memory because it will work. It's like a streaming engine. It's like a movie, basically. It's only going to look at some section of the data at a time such that it won't completely fill your memory. And it's going to do just enough computation on the data it has in memory to get the answer. And DuckDB was designed in such a way that it's very efficient at doing this. And you can get answers very quickly.
So, when you're working with DuckDB, you can have data in many formats to start with. You can work with DuckDB with data stored as CSV file, Parquet, JSON. You can have even Google Sheets. There are plugins for that. And really many different formats. DuckDB also comes with its own file format that is a single file that is self-contained, that is easy to backup, easy to share with others, and that is very convenient to work with. Again, with DuckDB, you can also have your data in many different places. You can have your file on your hard drive, just like any other file, though that could be again your CSV or Parquet files or even your DuckDB files. But you can also have your CSV, Parquet, JSON files stored in the cloud in a blob storage service like AWS S3, Google Cloud, Azure. Wherever the internet can reach, you can most likely store it. DuckDB will be able to also reach these files.
And DuckDB also supports what is called a Hive partition. So, if you're working with very large datasets, sometimes it makes sense to split the datasets into multiple files following this schema that is called a Hive partition. And so, basically, you have one variable that is just like a folder system, a regular folder system. So, for instance, if you have data for multiple years, you can have one folder for each year. And within each of these folders, one file. And DuckDB will be able to read that. And so, that's very convenient because if you need to, for instance, only get data for a single year, you don't need to load the entire dataset. You can just load or scan the file that is in the year you are interested in.
Parquet files
I just want to talk a little bit more about Parquet files because that's what we're going to use today for the coding exercise. And the Parquet files are similar to what I described before with DuckDB in the sense that the layout of the data is column-oriented on disk. And that means that there's a lot of benefits. For instance, instead of having to store exactly the same data 100 times in a row, you can compress that by saying there is this data and it's going to be repeated 100 times. So, that's a way to be much more efficient in terms of storage on disk.
Another main benefit compared to, for instance, CSV file is that the data is typed. So, it means that each column has a specific data type. And so, it's much harder to introduce mistakes inside your file because if you try to insert, let's say, a numerical value in something that should be text, you could get a warning. And also, if you set your Parquet file to have specific data types, it means that everyone who is going to read this file will get the same data type, which might not be the case with CSV files depending on locals and other things. But the main benefit is that it uses the same type of layout that what, for instance, Arrow or DuckDB uses internally. And so, it's very, very fast to read this Parquet file because there is no gymnastics of trying to change the representation from an orientation in row to an orientation in column. Everything is in column and stays in column. And so, we have this very, very fast serialization from Parquet to things like Arrow or DuckDB.
Q&A before hands-on coding
Okay. Before we jump into coding, are there any other... I'm wondering what if the database is SQL and can switch.
So, yeah. I wanted to take this time before we jump into coding to answer some questions. I just saw that there was a question in the chat about what if the database is SQL and cannot switch or install DuckDB. So, yeah. If your data is already, for instance, in... I don't know what would be a good example. Maybe something like MySQL, for instance, or Microsoft... What is it called? SQL Server. Then you won't be able to use DuckDB directly on this data. But most database systems offer a way to export the data potentially in CSV file. Now, a lot of more modern database systems even offer Parquet. And once you have this file, this data exported in different formats, then you'll be able to use DuckDB to work on it. But then it's more like a data engineering problem. And it's really dependent on your particular architecture, the type of data you have. And so, we'll have to see in more detail what it looks like.
doesn't mean you won't be able to use the dbplyr tools, right? Yeah, that's right.
Is the RDS format similar to Parquet file column-oriented? Not quite. Because the RDS file is really an internal format to R. I don't know all the details of RDS, but I think because it's much more... You can store any kind of object in it. I don't think it's optimized for data settings, especially in the column-oriented. That being said, in R, if you're used to working with R, you're already used to the column-oriented format because that's how data frames are represented, right? It's a list that is shaped, that is rectangular, and that is formatted a bit differently. But each column is an element of a list. And so, it means that in memory, all the elements of a column are next to each other, not each row.
Okay, so snowflake is, there's another question about, yeah, there's another question about, are you saying that if my data is in snowflake, I need to export it first as csv or parquet instead of using a db connection to snowflake? Okay, so snowflake is a different beast altogether, because they are actually also a column-oriented format. And so a lot of the benefits that I'm talking about here around column-oriented dataset and ways to working with them applies also to snowflake. The main difference is that snowflake is really meant for very, very, very large enterprise datasets and not the type of dataset you would use in DuckDB. And also that a snowflake gives you a lot more control in terms of data governance that are not available in DuckDB.
And so that's a kind of a whole different beast. But typically, you would not be able, you wouldn't need to export the data from snowflake into csv or parquet to analyze them, because snowflake is already very efficient at analyzing data. That being said, if you wanted to do it locally, then that would be a good way of doing it.
Setting up the demo environment
So if you want to follow along with me today, I'm going to use Positron for this demo. You are welcome to use Positron as well if you're already using it. If you are using RStudio, that's fine. I'm not doing anything very specific to Positron today, so you'll be able to follow along just fine with RStudio. Make sure that you have the DuckDB, dplyr, and duckplyr packages installed, mostly DuckDB and dplyr. I'm going to put the links back in the chat. And then I provided the material in a Git repository. So if you already use Git and have Git installed, you can start a new project from Git using the URL of the repository. Or if you go to this repository in the readme, you will have access to a link with a zip file that contains that. So let me see if I can get these links quickly and easily for you.
So I'm putting back the links, the other instructions. I'm going to give you about five minutes to get caught up, and I'm going to go drink some more water, and I'll be back in just a minute, and I'll show you exactly where we need to be.
If you are in Positron, again, it would be pretty similar if you are in RStudio, but we're going to do a new from Git here on my user home directory, but you can do wherever you would like. I'm going to copy and paste the URL for the repository. And then you should be able to click OK, and that's going to clone the repository and bring you to something like this.
Okay, so I'm going to go ahead and get started. Again, if you have a question, I don't want to interrupt. If you want to follow along and you're not quite caught up yet, feel free to put a message in the chat. I'm happy to wait a few more seconds.
But so, in this repository, there is just a few things. One file that is called code.R that right now has only two lines, a call for the library.db and one library.dplyr. It also has a data folder and in it there is a Parquet file. I didn't provide you with a huge Parquet file because I want it to be still fast to download, easy to work with. But what I'm showing today is on 500,000 rows.
If you are going to work with a data frame in R with 500,000 rows, most likely it would fit in memory. Most likely it will still be fast, but I think that you'll see that with DuckDB it will be very fast today. Another question, can we import SQL data to DuckDB without the intermediate CSV file?
I think it depends on what kind of SQL data it is. Because I know that there are some connectors, I believe now that you can use the DuckDB engine with data stored as Postgres. But I think that's the only option at this time, otherwise you will need to export the data.
Exploring the dataset
Let's look a little bit at the data we're working with today. Again, I'm in Positron. That's a nice feature in Positron. If you have a Parquet file, you can preview it. I'm going to give it more room because the font is pretty large. But as I mentioned before, this dataset has 481,000 rows and 300 columns.
This is what is called the PUMPS dataset. It's a dataset from the US Census data. Every few years, the US Census sends a form to a fraction of the population and asks them lots of questions. And this dataset is a representation of this data. The actual dataset is quite large, the actual dataset is quite large, definitely doesn't fit in memory, but can completely be used, analyzed in DuckDB or Arrow.
Each of the column names is relatively cryptic, but there is a summary, or a dictionary, of these different column names in the URL I just put in the chat. I'm going to put it in here as well so we can have it. We'll just work with a few of these columns.
But let's look quickly at what Positron lets us do here with this Parquet file. First on the left, we have summary statistics of the different columns. We have a histogram when it's a numeric value. It's a little bit more interesting than when it's other type of data. But let's look at this one for instance. On top of my head, I don't remember what this particular variable is, but Positron does a few things. First, it tells us that we have 61% missing data. It tells us what is the minimum value, maximum value, mean, median, and standard deviation for each of these columns.
One thing that I just want to point out is, if you are working with data like this, Positron is pretty good because you can, for instance, use it to filter data. For instance, here we have data with lots of zeros. If we wanted to look at the data with other zeros, so if we wanted to look at the data with other zeros, what we could do is not equal to zero, apply filter, and almost instantly the data viewer updates the data based on our filter and also updates the summary statistics on the left here. It's a little bit of a tangent, but I feel like it's a very helpful thing to do when we're working even with very large data sets. Positron can handle them, and it's a good way to get familiar with your data and check for any potential data quality issues.
Creating a DuckDB connection
Okay, the first thing we are going to do is create a connection. So, I'm going to create a connection with the variable con, and we're going to say dbconnect. So, this particular function is from the dbi package. It's something that the db package uses to be able to interact with DuckDB data. So, here I want to say we want to connect to database, so dbconnect. Our database is the DuckDB format, and it's stored in memory. Just doing this doesn't do much. It just creates the connection.
The next step is we are going to work with the parquet files, and we're going to use the function tbl to say that it's a table. We're going to use as a first argument the name of our database connection. If I didn't load my packages, that's not going to work. Here we go, and then here we are going to do double quotes, and we're going to use the function read parquet. This is a function actually from DuckDB, so that's why we're using the double quotation, and I'm going to provide the path of where our parquet file is stored. So, here I'm going to put a single quotation mark to put the path. The path is data slash pumps underscore person underscore sample dot parquet, and closes this parenthesis, and then we are going to do just the head function. I'm going to copy this in the chat, so if you have any issue with the path or anything like that, you can just copy and paste.
Okay, and if I do a control enter here, we can see this output. So, let's look at what we have here. The first line is telling us that we are dealing with SQL connections, so that's why it says source SQL. You may have noticed here that we don't actually know the number of rows in this particular table. We have a double question mark instead of the number of rows, and then we still have the right number of columns. It tells us that we are dealing with a DuckDB database, and that it's actually stored in memory, and then it gives us a sample of the data.
So, just to summarize what we just did here is that we are using a DuckDB connection in our database to read this parquet file and get just the first few rows. So far so good, nothing very interesting for now. I just want to show this example to get the beginning of the data.
Okay, actually let me also show you what happens if we don't do head and we just do this. We basically get the same, except we get a few more rows, I think. Here, by default, head gives us six, and just without a head function we get the first 10 rows, but we get the same thing otherwise.
Okay, the next thing we're going to do, and we're going to copy and paste this first line, I'm just going to use another function, just call names, so it's a way to extract the column names for all tables. Again, so far so good.
Another thing I wanted to show you is that if you try to do tail, like you might be familiar with with a data frame, here it's not going to work. It's not going to work because we are dealing still with a database, not a data frame. And because it's a database, we don't know what the end of the file is. Because here we are working with lazy evaluation, and so that means that the entire data is not inside the memory of R, and so it cannot just give us the end of the file because it doesn't know what it looks like. The full table doesn't exist in the memory, so it can't access the end of it. The notion of potentially row numbers and the end of a file doesn't exist in this context because the data has not been loaded fully in memory.
The notion of potentially row numbers and the end of a file doesn't exist in this context because the data has not been loaded fully in memory.
We can, however, get some statistics. For instance, we could get the number of rows by using the count function, and so in this case we will use an aggregate function from DuckDB to compute the number of rows in our dataset. And you saw that it was very quick to calculate that we had the same number that we saw in the Data Explorer earlier, a little over 480,000 rows.
If I wanted to say, for instance, let's look at the ST column. This column stores the name of the state, the US state, for each row of our dataset. This particular function call is going to give us the unique values for these particular columns. Here we have data for seven different states that are being shown here. If I do it again, you may notice that the output is not in the same order. Again, because that's the way the database is going to work. It's going to reach to the content stored on disk, it's going to grab whatever it needs to get the answer, but there is no guarantee that it's going to scan the dataset in exactly the same order each time, and so your output might not always be in the same order.
Let's keep going. Let's do another, just very quick for now, exploratory data analysis on this DuckDB. We can use distinct using two different columns. For instance, if we wanted to see the unique combination of state and year, you would put these two variables in a distinct column, exactly as if you are working with a data frame in our memory, except that here we are working with a DuckDB database by going through a Parquet file.
Okay, so to make sure that we get the same result every time, you can add an arrange call. So, if we do arrange entity here, then in this case, we say that we care about the order in which the values for each of these columns are being returned, and so anytime we are going to call these particular lines of codes, the output will be in the same order.
But another thing you might notice still is that here we have the bold quotation mark, and that's because this particular data frame is not still in our memory. We are seeing it here, but we can't really do anything with it. For instance, we couldn't pass it, for instance, to ggplot. It's not a great example here because I don't know what we would put in ggplot, but it doesn't exist in our memory. To be able to transfer this representation of the data that we have been working with into our memory, we need to add something to it, and that's going to be the collect function. So, let me just to make it clear, I'm going to just copy and paste this and add the collect function. In this case, the collect function is going to ensure that all the data that is being returned by this particular query is going to be converted into a table, and that will be made available in our memories.
Let's see, I have one more example to show you, and that's to illustrate something that we're going to do later. If you've never worked with this data, you wouldn't know, but in this particular data, each row doesn't represent an individual. Instead, there is a column with a weight that represents the fraction of the population that answered similar questions, and so we need to take this value into consideration to see how many people we're actually dealing with. So, let's again go back to this, and we are going to do a summarize call, and we are going to do, this is pwgtp is the name of the column, we are going to say nAIRAMPS equals true, and we are going to do an aggregate by state and year, and let's see where it takes us. And so, these numbers here now are the actual estimate of the population in the state of Wyoming in 2018, because we are aggregating for the full data set that we have over this particular column that has the weight in it, and so that's a way to get the the data. And again, if we wanted to make sure that the output will always be in same order, we would want to add an arrange, and then to make this available in memory, we would want to add a collect. I must have made a mistake, this t is not formed, why is that? Because I forgot to pipe here.
Okay, and in this case it works. Okay, so there are a few questions in the chat, so let me go back to these questions, and then we can see what we're going to do next. Why is a source SQL? I thought we were using DuckDB engine. Now, that's not a silly question. So, this particular thing at the beginning where it says source is to mean that we are using the DBI interface to interact with a database, and so I think that's what it refers to in this case. We are definitely using the DuckDB engine to compute the summary statistics that we are dealing with here today.
And so now, very good point, this is not what the workflow would like. I just wanted to first illustrate quickly, easily, how we could build on top of queries, and I'm repeating this as a way for you to have a record of exactly the comments I used. So, when I save this file and put it in the repo at the end of the workshop, you will have a record of every single comment that I have used.
Switching to a persistent DuckDB connection
Yeah, we can definitely do this, but I'll show you a more, maybe, a standard way of working with this, because up to this, bringing in memories with a collect function. The collect is a function that is going to look into, that is going to work with, it's going to work with your, with the database to make sure that every single result that is matching your query is going to be brought back into AuthMemory.
Yes, absolutely, we could save it with a data frame here. I'm just being in exploratory mode, so I'm not saving anything, but we could definitely create variables with it.
Okay, we're going to switch gears a little bit now and do something a little bit more realistic, because as you've seen, here we have been working with a Parquet file, and I've been, I didn't save anything. Everything that we've done is just very basic exploration. So, I'm going to call, actually, DB disconnect to close the connection that we created earlier, and so now we are not attached to this in-memory DuckDB database we have been working with.
Instead, we are going to create a new connection, and I'm still going to call it, maybe that's not a good idea, maybe it would have been better to call it something else, but I'm going to keep it this name, and we are going to still use the DB connect function just like before. I'm still going to say that I want to work with DuckDB, so I'm going to say I'm going to use a DuckDB function for the driver, and here I'm going to
use the argument dbdir. So, when we used earlier the colon memory colon to say that we wanted to work with an in-memory database, that's the same argument, except that here we are going to give it a path. So, I'm going to call it pumps.small.DuckDB. So, here we are actually going to create a DuckDB file that is going to hold our database on disk. So, we will be able to share it with others, go back to it at a later point, and if there are some particular summary statistics or things we want to save for later, we will be able to do that in this table.
Another important argument to use is read-only, and we're going to set it to false because we are actually going to write to this database. That's something also that if you're working with other database systems like Postgres, they are designed to have multiple users working on it at the same time that can make modifications to the data without having any worry for data loss. DuckDB is really not designed for this. It's really designed for only doing analytical workflows and only analyzing data. And so, if you are, for instance, creating a Shiny app that uses DuckDB and you want to have multiple people querying the data at the same time, you need to put read-only here set as true, and that will allow this process. Otherwise, if you are going to let DuckDB have data written to it, then you can only have one person at a time opening the file.
So, it's a big difference compared to other database systems. And it's just that writing data to DuckDB can be slow depending on how it's done, especially in the traditional transactional way where you are thinking about one record at a time. And so, to avoid any issues with overwriting data, it's only one person at a time that is allowed to write data to DuckDB at a time. But if you are using it in a context where it's just reading the data, like in a Shiny app, then you can have as many people as you want connecting to it.
Creating and writing to a DuckDB file
Okay, so that was my little aside. And here we are going to use another function from DBI called dbexecute because we are going to write some SQL code. It's not too scary. The first argument is going to be the name of our connection. So here it's con. And here it's going to be a string of text that is going to be our SQL statement where we are going to create a table inside our DuckDB file that we'll be able to use later. So in this case, we're going to say create table. I'm using uppercase because it's what people do. That's a standard in SQL. It also works if you write it in lowercase. Here I'm using lowercase because that's the name of our table. So I'm going to use it called pumps person small. And to create this table, we are going to say as select star from read parquet data. Let's copy and paste here so I don't make any mistakes.
And I think that's it. We can put a semicolon at the end again. Good practice. Oh, I didn't evaluate this first. So that doesn't exist. Okay. So first I need to make sure that object exists in my memory that we are creating the connection. And if we look here, you can see that the file was created. Right now it's empty, but the file is created. It's here. That's our db file. Now I can do db execute with a statement. And you can see the progress bar. And so in this case, all our parquet file gets written to our db file. We can check that it works again using a function from the dbi package, db list tables with the name of our connection. And we can see that the table we just created, pumps person small, is available there.
Okay. So let me put this piece of code also in the chat so you can be sure that you have it if you want to follow along. Great question. Yeah, db is definitely a path. I could have put maybe data here. So we could put all of the data in the same spot here. Just keep it at the root of the project. So we have everything in one place. But this is not just a name. It's actually the path of where the file is going to be stored. If you are going to connect to a database that is hosted somewhere online, you would put the URL for your database there.
Yes, that's a good point. Yeah, we just need to have, that's one of the potential challenges of working in this way here, is that this is a SQL statement. And to be able to pass it using the execute, it needs to be a string. So it needs to be in double quotation. And then we still need quotation here because the read parquet function from db to be able to read a parquet file also needs to be quoted. So here I'm using single quotations. So hopefully that explains it.
Adding multiple tables to the database
Okay, so right now we have a database with a single table. And we are going to create a new table also to this in this database. It's something that is very typical when you are working with databases is that your database has multiple tables. And then you can do joins on them, for instance, to be able to reduce the amount of data that is in each table to reduce duplication. So you create tidy tables, and then you can use joins to combine and get the insight you want from your data.
So here we are going to create a table called mean commute time. As its name indicates, we're going to use the data inside our pumps person small table to evaluate the in commute time for people by state and year. So we are going to use the table function again, here we're going to say con. And here we are going to call it directly pumps person small. So in this case, we don't need to refer to the read parquet file anymore, because this table already exists inside our wdb file, we read it, it's now in wdb format, it's inside our database. And so we can call it just using the name of the table.
So that would be a much more realistic way of working with parquet data, for instance, is that you would create a table inside your wdb file that represents the data. Okay, so what from this table, we are going to want to use a summarize function. And here I'm going to do mean commute time as the sum of gwmnp, the name of the variable, times, again, we need to do the weights here, wgtp divided by the sum of the weight, wgtp, that's right, that looks good. And we are going to aggregate by state and year. Let's see what it looks like. It looks like nothing at the moment, because I assigned it, so I'm going to do this first. Okay, we have 14 rows. It looks like there is an error, but that's okay, maybe it's a good idea to avoid warning which an error may go through. And one thing I forgot to do is to call the collect function to make sure that actually it was in memory. Okay, so now we computed our mean commute type for each state and year we have in our dataset.
This particular object now lives in the memory of our session, but we are going to use another function from the dbi package called tbrite table. We give it the name of our connection again. We are going to say that we want to call our table mean commute time, and we want to use a variable mean commute time to do that. And now if we do dblist tables again on our connection object, we have two tables. We have our mean commute time and the full table. We can even make sure that it looks the way we want by using the table function again, or tbl function, meet mean compute time. And so in this case we are not in our memory, because here it says it was a table, but here we are really inside our db database, and we can see that it's a table that we are calling from our db database. And I'm going to do db disconnect. We're going to clear this connection for now, and we are going to remove the object. Perfect.
Let's go back here. Now we still have our file. We cannot display these files in in positron, but our db file is here, and we are going to be able to read from it. Let's check that just to make sure that everything worked. I'm going to create a new section, because we are going to do things a little bit more advanced. We are going to recall our connection. Oops, I'm using the wrong shortcut, because I'm not set up correctly. db connect. Again, we are using tag db, and we want to use our pumps small tag db file. That's what I call it, yes. Okay, and here if I do db list tables, we have the two tables inside our file. So, we'll be able to to work with this.
Let's be a way to copy and put all of this code in in the chat for now. So, you have an option. Perfect, and yeah, we're going to take a little break just before we continue, because we still have time. I'm going to look at the questions, and then we'll break for maybe 10 minutes.
Yep, that could work as well. You just need to be careful if you are using, like, this is your script. This will work fine, but just be careful if you try to do something in, for instance, a Shiny app that can be exposed to the public. You need to use a special function from Glue that escapes SQL character, because you could potentially corrupt your data. Well, maybe not because you've been read-only, but still, it's a good idea to check that.
But that should work. Let me again give it an error. Yep, if you get a table already exists, it means that it was created and everything is good. Perfect. Interesting that it took a long time. Yeah, the semicolon is definitely optional in this particular case with DuckDB, but it's good to put it in. Okay, before we take a little break, are there any questions at this point?
We'll go next in a little bit more advanced data manipulation, and so you can see what it looks like. But I feel like we're already seeing a little bit how you can work with data in DuckDB using very familiar syntax, if you already know how to use dplyr. Here, this type of call is exactly what you would be doing if it was a data frame in memory, except here that you're working with a database. We have to write a little bit of DBI code here because we're getting things set up, but we'll talk a little bit more about why we're using this approach in the second part. Okay, it looks like there are no other questions, so I suggest we take... it's 17, so...
Yeah, let's take... what is this? My brain is not working very well. Eight minutes, so we go back at 25 past the hour. Sounds good, thanks. Okay, thank you everyone, and again, if you have questions that come up during the break, feel free to put them in the chat. See you at 25 past the hour.
After the break: advanced data manipulation
Okay, welcome back everyone. There, any questions or anything that came up during the break? Okay, perfect. So, if there's anything, don't hesitate to put the question in the chat. I try to come back to it regularly and try to answer them, and I think we'll have some time at the end or so for more questions. Okay, let's continue. I want to show you kind of a type of workflow that I have often faced with when working with this type of data.
In most cases, you are going to be able to use the dplyr syntax exactly as you would if you are working with a data frame in memory. But sometimes it might not work quite the way you want, and so you need to learn what your options are. So, to illustrate this, we are going to go back to our table, and so we are going to... Okay, that's the right file, so that's our full table. And up to now, what we did, we created a table that has, for each state and each year, the mean time for commuting to work for the people who are working in our dataset.
Let's say we're interested in figuring out if there are, or which occupations people have, that have higher than average or lower than average commute time. So, if... Let's select some data from... I'm sorry, some columns from our dataset. We're going to select a state, because we're going to need it for our aggregation. We want to select a year. We want to select this column GWMNP, so that's our actual commute time. We want to also get PWGTP, that's our weight, and the column called OCCP for the occupation. So, let's look at a few rows to see what it looks like, and what... Let's also remove some data where there is no...
I could use a new filter out, but I'm going to use this one, GWMNP. Okay, so here, this is what it would look like when we remove the data. We have here the commute time, here we have the weight, and here we have the name of the occupation. Just for the sake of argument for this exercise, let's say that we don't want to have to deal with these very long names. We just want to be able to use the abbreviation. If we were with a dataset in memory, we should be able to use something like mutate category equals... I'm sorry, it's more just an abbreviation, it's actually a category. So, under these first three letters, there are multiple occupations. So, we just want to do it based on the category of the occupation, and not this full label.
Using SQL functions within DuckDB
We could use a stringer package, for instance. I hope I have it installed. We'll see. I'm actually split I, and we say we want to split the OCCP columns. We want to split using the dash, and we want the first element. But if we are in DuckDB land, this doesn't work, because right now, we are dealing with still data inside DuckDB. So, we have several options, and it really depends on your use cases. Here, the data is big, but not that big. So, one option could be that we first call collect here. By calling collect here, then we bring all the data in memory, then it becomes a table, a regular data frame. And so, that means that we can use our mutate call to do our stringer manipulation. That's all good and fine, as long as your data is not that large and still fits in memory.
What if you wanted to do this actually within DuckDB? DuckDB is a very comprehensive database system with many functions, including functions to do string manipulation. So, here, instead of doing it this way, we can say, again, mutate to secret category. But here, I'm going to use the SQL functions to tell it that I actually want to use the SQL statement. And in DuckDB land, that function is called split parts. That is almost the same syntax as the function from the stringer package. And we want to do on the OCCP columns, we want to split according to the dash, and we want the first element. And here, we get the same result, except that we are still in memory. So, here, this particular string manipulation is done by one of the functions available in DuckDB.
I'm going to just going to stop sharing my screen a second so you can see. This one, I believe. Yes, it looks good. Okay, let's go to the DuckDB website. Documentation, guides, it's more like SQL here. Functions. And so, this is an overview of all the functions that are available inside DuckDB, organized by categories. And so, for text functions, for instance, there are, I don't know, 50 maybe a little bit more function that you can use to do all kind of manipulation to your strings. And these are going to be done within DuckDB by the DuckDB engine on your DuckDB data. And so, they will scale to millions of rows without any issues. And they tend to be very, very, very fast.
they will scale to millions of rows without any issues. And they tend to be very, very, very fast.
And so, that particular trick that I showed by using the SQL function allows you to leverage all of these functions that are available to you and that you can use directly into DuckDB. So, in some cases, we have to step outside of what we're familiar with and deploy your function to be able to remain memory efficient and work with very large data. But for the most part, these functions are relatively easy to use and not that different from what you are already used to when you are working with R.
Using pure SQL with dbGetQuery
Another way of doing this is to use pure SQL. That's another way of doing it. And I'm going to give a different example for this. Let me see. So, we are still using the account through the collection. I'm going to give it some room. And we're going to, again, select the column we want. We want the column state again. We want the column year again. We want the column, we're going to do a different one called drive SP. We're going to look at what it looks like in a second. And here, yeah, let's say we wanted to do something. So, we still want the WPWGTP. We want to say from person small. And we're going to say where drive SP is not. No. So, in this particular column called drive SP, it categorizes people who are using their car to commute. And it's stored in an interesting way.
It first starts with the number of, what do you call it? What's a good way to explain it? There's a fraction of people that are in the car. So, basically, you get one if there is a single person in the car. It's a driver alone. 0.5 if there are two people in the car. 0.2, etc. You get a fraction of the space that this person occupies in a car, basically. If we wanted to do some actual, I don't know, averages on it, it would be great if we could just extract this value. And we can do that using a very similar function that we, the same function that we used before, using splits part. Give it the name of the columns we want. Here, the delineator is going to be the space. And we want the first element. Let's see what it looks like. Let's give it a name.
Now, we created a data frame because we use a dbGetQuery. So, dbGetQuery automatically is going to return the data frame. So, it's going to do the collect, the equivalent of what would be a collect at the end of a dplyr pipeline. And it will, yeah, do what we asked it to do using, in this case, a split part function from the db to extract the numeric value that we had. The thing you may notice, though, is that here we are still dealing with a character value and not a number. And so, one way of do, you can walk around it, is to get, to put two columns and then add float at the end. And in this case, the the value in the column will be cast as a numerical value. And so, in your data frame that you get out of this, you would get a double. Sorry, I may have misspoken. It's something that I have internally. It's actually a data frame, not a table. Here, it's showing as a table. But for you, it will look like a data frame. I tried to disable this before teaching, and I forgot to do this today. So, yeah, you should still be getting a data frame, maybe not a table, but at least a data frame. Is that right?
And it's possible that, yeah, it's possible you get disconnected. Yeah, I'm not sure why you would get disconnected. But if you want to connect again, you should just evaluate this line, and you should be able to get back your connection. And, again, so we created this table for, yeah, right now it's this data frame. If we wanted, if you wanted, then you would say as table of the result. I think you could just, I think this would work. Yeah, and then you'll get a table directly out of it. So, here we are. We got a data frame. It's in memory. But if we wanted to, we could also add this table inside our database. And so, we could reuse it later for something else.
So, instead of get query that is used to return a data frame into our session, we're going to use db.execute, which, as its name suggests, executes some SQL code.
Okay, let me go back to this. So, we're going to use exactly this statement. Copy it, because I don't want to make mistakes, and I won't type for nothing. I'm going to add a line here. I'm going to say create table, like we did before. We're going to create carpool as, and we put a parenthesis all the way around our statement. And that should give us this number. So, when we use db.execute, we get a number back. When it's a table creation, we get the number of rows in the table we just created. So, it means that in these rows, we have 176,000 rows. And to make sure that it works, we can go back to the previous syntax we used with the tbl function, give it a name for connection, carpool, and we get, yeah, we get our table now stored inside our db database file. And for instance, if we wanted to get the number of person per call, basically.
So, I'm running out of ideas. I'm going to call it carpool, and with some pwGTP, that's our weight. So, we need to multiply it by the value we got in our carpool column, and we divide it by the sum of the pwGTP.
And we want to aggregate this by state and year. For good measure, we are going to arrange by state and year, and we are going to collect, so we get actual data frame.
And so, yeah, that gives us a carpooling index, and basically, we could even see between the two years for each state, whether people are carpooling more or less. So, that's one way of doing this.
Hands-on exercise
I think at this point, it will be maybe good for you. I've been doing a lot of coding, and maybe you've been able to follow along, but give you a little exercise, so you'll be able to practice this on your own. It's a little bit advanced, but it uses a lot of the context of some concepts that the one we've done up until now. You'll definitely want to use the DuckDB documentation. Let me put this here. Let me put it in the code of block, so you can read the exercise that you have, and I'm going to put a link to the documentation.
I'm going to give you this link to have access to the doc and that should help you to get started to try to solve this problem I'm going to give you maybe 10 minutes To start, if you are done, at least if you think you are done, put a thumbs up in Zoom so I have an idea of where we're at. If you are completely stuck and facing some issues, feel free to ask questions in the chat. I'll stay around, I'll keep an eye out in the chat. I'll give you about 10 minutes to do this.
Okay, I think I saw one thumb up, so it's been about 10 minutes. I'm happy to give you more time if you want to keep playing with this or you can start solving it. It's not the easiest.
Walking through the solution
Okay, I'm going to start slowly. So, let's start with our data. Okay, we said we wanted to look at the JWAP column but we're probably going to want at least state in here. Let's do that. There are lots of NAs, so let's remove the NAs.
Great, so we have almost what we want. I'm going to look at the strptime function first because I think that's what we want. Let me check this so it works. Oh, and it was $p. Yeah, I think I have this in my note and I think that's right. I think we need to first remove the dots that are here. So, we have just $am and $pm. Otherwise, we might get run into trouble. So, I'm going to do a mutate and $timestart equals. We are going to use the SQL function to be able to get to be able to use the DuckDB function and that's going to be the replace to remove that these dots. We want to remove the dots with nothing. So, we're going to do that. I'm sure there are other functions that could be used but now we have $timestart here. It's a new column and we can see that it says 6 to 5 am here which has a dot here. It doesn't have the dots.
Okay, so now we should be able to extract. We want the first time, right? That's what I remember. First time, okay. So, mutate. $timestart, I can use the SQL command. Let's get the first value by saying split part. $timestart and we're going to do it on DuckDB. So, I'm putting spaces that should do it. What did I do wrong? I forgot to put the quotation mark here. Something else I did wrong. What did I do wrong? DuckDB, because I forgot to add one because I want the first element. That's what I missed. Okay, great. So, now we have this. So, now we should be able to, this is still character. If we wanted to convert it to a time, we can use a strip time function and use $i, colon, $i today and $p, except that I still did an error. What did I do? I forgot the colon here.
DuckDB, $p, $time, DuckDB, DuckDB, DuckDB. I forgot the quotation mark here. Here we go. So, now we have this except that it's a diet time. Perfect, thank you. And here the only thing I'm missing is that I want to cast this as a time. Except now that in this particular output, it shows it to me as a number of seconds, which is not great. But I believe that if we do collect, it's still doing it this way. But I think that's new because before it used to work in DRTN. That's duration. Maybe we need another code to re-predate or something else to make this work. But yeah, it was just a little gymnastic exercise, ready to get a chance to practice using the DuckDB functions to be able to extract and work with it and combine it with the regular dplyr.
Okay, thank you. At least someone was able to do this. So, I hope this little exercise was somewhat helpful and that you get a chance to practice it. Really here, the point I wanted to emphasize is that working with data stored as DuckDB doesn't need to be scary. It doesn't change a lot about the way you are currently working. You just need to think a little bit about where the computation happens. Again, here the data is pretty small. So, we could have done the collect sooner and did all of that using traditional functions from the tidyverse of rbase. Something I forgot to mention is that in collect, if you don't provide any arguments by default, you will get all the results back. But you can also say n equals, and for instance, 1,000 and you get only the first 1,000 rows. So, if you want to get a sense for what the data looks like before pulling a huge collect that's going to bring hundreds of thousands or millions of rows in your memory, you can also do it this way.
the point I wanted to emphasize is that working with data stored as DuckDB doesn't need to be scary. It doesn't change a lot about the way you are currently working. You just need to think a little bit about where the computation happens.
Summary: DuckDB R package
Okay, I think for now that's going to conclude the coding part, and we're going to go back to the presentation. I want to talk about a few more things. Let's click here. It's this one. Oops, I said that it's a wrong slide. Here we go. Oops, I went a little too fast. Okay, so I want to summarize a little bit what we've seen working with the DuckDB R package. So, as you may have seen, it requires a little bit of SQL knowledge because you are going to use the DBI package to interact with the DuckDB files.
But it's great because you get full SQL support this way. As I've shown in the coding example, you often have the option to either use the SQL function in R to then call the DuckDB function to do exactly what you want or using the DuckDB engine inside the DuckDB database. But it requires learning a few more things to be able to pull this off. When you are going to work with the DuckDB, you'll be able to use it on the DuckDB database files like I've done. You have a persistent record of your database, and you'll be able to work with it over multiple days with collaborators.
That's very convenient. The fact that DuckDB stores all the database in a single file that you can save, email, put on GitHub is great. Another big advantage of using the DuckDB package directly like we've done is that you don't have to worry about the size of your data. It will work very smoothly, very easily. You won't have any issues up until the point where you want to bring back this data as a data frame into R's memory. Then you need to make sure at that point that it's aggregated enough, summarized enough, selected enough, that you only have the data you really need, and that you're not going to go over the memory you have available for your process.
And because it's using SQL, it's a very stable, very mature interface. DBI has been around for many years now in our ecosystem, and I don't think it's going to go anywhere. So you have something very stable to work with.
The inconvenience of working with this approach is that everything goes through SQL translation. It's also a pro, but it's also a con, because it means that in some cases, the way you are going to write the code in dplyr and the way it's going to be interpreted in SQL to be sent to the DuckDB engine might not be the best. It might be very suboptimal, and you may end up with slower than needed queries. That might be an issue if you are doing some very intensive manipulation of, for instance, geospatial data that DuckDB supports that tends to be more computationally intensive, or if you are working with very, very large datasets. But it's something to keep in mind if really speed is really of the essence. Sometimes, not going through the dplyr interface, but writing your own SQL might save you some time.
And as we've seen, some operations that you'd want to do in R are not directly supported by this SQL translation, and so you have to either do it in R by bringing the data in memory first, if it's small, or as we've seen, using some of the internal DuckDB functions to do these operations.
The code is also very boilerplate. That was one of the questions that was asked. You need to really start every time by connecting to the database with dbconnect. When you are done, it's really good practice to do dbdisconnect so you don't leave any connection open and potentially having issues with others trying to connect to your data. To really get the most out of this, it's very beneficial to learn a little bit of SQL, especially if you want to do more advanced operations. I have to say that if you have access to LLM models, they are very good at writing SQL, and so you can really get a lot of help this way.
It's a little less idiomatic than tidyverse workflows because you may have to intersperse some SQL. As you've seen, we cannot have a single mutate call that references the same column. You're going to have to get multiple mutate calls. So there are some things that you're going to have to change in your workflows. Overall, if you're working with very large datasets, going with DuckDB will really speed things up. But you are still going to have to deal with some SQL because that's just the language that works best for these very large datasets.
duckplyr overview
I think I just wanted to point out that that was a thing I showed earlier as an introduction to say this is how it works. What we've seen in reality is it looks a little bit more like this, where on the user interface, yes, you can use dplyr in many cases, but sometimes it's going to be helpful to use SQL. When you are using this approach, you are going to go through a translation layer where your dplyr code is going to be converted into SQL to be passed to the DuckDB engine and you'll need to collect to get the full dataset back into memory. To go to Rodrigo's point, I want to talk a little bit about duckplyr. I didn't include any code example because it's very easy to use. It's almost exactly like regular dplyr, but there are some differences and I'm going to try to explain why I prefer to show the plain DuckDB rather than the duckplyr way. There is nothing wrong with duckplyr. It's just a slightly different approach to what we've been doing. It's described as a drop-in replacement for dplyr. The main differentiator is that where you have to write some SQL in the way I've shown in this workshop, with DuckDB duckplyr, you automatically get the fallback on dplyr. That can be great in many cases and not so great in some others.
When you are going to use duckplyr, you are going to start with some data. I put it this way. Assuming that you are starting with something that already fits in R's memory, so it's not larger than memory, it still fits in the memory, you're going to convert it into what is called a DuckDB table. It's a special representation of tables using the DuckDB format. Then you'll be able to use your regular dplyr code to make it work. The main difference is that when you're going to write your dplyr code, sometimes you are going to use DuckDB as the engine, and other times you are going to use dplyr. But I don't know if it was clear from what we've done and what I've introduced at the beginning. The DuckDB engine is only going to work if your data is actually in DuckDB format, and dplyr verbs are only going to work if they are going to be in the R format.
And so that means that to be able to use the dplyr syntax and verbs that are not supported by DuckDB, duckplyr is going to automatically convert or going to bring into R's memory your dataset. And that can lead to bad surprises because if your dataset is too large to fit in memory, then you are going to get some issues.
So if we look at the pros of duckplyr, it's a perfect drop-in replacement for dplyr. You almost don't have to change any code to make it work. You don't need to use any SQL code to make it work, just dplyr verbs and base R syntax will work well. duckplyr will work really well with anything that fits already in memory and that you can store there. There is nothing really to set up, it's just installing the package and you're ready to go.
And it's lazy by default, meaning that until you try to get all the data back as much as possible, you are not going to get any computation done in duckplyr until you absolutely need it. And that's a way to really leverage the query optimizer inside DuckDB to really get the fastest results. And contrary to using DuckDB, where you have to go through this SQL translation of the dplyr verb, duckplyr was written in a way where it calls the internal methods of the DuckDB engine directly. And so it's much faster because you have no need to think about, not directly, but the code doesn't have to go through, okay, you want to do a filter, okay, so that means I need to add a where clause in my SQL statement. The filter that you use in duckplyr gets mapped directly to the filter function inside the internals of DuckDB and so you get very fast results. And it means also that DuckDB can think about all the query plan at one time and you'll get slightly better and faster results than with using just a DuckDB package.
Some of the inconveniences though is that not all dplyr verbs are translated and you are going to fall back on dplyr without realizing that you're not using DuckDB anymore. And so it means that you are going to lose the advantage that you thought you were going to have by using this approach. The features that are implemented in DuckDB with duckplyr are great and are very fast, but it's a pretty limited set at the moment still, and so you might still run into performance issues. And it's not really designed to work in a production environment for doing ETL or working with persistent databases. And so you may end up not getting as much performance as you were hoping for. And again, because of these fallback performances where you're going to go back to sometimes using the dplyr engine for some of this computation, you are going to get some performance issues in your results.
When to use DuckDB vs duckplyr
Okay, so when to use DuckDB? And so in this case, I'm not talking really about the general case, not just the DuckDB package, but DuckDB in general. When is it a good thing to say, okay, I'm dealing with this parquet files or this large CSV file, is it worth for me to think about DuckDB? I think these are the questions you need to ask yourself. The first is obviously, is the data going to fit in memory? Because if you are using data sets that are so large that they won't fit in memory, then most likely you are going to want to reach out to something like DuckDB where you don't have to worry about that. If you are really working with tidy data that have multiple tables and you want to combine them, DuckDB is a great tool.
And also the fact that you may want to think about what is the data format you are dealing with. And so if you are dealing with already parquet files or CSV or JSON, things that DuckDB can support natively, then you are probably in good shape to be a good candidate to use DuckDB. And whether you need to share data with other programs, other people, again, depending on the situation, DuckDB can be a very good solution.
So I wanted to kind of finish this also with a little bit of a side-by-side comparison of the difference of when to use the DuckDB package versus duckplyr. I think DuckDB is great when you are working with these persistent files, things that you want to store in your system and you want to go back to them regularly. If you have big data pipelines and you need to combine multiple tables, all of this is really great with DuckDB. But I think the most important is if you want something that is explicit and predictable, I think DuckDB is a better candidate. I would think that if you are a data cleaning pipeline in production, I would go with DuckDB rather than duckplyr. With duckplyr, I think the main advantage is if you have a large dataset, like hundreds of thousands of rows, but it still fits in memory and you want to take less time to do whatever you are doing, then it's most likely a very good
candidate in these situations because you would have to change one or two lines of code in your pipes and you'll be set to go to be able to use it.
Also, if you have a very big dataset, it doesn't quite fit in memory and you want to explore it, start with duckplyr because you can just use the duckplyr verbs and you'll be able to get a sense of the field for your data. And so that would be very helpful. And then once you get a sense of what you want to do and you want to maybe put it in production, then at this point, maybe consider translating some of the things that are available in duckplyr into SQL and use the duckdb package.
And so just in general, duckdb is safer and you may not always get the best of the best performance if you write the code using duckplyr, but it's still very fast and much faster than many alternatives. Before I switch to resources, I wanted to answer this question. Actually, I don't know foolproof methods to know if something is going to fit in memory besides trying and seeing that it fails, but I can look it up and see if I find something and I'll add it to the readme of the Git repo and so you'll be able to look at it.
And so just in general, duckdb is safer and you may not always get the best of the best performance if you write the code using duckplyr, but it's still very fast and much faster than many alternatives.
Resources and further learning
And in terms of resources, I wanted to point out some stuff that exists. I didn't talk too much about arrow today, but if you want to learn more about working with larger assets, I really encourage you to read the book called Scaling Up with R and Arrow. And again, even though the focus in this book is mostly arrow, a lot of the concepts that are described here apply directly to duckdb. So if you get comfortable working with arrow, you'll be able to be good at working with duckdb. For duckdb, I find that their documentation itself is really, really good. And the other things you can read, I skimmed it a few times, but I recommend it is the book called Duckdb in Action. And yeah, you can get it for free if you leave your email address at the motherduck website. Some community resources around awesome duckdb. And if you don't know a lot of SQL and you want to learn more, this website called dbQuacks is a great tutorial. It has fun visuals, and it has an easy tutorial in about 40 different puzzles.
And this one is the Christmas special edition that you may want to look at. Okay. And that's, yeah, that's a fun way to really get your SQL skills up.
Another thing I want to mention is that duckdb is open source. And it's obviously free to use. But there is a commercial offering that came up called MotherDuck that uses duckdb. And MotherDuck is cloud storage for duckdb databases. And so the data is made available in a pretty interface on the web. And you can have multiple people access it, have your apps access the data there. So you don't have to worry about the storage of your duckdb, of your duckdb data. If you are not doing anything enterprise, at the enterprise level, just for fun little personal projects, they have a free plan that is more than enough.
But one thing I wanted to show you that something that came up recently, called dives that I find very interesting, personally. And it's a way to build dashboards and to build dashboards and interactive visualization directly from data hosted on MotherDuck with very little infrastructure. And it's all based on JavaScript that you can embed in any website. So what they have here is, for instance, this animation of earthquake. So all of that is built using DuckDB and this particular product that they call Dives. The idea is that you use your LLM to connect to your data, and you use an MCP server that has access to this library and these skills, you'll be able to generate this visualization.
To me, it seems very interesting to be able to build a dashboard and exploration, and it works on very large datasets. It's not quite R-related, but it might be something you want to explore on your own. And I think that's all I wanted to share with you today.
We are a little bit early, so we still have plenty of time for questions. Feel free to put them in the chat. I'm going to put in chat also this little survey. I'm always curious to hear what people get out of this workshop, if it was helpful, things that I can do to improve. So, if you can take a few minutes to fill it out, that would be super helpful. My email address is here. I'm always happy to get mail. If you have any questions, you need any follow-up.
Q&A
Brent, I have a question. Yes. So, for DuckDB, as far as large datasets go, that's for medium-large, right? Larger than what you have in RAM, but maybe not full enterprise size. I guess, at what point do you start thinking about taking a snapshot or using DuckDB instead of a full dataset that may be terabytes in size, for example? Because I know DuckDB will support terabyte-scale data. Is there a time, though, when you start to think, well, maybe we just grab still a large piece, but a portion of this, for internal analyses or validation?
I don't know. I mean, DuckDB really scales to large datasets without too much problem. They have a pretty interesting marketing approach, at least at MotherDuck. They often show how they can analyze this massive dataset on an iPhone or Android. They put it in a cube of cold ice so it doesn't burn down, and then they still manage to analyze a huge dataset on an iPhone because DuckDB is so efficient. So, I think the issues, yeah, maybe the hardware, what is practical in your workflow and how you get the data, more than really a limitation of DuckDB. That would be my take on that.
To say regression using the DuckDB engine, if you want to use all... Yeah, so that's a great question. And for this, I don't know if... I think it's what it does, orbital or package. I think it's what it does. It's not my area of specialty in terms of regression, but I know that it's something that several of my colleagues at Posit are working on to be able to run the prediction and models directly inside databases. I know that, yeah, if they support SQL Lite, I would think that they also support maybe DuckDB. But I know that there is work being done for doing the computation directly inside the database. So you don't need to bring it inside our memory and you can still do this analysis on your full data set. But it's not something I'm a specialist of, so I don't know if I can provide a much better answer than this.
When you use DuckDB, do you compute each pipe if you have many or does that take up space? It doesn't take space necessarily in memory, it will take space inside your database. Typically it's not an issue, you don't have to worry about the space that you're storing. It's not going to take space inside your memory just on storage. I don't use compute very much, I thought I remember vaguely that I had to use it at some point, but I don't remember why it was needed.
Collect though does take up memory, so if you do a big pipe and you're working on it, if you can get the result you want and the field you want for your data just using the first few rows, don't do collect every time, or do collect on a smaller number of rows just to get a feel for your data. Because if you are doing a pipeline with many steps and every time you do collect, then you can use up your RAM. Sometimes it's needed because you need to reuse this object for another part of your pipeline, but if it's just for debugging it might take up space for nothing.
When I use Arrow versus DuckDB, these two projects really overlapped in many ways. Arrow is a way of representing data on disk, so it's a storage format just like Parquet or DuckDB, but it's not a very good one at that. It's also a way of representing data in transit. For instance, we were talking about Snowflake before, and you want to bring your Snowflake data to your local computer so you can analyze it. To make this data very fast, they use internally the Arrow representation because it's in columnar format and it has all the benefits that I introduced at the beginning. It makes the transfer from your data in Snowflake into the pipes of the internet up to the memory of your R session in exactly the same format, so you never have to change the format and you don't waste any time. It makes everything very fast.
When you're using the Arrow R package, for instance, you are going to use this internal representation of the objects in the memory. This internal representation is very similar to what DuckDB does, except with some modification. DuckDB doesn't have all these nice features that Arrow has of being able to be the same format in multiple spaces. But one great advantage, I think, of Arrow is if you need to access the object in memory from both R, Python, and maybe Rust or something, because the internal representation is the same regardless of the programming language, you never again have to change format.
And so, I think my long answer or shorter version of this answer to your question would be if you are going to have data that you want to analyze with multiple programming languages, or that you need to carry over the internet, then Arrow might be a better fit than DuckDB. But for most people on data analysis, data pipeline situation, DuckDB works very well.
And so, I think my long answer or shorter version of this answer to your question would be if you are going to have data that you want to analyze with multiple programming languages, or that you need to carry over the internet, then Arrow might be a better fit than DuckDB. But for most people on data analysis, data pipeline situation, DuckDB works very well.
All right. So, are there any other questions? If not, Francois has provided contact information. Thanks very much, Francois. You're welcome. I hope it was helpful. And feel free to fill out the feedback form and send me an email.