Found 136 bookmarks
Newest
Three experiments in LLM code assist with RStudio and Positron - Tidyverse
Three experiments in LLM code assist with RStudio and Positron - Tidyverse
We've been experimenting with LLM-powered tools to streamline R data science and package development.
Twice a year, the tidyverse team sets a week aside for “spring cleaning,” bringing all of our R packages up to snuff with the most current tooling and standardizing various bits of our development process. Some of these updates can happen by calling a single function, while others are much more involved. One of those more involved updates is updating erroring code, transitioning away from base R (e.g.  stop()), rlang (e.g.  rlang::abort()), glue, and homegrown combinations of them. cli’s new syntax is easier to work with as a developer and more visually pleasing as a user.
In some cases, transitioning is almost as simple as Finding + Replacing rlang::abort() to cli::cli_abort():
# before: rlang::abort("`save_pred` can only be used if the initial results saved predictions.") # after: cli::cli_abort("{.arg save_pred} can only be used if the initial results saved predictions.")
In others, there’s a mess of ad-hoc pluralization, paste0()s, glue interpolations, and other assorted nonsense to sort through:
Thus was born clipal1, a (now-superseded) R package that allows users to select erroring code, press a keyboard shortcut, wait a moment, and watch the updated code be inlined in to the selection.
clipal was a huge boost for us in the most recent spring cleaning. Depending on the code being updated, these erroring calls used to take 30 seconds to a few minutes. With clipal, though, the model could usually get the updated code 80% or 90% of the way there in a couple seconds. Up to this point, irritated by autocomplete and frustrated by the friction of copying and pasting code and typing out the same bits of context into chats again and again, I had been relatively skeptical that LLMs could make me more productive. After using clipal for a week, though, I began to understand how seamlessly LLMs could automate the cumbersome and uninteresting parts of my work.
clipal itself is now superseded by pal, a more general solution to the problem that clipal solved. I’ve also written two additional packages like pal that solve two other classes of pal-like problems using similar tools, ensure and gander. In this post, I’ll write a bit about how I’ve used a pair of tools in three experiments that have made me much more productive as an R developer
After using clipal during our spring cleaning, I approached another spring cleaning task for the week: updating testing code. testthat 3.0.0 was released in 2020, bringing with it numerous changes that were both huge quality of life improvements for package developers and also highly breaking changes. While some of the task of converting legacy unit testing code to testthat 3e is relatively straightforward, other components can be quite tedious. Could I do the same thing for updating to testthat 3e that I did for transitioning to cli? I sloppily threw together a sister package to clipal that would convert tests for errors to snapshot tests, disentangle nested expectations, and transition from deprecated functions like ⁠expect_known_*(). ⁠(If you’re interested, the current prompt for that functionality is here.) That sister package was also a huge boost for me, but the package reused as-is almost every piece of code from clipal other than the prompt. Thus, I realized that the proper solution would provide all of this scaffolding to attach a prompt to a keyboard shortcut, but allow for an arbitrary set of prompts to help automate these wonky, cumbersome tasks.
The next week, pal was born. The pal package ships with three prompts centered on package development: the cli pal and testthat pal mentioned previously, as well as the roxygen pal, which drafts minimal roxygen documentation based on a function definition. Here’s what pal’s interface looks like now:
ensure
While deciding on the initial set of prompts that pal would include, I really wanted to include some sort of “write unit tests for this function” pal. To really address this problem, though, requires violating two of pal’s core assumptions:
All of the context that you need is in the selection and the prompt. In the case of writing unit tests, it’s actually pretty important to have other pieces of context. If a package provides some object type potato, in order to write tests for some function that takes potato as input, it’s likely very important to know how potatoes are created and the kinds of properties they have. pal’s sister package for writing unit tests, ensure, can thus “see” the rest of the file that you’re working on, as well as context from neighboring files like other .R source files, the corresponding test file, and package vignettes, to learn about how to interface with the function arguments being tested.
The LLM’s response can prefix, replace, or suffix the active selection in the same file. In the case of writing unit tests for R, the place that tests actually ought to go is in a corresponding test file in tests/testthat/. Via the RStudio API, ensure can open up the corresponding test file and write to it rather than the source file where it was triggered from.3
gander
·tidyverse.org·
Three experiments in LLM code assist with RStudio and Positron - Tidyverse
Format SQL Queries
Format SQL Queries
A convenient interface for formatting SQL queries directly within R. It acts as a wrapper around the sql_format Rust crate. The package allows you to format SQL code with customizable options, including indentation, case formatting, and more, ensuring your SQL queries are clean, readable, and consistent.
·dataupsurge.github.io·
Format SQL Queries
Documenting functions
Documenting functions
The basics of roxygen2 tags and how to use them for documenting functions.
Examples @examples provides executable R code showing how to use the function in practice. This is a very important part of the documentation because many people look at the examples before reading anything. Example code must work without errors as it is run automatically as part of R CMD check. For the purpose of illustration, it’s often useful to include code that causes an error. You can do this by wrapping the code in try() or using \dontrun{} to exclude from the executed example code. For finer control, you can use @examplesIf: #' @examplesIf interactive() #' browseURL("https://roxygen2.r-lib.org")
Instead of including examples directly in the documentation, you can put them in separate files and use @example path/relative/to/package/root to insert them into the documentation. All functions must have examples for initial CRAN submission.
·roxygen2.r-lib.org·
Documenting functions
Making Tables Shiny: DT, formattable, and reactable
Making Tables Shiny: DT, formattable, and reactable
Demo of popular packages for generating interactive tables suitable for Shiny apps
formattable Another nice table-making package is formattable. The cute heatmap-style colour formatting and the easy-to-use formatter functions make formattable very appealing. color_tile() fills the cells with a colour gradient corresponding to the values color_bar() adds a colour bar to each cell, where the length is proportional to the value The true_false_formatter() defined below demonstrates how to define your own formatting function, in this case formatting TRUE, FALSE and NA as green, red and black.
If you want the features of both DT and formattable, you can combine them by converting the formattable() output to as.datatable(), and much of the formattable features will be preserved.
However, one problem I had was that when using DT::datatable, missing values (NA) are left blank in the display (which I prefer), but in the converted from formattable() version, NA’s are printed. Also, color_bar columns seem to be converted to character, which can no longer be sorted numerically.
reactable Next I tried reactable, a package based on the React Table library.
Columns are customised via the columns argument, which takes a named list of column definitions defined using colDef(). These include format definitions created using colFormat.
In the end, I used DT::datatable() in my Shiny app, because I found it the easiest, fastest, and most comprehensive. I’ve been able to achieve most of the features I wanted using just DT.
Heatmap-like fill effect:
apply the formatStyle() function to the output of datatable() to set the backgroundColor for selected columns:
Abbreviate long cells
Sometimes some cells have a large amount of text that would mess up the table layout if I showed it all. In these cases, I like to abbreviate long values and show the full text in a tooltip. To do this, you can use JavaScript to format the column to show a substring with “…” and the full string in a tooltip (<span title="Long string">Substring...</span) when values are longer than N characters (in this case 10 characters). You can do this using columnDefs and pass JavaScript with the JS() function:
Really plain table Sometimes I don’t need any of the faff. Here’s how to get rid of it all:
headerCallbackRemoveHeaderFooter <- c( "function(thead, data, start, end, display){", " $('th', thead).css('display', 'none');", "}" )
datatable( my_pic_villagers, options = list( dom = "t", ordering = FALSE, paging = FALSE, searching = FALSE, headerCallback = JS(headerCallbackRemoveHeaderFooter) ), selection = 'none', callback = JS( "$('table.dataTable.no-footer').css('border-bottom', 'none');" ), class = 'row-border', escape = FALSE, rownames = FALSE, filter = "none", width = 500 )
·clarewest.github.io·
Making Tables Shiny: DT, formattable, and reactable
UNCHARTED DATA: Using Crosstalk to Add User-Interactivity
UNCHARTED DATA: Using Crosstalk to Add User-Interactivity
Linking an interactive plot and table together with the crosstalk package.
Using Crosstalk to Add User-Interactivity
The goal is to link the reactable table I created to a plotly chart and provide additional filter options that control both the table and the chart.
An important note: in order to use crosstalk, you must create a shared dataset and call that dataset within both plotly and reactable. Otherwise, your dataset will not communicate and filter with eachother. The code to do this is SharedData$new(dataset).
If you expand the code below, you’ll see that the code to build a table in reactable is quite extensive. I will not go into the details in this post, but do recommend a couple great tutorials that I used to create the interactive table such as this tutorial from Greg Lin, and this from Tom Mock which really helped me understand how to use CSS and Google fonts to enhance the visual appeal of the table (see the “Additional CSS Used for Table” section below for more info).
If you have ever built something in Shiny before, you’ll notice that the crosstalk filters are very similar. You can add a filter to any existing column in the dataset. As you can see in the code below, I used a mixture of filter_checkbox and filter_select depending on how many unique options were available in the column you’re filtering. My rule of thumb is if there are more than five options to choose from it’s probably better to put them into a list in filter_select like I did with the Division filtering as to not take up too much space on the page.
For the layout of the data visualization, I used bscols to place the crosstalk filters side-by-side with the interactive plotly chart. I then placed the reactable table underneath and added a legend to the table using tags from the htmltools package. The final result is shown below. Feel free to click around and the filters and you will notice that both the plot and the table will filter accordingly. Another option is to drag and click on the plot and you will see the table underneath mimic the teams shown.
·uncharteddata.netlify.app·
UNCHARTED DATA: Using Crosstalk to Add User-Interactivity
Design Patterns in R
Design Patterns in R
Build robust and maintainable software with object-oriented design patterns in R. Design patterns abstract and present in neat, well-defined components and interfaces the experience of many software designers and architects over many years of solving similar problems. These are solutions that have withstood the test of time with respect to re-usability, flexibility, and maintainability. R6P provides abstract base classes with examples for a few known design patterns. The patterns were selected by their applicability to analytic projects in R. Using these patterns in R projects have proven effective in dealing with the complexity that data-driven applications possess.
·tidylab.github.io·
Design Patterns in R
Fast JSON, NDJSON and GeoJSON Parser and Generator
Fast JSON, NDJSON and GeoJSON Parser and Generator
A fast JSON parser, generator and validator which converts JSON, NDJSON (Newline Delimited JSON) and GeoJSON (Geographic JSON) data to/from R objects. The standard R data types are supported (e.g. logical, numeric, integer) with configurable handling of NULL and NA values. Data frames, atomic vectors and lists are all supported as data containers translated to/from JSON. GeoJSON data is read in as simple features objects. This implementation wraps the yyjson C library which is available from .
·coolbutuseless.github.io·
Fast JSON, NDJSON and GeoJSON Parser and Generator
Prompt and empower your LLM, the tidy way
Prompt and empower your LLM, the tidy way
The tidyprompt package allows users to prompt and empower their large language models (LLMs) in a tidy way. It provides a framework to construct LLM prompts using tidyverse-inspired piping syntax, with a library of pre-built prompt wrappers and the option to build custom ones. Additionally, it supports structured LLM output extraction and validation, with automatic feedback and retries if necessary. Moreover, it enables specific LLM reasoning modes, autonomous R function calling for LLMs, and compatibility with any LLM provider.
·tjarkvandemerwe.github.io·
Prompt and empower your LLM, the tidy way
REST API in R with plumber
REST API in R with plumber
API and R Nowadays, it’s pretty much expected that software comes with an HTTP API interface. Every programming language out there offers a way to expose APIs or make GET/POST/PUT requests, including R. In this post, I’ll show you how to create an API using the plumber package. Plus, I’ll give you tips on how to make it more production ready - I’ll tackle scalability, statelessness, caching, and load balancing. You’ll even see how to consume your API with other tools like python, curl, and the R own httr package.
Nowadays, it’s pretty much expected that software comes with an HTTP API interface. Every programming language out there offers a way to expose APIs or make GET/POST/PUT requests, including R. In this post, I’ll show you how to create an API using the plumber package. Plus, I’ll give you tips on how to make it more production ready - I’ll tackle scalability, statelessness, caching, and load balancing. You’ll even see how to consume your API with other tools like python, curl, and the R own httr package
# When an API is started it might take some time to initialize # this function stops the main execution and wait until # plumber API is ready to take queries. wait_for_api <- function(log_path, timeout = 60, check_every = 1) { times <- timeout / check_every for(i in seq_len(times)) { Sys.sleep(check_every) if(any(grepl(readLines(log_path), pattern = "Running plumber API"))) { return(invisible()) } } stop("Waiting timed!") }
Oh, in some examples I am using redis. So, before you dive in, make sure to fire up a simple redis server. At the end of the script, I’ll be turning redis off, so you don’t want to be using it for anything else at the same time. I just want to remind you that this code isn’t meant to be run on a production server.
redis is launched in a background, , so you might want to wait a little bit to make sure it’s fully up and running before moving on.
wait_for_redis <- function(timeout = 60, check_every = 1) { times <- timeout / check_every for(i in seq_len(times)) { Sys.sleep(check_every) status <- suppressWarnings(system2("redis-cli", "PING", stdout = TRUE, stderr = TRUE) == "PONG") if(status) { return(invisible()) } } stop("Redis waiting timed!") }
First off, let’s talk about logging. I try to log as much as possible, especially in critical areas like database accesses, and interactions with other systems. This way, if there’s an issue in the future (and trust me, there will be), I should be able to diagnose the problem just by looking at the logs alone. Logging is like “print debugging” (putting print(“I am here”), print(“I am here 2”) everywhere), but done ahead of time. I always try to think about what information might be needed to make a correct diagnosis, so logging variable values is a must. The logger and glue packages are your best friends in that area.
Next, it might also be useful to add a unique request identifier ((I am doing that in setuuid filter)) to be able to track it across the whole pipeline (since a single request might be passed across many functions). You might also want to add some other identifiers, such as MACHINE_ID - your API might be deployed on many machines, so it could be helpful for diagnosing if the problem is associated with a specific instance or if it’s a global issue.
In general you shouldn’t worry too much about the size of the logs. Even if you generate ~10KB per request, it will take 100000 requests to generate 1GB. And for the plumber API, 100000 requests generated in a short time is A LOT. In such scenario you should look into other languages. And if you have that many requests, you probably have a budget for storing those logs:)
It might also be a good idea to setup some automatic system to monitor those logs (e.g. Amazon CloudWatch if you are on AWS). In my example I would definitely monitor Error when reading key from cache string. That would give me an indication of any ongoing problems with API cache.
Speaking of cache, you might use it to save a lot of resources. Caching is a very broad topic with many pitfalls (what to cache, stale cache, etc) so I won’t spend too much time on it, but you might want to read at least a little bit about it. In my example, I am using redis key-value store, which allows me to save the result for a given request, and if there is another requests that asks for the same data, I can read it from redis much faster.
Note that you could use memoise package to achieve similar thing using R only. However, redis might be useful when you are using multiple workers. Then, one cached request becomes available for all other R processes. But if you need to deploy just one process, memoise is fine, and it does not introduce another dependency - which is always a plus.
info <- function(req, ...) { do.call( log_info, c( list("MachineId: {MACHINE_ID}, ReqId: {req$request_id}"), list(...), .sep = ", " ), envir = parent.frame(1) ) }
#* Log some information about the incoming request #* https://www.rplumber.io/articles/routing-and-input.html - this is a must read! #* @filter setuuid function(req) { req$request_id <- UUIDgenerate(n = 1) plumber::forward() }
#* Log some information about the incoming request #* @filter logger function(req) { if(!grepl(req$PATH_INFO, pattern = "PATH_INFO")) { info( req, "REQUEST_METHOD: {req$REQUEST_METHOD}", "PATH_INFO: {req$PATH_INFO}", "HTTP_USER_AGENT: {req$HTTP_USER_AGENT}", "REMOTE_ADDR: {req$REMOTE_ADDR}" ) } plumber::forward() }
To run the API in background, one additional file is needed. Here I am creating it using a simple bash script.
library(plumber) library(optparse) library(uuid) library(logger) MACHINE_ID <- "MAIN_1" PORT_NUMBER <- 8761 log_level(logger::TRACE) pr("tmp/api_v1.R") %>% pr_run(port = PORT_NUMBER)
·zstat.pl·
REST API in R with plumber