libr
Reverse Engineering REST APIs: The Easy Way (2/12)
mirai - Plumber Integration
REST API in R with plumber
API and R Nowadays, it’s pretty much expected that software comes with an HTTP API interface. Every programming language out there offers a way to expose APIs or make GET/POST/PUT requests, including R. In this post, I’ll show you how to create an API using the plumber package. Plus, I’ll give you tips on how to make it more production ready - I’ll tackle scalability, statelessness, caching, and load balancing. You’ll even see how to consume your API with other tools like python, curl, and the R own httr package.
Nowadays, it’s pretty much expected that software comes with an HTTP API interface. Every programming language out there offers a way to expose APIs or make GET/POST/PUT requests, including R. In this post, I’ll show you how to create an API using the plumber package. Plus, I’ll give you tips on how to make it more production ready - I’ll tackle scalability, statelessness, caching, and load balancing. You’ll even see how to consume your API with other tools like python, curl, and the R own httr package
# When an API is started it might take some time to initialize # this function stops the main execution and wait until # plumber API is ready to take queries. wait_for_api <- function(log_path, timeout = 60, check_every = 1) { times <- timeout / check_every for(i in seq_len(times)) { Sys.sleep(check_every) if(any(grepl(readLines(log_path), pattern = "Running plumber API"))) { return(invisible()) } } stop("Waiting timed!") }
Oh, in some examples I am using redis. So, before you dive in, make sure to fire up a simple redis server. At the end of the script, I’ll be turning redis off, so you don’t want to be using it for anything else at the same time. I just want to remind you that this code isn’t meant to be run on a production server.
redis is launched in a background, , so you might want to wait a little bit to make sure it’s fully up and running before moving on.
wait_for_redis <- function(timeout = 60, check_every = 1) { times <- timeout / check_every for(i in seq_len(times)) { Sys.sleep(check_every) status <- suppressWarnings(system2("redis-cli", "PING", stdout = TRUE, stderr = TRUE) == "PONG") if(status) { return(invisible()) } } stop("Redis waiting timed!") }
First off, let’s talk about logging. I try to log as much as possible, especially in critical areas like database accesses, and interactions with other systems. This way, if there’s an issue in the future (and trust me, there will be), I should be able to diagnose the problem just by looking at the logs alone. Logging is like “print debugging” (putting print(“I am here”), print(“I am here 2”) everywhere), but done ahead of time. I always try to think about what information might be needed to make a correct diagnosis, so logging variable values is a must. The logger and glue packages are your best friends in that area.
Next, it might also be useful to add a unique request identifier ((I am doing that in setuuid filter)) to be able to track it across the whole pipeline (since a single request might be passed across many functions). You might also want to add some other identifiers, such as MACHINE_ID - your API might be deployed on many machines, so it could be helpful for diagnosing if the problem is associated with a specific instance or if it’s a global issue.
In general you shouldn’t worry too much about the size of the logs. Even if you generate ~10KB per request, it will take 100000 requests to generate 1GB. And for the plumber API, 100000 requests generated in a short time is A LOT. In such scenario you should look into other languages. And if you have that many requests, you probably have a budget for storing those logs:)
It might also be a good idea to setup some automatic system to monitor those logs (e.g. Amazon CloudWatch if you are on AWS). In my example I would definitely monitor Error when reading key from cache string. That would give me an indication of any ongoing problems with API cache.
Speaking of cache, you might use it to save a lot of resources. Caching is a very broad topic with many pitfalls (what to cache, stale cache, etc) so I won’t spend too much time on it, but you might want to read at least a little bit about it. In my example, I am using redis key-value store, which allows me to save the result for a given request, and if there is another requests that asks for the same data, I can read it from redis much faster.
Note that you could use memoise package to achieve similar thing using R only. However, redis might be useful when you are using multiple workers. Then, one cached request becomes available for all other R processes. But if you need to deploy just one process, memoise is fine, and it does not introduce another dependency - which is always a plus.
info <- function(req, ...) { do.call( log_info, c( list("MachineId: {MACHINE_ID}, ReqId: {req$request_id}"), list(...), .sep = ", " ), envir = parent.frame(1) ) }
#* Log some information about the incoming request #* https://www.rplumber.io/articles/routing-and-input.html - this is a must read! #* @filter setuuid function(req) { req$request_id <- UUIDgenerate(n = 1) plumber::forward() }
#* Log some information about the incoming request #* @filter logger function(req) { if(!grepl(req$PATH_INFO, pattern = "PATH_INFO")) { info( req, "REQUEST_METHOD: {req$REQUEST_METHOD}", "PATH_INFO: {req$PATH_INFO}", "HTTP_USER_AGENT: {req$HTTP_USER_AGENT}", "REMOTE_ADDR: {req$REMOTE_ADDR}" ) } plumber::forward() }
To run the API in background, one additional file is needed. Here I am creating it using a simple bash script.
library(plumber) library(optparse) library(uuid) library(logger) MACHINE_ID <- "MAIN_1" PORT_NUMBER <- 8761 log_level(logger::TRACE) pr("tmp/api_v1.R") %>% pr_run(port = PORT_NUMBER)
How I would do auth
A quick blog on how I would implement auth for my applications.
First, if the application is for devs and I need something very quick, I would just use GitHub OAuth. Done in 10 minutes.
Now to the main part - how would I implement password-based auth? The minimum for me would be password with 2FA using authenticator apps. Passkeys aren’t widespread enough and I just find magic-links annoying.
Always implement rate-limiting, even if it’s something very basic!
Session management
Database sessions 100%. I really, really don’t like JWTs and they shouldn’t be used as sessions majority of times.
Assuming I only have to deal with authenticated sessions, my preferred approach is 30 days expiration but the expiration gets extended every time the session is used. This ensures active users stay authenticated while inactive users are signed out.
Registration
Hot take - I think it’s fine for apps to share whether an email exists in their system or not. If the email is already taken, just tell the user that they already have an account. Significantly better UX for minimal security loss. Don’t use emails for auth if you don’t like that.
Anyway, something more important than preventing user enumeration is checking passwords against previous leaks. The haveibeenpwned.com API is probably the best option for this. This will reduce the effectiveness of credential stuffing attacks, where an attacker targets accounts using leaked passwords from other websites.
Passwords are hashed with either Argon2id or Scrypt - they’re both good enough. Bcrypt is ok but it unfortunately has a 50-70 character limit.
Rate limiting will be set to around 1 attempt per second per IP address. Captchas if I start to get spams.
Email verification
I would also check if the email starts or ends with a space just to make sure the user didn’t mistype it.
I personally prefer OTPs for email verification over links, but both work fine. For OTPs, a basic throttling like 5-10 attempts per hour per account should be good enough. The code will be valid for 10, maybe 15 minutes. For verification links, I’d set the expiration to 2 hours.
Here’s some ways I would generate those OTPs:
bytes := make([]byte, 5)
rand.Read(bytes)
// 8 characters, 40 bits of entropy
// I might use a custom character set to remove 1, I, 0, and O.
otp := base32.StdEncoding.EncodeToString(bytes)
// 8 characters, entropy equivalent to ~26 bits
// This introduces a tiny bias.
// See RFC 4226 for why this is fine.
bytes := make([]byte, 4)
rand.Read(bytes)
num := int(binary.BigEndian.Uint32(bytes) % 100000000)
otp := fmt.Sprintf("%08d", num)
First of all, I wouldn’t bother with those 100 character long regex. Here’s the only email regex you’ll ever need:
^.+@.+\..+$