Evaluating R’s {targets} Versus dbt for Data Engineering Solutions
May 04, 2026
357 views
The Importance of Choosing the Right Tools for Data Engineering
As someone who's been immersed in data engineering lately, I've encountered numerous challenges that seem tailor-made for R, a language I've been working with for years. Yet, rather than forcing a solution that fits my familiar toolkit, I've realized it's time to explore the available options in the field more deeply. The complexities of modern data workflows necessitate that we not only understand our preferred tools but also appreciate the specialized instruments designed for specific tasks. That's why I'm diving into these alternatives and comparing their workflow with the Naïve R approaches I've often relied on. It's also important to mention my current job search. If there's a role aligned with my background, don't hesitate to reach out at contact at jcarroll.com.au. Collaborating with like-minded professionals in this space is always an exciting opportunity. Hands-on experience is vital for genuine learning. I could easily let something like Claude Code draft entire workflows for me—which I might do in some future projects—but I prefer to roll up my sleeves and make mistakes. It's all part of understanding the underlying complexities and how my past notions align with real-world operational challenges. With Claude offering guidance for this project, I already had my SQL polished and organized, allowing me to focus on learning the intricacies of what I was building. To put this approach into practice, I decided to tackle a personal financial ingestion project. Previously, I used QuickBooks to manage my banking and freelance expenses, but I'm opting to build a self-made “slowbooks” workflow since I couldn’t locate an API from my bank for easy integration.Setting Up the Stage: Building on Makefiles
Both tools under examination here utilize the concept of a Makefile—a fundamental structure that efficiently connects commands based on dependencies, executing only the necessary steps. This principle allows for smarter processing; if inputs remain unchanged, existing steps don’t need to be re-run. My exploration suggests that utilizing Makefiles (or their modern equivalent, Just) can serve as a decent foundation, while the following two approaches offer enhanced structuring options that I find particularly beneficial. Given that this article is lengthier than my usual posts, I'll include quick links to help navigate through the sections. You can find key parts, like my insights on dbt and {targets}, along with a comparison of workflows, by jumping directly to the listings. This focused effort isn't just about learning from existing resources; it's about genuinely understanding these tools and processes so I can become more proficient in data engineering. The road ahead is filled with uncertain areas and plenty of room for growth, and I'm all in on tackling that journey.Transaction Categorization and Monthly Aggregation
The process of categorizing transactions marks a critical juncture in the data workflow. During this phase, transaction data is integrated with merchant information, where each transaction item is matched to a corresponding merchant category based on predefined keywords. Specifically, the SQL code outlined in `models/intermediate/int_transactions_categorised.sql` executes this matching process. It first retrieves transactions from the staging area before performing a left join with the merchant data. The clever use of the `row_number()` function ensures that, when multiple keywords match a transaction, only the most specific keyword is retained. Here's a closer look at the SQL snippet that accomplishes this: ```sql with transactions as ( select * from {{ ref('stg_transactions') }} ), merchants as ( select * from {{ ref('seed_merchants') }} ), matched as ( select t.*, m.keyword, m.merchant_name, m.merchant_category from transactions t left join merchants m on t.description ilike '%' || m.keyword || '%' qualify row_number() over ( partition by t.transaction_id order by length(m.keyword) desc ) = 1 ) select transaction_id, date, description, amount_aud, coalesce(merchant_name, 'Unknown') as merchant_name, coalesce(merchant_category, 'Uncategorised') as merchant_category from matched ``` This SQL script does a thorough job of organizing transaction data, ensuring no detail is overlooked. Next, the data is prepared for monthly aggregation, which becomes essential for any analysis. Utilizing `date_trunc()`, the SQL structure ensures a clean grouping by month, allowing for an accurate summation of spending across transactions, as shown in: ```sql select date_trunc('month', date)::date as month, sum(amount_aud) as total_spend_aud, count(*) as transaction_count from {{ ref('int_transactions_categorised') }} group by 1 ``` This structured output plays a significant role in understanding the overall spending trends on a monthly basis. When the data is aggregated monthly, it's easier to analyze financial behaviors over time. However, what’s less obvious is how this foundational categorization can fuel deeper analytical endeavors, such as spending anomaly detection.Using R for Fuzzy Matching
R users have an additional alternative for transaction categorization through the `{fuzzyjoin}` package. This tool allows for pattern matching similar to SQL but with the flexibility of R's syntax, which can lead to cleaner and more understandable code. Here's an example of how the matching function might look: ```r categorise_transactions <- function(transactions, merchants) { matched <- fuzzyjoin::fuzzy_left_join( transactions, merchants, by = c("description" = "keyword"), match_fun = \(x, y) { str_detect(str_to_lower(x), str_to_lower(y), negate = FALSE) } ) |> group_by(transaction_id) |> arrange(desc(str_length(keyword)), .by_group = TRUE) |> slice(1) |> ungroup() |> mutate( merchant_name = coalesce(merchant_name, "Unknown"), merchant_category = coalesce(merchant_category, "Uncategorised") ) |> select( transaction_id, date, description, amount_aud, merchant_name, merchant_category ) } ``` This approach offers a nuanced way to categorize transactions, ensuring that even slightly different descriptions can find their suitable match. Next, the monthly aggregation in R using `{dplyr}` provides a straightforward method to summarize the categorized transactions: ```r monthly_balances <- function(transactions_categorised) { transactions_categorised |> mutate(month = floor_date(date, "month")) |> group_by(month) |> summarise( total_spend_aud = sum(amount_aud), transaction_count = n(), .groups = "drop" ) } ``` This combination of R's fuzzy matching and aggregation capabilities allows for a more holistic view of spending patterns, which may alter the insights drawn from the data compared to traditional SQL methods.The Pipeline Workflow
Integrating these methods into a cohesive workflow highlights the flexibility of data processing tools. In a well-organized project, a clearly defined file structure guides the data journey from raw transactions through staging to categorized outputs. Here’s an example layout: ``` models ├── intermediate │ ├── int_monthly_balances.sql │ └── int_transactions_categorised.sql ├── marts │ ├── mart_category_outliers.sql │ ├── mart_category_trends.sql │ ├── mart_transactions.sql │ └── schema.yml └── staging ├── sources.yml ├── stg_bank.sql ├── stg_cc.sql └── stg_transactions.sql ``` As you implement these steps, you’ll see that each SQL model directly corresponds to a table created in DuckDB. Running `uv run dbt build` will execute your defined processes, ensuring each component functions in harmony to deliver coherent results. It’s an elegant solution that stands up to the complexity of modern data challenges while remaining adaptable for future analytical needs.Looking Ahead: Balancing Automation with Vigilance
As we close this exploration of data engineering methodologies, it’s evident that each tool has its strengths and weaknesses, especially when considering workflows that impact data integrity and operational efficiency. What stands out from the recent run of models and tests is the blend of automation and error monitoring inherent in both the dbt and {targets} frameworks. Despite a predominantly successful execution—12 out of 13 operations passing—there’s that single warning to contemplate. The fact that 34 transactions weren’t categorized could indicate a lurking data cleanliness issue. This isn’t just a minor footnote; it underscores the importance of vigilant testing as we automate increasingly complex workflows. While the streamlined process of writing directly to the database at key points can enhance efficiency, it also means that any errors could propagate unnoticed. This serves as a reminder: sophisticated workflows must still be actively monitored. If you’re early in this space, it’s tempting to trust the data pipeline blindly. But remember, each automation layer adds complexity, and with that complexity comes potential pitfalls. The choice between dbt’s full refresh and the incremental updates of {targets} will ultimately depend on your specific requirements. If you're working in high-stakes environments where data integrity is paramount, the finer granularity of {targets} in detecting changes—even amidst unaltered code—is a feature worth considering. In the world of data engineering, optimization and oversight should walk hand in hand. As these technologies evolve, practitioners must prioritize not just the efficiency of data processing, but the integrity of the insights drawn from them. The tools at our disposal are powerful, but human oversight remains essential in ensuring robust decision-making. Embrace the automation, but don’t let it overshadow the critical importance of vigilance in data quality.
Source:
Jonathan Carroll
·
https://www.r-bloggers.com/2026/05/comparing-rs-targets-and-dbt-for-data-engineering/