• Benchmarking Power BI import speed for local data sources

    TL;DR – The fastest local format for importing data into Power BI is Parquet and then….MS Access?

    The chart above shows the number of seconds it took to load X million rows of data from a given data source, according to a profiler trace and Phil Seamark’s Refresh visualizer. Parquet is a clear winner by far, with MS Access surprisingly coming in second. Sadly the 2 GB file limit stops Access from becoming the big data format of the future.

    Part of the reason I wanted to do these tests is often people on Reddit will complain that their refresh is slow and their CPU is maxed out. This is almost always a sign that they are importing oodles and oodles of CSV files. I recommended trying Parquet instead of CSV, but it’s nice to have concrete proof that it’s a better file source.

    For clarification, SQL_CCI means I used a clustered columnstore index on the transaction table and “JSON – no types” means all of the data was stored as text strings, even the numbers.

    Finally, if you like this kind of content, let me know! This took about 2 days of configuration, prep, and testing to do. It also involved learning things that the Contoso generated dataset has Nan as a given name, which my python code interpreted as NaN and caused Power BI to throw an error. I’m considering doing something similar for Fabric data sources when Fabric DBs show up in my tenant.

    Methodology

    All of these test were run on my GIGABYTE – G6 KF 16″ 165Hz Gaming Work Laptop (don’t tell my accountants). It has an Intel i7-13620H 2.40 GHz processor, 32 GB of RAM, and a Gigabyte ag450e1024-si secondary SSD. The only time a resource seemed to be maxed out was my RAM for the 100 million row SQL test (but not for columnstore). For SQL Server, I was running SQL Server 2022.

    The data I used was the Contoso generated dataset from the folks at SQLBI.com. This is a great resource if you want to do any sort of performance testing around Star Schema data. I had to manually convert it to JSON, XML, Excel and MS Access. For Excel, I had to use 3 files for the transaction table.

    Initially, I was planning on testing in 10x increments from 10k rows to 100m. However, MS Access imported in under a second for both 10k and 100k, making that a useless benchmark. Trying to convert the data to more than 1m rows of data for XML, JSON, and Excel seemed like more work than it was worth. However, if someone really wants to see those numbers, I can figure it out.

    For recording the times, I did an initial run to warm any caches involved. Then I ran and recorded it 3 times and reported the median time in seconds. For 100m rows, I took so long I just reported the initial run, since I didn’t want to spend half an hour importing data 4 times over.

    Want to try it yourself? Here’s a bunch of the files and some sample at the 10k level:

    Perf Data – local import blog.zip

    What to learn more?

    If you want to learn more about performance tuning Power BI, consider checking out my training course. You can use code ACCESS24 to get it for $20 until Dec 6th.

  • Power BI performance tuning – what’s in the course?

    My course on performance tuning is live and you can use code LAUNCHBLOG for 50% off until Sunday February 11th. Module 1 is free on YouTube and Teachable, no signups.

    Performance tuning playlist – Module 1

    The goal of this course is to orient you to the various pieces of Power BI, identify the source of problem, and give some general tips for solving them. If you are stuck and need help now, this should help.

    Note! This is an early launch. Modules 1 and 2 are available now, and the remaining ones will be coming out weekly.

    • Module 1: A Guide to Performance Tuning. This module focuses on defining a performance tuning strategy, and all of the places where Power BI can be slow.
    • Module 2: Improving Refresh – Optimizing Power Query. Optimize Power Query by understanding its data-pulling logic, reducing the data being loaded, and leveraging query folding for faster refreshes.
    • Module 3: Improving Refresh – Measuring Refresh Performance. Master measuring refresh performance using diagnostics and the refresh visualizer to identify which parts are slow.
    • Module 4: Improving Rendering – Modeling. Better modeling means faster rending. Understand the internals of models, using columnar storage, star schema, and tools like DAX Studio for optimization.
    • Module 5: Improving Rendering – DAX Code. Optimize DAX code to run faster, focusing on minimizing formula engine workload and effective data pre-calculation
    • Module 6. Improving Rendering – Visuals. Streamline visuals for better performance by minimizing objects, avoiding complex visuals, and using just-in-time context with report tool-tips and drill-through pages.
    • Module 7. Improving DirectQuery. Optimize DirectQuery with strategies to limit querying, improve SQL performance, and employ advanced features like user defined aggregations, composite models, and hybrid tables.

    Each module after the first covers how to solve performance problems in each specific area. Each module also provides demos of the various tools you can use (of which there are many, see below).