😺
wiki
  • Welcome, Internet Strangers!
  • sql
    • etl
      • Basics to Remember
      • Context to Remember
      • Naming Practices
      • ETL Steps
    • performance
      • Please No
      • Initial Explorations
      • Stored Procedures
    • select-tricks
      • Over Partition
      • Stored Procedures*
      • Creating Parameters
  • python
    • Working with Files
    • Classes in Python
    • Dictionaries
    • Working with Strings
    • Using Lambda
    • Seaborn
    • machine-learning
      • Learning Pandas
      • MatPlotLib! The Dreaded Line Graph...
      • matlab-qualgraphs-notes
      • Linear Regression Example
      • kNN Analysis in ScikitLearn
    • Neat Snippets
  • bash
    • helpful_cmd
  • math
    • Basic Definitions
    • Linear Regressions
    • Meaningful Sampling
Powered by GitBook
On this page
  • Excessive JOINs
  • Lack of Testing
  • Using Large Batches
  • Nesting Tables
  • DISTINCT

Was this helpful?

  1. sql
  2. performance

Please No

A list of ways to dramatically slow down queries

A lot of performant SQL writing is avoiding issues if at all posssible. Here are some of those issues.

Excessive JOINs

While potentially necessary for complex queries, reducing the number of JOINs in a process is helpful. Potentially, pulling down two datasets then merging them in the data analysis environment will take less time.

If multiple JOINs become necessary for common tasks, creating a custom VIEW in the database can yield great performance boosts.

Lack of Testing

Nearly any system will have an ability to see how long a function takes to complete. If a function is going to be run on a significant basis, testing different ways to achieve the same results is a good use of time.

Some data products even have profilers which will optimize queries or ETL steps. Searching for and using these tools can save analyst time and processing time. Databricks/Spark in particular has a rich toolset to measure and minimize processing time.

Using Large Batches

When trying to pull a large amount of data, multiple small batches is better than a few large batches. Joining together a large clustser of small batches seems like more hassle, but gives these advantages:

  1. Smaller batches are easier to rerun if something goes wrong

  2. Small batches can be run across multiple processors

  3. Small batches tend to have less memory failures

  4. Gaps are easier to spot in small batches (e.g., if using daily batches, a missed day can be spotted faster than if using monthly batches)

Nesting Tables

As an exaggeration of what can be done but probably should never happen:

SELECT * FROM (
    SELECT *, COUNT(col1) 
    FROM (
        SELECT col1, col2, col3
        FROM table1 )
    WHERE col1 == col2
    )
WHERE col3 IS NOT NULL

DISTINCT

Distinct is helpful, but also can massively slow down a query. Try not to use this when possible.

PreviousperformanceNextInitial Explorations

Last updated 3 years ago

Was this helpful?