How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python

Introduction

Fetching a million rows from SQL Server into a Polars DataFrame used to mean creating a million Python objects, triggering countless garbage-collector allocations, and then discarding it all to build the DataFrame. That overhead is now gone. The mssql-python driver, thanks to a community contribution by Felix Graßl (@ffelixg), can fetch SQL Server data directly as Apache Arrow structures. This guide walks you through using Arrow support in mssql-python to make your data pipelines faster and more memory-efficient, whether you work with Polars, Pandas, DuckDB, or any other Arrow-native library.

How to Accelerate SQL Server Data Fetching Using Apache Arrow in mssql-python — Source: devblogs.microsoft.com

What You Need

Python 3.8+ installed on your system.
mssql-python driver (version 1.2.0 or later) with Arrow support. Install via pip install mssql-python[arrow].
A target library that can consume Arrow data: Polars, Pandas (with ArrowDtype), DuckDB, or similar.
SQL Server database with appropriate credentials and network access.

Step-by-Step Guide

Step 1: Install Required Packages

First, ensure you have the latest mssql-python package with Arrow support. Open your terminal and run:

pip install mssql-python[arrow] polars

This installs both the driver and Polars as an example consumer. For Pandas or DuckDB, adjust accordingly.

Step 2: Import Libraries

Create a Python script and import the necessary modules:

import mssql
import polars as pl
from mssql import connect

The mssql module provides the driver; polars will be used to verify the zero-copy integration.

Step 3: Establish a Connection

Use your SQL Server credentials to create a connection. Replace placeholders with your actual server, database, username, and password:

connection = connect(
    server="your_server.database.windows.net",
    database="your_database",
    username="your_username",
    password="your_password"
)
cursor = connection.cursor()

Step 4: Execute a Query and Fetch as Arrow

Execute a SELECT query. The key new method is fetcharrow(), which returns an Arrow Table directly—no Python objects per row are created:

cursor.execute("SELECT TOP 1000000 * FROM large_table")
arrow_table = cursor.fetcharrow()

This call runs entirely in C++, writing values into contiguous, typed Arrow buffers. Nulls are tracked with a compact bitmap—no None objects per cell.

Step 5: Convert to Your DataFrame Library

Because Arrow uses a stable ABI (the Arrow C Data Interface), conversion is instantaneous—zero-copy:

To Polars: df = pl.from_arrow(arrow_table)
To Pandas: df = arrow_table.to_pandas(types_mapper=pd.ArrowDtype)
To DuckDB: duckdb.sql("SELECT * FROM arrow_table")

No serialization, no copies, no re-parsing—just a pointer exchange.

Step 6: Perform Operations Without Python Overhead

Once data is in an Arrow-native DataFrame, operations like filters, joins, and aggregations also work in-place on the same shared memory buffers. For example, in Polars:

result = df.filter(pl.col("datetime_col") > "2023-01-01").group_by("category").agg(pl.col("value").sum())

No intermediate Python objects are materialized at any stage—this is the foundation for high-throughput pipelines.

Tips and Best Practices

Speed gains are most noticeable with temporal types (like DATETIME and DATETIMEOFFSET) because per-value Python-side conversions are eliminated entirely.
Memory usage drops dramatically: A column of one million integers becomes a single contiguous C array, not a million individual Python objects. Great for large datasets.
Seamless interoperability with other Arrow-native tools—Polars, Pandas (with ArrowDtype), DuckDB, and even tools like Hugging Face datasets—means you can mix and match libraries without conversion costs.
Ensure you're using a recent version of mssql-python (1.2.0+) to have the fetcharrow() method available.
For best results, avoid fetching rows individually—always use bulk fetch methods like fetcharrow() when working with Arrow.
Monitor your garbage collector; with Arrow, you'll see far less GC pressure, making your overall application more predictable.

Tags: