Supercharging SQL Server Data Pipelines: Arrow Support in mssql-python

Faster, Leaner Data Fetching with Apache Arrow

Retrieving a million rows from SQL Server into a Polars DataFrame used to be a sluggish process: each row required creating a Python object, allocating memory in the garbage collector, and then discarding those objects to build the DataFrame. That era is over. The mssql-python driver now supports fetching SQL Server data directly as Apache Arrow structures, offering a faster and more memory-efficient path for anyone working with Polars, Pandas, DuckDB, or other Arrow-native libraries. This feature was contributed by community developer Felix Graßl (@ffelixg), and we are excited to make it available.

Supercharging SQL Server Data Pipelines: Arrow Support in mssql-python — Source: devblogs.microsoft.com

Understanding the Core Concepts

Before diving into the benefits, it helps to clarify a few terms that underpin this new capability:

API (Application Programming Interface): A source-code contract that defines how to call a function or library.
ABI (Application Binary Interface): A binary-level contract specifying how compiled code is laid out in memory. Two programs built in different languages can share an ABI and exchange data directly — no serialization is needed.
Arrow C Data Interface: Apache Arrow’s ABI specification — the standard that makes zero-copy data exchange between languages possible.

What Is Apache Arrow?

The key insight behind Apache Arrow is zero-copy language interoperability. Arrow defines a stable shared-memory layout — the Arrow C Data Interface, a cross-language ABI — that any language can produce or consume by exchanging a pointer. This eliminates serialization, copies, and re-parsing. A C++ database driver and a Python DataFrame library can operate on the exact same memory without either one knowing about the other.

Built on this foundation, Arrow uses a columnar in-memory format: instead of representing a table as a list of rows (each row a collection of Python objects), Arrow stores all values for a column contiguously in a typed buffer. Nulls are tracked in a compact bitmap rather than per-cell None objects.

For a database driver, this means the entire fetch loop can run in C++ and write values directly into Arrow buffers — no Python object creation per row, no garbage-collector pressure. The DataFrame library receives a pointer to that memory and can begin operating on it immediately. Crucially, subsequent operations — filters, joins, aggregations — also work in-place on those same buffers. A Polars pipeline reading from mssql-python never needs to materialize intermediate Python objects at any stage, making Arrow the right foundation for high-throughput data processing pipelines.

Four Concrete Benefits for Users

1. Speed

The columnar fetch path avoids Python object creation per row, making fetching noticeably faster for many SQL Server types — especially temporal types like DATETIME and DATETIMEOFFSET, where Python-side per-value conversions are eliminated entirely.

2. Lower Memory Usage

A column of one million integers is a single contiguous C array, not a million individual Python objects. This drastically reduces memory consumption and GC overhead.

3. Seamless Interoperability

Arrow-native libraries like Polars, Pandas (via ArrowDtype), DuckDB, Hugging Face datasets, and others can consume the data directly without conversion. This streamlines multi-tool workflows.

4. Simplified Data Pipelines

By eliminating intermediate serialization steps, you can build cleaner, more efficient ETL processes. The same Arrow buffers can be passed between systems without re-formatting.

Getting Started with Arrow in mssql-python

To enable Arrow fetching, use the fetch_arrow=True parameter when executing a query. For example:

import mssql

conn = mssql.connect(server='localhost', database='test')
cursor = conn.cursor()
cursor.execute('SELECT * FROM large_table')
arrow_table = cursor.fetch_arrow()  # Returns a PyArrow Table

You can then pass arrow_table directly to Polars, Pandas, or other Arrow-compatible tools.

Polars Integration Example

import polars as pl
df = pl.from_arrow(arrow_table)  # zero-copy conversion

This avoids the overhead of creating a Python list of tuples and then constructing a DataFrame. For large datasets, the performance improvement is dramatic.

What’s Next?

The mssql-python team continues to enhance Arrow support, with plans to cover more SQL Server data types and optimize further. Community contributions are welcome — Felix Graßl’s work is a great example of how open-source collaboration accelerates innovation.

Explore the mssql-python repository for full documentation and examples.

Tags: