Supercharging SQL Server Data Pipelines: Arrow Support in mssql-python
Faster, Leaner Data Fetching with Apache Arrow
Retrieving a million rows from SQL Server into a Polars DataFrame used to be a sluggish process: each row required creating a Python object, allocating memory in the garbage collector, and then discarding those objects to build the DataFrame. That era is over. The mssql-python driver now supports fetching SQL Server data directly as Apache Arrow structures, offering a faster and more memory-efficient path for anyone working with Polars, Pandas, DuckDB, or other Arrow-native libraries. This feature was contributed by community developer Felix Graßl (@ffelixg), and we are excited to make it available.

Understanding the Core Concepts
Before diving into the benefits, it helps to clarify a few terms that underpin this new capability:
- API (Application Programming Interface): A source-code contract that defines how to call a function or library.
- ABI (Application Binary Interface): A binary-level contract specifying how compiled code is laid out in memory. Two programs built in different languages can share an ABI and exchange data directly — no serialization is needed.
- Arrow C Data Interface: Apache Arrow’s ABI specification — the standard that makes zero-copy data exchange between languages possible.
What Is Apache Arrow?
The key insight behind Apache Arrow is zero-copy language interoperability. Arrow defines a stable shared-memory layout — the Arrow C Data Interface, a cross-language ABI — that any language can produce or consume by exchanging a pointer. This eliminates serialization, copies, and re-parsing. A C++ database driver and a Python DataFrame library can operate on the exact same memory without either one knowing about the other.
Built on this foundation, Arrow uses a columnar in-memory format: instead of representing a table as a list of rows (each row a collection of Python objects), Arrow stores all values for a column contiguously in a typed buffer. Nulls are tracked in a compact bitmap rather than per-cell None objects.
For a database driver, this means the entire fetch loop can run in C++ and write values directly into Arrow buffers — no Python object creation per row, no garbage-collector pressure. The DataFrame library receives a pointer to that memory and can begin operating on it immediately. Crucially, subsequent operations — filters, joins, aggregations — also work in-place on those same buffers. A Polars pipeline reading from mssql-python never needs to materialize intermediate Python objects at any stage, making Arrow the right foundation for high-throughput data processing pipelines.
Four Concrete Benefits for Users
1. Speed
The columnar fetch path avoids Python object creation per row, making fetching noticeably faster for many SQL Server types — especially temporal types like DATETIME and DATETIMEOFFSET, where Python-side per-value conversions are eliminated entirely.

2. Lower Memory Usage
A column of one million integers is a single contiguous C array, not a million individual Python objects. This drastically reduces memory consumption and GC overhead.
3. Seamless Interoperability
Arrow-native libraries like Polars, Pandas (via ArrowDtype), DuckDB, Hugging Face datasets, and others can consume the data directly without conversion. This streamlines multi-tool workflows.
4. Simplified Data Pipelines
By eliminating intermediate serialization steps, you can build cleaner, more efficient ETL processes. The same Arrow buffers can be passed between systems without re-formatting.
Getting Started with Arrow in mssql-python
To enable Arrow fetching, use the fetch_arrow=True parameter when executing a query. For example:
import mssql
conn = mssql.connect(server='localhost', database='test')
cursor = conn.cursor()
cursor.execute('SELECT * FROM large_table')
arrow_table = cursor.fetch_arrow() # Returns a PyArrow Table
You can then pass arrow_table directly to Polars, Pandas, or other Arrow-compatible tools.
Polars Integration Example
import polars as pl
df = pl.from_arrow(arrow_table) # zero-copy conversion
This avoids the overhead of creating a Python list of tuples and then constructing a DataFrame. For large datasets, the performance improvement is dramatic.
What’s Next?
The mssql-python team continues to enhance Arrow support, with plans to cover more SQL Server data types and optimize further. Community contributions are welcome — Felix Graßl’s work is a great example of how open-source collaboration accelerates innovation.
Explore the mssql-python repository for full documentation and examples.
Related Articles
- Choosing the Right Regularizer: A Data-Driven Framework from 134,400 Simulations
- From 61 Seconds to 0.2: How Polars Revolutionized a Real Data Workflow
- AI Knowledge Base Construction Must Be Iterative, Not One-Time, Experts Warn
- 10 Essential Steps to Craft a High-Performance Knowledge Base for AI Models
- Introduction to Time Series Analysis with Python
- How to Adapt Your AI Development Plans After Apple’s Mac Mini Price Surge
- mssql-python Now Supports Apache Arrow: A Q&A Guide
- 7 Key Facts About Apache Arrow Support in mssql-python