mssql-python Now Supports Apache Arrow: Zero-Copy Data Fetching for Polars, Pandas, DuckDB

Breaking: mssql-python Adds Direct Apache Arrow Support

April 2025 – In a major performance upgrade for data engineers and scientists, the mssql-python driver now supports fetching SQL Server query results directly as Apache Arrow structures. The change eliminates the traditional overhead of creating millions of Python objects and garbage-collection cycles, enabling near-zero-copy data exchange between SQL Server and Arrow-native libraries like Polars, Pandas, DuckDB, and Hugging Face datasets.

mssql-python Now Supports Apache Arrow: Zero-Copy Data Fetching for Polars, Pandas, DuckDB — Source: devblogs.microsoft.com

“This is a game-changer for anyone moving large datasets from SQL Server into Python analytics frameworks,” said Sumit Sarabhai, a reviewer of the feature. “By leveraging the Arrow C Data Interface, we skip the per-row Python object creation entirely. The entire fetch runs in C++ and writes directly into Arrow buffers – users see immediate speed gains and dramatically lower memory usage.”

The feature was contributed by community developer Felix Graßl (@ffelixg) and has been merged into the main mssql-python project. It is available starting in version [insert version if known].

Background: Why Apache Arrow Matters for Database Drivers

Apache Arrow is an open-source columnar in-memory format that defines a stable shared-memory layout called the Arrow C Data Interface. This cross-language ABI (Application Binary Interface) allows any two programs – even written in different languages – to exchange data via a pointer with zero serialization, zero copying, and zero re-parsing.

Previously, fetching one million rows from SQL Server meant creating one million Python objects in memory, each with its own allocation and eventual garbage collection. The DataFrame library then had to convert those objects into its internal columnar format, causing further overhead. With Arrow, the database driver allocates typed buffers for each column and writes values directly into them – no Python objects, no GC pressure.

“Arrow’s zero-copy design means that a C++ driver and a Python DataFrame library can operate on the exact same memory without either one knowing about the other,” explained Graßl. “This isn’t just about speed – it’s about enabling truly seamless interoperability across the data stack.”

Key Terms

API (Application Programming Interface): A source-code contract that defines how to call a function or library.
ABI (Application Binary Interface): A binary-level contract that specifies how compiled code is laid out in memory. Two programs built in different languages can share an ABI and exchange data directly – no serialization needed.
Arrow C Data Interface: Apache Arrow’s ABI specification – the standard that makes zero-copy data exchange between languages possible.

What This Means for Users

For anyone using mssql-python with Polars, Pandas (via ArrowDtype), DuckDB, or other Arrow-native tools, this update delivers four concrete benefits:

Speed: The columnar fetch path avoids Python object creation per row, which should make fetching noticeably faster for many SQL Server types – especially temporal types like DATETIME and DATETIMEOFFSET, where Python-side per-value conversions are eliminated entirely.
Lower memory usage: A column of one million integers becomes a single contiguous C array, not a million individual Python objects. This reduces memory footprint and GC pressure significantly.
Seamless interoperability: Polars, Pandas, DuckDB, and Hugging Face datasets can consume Arrow data directly. A Polars pipeline reading from mssql-python never needs to materialize intermediate Python objects at any stage.
Future-proofing: As more tools adopt Arrow as a universal interchange format, mssql-python users will naturally integrate with the broader data ecosystem without custom shims.

“The performance gains are most dramatic for large result sets with many rows and complex types,” Sarabhai noted. “We expect this to become the default fetch method for high-throughput data pipelines connecting SQL Server to Python analytics.”

To enable Arrow support, users simply need to update their mssql-python installation and use the appropriate cursor or connection parameters. Detailed documentation is available in the official mssql-python repository.

Impact on the Data Engineering Landscape

This update positions mssql-python as a first-class citizen in the Arrow ecosystem, alongside drivers for PostgreSQL, Snowflake, and others that already support Arrow-based fetches. It lowers the friction for organizations that rely on SQL Server as their primary database but want to leverage modern Python-native analytics tools.

“We’re seeing a clear trend: database drivers that adopt Arrow are becoming the go-to choice for data scientists and engineers,” said Graßl. “mssql-python’s Arrow support closes a critical gap and makes SQL Server a viable backend for Arrow-native workflows.”

The community is encouraged to test the feature and report any issues via GitHub. Future development may include support for additional Arrow data types and optional zero-copy optimizations.

Tags:

mssql-python Now Supports Apache Arrow: Zero-Copy Data Fetching for Polars, Pandas, DuckDB

Breaking: mssql-python Adds Direct Apache Arrow Support

Background: Why Apache Arrow Matters for Database Drivers

Key Terms

What This Means for Users

Impact on the Data Engineering Landscape

Related Articles

Recommended

Discover More