Apache Spark with Python Why use PySpark

Data processing: Apache Spark 3.1 wants to become a Zen master for Python

The first minor release of Apache Spark was released eight months after the last major version. The notable innovations in the analytics engine include an improved connection to the Python programming language and ANSI SQL. In addition, structured streaming applications can be connected to the history server.

In addition, the release brings numerous innovations under the hood, which are aimed at better performance in data processing. Spark on Kubernetes is now generally available for containerized applications. In addition, nodes can be decommissioned in an orderly manner, both under Kubernetes and in standalone mode.

Better Python connection with Project Zen

Project Zen aims to improve user-friendliness for the use of Python with Apache Spark and thus expands the efforts started in version 3.0 around the programming language. Apache Spark is written in Scala and offers numerous APIs tailored to the JVM language. However, more and more developers are relying on the PySpark module to write Spark applications in Python. However, the Spark team has recognized that PySpark is too little "pythonic", i.e. not really tailored to work with Python.

Among other things, the API documentation is inadequate. In addition, PySpark probably spits out numerous JVM error messages that are not very helpful in conjunction with Python. Project Zen aims, among other things, to output clearer error messages and warnings, to improve the documentation and to provide handy examples for the API documentation. In addition, PySpark should work better with popular Python libraries such as NumPy or pandas.

For the current release, the makers have revised the PySpark documentation and based it on the style of the NumPy description. The module also offers type hints for source code editors and development environments and improves the exception messages of user-defined functions (UDFs).

More ANSI for SQL

Another focus of Apache Spark 3.1 is ANSI SQL compliance. Among other things, the release recently offers the data types and the SQL command. In addition, in the ANSI mode, the engine now delivers runtime errors instead of returning errors.

In addition, the release brings with it a uniform SQL syntax for and new rules for explicit type conversions. For the latter, a table in the Spark documentation gives an overview of the source and target types allowed for.

The story of the stream

Applications that use structured streaming can now use Spark's history server, which is used to track the operation and metrics of an application and to diagnose and debug them.

The official announcement of Spark 3.1.1 lists the other new features in the current release. A detailed overview can be found on the associated Jira page. In addition to the current version, the download page also offers Spark 3.0.2 and 2.4.7. The 3.x releases are available for Hadoop 2.7 and 3.2+ and in source code.

(rme)