Amazon SageMaker provides a number of methods to run distributed knowledge processing jobs with Apache Spark, a preferred distributed computing framework for large knowledge processing.
You possibly can run Spark functions interactively from Amazon SageMaker Studio by connecting SageMaker Studio notebooks and AWS Glue Interactive Sessions to run Spark jobs with a serverless cluster. With interactive periods, you possibly can select Apache Spark or Ray to simply course of giant datasets, with out worrying about cluster administration.
Alternately, in the event you want extra management over the atmosphere, you need to use a pre-built SageMaker Spark container to run Spark functions as batch jobs on a completely managed distributed cluster with Amazon SageMaker Processing. This feature permits you to choose a number of forms of situations (compute optimized, reminiscence optimized, and extra), the variety of nodes within the cluster, and the cluster configuration, thereby enabling larger flexibility for knowledge processing and mannequin coaching.
Lastly, you possibly can run Spark functions by connecting Studio notebooks with Amazon EMR clusters, or by operating your Spark cluster on Amazon Elastic Compute Cloud (Amazon EC2).
All these choices let you generate and retailer Spark occasion logs to investigate them by means of the web-based person interface generally named the Spark UI, which runs a Spark Historical past Server to watch the progress of Spark functions, observe useful resource utilization, and debug errors.
On this put up, we share a solution for putting in and operating Spark Historical past Server on SageMaker Studio and accessing the Spark UI immediately from the SageMaker Studio IDE, for analyzing Spark logs produced by completely different AWS companies (AWS Glue Interactive Periods, SageMaker Processing jobs, and Amazon EMR) and saved in an Amazon Simple Storage Service (Amazon S3) bucket.
Resolution overview
The answer integrates Spark Historical past Server into the Jupyter Server app in SageMaker Studio. This enables customers to entry Spark logs immediately from the SageMaker Studio IDE. The built-in Spark Historical past Server helps the next:
- Accessing logs generated by SageMaker Processing Spark jobs
- Accessing logs generated by AWS Glue Spark functions
- Accessing logs generated by self-managed Spark clusters and Amazon EMR
A utility command line interface (CLI) known as sm-spark-cli
can also be supplied for interacting with the Spark UI from the SageMaker Studio system terminal. The sm-spark-cli
permits managing Spark Historical past Server with out leaving SageMaker Studio.
The answer consists of shell scripts that carry out the next actions:
- Set up Spark on the Jupyter Server for SageMaker Studio person profiles or for a SageMaker Studio shared house
- Set up the
sm-spark-cli
for a person profile or shared house
Set up the Spark UI manually in a SageMaker Studio area
To host Spark UI on SageMaker Studio, full the next steps:
- Select System terminal from the SageMaker Studio launcher.
- Run the next instructions within the system terminal:
The instructions will take a couple of seconds to finish.
- When the set up is full, you can begin the Spark UI by utilizing the supplied
sm-spark-cli
and entry it from an online browser by operating the next code:
sm-spark-cli begin s3://DOC-EXAMPLE-BUCKET/<SPARK_EVENT_LOGS_LOCATION>
The S3 location the place the occasion logs produced by SageMaker Processing, AWS Glue, or Amazon EMR are saved could be configured when operating Spark functions.
For SageMaker Studio notebooks and AWS Glue Interactive Periods, you possibly can arrange the Spark occasion log location immediately from the pocket book by utilizing the sparkmagic
kernel.
The sparkmagic
kernel incorporates a set of instruments for interacting with distant Spark clusters by means of notebooks. It provides magic (%spark
, %sql
) instructions to run Spark code, carry out SQL queries, and configure Spark settings like executor reminiscence and cores.
For the SageMaker Processing job, you possibly can configure the Spark occasion log location immediately from the SageMaker Python SDK.
Confer with the AWS documentation for added data:
You possibly can select the generated URL to entry the Spark UI.
The next screenshot exhibits an instance of the Spark UI.
You possibly can examine the standing of the Spark Historical past Server by utilizing the sm-spark-cli standing
command within the Studio System terminal.
You can too cease the Spark Historical past Server when wanted.
Automate the Spark UI set up for customers in a SageMaker Studio area
As an IT admin, you possibly can automate the set up for SageMaker Studio customers by utilizing a lifecycle configuration. This may be finished for all person profiles underneath a SageMaker Studio area or for particular ones. See Customize Amazon SageMaker Studio using Lifecycle Configurations for extra particulars.
You possibly can create a lifecycle configuration from the install-history-server.sh script and fix it to an present SageMaker Studio area. The set up is run for all of the person profiles within the area.
From a terminal configured with the AWS Command Line Interface (AWS CLI) and applicable permissions, run the next instructions:
After Jupyter Server restarts, the Spark UI and the sm-spark-cli
might be obtainable in your SageMaker Studio atmosphere.
Clear up
On this part, we present you how one can clear up the Spark UI in a SageMaker Studio area, both manually or robotically.
Manually uninstall the Spark UI
To manually uninstall the Spark UI in SageMaker Studio, full the next steps:
- Select System terminal within the SageMaker Studio launcher.
- Run the next instructions within the system terminal:
Uninstall the Spark UI robotically for all SageMaker Studio person profiles
To robotically uninstall the Spark UI in SageMaker Studio for all person profiles, full the next steps:
- On the SageMaker console, select Domains within the navigation pane, then select the SageMaker Studio area.
- On the area particulars web page, navigate to the Atmosphere tab.
- Choose the lifecycle configuration for the Spark UI on SageMaker Studio.
- Select Detach.
- Delete and restart the Jupyter Server apps for the SageMaker Studio person profiles.
Conclusion
On this put up, we shared an answer you need to use to shortly set up the Spark UI on SageMaker Studio. With the Spark UI hosted on SageMaker, machine studying (ML) and knowledge engineering groups can use scalable cloud compute to entry and analyze Spark logs from anyplace and velocity up their challenge supply. IT admins can standardize and expedite the provisioning of the answer within the cloud and keep away from proliferation of customized improvement environments for ML tasks.
All of the code proven as a part of this put up is out there within the GitHub repository.
In regards to the Authors
Giuseppe Angelo Porcelli is a Principal Machine Studying Specialist Options Architect for Amazon Net Companies. With a number of years software program engineering and an ML background, he works with clients of any measurement to know their enterprise and technical wants and design AI and ML options that make the perfect use of the AWS Cloud and the Amazon Machine Studying stack. He has labored on tasks in numerous domains, together with MLOps, laptop imaginative and prescient, and NLP, involving a broad set of AWS companies. In his free time, Giuseppe enjoys taking part in soccer.
Bruno Pistone is an AI/ML Specialist Options Architect for AWS primarily based in Milan. He works with clients of any measurement, serving to them perceive their technical wants and design AI and ML options that make the perfect use of the AWS Cloud and the Amazon Machine Studying stack. His discipline of expertice contains machine studying finish to finish, machine studying endustrialization, and generative AI. He enjoys spending time together with his buddies and exploring new locations, in addition to touring to new locations.