-
Notifications
You must be signed in to change notification settings - Fork 127
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Usage of Glue Data Catalog with sagemaker_pyspark #109
Comments
Can you post the error message you got? Also the currently supported spark version is 2.2 |
Hi, |
sorry for the slow reply here. it looks like the code you're referencing is more about PySpark and Glue rather than this sagemaker-pyspark library, so apologies if some of my questions/suggestions seem too basic. what kind of log messages are showing you that it's not using your configuration? I did some Googling and found https://forums.aws.amazon.com/thread.jspa?threadID=263860. When I compare your code to the last reply in that thread, I notice that your code doesn't have parentheses with builder. Perhaps you need to invoke it with |
Hi @laurenyu, I'm having the same issue as @mattiamatrix above, where instructing Spark to use the Glue catalog as a metastore doesn't throw any errors but also does not appear to have any effect at all, with Spark defaulting to using the local catalog. I looked at the reference you suggested from the AWS forums but I believe that example is in Scala (or maybe Java?) and adding the parentheses to
Happy to provide any additional information if that's helpful. |
Hi @mattiamatrix and @krishanunandy . Thanks for the reply. I'm not exactly sure of your set-up, but I noticed from the original post that you were attempting to follow the cited guide and, as noted in the original post, "this is do-able via EMR" by enabling "Use AWS Glue Data Catalog for table metadata" on cluster launch which ensures the necessary jar is available on the cluster instances and on the classpath. However, when using a notebook launched from the AWS SageMaker console, the necessary jar is not a part of the classpath. Launching a notebook instance with, say,
Can you provide more details on your setup? |
Hi @metrizable! Thanks for following up! I ran the code snippet you posted on my SageMaker instance that's running the At the top of my code I create a
Do you know where I can find the jar file? I'm optimistically presuming that once I have the jar, something like this -
and adding |
sorry for the delayed response. talked to @metrizable and it looks like https://github.com/awslabs/aws-glue-data-catalog-client-for-apache-hive-metastore probably contains the right class.
I found https://github.com/tinyclues/spark-glue-data-catalog, which looks to be an unofficial build that contains
does that help? |
We ended up using an EMR backend for running Spark on SageMaker as a workaround but I'll try your solution and report back. Appreciate the follow up! |
Hello, since this issue is still open, Thanks |
I am also interested to see a solution for using Glue Catalog from Sagemaker without using EMR. |
Is there any way we can bump the priority on this? It would be really nice to use the glue data catalog from SM notebooks |
Is this available as a feature now? |
For visibility, you can now run Glue interactive sessions directly from a SageMaker Studio Notebook |
@joaopcm1996 Can we run glue interactive sessions from SM notebooks without using SM studio? Or as per the original request Is there a way to read glue catalog data from SM notebook. I see that there was a jar missing problem above. Was anyone able to get this to work? |
Hi, can we configure a sagemaker pysparkprocessor to use Glue Data Catalog as the metastore for Hive, or can we use the Glue interactive sessions with this processor? |
Did anybody managed to make a sagamaker instancen work with PySpark and the Glue data catalog? Send help. |
System Information
Describe the problem
I'm following the instructions proposed HERE to connect a local spark session running in a notebook in Sagemaker to the Glue Data Catalog of my account.
I know this is doable via EMR but I'd like do to the same using a Sagemaker notebook (or any other kind of separate spark installation)
Minimal repo / logs
Below is the current code that runs in the notebook but it doesn't actually work.
The text was updated successfully, but these errors were encountered: