mirror of
https://github.com/fsouza/fake-gcs-server.git
synced 2026-04-26 22:25:50 +03:00
[GH-ISSUE #811] Connecting to emulator through PySpark #132
Labels
No labels
bug
compatibility-issue
docker
documentation
enhancement
help wanted
needs information
pull-request
question
stale
unfortunate
wontfix
No milestone
No project
No assignees
1 participant
Notifications
Due date
No due date set.
Dependencies
No dependencies set.
Reference
starred/fake-gcs-server#132
Loading…
Add table
Add a link
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Originally created by @TxS-7 on GitHub (Jun 3, 2022).
Original GitHub issue: https://github.com/fsouza/fake-gcs-server/issues/811
Hello,
I've been trying to use the Fake GCS server from inside PySpark using something like this:
csv_file = 'gs://bucket/file.csv' df = spark_session.read.format('csv') .load(csv_file)I've also added the JAR gcs-connector-hadoop3-latest.jar because gs file type could not be recognized.
When running the above code, the program hangs. Do you have any idea as to what the problem could be?
I'm starting the server using the following command:
/fake-gcs-server -host 127.0.0.1 -scheme http
From inside Python I'm using the following to connect to the fake GCS server:
os.environ['STORAGE_EMULATOR_HOST'] = 'http://localhost:4443/'Thanks
@fsouza commented on GitHub (Jun 3, 2022):
I'm not familiar with PySpark, is STORAGE_EMULATOR_HOST a documented environment variable for that tool or something you just tried? Do you have a link for the documentation on communicating with GCS? We can check if there are any settings that we could set.
@TxS-7 commented on GitHub (Jun 3, 2022):
Thank you for your quick reply.
STORAGE_EMULATOR_HOST is an environment variable used for Python's google cloud API. You can use it to connect to an emulated GCS server.
If I try to connect to the fake GCS server from Python's Google Cloud Storage API it works fine, but when I try to read a file through PySpark, it stops responding.
I couldn't find any documentation regarding connecting PySpark to an emulated GCS server unfortunately.
@fsouza commented on GitHub (Jun 4, 2022):
@TxS-7 do you have any docs to the PySpark integration with GCS? Worst case we can dig into the code I guess heh
@TxS-7 commented on GitHub (Jun 6, 2022):
Apparently Spark was connecting to Google's servers and I had to set a configuration parameter for the Spark Session so that it connects to the fake GCS server instead using the following:
spark_session._jsc.hadoopConfiguration().set("fs.gs.storage.root.url", "http://localhost:4443/")Thank you for your help!
@dht7 commented on GitHub (Jan 23, 2023):
@TxS-7 Can you please share an example where you implemented a spark application with fake-gcs-server? I'm facing a similar issue with my setup, wherein Spark is connecting to Google's servers. I have set the root url to my localhost as mentioned by you, but the client is hitting the google servers for access token (Even when fs.gs.auth.type is set as UNAUTHENTICATED). Not sure if I am missing any configs, so it would help if you share an example.
@TxS-7 commented on GitHub (Jan 30, 2023):
Hello @dht7, it's been some time since I used it so I don't remember that well. I ended up using Hadoop 3 GCS connector version 2.2.6 and I also set the following parameters in the Spark context:
spark.sparkContext._jsc.hadoopConfiguration().set('fs.gs.storage.root.url', 'http://localhost:4443/')spark.sparkContext._jsc.hadoopConfiguration().set('fs.gs.auth.null.enable', 'true')spark.sparkContext._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'false')I also set the environent variable
STORAGE_EMULATOR_HOST.Hope this helps!
@dht7 commented on GitHub (Jan 31, 2023):
Thank you @TxS-7! The 'fs.gs.auth.null.enable' configuration was missing in my app, hence the issue.
Documenting below all the configs required to be set in order to setup Spark with the fake-gcs-server:
spark.hadoop.fs.gs.storage.root.url: http://localhost:4443/spark.hadoop.fs.gs.auth.service.account.enable: falsespark.hadoop.fs.gs.auth.null.enable: truespark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemspark.hadoop.fs.gs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS