[GH-ISSUE #811] Connecting to emulator through PySpark #132

Closed
opened 2026-03-03 12:08:35 +03:00 by kerem · 7 comments
Owner

Originally created by @TxS-7 on GitHub (Jun 3, 2022).
Original GitHub issue: https://github.com/fsouza/fake-gcs-server/issues/811

Hello,

I've been trying to use the Fake GCS server from inside PySpark using something like this:
csv_file = 'gs://bucket/file.csv' df = spark_session.read.format('csv') .load(csv_file)

I've also added the JAR gcs-connector-hadoop3-latest.jar because gs file type could not be recognized.

When running the above code, the program hangs. Do you have any idea as to what the problem could be?

I'm starting the server using the following command:
/fake-gcs-server -host 127.0.0.1 -scheme http

From inside Python I'm using the following to connect to the fake GCS server:
os.environ['STORAGE_EMULATOR_HOST'] = 'http://localhost:4443/'

Thanks

Originally created by @TxS-7 on GitHub (Jun 3, 2022). Original GitHub issue: https://github.com/fsouza/fake-gcs-server/issues/811 Hello, I've been trying to use the Fake GCS server from inside PySpark using something like this: `csv_file = 'gs://bucket/file.csv' df = spark_session.read.format('csv') .load(csv_file)` I've also added the JAR gcs-connector-hadoop3-latest.jar because gs file type could not be recognized. When running the above code, the program hangs. Do you have any idea as to what the problem could be? I'm starting the server using the following command: /fake-gcs-server -host 127.0.0.1 -scheme http From inside Python I'm using the following to connect to the fake GCS server: `os.environ['STORAGE_EMULATOR_HOST'] = 'http://localhost:4443/'` Thanks
kerem 2026-03-03 12:08:35 +03:00
  • closed this issue
  • added the
    question
    label
Author
Owner

@fsouza commented on GitHub (Jun 3, 2022):

I'm not familiar with PySpark, is STORAGE_EMULATOR_HOST a documented environment variable for that tool or something you just tried? Do you have a link for the documentation on communicating with GCS? We can check if there are any settings that we could set.

<!-- gh-comment-id:1146192426 --> @fsouza commented on GitHub (Jun 3, 2022): I'm not familiar with PySpark, is STORAGE_EMULATOR_HOST a documented environment variable for that tool or something you just tried? Do you have a link for the documentation on communicating with GCS? We can check if there are any settings that we could set.
Author
Owner

@TxS-7 commented on GitHub (Jun 3, 2022):

Thank you for your quick reply.

STORAGE_EMULATOR_HOST is an environment variable used for Python's google cloud API. You can use it to connect to an emulated GCS server.

If I try to connect to the fake GCS server from Python's Google Cloud Storage API it works fine, but when I try to read a file through PySpark, it stops responding.

I couldn't find any documentation regarding connecting PySpark to an emulated GCS server unfortunately.

<!-- gh-comment-id:1146208135 --> @TxS-7 commented on GitHub (Jun 3, 2022): Thank you for your quick reply. STORAGE_EMULATOR_HOST is an environment variable used for Python's google cloud API. You can use it to connect to an emulated GCS server. If I try to connect to the fake GCS server from Python's Google Cloud Storage API it works fine, but when I try to read a file through PySpark, it stops responding. I couldn't find any documentation regarding connecting PySpark to an emulated GCS server unfortunately.
Author
Owner

@fsouza commented on GitHub (Jun 4, 2022):

@TxS-7 do you have any docs to the PySpark integration with GCS? Worst case we can dig into the code I guess heh

<!-- gh-comment-id:1146506534 --> @fsouza commented on GitHub (Jun 4, 2022): @TxS-7 do you have any docs to the PySpark integration with GCS? Worst case we can dig into the code I guess heh
Author
Owner

@TxS-7 commented on GitHub (Jun 6, 2022):

Apparently Spark was connecting to Google's servers and I had to set a configuration parameter for the Spark Session so that it connects to the fake GCS server instead using the following:
spark_session._jsc.hadoopConfiguration().set("fs.gs.storage.root.url", "http://localhost:4443/")

Thank you for your help!

<!-- gh-comment-id:1147187434 --> @TxS-7 commented on GitHub (Jun 6, 2022): Apparently Spark was connecting to Google's servers and I had to set a configuration parameter for the Spark Session so that it connects to the fake GCS server instead using the following: `spark_session._jsc.hadoopConfiguration().set("fs.gs.storage.root.url", "http://localhost:4443/")` Thank you for your help!
Author
Owner

@dht7 commented on GitHub (Jan 23, 2023):

@TxS-7 Can you please share an example where you implemented a spark application with fake-gcs-server? I'm facing a similar issue with my setup, wherein Spark is connecting to Google's servers. I have set the root url to my localhost as mentioned by you, but the client is hitting the google servers for access token (Even when fs.gs.auth.type is set as UNAUTHENTICATED). Not sure if I am missing any configs, so it would help if you share an example.

<!-- gh-comment-id:1400261901 --> @dht7 commented on GitHub (Jan 23, 2023): @TxS-7 Can you please share an example where you implemented a spark application with fake-gcs-server? I'm facing a similar issue with my setup, wherein Spark is connecting to Google's servers. I have set the root url to my localhost as mentioned by you, but the client is hitting the google servers for access token (Even when fs.gs.auth.type is set as UNAUTHENTICATED). Not sure if I am missing any configs, so it would help if you share an example.
Author
Owner

@TxS-7 commented on GitHub (Jan 30, 2023):

Hello @dht7, it's been some time since I used it so I don't remember that well. I ended up using Hadoop 3 GCS connector version 2.2.6 and I also set the following parameters in the Spark context:
spark.sparkContext._jsc.hadoopConfiguration().set('fs.gs.storage.root.url', 'http://localhost:4443/')
spark.sparkContext._jsc.hadoopConfiguration().set('fs.gs.auth.null.enable', 'true')
spark.sparkContext._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'false')

I also set the environent variable STORAGE_EMULATOR_HOST.

Hope this helps!

<!-- gh-comment-id:1409256398 --> @TxS-7 commented on GitHub (Jan 30, 2023): Hello @dht7, it's been some time since I used it so I don't remember that well. I ended up using Hadoop 3 GCS connector version 2.2.6 and I also set the following parameters in the Spark context: `spark.sparkContext._jsc.hadoopConfiguration().set('fs.gs.storage.root.url', 'http://localhost:4443/')` `spark.sparkContext._jsc.hadoopConfiguration().set('fs.gs.auth.null.enable', 'true')` `spark.sparkContext._jsc.hadoopConfiguration().set('fs.gs.auth.service.account.enable', 'false')` I also set the environent variable `STORAGE_EMULATOR_HOST`. Hope this helps!
Author
Owner

@dht7 commented on GitHub (Jan 31, 2023):

Thank you @TxS-7! The 'fs.gs.auth.null.enable' configuration was missing in my app, hence the issue.
Documenting below all the configs required to be set in order to setup Spark with the fake-gcs-server:

  • spark.hadoop.fs.gs.storage.root.url: http://localhost:4443/
  • spark.hadoop.fs.gs.auth.service.account.enable: false
  • spark.hadoop.fs.gs.auth.null.enable: true
  • spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem
  • spark.hadoop.fs.gs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS
<!-- gh-comment-id:1410687397 --> @dht7 commented on GitHub (Jan 31, 2023): Thank you @TxS-7! The 'fs.gs.auth.null.enable' configuration was missing in my app, hence the issue. Documenting below all the configs required to be set in order to setup Spark with the fake-gcs-server: - `spark.hadoop.fs.gs.storage.root.url: http://localhost:4443/` - `spark.hadoop.fs.gs.auth.service.account.enable: false` - `spark.hadoop.fs.gs.auth.null.enable: true` - `spark.hadoop.fs.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystem` - `spark.hadoop.fs.gs.AbstractFileSystem.gs.impl: com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS`
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/fake-gcs-server#132
No description provided.