[GH-ISSUE #419] String format of text files larger than 1400 bytes get mangled #82

Open
opened 2026-03-03 12:08:06 +03:00 by kerem · 2 comments
Owner

Originally created by @ChrisWhealy on GitHub (Feb 11, 2021).
Original GitHub issue: https://github.com/fsouza/fake-gcs-server/issues/419

In Kotlin, if I stream a text file larger than 1400 bytes, then the string data becomes mangled and is no longer in US-ASCII format. If the file is up to 1400 bytes long, then everything works. If the file is 1401 bytes or more, then the data is mangled

Using version latest, with the test code shown below, the variable bufferAsString receives

���������R]o�@|����^���<"�4�R��UUe-�B�:��>��﫳��@P!R%�^��nf���%�10�uRqM��ϯ�~�s���A] $�2�=9R�F�V� �h1���4�rI܊-}�j���ЭI ��%��V����VZ����s�j�ា$c��G0�ąi�d�8�I�T-�a=�0�E�z�5Sm�ZQRm4����iN������N�_ �faL��6�)J����;44����ŴD'm_�g��

If I switch back to version 1.21.2, then it works fine

Create a test storage container

//private val fakeContainerVersion = "1.21.2"
private val fakeContainerVersion = "latest"

private const val testContainerResourcePath = "/data"

object StorageTestContainer {
  val storage: Storage by lazy {
    println("Creating test storage container using '$fakeContainerName'")

    val fakeStorageContainer = TestGenericContainer("fsouza/fake-gcs-server:$fakeContainerVersion")
      .apply {
        withExposedPorts(4443)
        withClasspathResourceMapping("data", testContainerResourcePath, BindMode.READ_WRITE)
        withCreateContainerCmdModifier { cmd ->
          cmd.withEntrypoint("/bin/$fakeContainerName", "-data", testContainerResourcePath, "-scheme", "http")
        }
        start()
      }

    StorageOptions
      .newBuilder()
      .setHost("http://${fakeStorageContainer.host}:${fakeStorageContainer.firstMappedPort}")
      .build()
      .service
  }
}

class TestGenericContainer(imageName: String) : GenericContainer<TestGenericContainer>(imageName)

class ObjectStore(
  private val storage: Storage,
  private val bucketName: String
) {
  constructor(projectID: String, bucketName: String) : this(
    StorageOptions.newBuilder().setProjectId(projectID).build().service,
    bucketName
  )

  fun lastUpdateTimestamp(objectName: String): Instant {
    val blob = storage.get(bucketName, objectName, Storage.BlobGetOption.fields(Storage.BlobField.UPDATED))
      ?: throw IllegalStateException("Object '$objectName' does not exist in bucket '$bucketName'")

    return Instant.ofEpochMilli(blob.updateTime)
  }

  fun load(objectName: String): String {
    val blob = storage.get(BlobId.of(bucketName, objectName))
      ?: throw IllegalStateException("Object '$objectName' does not exist in bucket '$bucketName'")

    return String(blob.getContent())
  }

  fun write(objectName: String, objectContent: String) =
    storage.get(bucketName).create(objectName, objectContent.toByteArray(Charsets.UTF_8))

  fun streamReader(objectName: String): ReadChannel = storage.reader(bucketName, objectName)
}

The test function then is

  private lateinit var underTest: ObjectStore
  private val BUCKET_NAME = "raw_csv"

  @BeforeEach
  fun beforeEach() {
    underTest = ObjectStore(StorageTestContainer.storage, BUCKET_NAME)
  }

  @Test
  fun `should read CSV file via stream`() {
    val byteBuffer        = ByteBuffer.allocate(256)
    val streamReadChannel = underTest.streamReader("test.csv")

    var bytesInBuffer = streamReadChannel.read(byteBuffer)
    var bufferAsString = ""
    var nextCrlf = 0

    // Process the buffer
    while(bytesInBuffer > 0) {
      // Append the new buffer to anything that might be left over from processing the previous buffer
      bufferAsString = bufferAsString + String(byteBuffer.array(), 0, bytesInBuffer)
      nextCrlf = bufferAsString.indexOf(lineSeparator())

      // Process each line in the buffer
      while (nextCrlf > -1) {
        val csvLine = bufferAsString.take(nextCrlf)

       // Do stuff...

        // Shrink buffer
        bufferAsString = bufferAsString.takeLast(bufferAsString.length - nextCrlf - 1)
        nextCrlf = bufferAsString.indexOf(lineSeparator())
      }

      byteBuffer.flip()
      bytesInBuffer = streamReadChannel.read(byteBuffer)
    }

    // Flush last line from buffer (needed if the input file does not end with a CRLF
    outputLines.add(lineWriterFn(bufferAsString))

    // Ensure there's a carriage return on the last line
    outputLines.add("")
    outputFile.writeText(outputLines.joinToString(lineSeparator()))
  }
}
Originally created by @ChrisWhealy on GitHub (Feb 11, 2021). Original GitHub issue: https://github.com/fsouza/fake-gcs-server/issues/419 In Kotlin, if I stream a text file larger than 1400 bytes, then the string data becomes mangled and is no longer in US-ASCII format. If the file is up to 1400 bytes long, then everything works. If the file is 1401 bytes or more, then the data is mangled Using version `latest`, with the test code shown below, the variable `bufferAsString` receives ���������R]o�@|����^���<"�4�R��UUe-�B�:��>��﫳��@P!R%�^��nf���%�10�uRqM�`�ϯ�~�s���A] $�2��=9R�F�V� �h1���4�rI܊-}�j���ЭI ��%��V����VZ����s�j�ា$c��G`0�ąi�d�8�I�T-�a=�0�E�z�5Sm�ZQRm4����iN������N�_ �faL��6�)J����;44����ŴD'm_�g�� If I switch back to version `1.21.2`, then it works fine Create a test storage container ```kotlin //private val fakeContainerVersion = "1.21.2" private val fakeContainerVersion = "latest" private const val testContainerResourcePath = "/data" object StorageTestContainer { val storage: Storage by lazy { println("Creating test storage container using '$fakeContainerName'") val fakeStorageContainer = TestGenericContainer("fsouza/fake-gcs-server:$fakeContainerVersion") .apply { withExposedPorts(4443) withClasspathResourceMapping("data", testContainerResourcePath, BindMode.READ_WRITE) withCreateContainerCmdModifier { cmd -> cmd.withEntrypoint("/bin/$fakeContainerName", "-data", testContainerResourcePath, "-scheme", "http") } start() } StorageOptions .newBuilder() .setHost("http://${fakeStorageContainer.host}:${fakeStorageContainer.firstMappedPort}") .build() .service } } class TestGenericContainer(imageName: String) : GenericContainer<TestGenericContainer>(imageName) class ObjectStore( private val storage: Storage, private val bucketName: String ) { constructor(projectID: String, bucketName: String) : this( StorageOptions.newBuilder().setProjectId(projectID).build().service, bucketName ) fun lastUpdateTimestamp(objectName: String): Instant { val blob = storage.get(bucketName, objectName, Storage.BlobGetOption.fields(Storage.BlobField.UPDATED)) ?: throw IllegalStateException("Object '$objectName' does not exist in bucket '$bucketName'") return Instant.ofEpochMilli(blob.updateTime) } fun load(objectName: String): String { val blob = storage.get(BlobId.of(bucketName, objectName)) ?: throw IllegalStateException("Object '$objectName' does not exist in bucket '$bucketName'") return String(blob.getContent()) } fun write(objectName: String, objectContent: String) = storage.get(bucketName).create(objectName, objectContent.toByteArray(Charsets.UTF_8)) fun streamReader(objectName: String): ReadChannel = storage.reader(bucketName, objectName) } ``` The test function then is ```kotlin private lateinit var underTest: ObjectStore private val BUCKET_NAME = "raw_csv" @BeforeEach fun beforeEach() { underTest = ObjectStore(StorageTestContainer.storage, BUCKET_NAME) } @Test fun `should read CSV file via stream`() { val byteBuffer = ByteBuffer.allocate(256) val streamReadChannel = underTest.streamReader("test.csv") var bytesInBuffer = streamReadChannel.read(byteBuffer) var bufferAsString = "" var nextCrlf = 0 // Process the buffer while(bytesInBuffer > 0) { // Append the new buffer to anything that might be left over from processing the previous buffer bufferAsString = bufferAsString + String(byteBuffer.array(), 0, bytesInBuffer) nextCrlf = bufferAsString.indexOf(lineSeparator()) // Process each line in the buffer while (nextCrlf > -1) { val csvLine = bufferAsString.take(nextCrlf) // Do stuff... // Shrink buffer bufferAsString = bufferAsString.takeLast(bufferAsString.length - nextCrlf - 1) nextCrlf = bufferAsString.indexOf(lineSeparator()) } byteBuffer.flip() bytesInBuffer = streamReadChannel.read(byteBuffer) } // Flush last line from buffer (needed if the input file does not end with a CRLF outputLines.add(lineWriterFn(bufferAsString)) // Ensure there's a carriage return on the last line outputLines.add("") outputFile.writeText(outputLines.joinToString(lineSeparator())) } } ```
Author
Owner

@fsouza commented on GitHub (Feb 11, 2021):

Oh interesting, thanks for reporting. I assume it has to do with gzip. Will investigate.

<!-- gh-comment-id:777428034 --> @fsouza commented on GitHub (Feb 11, 2021): Oh interesting, thanks for reporting. I assume it has to do with gzip. Will investigate.
Author
Owner

@fsouza commented on GitHub (Feb 15, 2021):

Hi @ChrisWhealy, I don't know much about Kotlin, but I assume using the Java sample provided by @dnatic09 in #142 should be equivalent here? (just need to write more data and try to read it)

Either way, I think this is related to gzipped responses, which is something we don't really have to support. Can you try your code with #426 (you should be able to checkout that PR locally then build it with docker build -t fsouza/fake-gcs-server:some-tag ., then use some-tag in your TestGenericContainer).

<!-- gh-comment-id:779478857 --> @fsouza commented on GitHub (Feb 15, 2021): Hi @ChrisWhealy, I don't know much about Kotlin, but I assume using the Java sample provided by @dnatic09 in #142 should be equivalent here? (just need to write more data and try to read it) Either way, I think this is related to gzipped responses, which is something we don't really _have_ to support. Can you try your code with #426 (you should be able to checkout that PR locally then build it with `docker build -t fsouza/fake-gcs-server:some-tag .`, then use `some-tag` in your `TestGenericContainer`).
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
starred/fake-gcs-server#82
No description provided.