How to avoid aggregate(shuffle) in processing the tfrecord file? #201

mathetian · 2022-12-24T06:29:34Z

I have a very large tfrecord directory, and need to filter it with some column to generate new tfrecord files.

Code likes that

When I run it in spark cluster, I find it will run with two steps.

I check the code in https://github.com/tensorflow/ecosystem/blob/master/spark/spark-tensorflow-connector/src/main/scala/org/tensorflow/spark/datasources/tfrecords/TensorFlowInferSchema.scala#L39, it have the aggregate steps !

Can I avoid it?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to avoid aggregate(shuffle) in processing the tfrecord file? #201

How to avoid aggregate(shuffle) in processing the tfrecord file? #201

mathetian commented Dec 24, 2022

How to avoid aggregate(shuffle) in processing the tfrecord file? #201

How to avoid aggregate(shuffle) in processing the tfrecord file? #201

Comments

mathetian commented Dec 24, 2022