You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Problem
This is a sum of two problems that we have:
We use cloud AWS OpenSearch and it has specific limits for the max HTTP request size: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/limits.html#network-limits . For example, for our workload, an m6g.large.search instance is enough. And it has a limit of 10 MiB. If our batches exceed this limit, indexing fails and we have to buy a bigger instance which increases cloud cost significantly, or decrease the batch size significantly to fit the limit.
Our products may have very different data sizes for indexing: from 5 KiB to 0.5 MiB. So a batch of 100 products may easily exceed the limit. The spread is very unequal and a batch of 100 documents may have the size from 1 MiB to 50 MiB. Decreasing the batch size decreases the indexing performance so it doesn't make sense to keep small batch sizes on our dataset (which is large, more than 1M products).
Solution
Add the possibility to limit batch data size, not only batch row count. We're currently reaching this by applying the following patch:
--- a/src/module-elasticsuite-core/Indexer/GenericIndexerHandler.php+++ b/src/module-elasticsuite-core/Indexer/GenericIndexerHandler.php@@ -101,6 +101,7 @@
*/
public function saveIndex($dimensions, \Traversable $documents)
{
+ $maxBatchDataSize = $this->indexSettings->getBatchIndexingDataSize();
foreach ($dimensions as $dimension) {
$storeId = $dimension->getValue();
@@ -120,8 +121,15 @@
}
if (!empty($batchDocuments)) {
- $bulk = $this->indexOperation->createBulk()->addDocuments($index, $batchDocuments);- $this->indexOperation->executeBulk($bulk);+ if ($maxBatchDataSize !== null) {+ foreach (self::splitBatchByDataSize($batchDocuments, $maxBatchDataSize) as $subBatch) {+ $bulk = $this->indexOperation->createBulk()->addDocuments($index, $subBatch);+ $this->indexOperation->executeBulk($bulk);+ }+ } else {+ $bulk = $this->indexOperation->createBulk()->addDocuments($index, $batchDocuments);+ $this->indexOperation->executeBulk($bulk);+ }
}
}
@@ -132,6 +140,48 @@
return $this;
}
++ private static function splitBatchByDataSize(&$batch, int $maxBatchDataSize): array+ {+ // measure the size of every batch in JSON format and split them until every batch is less than or equal to the max batch data size+ $batches = [$batch];+ $loopCount = 0;+ for ($i = 0; $i < count($batches);) {+ $subBatch = $batches[$i];+ $jsonSize = strlen(json_encode($subBatch));+ if ($jsonSize > $maxBatchDataSize) {+ // If the batch is bigger, split it into two, replace the current one, append the second, and run the loop+ // again on the same index to split again if needed.+ $twoBatches = self::splitBatch($subBatch);+ $batches[$i] = $twoBatches[0];+ $batches[] = $twoBatches[1];+ $loopCount++;+ if ($loopCount > 100) {+ throw new \RuntimeException('Batch split loop limit reached');+ }+ } else {+ $i++;+ $loopCount = 0;+ }+ }++ return $batches;+ }++ private static function splitBatch(array &$batch): array+ {+ if (count($batch) == 1) {+ throw new \RuntimeException('Batch split failed. Batch size is 1');+ }+ $result = array_chunk($batch, (int)floor(count($batch) / 2));+ if (count($result > 2)) {+ $result[1] = array_merge($result[1], $result[2]);+ unset($result[2]);+ }++ return $result;+ }+
/**
* {@inheritDoc}
--- a/src/module-elasticsuite-core/Helper/IndexSettings.php+++ b/src/module-elasticsuite-core/Helper/IndexSettings.php@@ -193,6 +193,15 @@
}
/**
+ * Get the max batch indexing data size from the configuration.+ */+ public function getBatchIndexingDataSize(): ?int+ {+ $value = $this->getIndicesSettingsConfigParam('batch_indexing_data_size');+ return $value ? (int) $value : null;+ }++ /**
* Get the indices pattern from the configuration.
*
* @return string
--- a/src/module-elasticsuite-core/Index/IndexSettings.php+++ b/src/module-elasticsuite-core/Index/IndexSettings.php@@ -188,6 +188,11 @@
return $this->helper->getBatchIndexingSize();
}
+ public function getBatchIndexingDataSize(): ?int+ {+ return $this->helper->getBatchIndexingDataSize();+ }+
/**
* {@inheritDoc}
*/
--- a/src/module-elasticsuite-core/Api/Index/IndexSettingsInterface.php+++ b/src/module-elasticsuite-core/Api/Index/IndexSettingsInterface.php@@ -90,6 +90,11 @@
public function getBatchIndexingSize();
/**
+ * Get the maximum batch data size for indexing.+ */+ public function getBatchIndexingDataSize(): ?int;++ /**
* Get dynamic index settings per store (language).
*
* @param integer|string|\Magento\Store\Api\Data\StoreInterface $store Store.
This patch evaluates the data size of the batch to be indexed and splits it into multiple ones until all of the sub-batches will not be smaller than the limit. The algorithm is probably not so ideal but it was developed in a limited time. Since the data size can be calculated only by converting the batch to JSON, I tried to minimize the number of json_encode calls to not affect the performance. I also didn't make enough performance tests to determine whether it's possible to calculate every row size and make batches utilizing the request size limit more efficiently. I think it would be nice to implement such functionality in the package.
The text was updated successfully, but these errors were encountered:
Problem
This is a sum of two problems that we have:
m6g.large.search
instance is enough. And it has a limit of 10 MiB. If our batches exceed this limit, indexing fails and we have to buy a bigger instance which increases cloud cost significantly, or decrease the batch size significantly to fit the limit.Solution
Add the possibility to limit batch data size, not only batch row count. We're currently reaching this by applying the following patch:
This patch evaluates the data size of the batch to be indexed and splits it into multiple ones until all of the sub-batches will not be smaller than the limit. The algorithm is probably not so ideal but it was developed in a limited time. Since the data size can be calculated only by converting the batch to JSON, I tried to minimize the number of
json_encode
calls to not affect the performance. I also didn't make enough performance tests to determine whether it's possible to calculate every row size and make batches utilizing the request size limit more efficiently. I think it would be nice to implement such functionality in the package.The text was updated successfully, but these errors were encountered: