Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

增加版本信息支持外部索引 #2547

Open
CrazyBeeline opened this issue May 31, 2024 · 6 comments
Open

增加版本信息支持外部索引 #2547

CrazyBeeline opened this issue May 31, 2024 · 6 comments
Labels
feature New feature

Comments

@CrazyBeeline
Copy link

Feature Description (功能描述)

目前hugegraph索引不是很完善,我们想借助外部的数据库(例如 elasticsearch)实现复杂的查询。一般我们会写图一份数据,huge接口非常友好,会返回最新的图数据,我们将最新的图数据写一份到es。但是这里有一个问题,多线程并发写huge,就有可能huge数据和es的数据不一致,huge可以保证多线程的的数据一致,但是写es无法保证。如果huge提供数据的版本信息就能解决这个问题,版本信息🈶两个部分,创建版本和更新版本。创建版本就是创建时的版本,更新不会改变,更新版本就是每次更新就回改变。有个版本信息就能很好解决es数据不一致问题

@CrazyBeeline CrazyBeeline added the feature New feature label May 31, 2024
@CrazyBeeline
Copy link
Author

@imbajin

@imbajin
Copy link
Member

imbajin commented Jun 27, 2024

这里我个人觉得可以参考业内常见的异步写索引➕柔性(事务)补偿机制:

  1. 先写主表(graph),然后再异步写es(写主表➕异步发送出去则视为写成功避免影响写速度)
  2. 一致性的保证主要靠补偿,异步写es返回成功则继续,发现结果失败,则有一个单独的守护/检查后台线程重新尝试提交,不阻塞其他写入
  3. 全量的数据(已不一致但因种种原因没有修正的),或者整体的一致性检查,可以用 bulkload 的方式定期取 db 文件,全扫一遍对齐主-副表(通过spark/flink-connector),确保最终一致性

大体想法如此,欢迎@javeme @zyxxoo @simon824 @JackyYangPassion @VGalaxies @liuxiaocs7 @dosu 补充哈~

@JackyYangPassion
Copy link
Contributor

JackyYangPassion commented Jul 1, 2024

Feature Description (功能描述)

目前hugegraph索引不是很完善,我们想借助外部的数据库(例如 elasticsearch)实现复杂的查询。一般我们会写图一份数据,huge接口非常友好,会返回最新的图数据,我们将最新的图数据写一份到es。但是这里有一个问题,多线程并发写huge,就有可能huge数据和es的数据不一致,huge可以保证多线程的的数据一致,但是写es无法保证。如果huge提供数据的版本信息就能解决这个问题,版本信息🈶两个部分,创建版本和更新版本。创建版本就是创建时的版本,更新不会改变,更新版本就是每次更新就回改变。有个版本信息就能很好解决es数据不一致问题

最近在解决多节点Server 缓存一致性问题,根据这里的问题描述,我建议将ES 抽象成一种特殊Cache,继承实现AbstractCache 抽象类。

也就是问题中提到的数据一致性,就复用了HugeServer自身的写入事务

public class ESCache  extends AbstractCache<Id, Object> {
    //TODO: 实现提升xhop 查询性能的一种选择
    // 需要明确这种缓存粒度
    // 缓存一致性问题:集中式缓存

    @Override
    protected Object access(Id id) {
        return null;
    }

    @Override
    protected boolean write(Id id, Object value, long timeOffset) {
        return false;
    }

    @Override
    protected void remove(Id id) {

    }

    @Override
    protected Iterator<CacheNode<Id, Object>> nodes() {
        return null;
    }

    @Override
    public boolean containsKey(Id id) {
        return false;
    }

    @Override
    public void traverse(Consumer<Object> consumer) {

    }

    @Override
    public void clear() {

    }

    @Override
    public long size() {
        return 0;
    }
}

@CrazyBeeline
Copy link
Author

@JackyYangPassion 我们打算1.5.0尝试缓存的方案同步数据到es 请问什么时候发版

@imbajin
Copy link
Member

imbajin commented Oct 9, 2024

@JackyYangPassion 我们打算1.5.0尝试缓存的方案同步数据到es 请问什么时候发版

@VGalaxies 是这次负责发版的同学, 预计这周左右? (正在收尾一下 PR/ LICENSE 问题)

另外 @dosu 你对这个功能实现有什么建议和分析么?

Copy link

dosubot bot commented Oct 9, 2024

@imbajin 你好,我是Dosu!我是一个可以帮助您解决问题、回答疑问并成为贡献者的机器人。虽然我们的人类维护者正忙着处理一些其他问题,但我会在这里帮助您的。

To implement the feature of synchronizing data to Elasticsearch using a caching mechanism, you can consider the following suggestions:

  1. Use AbstractCache for Synchronization: Implementing Elasticsearch as a specialized cache by extending the AbstractCache class is a good approach. This allows you to leverage HugeGraph's existing transaction mechanisms to ensure consistency.

  2. Versioning for Consistency: Incorporate versioning in your data model. Use creation and update versions to track changes. This can help in resolving inconsistencies between HugeGraph and Elasticsearch by ensuring that only the latest data is synchronized.

  3. Asynchronous Writes with Compensation: Follow the pattern of writing to the main graph first, then asynchronously updating Elasticsearch. Implement a compensation mechanism to handle failures in Elasticsearch updates. This can involve retrying failed updates or using a background process to ensure eventual consistency.

  4. Bulk Loading for Full Consistency: Periodically perform a full consistency check using bulk loading. This can be done with tools like Spark or Flink to align the data between HugeGraph and Elasticsearch.

  5. Release Timeline: The 1.5.0 release is expected soon, which will include the caching mechanism for data synchronization. Keep an eye on the finalization of PRs and licensing issues for the exact release date.

These steps should help in achieving a robust synchronization between HugeGraph and Elasticsearch.

To continue talking to Dosu, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature
Projects
None yet
Development

No branches or pull requests

3 participants