Skip to content

[Feature Request]: Support Checkpoint/Resume mechanism for long-running tasks (Knowledge Graph & RAPTOR) #11640

@chg387387

Description

@chg387387

Self Checks

  • I have searched for existing issues search for existing issues, including closed ones.
  • I confirm that I am using English to submit this report (Language Policy).
  • Non-english title submitions will be closed directly ( 非英文标题的提交将会被直接关闭 ) (Language Policy).
  • Please do not modify this template :) and fill in all the required fields.

Is your feature request related to a problem?

Yes.

I am trying to use RAGFlow to generate a Knowledge Graph for a relatively large Knowledge Base. In this scenario, the current implementation behaves like a single, very long-running task without any way to pause, resume, or use checkpoints.

The consequence is that once I start Knowledge Graph generation, the machine basically cannot make any mistakes for a very long time. For my KB size, a full run can easily last from about one week up to 20 days and consume hundreds of millions of tokens. During this whole period I cannot safely reboot or shut down the computer, and I have to hope there are zero network glitches or API timeouts. If there is any power loss, crash, or temporary network error in the middle, the entire job is lost and I have to restart from the beginning. For large KBs and paid LLM APIs this means wasting a huge amount of time and money, and in practice it makes the current Knowledge Graph feature too fragile to use for serious, large-scale knowledge bases.

Describe the feature you'd like

1.Allow the job to be paused and resumed from the UI. I should be able to start Knowledge Graph generation, pause it when I need the machine for something else, and then continue from the same point later without losing progress.

2.Save progress after each document, or after a small chunk/batch. Once a document (or chunk) is processed and its graph has been generated, that result should be persisted. If the process crashes, the power goes off, or the network/API fails, RAGFlow should continue from the last completed document or chunk instead of starting from the very beginning again.

3.When using a remote LLM API, a single failed call should not kill the whole job. Ideally, only the failed document or chunk is retried or marked as error, while the rest of the job can continue or be resumed.

Internally this could be implemented by treating each document (or chunk) as an independent sub-task with a simple state (pending / running / done / error) and storing the generated graph incrementally. On restart, RAGFlow would just pick up the remaining pending or error tasks.

This request is closely related to the RAPTOR feature request I opened earlier (about making RAPTOR indexing resumable and more robust): https://github.com/infiniflow/ragflow/issues/11483

I think the same task/checkpoint mechanism could be reused for both RAPTOR and Knowledge Graph generation. #11483

Describe implementation you've considered

Right now my only option is to keep the computer running 24/7 for weeks and hope the internet or API doesn't fail, which is really risky. I think the system needs to save the task status to the database after every single document is processed. If the backend sees a "paused" or "interrupted" state, it should just check the database and resume from the next document instead of restarting from scratch.

Documentation, adoption, use case

This is crucial for anyone running RAGFLOW locally without a dedicated server. For example, I want to run the graph generation only at night or on weekends when I'm not using the computer. When I need to work during the day, I should be able to pause it to free up resources. This feature would make it safe to process large datasets (like my 200 docs case) without worrying that a single crash on day 19 will waste millions of tokens.

Additional information

I also opened another issue regarding RAPTOR suggestions (issue #11483). I think these two are related because both RAPTOR and Knowledge Graph generation are very long-running processes that consume a lot of tokens. Solving the checkpoint/resume problem would fix the main pain point for both features.

Here is the link to that issue: #11483 (comment)

Metadata

Metadata

Assignees

Labels

💞 featureFeature request, pull request that fullfill a new feature.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions