Repo change detection
Determine github changes from existing db entries.
This works based on a couple of assumptions that could be improved in the future (2024-12-17): - node_ids are unique to the contents of a file (i.e., a change in node_id means that contents must have changed) - Note: node_id can be shared by many files (if they have the same content) - (node_id, path) is a unique identifier of a file state - I.e., node_id alone is not enough since multiple files with the same content can share a node_id - path alone is not enough since that doesn't say whether contents have changed. - Anything that doesn't show up in the tree but is in db entries must have been deleted from repo - I.e. assuming that the tree is the full root tree - The tree is fully loaded - I.e. the tree represents the full structure of the repository - This could also be optimized in the future, but for now the assumption is that it is relatively cheap to get very basic info about the full structure of the repository (i.e., no contents of blobs). - Could later be more clever about saving some info about the subtrees that have previously been parsed to be able to completely skip re-parsing whole subtrees that haven't changed. - Assuming only blobs stored in db. (i.e. no Tree info stored) - So only necessary to walk through all blobs to determine relevant changes. - Not trying to determine what part of files have changed, if there is any change to the file, all parts of it should be parsed again. - I.e., room for later optimization to determine more specifically what parts have actually changed.
Note 2024-12-17: Mostly ignoring Tree entries for now, but could massively improve performance by skipping recursing into any tree entry that is unchanged. Something to think about in the future...
NODE_ID = str
module-attribute
PATH = str
module-attribute
ChangedEntryInfo
Bases: BaseModel
new_entry
instance-attribute
previous
instance-attribute
Changes
Bases: BaseModel
changed_blobs = Field(default_factory=list)
class-attribute
instance-attribute
deleted_blobs = Field(default_factory=list)
class-attribute
instance-attribute
new_blobs = Field(default_factory=list)
class-attribute
instance-attribute
renamed_blobs = Field(default_factory=list)
class-attribute
instance-attribute
unchanged = Field(default_factory=list)
class-attribute
instance-attribute
__add__(other)
__str__()
clear()
NodeIdentifier
Bases: BaseModel
Minimal info to identify what entries are stored in the db currently.
I.e., Which blob entries have an existing record in the db.
metadata
instance-attribute
node_id = Field(repr=False)
class-attribute
instance-attribute
path
instance-attribute
from_metadatas(metadatas)
classmethod
RenamedEntryInfo
Bases: BaseModel
Info about a renamed entry.
Note: Not technically limited to renamed, could just have exactly the same contents as an existing entry. I.e., the "same_as" entry should not necessarily be deleted from the db.
Thought is that if the contents are identical to something that exists, can just copy those contents.
new_entry
instance-attribute
previous
instance-attribute
RepoChangeDetector
Builder class for Changes object.
Basically acts like a determine_changes function.
Class used just to help pass state around between parts.
changes
instance-attribute
node_id_only_mapping
instance-attribute
node_path_mapping
instance-attribute
path_only_mapping
instance-attribute
seen
instance-attribute
determine_changes(tree, stored_nodes)
Determine what has changed in the repo.
| PARAMETER | DESCRIPTION |
|---|---|
tree
|
The tree from GitHub to search for changes in relative to db
TYPE:
|
stored_nodes
|
Minimal info about the existing files stored in the db
TYPE:
|