Skip to content

Repo change detection

Determine github changes from existing db entries.

This works based on a couple of assumptions that could be improved in the future (2024-12-17): - node_ids are unique to the contents of a file (i.e., a change in node_id means that contents must have changed) - Note: node_id can be shared by many files (if they have the same content) - (node_id, path) is a unique identifier of a file state - I.e., node_id alone is not enough since multiple files with the same content can share a node_id - path alone is not enough since that doesn't say whether contents have changed. - Anything that doesn't show up in the tree but is in db entries must have been deleted from repo - I.e. assuming that the tree is the full root tree - The tree is fully loaded - I.e. the tree represents the full structure of the repository - This could also be optimized in the future, but for now the assumption is that it is relatively cheap to get very basic info about the full structure of the repository (i.e., no contents of blobs). - Could later be more clever about saving some info about the subtrees that have previously been parsed to be able to completely skip re-parsing whole subtrees that haven't changed. - Assuming only blobs stored in db. (i.e. no Tree info stored) - So only necessary to walk through all blobs to determine relevant changes. - Not trying to determine what part of files have changed, if there is any change to the file, all parts of it should be parsed again. - I.e., room for later optimization to determine more specifically what parts have actually changed.

Note 2024-12-17: Mostly ignoring Tree entries for now, but could massively improve performance by skipping recursing into any tree entry that is unchanged. Something to think about in the future...

NODE_ID = str module-attribute

PATH = str module-attribute

ChangedEntryInfo

Bases: BaseModel

new_entry instance-attribute

previous instance-attribute

Changes

Bases: BaseModel

changed_blobs = Field(default_factory=list) class-attribute instance-attribute

deleted_blobs = Field(default_factory=list) class-attribute instance-attribute

new_blobs = Field(default_factory=list) class-attribute instance-attribute

renamed_blobs = Field(default_factory=list) class-attribute instance-attribute

unchanged = Field(default_factory=list) class-attribute instance-attribute

__add__(other)

__str__()

clear()

NodeIdentifier

Bases: BaseModel

Minimal info to identify what entries are stored in the db currently.

I.e., Which blob entries have an existing record in the db.

metadata instance-attribute

node_id = Field(repr=False) class-attribute instance-attribute

path instance-attribute

from_metadatas(metadatas) classmethod

RenamedEntryInfo

Bases: BaseModel

Info about a renamed entry.

Note: Not technically limited to renamed, could just have exactly the same contents as an existing entry. I.e., the "same_as" entry should not necessarily be deleted from the db.

Thought is that if the contents are identical to something that exists, can just copy those contents.

new_entry instance-attribute

previous instance-attribute

RepoChangeDetector

Builder class for Changes object.

Basically acts like a determine_changes function.

Class used just to help pass state around between parts.

changes instance-attribute

node_id_only_mapping instance-attribute

node_path_mapping instance-attribute

path_only_mapping instance-attribute

seen instance-attribute

determine_changes(tree, stored_nodes)

Determine what has changed in the repo.

PARAMETER DESCRIPTION
tree

The tree from GitHub to search for changes in relative to db

TYPE: Tree

stored_nodes

Minimal info about the existing files stored in the db

TYPE: list[NodeIdentifier]