Skip to content

init

GITHUB_SEARCH_COLLECTION_NAME = 'github_search' module-attribute

LiteralQueryType = Literal['text', 'code'] module-attribute

__all__ = ['GITHUB_SEARCH_COLLECTION_NAME', 'CodeChunkType', 'CodeFileQdrantMetadata', 'GithubCollectionInitializer', 'GithubFilterArgs', 'GithubScrollResult', 'GithubSearchResult', 'GithubSearchResults', 'GithubSearchUOW', 'LiteralQueryType', 'TextFileQdrantMetadata', 'add_parsed_github_files', 'delete_repository_points', 'get_num_points', 'make_github_filter', 'scroll_all_points', 'scroll_points', 'search', 'search_file_results', 'update_repository_points', 'vector_stats_for_repo'] module-attribute

CodeChunkType

Bases: StrEnum

CLASS = 'class' class-attribute instance-attribute

DOCSTRING = 'docstring' class-attribute instance-attribute

FUNCTION = 'function' class-attribute instance-attribute

IMPORT = 'import' class-attribute instance-attribute

OTHER = 'other' class-attribute instance-attribute

CodeFileQdrantMetadata

Bases: AbstractQdrantMetadata

code_chunk_type instance-attribute

full_chunk_info = Field(default=None, description='Additional information about the chunk optionally loaded from the database.') class-attribute instance-attribute

github_metadata instance-attribute

language instance-attribute

name instance-attribute

parent_classes = Field(default_factory=list) class-attribute instance-attribute

pk_id = Field(description='The primary key id of the related chunk in the database.') class-attribute instance-attribute

user_clerk_id instance-attribute

as_artifact()

as_content()

from_chunk_and_gh_metadata(chunk, gh_metadata, user_clerk_id, chunk_type, language, parent_classes=None) classmethod

Create the metadata object that will be stored in qdrant.

Note: The full_chunk_info is not stored in qdrant, but can be loaded from the database when needed.

GithubCollectionInitializer

Bases: CollectionInitializer

collection_name = COLLECTION_NAME class-attribute instance-attribute

default_indexes property

GithubFilterArgs

Bases: BaseModel

chunk_type = Field(default=None, description='The code chunk type to filter by.') class-attribute instance-attribute

file_path_end = Field(default=None, description="The end of the file path to filter by. (e.g. '.py', 'some_file.py', 'sub_dir/file.py')") class-attribute instance-attribute

file_path_start = Field(default=None, description="The start of the file path to filter by. (e.g. 'src/', 'src/main/')") class-attribute instance-attribute

language = Field(default=None, description='The language to filter by.') class-attribute instance-attribute

repo_id = Field(description='The repository to search in.') class-attribute instance-attribute

user_id = Field(description='The user to search as.') class-attribute instance-attribute

GithubScrollResult dataclass

last_point_id instance-attribute

results instance-attribute

__init__(results, last_point_id)

as_artifact()

as_content()

GithubSearchResult

Bases: BaseModel

content instance-attribute

metadata instance-attribute

qdrant_id instance-attribute

score instance-attribute

as_artifact()

as_content()

GithubSearchResults dataclass

results instance-attribute

__init__(results)

__post_init__()

as_artifact()

as_content()

GithubSearchUOW

collection_name = COLLECTION_NAME class-attribute

qdrant = qdrant instance-attribute

read_only_uow property

__init__(qdrant, read_only_uow=None)

TextFileQdrantMetadata

Bases: AbstractQdrantMetadata

chunk_index instance-attribute

github_metadata instance-attribute

language instance-attribute

pk_id instance-attribute

user_clerk_id instance-attribute

as_artifact()

as_content()

add_parsed_github_files(github_search_uow, parsed_github_files, user_id, wait=False) async

Add parsed github files to the qdrant database.

delete_repository_points(github_search_uow, repo_id, user_id, wait=False) async

Delete all points for a specific repository id.

Note: Passing "user_id" to delete a public repo is not necessary, but it is allowed.

TODO 2024-12-18: Should I allow this, or make it require explicit "public"?

get_num_points(uow, filter_args) async

Get the number of points that match the given filter.

make_github_filter(filter_args)

Create a qdrant filter for searching for code chunks (embedded as text or code).

Note: The file_path filter is a basic qdrant text filter and can return unexpected results. Further filtering of results should be done on the returned data. Examples of search terms and unexpected results: - "some/inner/folder" -> "some/folder/inner" (order of words not preserved (I think)) - "file.txt" -> "some_file.txt", "another_file.txt" (substring matching (2025-01-02 -- may be fixed by WORD indexing))

PARAMETER DESCRIPTION
filter_args

The filter arguments to use for creating the filter.

TYPE: GithubFilterArgs

RETURNS DESCRIPTION
Filter

A qdrant filter object that can be used for searching, scrolling, deleting, etc.

scroll_all_points(uow, filter_args=None, num_per_iteration=100, load_additional_chunk_info=False) async

Scroll through all points that match the given filter.

PARAMETER DESCRIPTION
uow

The unit of work to use for the search.

TYPE: GithubSearchUOW

filter_args

The filter to use for the scroll.

TYPE: GithubFilterArgs | None DEFAULT: None

num_per_iteration

The number of results to return in each iteration.

TYPE: int DEFAULT: 100

load_additional_chunk_info

Whether to load additional metadata from the db for the search results.

TYPE: bool DEFAULT: False

scroll_points(uow, filter_args=None, limit=100, from_point_id=None, load_additional_chunk_info=False) async

Scroll through all points that match the given filter.

PARAMETER DESCRIPTION
uow

The unit of work to use for the search.

TYPE: GithubSearchUOW

filter_args

The filter to use for the scroll.

TYPE: GithubFilterArgs | None DEFAULT: None

limit

The maximum number of results to return.

TYPE: int DEFAULT: 100

from_point_id

The id of the last point from previous scroll to start from (won't be included).

TYPE: str | None DEFAULT: None

load_additional_chunk_info

Whether to load additional metadata from the db for the search results.

TYPE: bool DEFAULT: False

search_file_results(uow, query_text, query_type, filter_args, max_results=None, load_full_chunk_info=False) async

Return whole files of any search results.

Note: This does not return additional metadata that is stored in db only for the search results (only the basic metadata stored in qdrant).

PARAMETER DESCRIPTION
uow

The unit of work to use for the search.

TYPE: GithubSearchUOW

query_text

The text to search for.

TYPE: str

query_type

The type of query to perform (e.g. "text" or "code"). Determines which vectors to search against.

TYPE: LiteralQueryType

filter_args

The filter arguments to use for the search.

TYPE: GithubFilterArgs

max_results

The maximum number of results to return.

TYPE: int | None DEFAULT: None

load_full_chunk_info

Whether to load additional metadata from the db for the search results.

TYPE: bool DEFAULT: False

update_repository_points(uow, repo_id, user_id, update_info, wait=False) async

Update the repository in the qdrant database.

Note: user_id must be explicitly provided as "public" if it's a public repository.

Note: This handles changed files by deleting them and adding the new version back in.

vector_stats_for_repo(uow, repo_id, user_id, iter_size=100) async

Get some basic stats about what is vectorized for the given repository.

Note: This streams back the results as they are found -- changing the results in-place.

I.e., Can show updates as they are found, but not useful to collect all the results as they will all be the same object.

PARAMETER DESCRIPTION
uow

The unit of work to use for the search.

TYPE: GithubSearchUOW

repo_id

The repository to get the stats for.

TYPE: RepoID

user_id

The user (for permissions to access repo info).

TYPE: UserID | Literal['public']

iter_size

The number of results to scroll through at a time

TYPE: int DEFAULT: 100