init
GITHUB_SEARCH_COLLECTION_NAME = 'github_search'
module-attribute
LiteralQueryType = Literal['text', 'code']
module-attribute
__all__ = ['GITHUB_SEARCH_COLLECTION_NAME', 'CodeChunkType', 'CodeFileQdrantMetadata', 'GithubCollectionInitializer', 'GithubFilterArgs', 'GithubScrollResult', 'GithubSearchResult', 'GithubSearchResults', 'GithubSearchUOW', 'LiteralQueryType', 'TextFileQdrantMetadata', 'add_parsed_github_files', 'delete_repository_points', 'get_num_points', 'make_github_filter', 'scroll_all_points', 'scroll_points', 'search', 'search_file_results', 'update_repository_points', 'vector_stats_for_repo']
module-attribute
CodeChunkType
Bases: StrEnum
CLASS = 'class'
class-attribute
instance-attribute
DOCSTRING = 'docstring'
class-attribute
instance-attribute
FUNCTION = 'function'
class-attribute
instance-attribute
IMPORT = 'import'
class-attribute
instance-attribute
OTHER = 'other'
class-attribute
instance-attribute
CodeFileQdrantMetadata
Bases: AbstractQdrantMetadata
code_chunk_type
instance-attribute
full_chunk_info = Field(default=None, description='Additional information about the chunk optionally loaded from the database.')
class-attribute
instance-attribute
github_metadata
instance-attribute
language
instance-attribute
name
instance-attribute
parent_classes = Field(default_factory=list)
class-attribute
instance-attribute
pk_id = Field(description='The primary key id of the related chunk in the database.')
class-attribute
instance-attribute
user_clerk_id
instance-attribute
as_artifact()
as_content()
from_chunk_and_gh_metadata(chunk, gh_metadata, user_clerk_id, chunk_type, language, parent_classes=None)
classmethod
Create the metadata object that will be stored in qdrant.
Note: The full_chunk_info is not stored in qdrant, but can be loaded from the database when needed.
GithubCollectionInitializer
Bases: CollectionInitializer
collection_name = COLLECTION_NAME
class-attribute
instance-attribute
default_indexes
property
GithubFilterArgs
Bases: BaseModel
chunk_type = Field(default=None, description='The code chunk type to filter by.')
class-attribute
instance-attribute
file_path_end = Field(default=None, description="The end of the file path to filter by. (e.g. '.py', 'some_file.py', 'sub_dir/file.py')")
class-attribute
instance-attribute
file_path_start = Field(default=None, description="The start of the file path to filter by. (e.g. 'src/', 'src/main/')")
class-attribute
instance-attribute
language = Field(default=None, description='The language to filter by.')
class-attribute
instance-attribute
repo_id = Field(description='The repository to search in.')
class-attribute
instance-attribute
user_id = Field(description='The user to search as.')
class-attribute
instance-attribute
GithubScrollResult
dataclass
last_point_id
instance-attribute
results
instance-attribute
__init__(results, last_point_id)
as_artifact()
as_content()
GithubSearchResult
Bases: BaseModel
content
instance-attribute
metadata
instance-attribute
qdrant_id
instance-attribute
score
instance-attribute
as_artifact()
as_content()
GithubSearchResults
dataclass
results
instance-attribute
__init__(results)
__post_init__()
as_artifact()
as_content()
GithubSearchUOW
collection_name = COLLECTION_NAME
class-attribute
qdrant = qdrant
instance-attribute
read_only_uow
property
__init__(qdrant, read_only_uow=None)
TextFileQdrantMetadata
Bases: AbstractQdrantMetadata
chunk_index
instance-attribute
github_metadata
instance-attribute
language
instance-attribute
pk_id
instance-attribute
user_clerk_id
instance-attribute
as_artifact()
as_content()
add_parsed_github_files(github_search_uow, parsed_github_files, user_id, wait=False)
async
Add parsed github files to the qdrant database.
delete_repository_points(github_search_uow, repo_id, user_id, wait=False)
async
Delete all points for a specific repository id.
Note: Passing "user_id" to delete a public repo is not necessary, but it is allowed.
TODO 2024-12-18: Should I allow this, or make it require explicit "public"?
get_num_points(uow, filter_args)
async
Get the number of points that match the given filter.
make_github_filter(filter_args)
Create a qdrant filter for searching for code chunks (embedded as text or code).
Note: The file_path filter is a basic qdrant text filter and can return unexpected results.
Further filtering of results should be done on the returned data.
Examples of search terms and unexpected results:
- "some/inner/folder" -> "some/folder/inner" (order of words not preserved (I think))
- "file.txt" -> "some_file.txt", "another_file.txt" (substring matching (2025-01-02 -- may be fixed by WORD
indexing))
| PARAMETER | DESCRIPTION |
|---|---|
filter_args
|
The filter arguments to use for creating the filter.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
Filter
|
A qdrant filter object that can be used for searching, scrolling, deleting, etc. |
scroll_all_points(uow, filter_args=None, num_per_iteration=100, load_additional_chunk_info=False)
async
Scroll through all points that match the given filter.
| PARAMETER | DESCRIPTION |
|---|---|
uow
|
The unit of work to use for the search.
TYPE:
|
filter_args
|
The filter to use for the scroll.
TYPE:
|
num_per_iteration
|
The number of results to return in each iteration.
TYPE:
|
load_additional_chunk_info
|
Whether to load additional metadata from the db for the search results.
TYPE:
|
scroll_points(uow, filter_args=None, limit=100, from_point_id=None, load_additional_chunk_info=False)
async
Scroll through all points that match the given filter.
| PARAMETER | DESCRIPTION |
|---|---|
uow
|
The unit of work to use for the search.
TYPE:
|
filter_args
|
The filter to use for the scroll.
TYPE:
|
limit
|
The maximum number of results to return.
TYPE:
|
from_point_id
|
The id of the last point from previous scroll to start from (won't be included).
TYPE:
|
load_additional_chunk_info
|
Whether to load additional metadata from the db for the search results.
TYPE:
|
search_file_results(uow, query_text, query_type, filter_args, max_results=None, load_full_chunk_info=False)
async
Return whole files of any search results.
Note: This does not return additional metadata that is stored in db only for the search results (only the basic metadata stored in qdrant).
| PARAMETER | DESCRIPTION |
|---|---|
uow
|
The unit of work to use for the search.
TYPE:
|
query_text
|
The text to search for.
TYPE:
|
query_type
|
The type of query to perform (e.g. "text" or "code"). Determines which vectors to search against.
TYPE:
|
filter_args
|
The filter arguments to use for the search.
TYPE:
|
max_results
|
The maximum number of results to return.
TYPE:
|
load_full_chunk_info
|
Whether to load additional metadata from the db for the search results.
TYPE:
|
update_repository_points(uow, repo_id, user_id, update_info, wait=False)
async
Update the repository in the qdrant database.
Note: user_id must be explicitly provided as "public" if it's a public repository.
Note: This handles changed files by deleting them and adding the new version back in.
vector_stats_for_repo(uow, repo_id, user_id, iter_size=100)
async
Get some basic stats about what is vectorized for the given repository.
Note: This streams back the results as they are found -- changing the results in-place.
I.e., Can show updates as they are found, but not useful to collect all the results as they will all be the same object.
| PARAMETER | DESCRIPTION |
|---|---|
uow
|
The unit of work to use for the search.
TYPE:
|
repo_id
|
The repository to get the stats for.
TYPE:
|
user_id
|
The user (for permissions to access repo info).
TYPE:
|
iter_size
|
The number of results to scroll through at a time
TYPE:
|