Curator
synapseclient.extensions.curator
¶
Synapse Curator Extensions
This module provides library functions for metadata curation tasks in Synapse.
Functions¶
create_file_based_metadata_task
¶
create_file_based_metadata_task(folder_id: str, curation_task_name: str, instructions: str, attach_wiki: bool = False, entity_view_name: str = 'JSON Schema view', schema_uri: Optional[str] = None, enable_derived_annotations: bool = False, *, synapse_client: Optional[Synapse] = None) -> Tuple[str, str]
Create a file view for a schema-bound folder using schematic.
Creating a file-based metadata curation task with schema binding
In this example, we create an EntityView and CurationTask for file-based metadata curation. If a schema_uri is provided, it will be bound to the folder.
import synapseclient
from synapseclient.extensions.curator import create_file_based_metadata_task
syn = synapseclient.Synapse()
syn.login()
entity_view_id, task_id = create_file_based_metadata_task(
synapse_client=syn,
folder_id="syn12345678",
curation_task_name="BiospecimenMetadataTemplate",
instructions="Please curate this metadata according to the schema requirements",
attach_wiki=False,
entity_view_name="Biospecimen Metadata View",
schema_uri="sage.schemas.v2571-amp.Biospecimen.schema-0.0.1"
)
| PARAMETER | DESCRIPTION |
|---|---|
folder_id
|
The Synapse Folder ID to create the file view for.
TYPE:
|
curation_task_name
|
Name for the CurationTask (used as data_type field). Must be unique within the project, otherwise if it matches an existing CurationTask, that task will be updated with new data.
TYPE:
|
instructions
|
Instructions for the curation task.
TYPE:
|
attach_wiki
|
Whether or not to attach a Synapse Wiki (default: False).
TYPE:
|
entity_view_name
|
Name for the created entity view (default: "JSON Schema view").
TYPE:
|
schema_uri
|
Optional JSON schema URI to bind to the folder. If provided, the schema will be bound to the folder before creating the entity view. (e.g., 'sage.schemas.v2571-amp.Biospecimen.schema-0.0.1') |
enable_derived_annotations
|
If true, enable derived annotations. Defaults to False.
TYPE:
|
synapse_client
|
If not passed in and caching was not disabled by
|
| RETURNS | DESCRIPTION |
|---|---|
Tuple[str, str]
|
A tuple containing: - The Synapse ID of the entity view created - The task ID of the curation task created |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If required parameters are missing. |
SynapseError
|
If there are issues with Synapse operations. |
Source code in synapseclient/extensions/curator/file_based_metadata_task.py
| |
create_record_based_metadata_task
¶
create_record_based_metadata_task(project_id: str, folder_id: str, record_set_name: str, record_set_description: str, curation_task_name: str, upsert_keys: List[str], instructions: str, schema_uri: str, bind_schema_to_record_set: bool = True, enable_derived_annotations: bool = False, *, synapse_client: Optional[Synapse] = None) -> Tuple[RecordSet, CurationTask, Grid]
Generate and upload CSV templates as a RecordSet for record-based metadata, create a CurationTask, and also create a Grid to bootstrap the ValidationStatistics.
A number of schema URIs that are already registered to Synapse can be found at:
If you have yet to create and register your JSON schema in Synapse, please refer to the tutorial at https://python-docs.synapse.org/en/stable/tutorials/python/json_schema/.
Creating a record-based metadata curation task with a schema URI
In this example, we create a RecordSet and CurationTask for biospecimen metadata
curation using a schema URI. By default this will also bind the schema to the
RecordSet, however the bind_schema_to_record_set parameter can be set to
False to skip that step.
import synapseclient
from synapseclient.extensions.curator import create_record_based_metadata_task
syn = synapseclient.Synapse()
syn.login()
record_set, task, grid = create_record_based_metadata_task(
synapse_client=syn,
project_id="syn12345678",
folder_id="syn87654321",
record_set_name="BiospecimenMetadata_RecordSet",
record_set_description="RecordSet for biospecimen metadata curation",
curation_task_name="BiospecimenMetadataTemplate",
upsert_keys=["specimenID"],
instructions="Please curate this metadata according to the schema requirements",
schema_uri="schema-org-schema.name.schema-v1.0.0"
)
| PARAMETER | DESCRIPTION |
|---|---|
project_id
|
The Synapse ID of the project where the folder exists.
TYPE:
|
folder_id
|
The Synapse ID of the folder to upload RecordSet to.
TYPE:
|
record_set_name
|
Name for the RecordSet.
TYPE:
|
record_set_description
|
Description for the RecordSet.
TYPE:
|
curation_task_name
|
Name for the CurationTask (used as data_type field). Must be unique within the project, otherwise if it matches an existing CurationTask, that task will be updated with new data.
TYPE:
|
upsert_keys
|
List of column names to use as upsert keys. |
instructions
|
Instructions for the curation task.
TYPE:
|
schema_uri
|
JSON schema URI for the RecordSet schema. (e.g., 'sage.schemas.v2571-amp.Biospecimen.schema-0.0.1', 'sage.schemas.v2571-ad.Analysis.schema-0.0.0')
TYPE:
|
bind_schema_to_record_set
|
Whether to bind the given schema to the RecordSet (default: True).
TYPE:
|
enable_derived_annotations
|
If true, enable derived annotations. Defaults to False.
TYPE:
|
synapse_client
|
If not passed in and caching was not disabled by
|
| RETURNS | DESCRIPTION |
|---|---|
Tuple[RecordSet, CurationTask, Grid]
|
Tuple containing the created RecordSet, CurationTask, and Grid objects |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If required parameters are missing or if schema_uri is not provided. |
SynapseError
|
If there are issues with Synapse operations. |
Source code in synapseclient/extensions/curator/record_based_metadata_task.py
| |
generate_jsonld
¶
generate_jsonld(schema: Any, data_model_labels: DisplayLabelType, output_jsonld: Optional[str], *, synapse_client: Optional[Synapse] = None) -> dict
Convert a CSV data model specification to JSON-LD format with validation and error checking.
This function parses your CSV data model (containing attributes, validation rules,
dependencies, and valid values), converts it to a graph-based JSON-LD representation,
validates the structure for common errors, and saves the result. The generated JSON-LD
file serves as input for generate_jsonschema() and other data model operations.
Data Model Requirements:
Your CSV should include columns defining:
- Attribute names: Property/attribute identifiers
- Display names: Human-readable labels (optional but recommended)
- Descriptions: Documentation for each attribute
- Valid values: Allowed enum values for attributes (comma-separated)
- Validation rules: Rules like
list,regex,inRange,required, etc. - Dependencies: Relationships between attributes using
dependsOn - Required status: Whether attributes are mandatory
Validation Checks Performed:
- Ensures all required fields (like
displayName) are present - Detects cycles in attribute dependencies (which would create invalid schemas)
- Checks for blacklisted characters in display names that Synapse doesn't allow
- Validates that attribute names don't conflict with reserved system names
- Verifies the graph structure is a valid directed acyclic graph (DAG)
| PARAMETER | DESCRIPTION |
|---|---|
schema
|
Path or URL to your data model CSV file. Can be a local file path or a URL (e.g., from GitHub). This file should contain your complete data model specification with all attributes, validation rules, and relationships.
TYPE:
|
data_model_labels
|
Label format for the JSON-LD output:
TYPE:
|
output_jsonld
|
Path where the JSON-LD file will be saved. If None, saves alongside
the input CSV with a |
synapse_client
|
Optional Synapse client instance for logging. If None, creates a
new client instance. Use |
Output:
The function logs validation errors and warnings to help you fix data model issues before generating JSON schemas. Errors indicate critical problems that must be fixed, while warnings suggest improvements but won't block schema generation.
| RETURNS | DESCRIPTION |
|---|---|
dict
|
The generated data model as a dictionary in JSON-LD format. The same data is
also saved to the file path specified in |
Using this function to generate JSONLD Schema files:
Basic usage with default output path:
from synapseclient import Synapse
from synapseclient.extensions.curator import generate_jsonld
syn = Synapse()
syn.login()
jsonld_model = generate_jsonld(
schema="path/to/my_data_model.csv",
data_model_labels="class_label",
output_jsonld=None, # Saves to my_data_model.jsonld
synapse_client=syn
)
Specify custom output path:
jsonld_model = generate_jsonld(
schema="models/patient_model.csv",
data_model_labels="class_label",
output_jsonld="~/output/patient_model_v1.jsonld",
synapse_client=syn
)
Use display labels:
jsonld_model = generate_jsonld(
schema="my_model.csv",
data_model_labels="display_label",
output_jsonld="my_model.jsonld",
synapse_client=syn
)
Load from URL:
jsonld_model = generate_jsonld(
schema="https://raw.githubusercontent.com/org/repo/main/model.csv",
data_model_labels="class_label",
output_jsonld="downloaded_model.jsonld",
synapse_client=syn
)
Source code in synapseclient/extensions/curator/schema_generation.py
5738 5739 5740 5741 5742 5743 5744 5745 5746 5747 5748 5749 5750 5751 5752 5753 5754 5755 5756 5757 5758 5759 5760 5761 5762 5763 5764 5765 5766 5767 5768 5769 5770 5771 5772 5773 5774 5775 5776 5777 5778 5779 5780 5781 5782 5783 5784 5785 5786 5787 5788 5789 5790 5791 5792 5793 5794 5795 5796 5797 5798 5799 5800 5801 5802 5803 5804 5805 5806 5807 5808 5809 5810 5811 5812 5813 5814 5815 5816 5817 5818 5819 5820 5821 5822 5823 5824 5825 5826 5827 5828 5829 5830 5831 5832 5833 5834 5835 5836 5837 5838 5839 5840 5841 5842 5843 5844 5845 5846 5847 5848 5849 5850 5851 5852 5853 5854 5855 5856 5857 5858 5859 5860 5861 5862 5863 5864 5865 5866 5867 5868 5869 5870 5871 5872 5873 5874 5875 5876 5877 5878 5879 5880 5881 5882 5883 5884 5885 5886 5887 5888 5889 5890 5891 5892 5893 5894 5895 5896 5897 5898 5899 5900 5901 5902 5903 5904 5905 5906 5907 5908 5909 5910 5911 5912 5913 5914 5915 5916 5917 5918 5919 5920 5921 5922 5923 5924 5925 5926 5927 5928 5929 | |
generate_jsonschema
¶
generate_jsonschema(data_model_source: str, synapse_client: Synapse, data_types: Optional[list[str]] = None, output: Optional[str] = None, data_model_labels: DisplayLabelType = 'class_label') -> tuple[list[dict[str, Any]], list[str]]
Generate JSON Schema files from a data model.
| PARAMETER | DESCRIPTION |
|---|---|
data_model_source
|
Path or URL to the data model file (CSV or JSONLD). Can accept:
- A local CSV file with your data model specification (will be parsed automatically)
- A local JSONLD file generated from
TYPE:
|
synapse_client
|
Synapse client instance for logging. Use
TYPE:
|
data_types
|
List of specific cdata types to generate schemas for. If None, generates schemas for all data types in the data model. |
output
|
One of: None, a directory path, or a file path.
- If None, schemas will be written to the current working directory, with filenames formatted as |
data_model_labels
|
Label format for properties in the generated schema:
-
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[list[dict[str, Any]], list[str]]
|
A tuple containing: - A list of JSON schema dictionaries, each corresponding to a data type - A list of file paths where the schemas were written |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If a single output file is specified but multiple data types are requested. |
Using this function to generate JSON Schema files:
Generate schema for one datatype:
from synapseclient import Synapse
from synapseclient.extensions.curator import generate_jsonschema
syn = Synapse()
syn.login()
schemas, file_paths = generate_jsonschema(
data_model_source="path/to/model.csv",
output="output.json",
data_types=["Patient"],
synapse_client=syn
)
Generate schema for specific data types:
schemas, file_paths = generate_jsonschema(
data_model_source="path/to/model.csv",
output="./schemas",
data_types=["Patient", "Biospecimen"],
synapse_client=syn
)
Generate schemas for all data types:
schemas, file_paths = generate_jsonschema(
data_model_source="path/to/model.csv",
output="./schemas",
synapse_client=syn
)
Generate schema from CSV URL:
schemas, file_paths = generate_jsonschema(
data_model_source="https://raw.githubusercontent.com/org/repo/main/model.csv",
output_directory="./schemas",
data_type=None,
data_model_labels="class_label",
synapse_client=syn
)
Source code in synapseclient/extensions/curator/schema_generation.py
5592 5593 5594 5595 5596 5597 5598 5599 5600 5601 5602 5603 5604 5605 5606 5607 5608 5609 5610 5611 5612 5613 5614 5615 5616 5617 5618 5619 5620 5621 5622 5623 5624 5625 5626 5627 5628 5629 5630 5631 5632 5633 5634 5635 5636 5637 5638 5639 5640 5641 5642 5643 5644 5645 5646 5647 5648 5649 5650 5651 5652 5653 5654 5655 5656 5657 5658 5659 5660 5661 5662 5663 5664 5665 5666 5667 5668 5669 5670 5671 5672 5673 5674 5675 5676 5677 5678 5679 5680 5681 5682 5683 5684 5685 5686 5687 5688 5689 5690 5691 5692 5693 5694 5695 5696 5697 5698 5699 5700 5701 5702 5703 5704 5705 5706 5707 5708 5709 5710 5711 5712 5713 5714 5715 5716 5717 5718 5719 5720 5721 5722 5723 5724 5725 5726 5727 5728 5729 5730 5731 5732 5733 5734 5735 | |
bind_jsonschema
¶
bind_jsonschema(entity_id: str, json_schema_uri: str, enable_derived_annotations: bool = False, synapse_client: Optional[Synapse] = None) -> JSONSchemaBinding
Bind a JSON schema to a Synapse entity.
This function binds a JSON schema to a Synapse entity using the Entity OOP model's bind_schema method.
| PARAMETER | DESCRIPTION |
|---|---|
entity_id
|
The Synapse ID of the entity to bind the schema to (e.g., syn12345678)
TYPE:
|
json_schema_uri
|
The URI of the JSON Schema to bind (e.g., 'my.org-schema.name-1.0.0')
TYPE:
|
enable_derived_annotations
|
If true, enable derived annotations. Defaults to False.
TYPE:
|
synapse_client
|
If not passed in and caching was not disabled by
|
| RETURNS | DESCRIPTION |
|---|---|
JSONSchemaBinding
|
The JSONSchemaBinding object containing the binding details |
Bind a JSON schema to an entity
from synapseclient import Synapse
from synapseclient.extensions.curator import bind_jsonschema
syn = Synapse()
syn.login()
result = bind_jsonschema(
entity_id="syn12345678",
json_schema_uri="my.org-my.schema-0.0.1",
enable_derived_annotations=True,
synapse_client=syn
)
print(f"Successfully bound schema: {result}")
Source code in synapseclient/extensions/curator/schema_management.py
86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 | |
register_jsonschema
¶
register_jsonschema(schema_path: str, organization_name: str, schema_name: str, schema_version: Optional[str] = None, synapse_client: Optional[Synapse] = None) -> JSONSchema
Register a JSON schema to a Synapse organization.
This function loads a JSON schema from a file and registers it with a specified organization in Synapse using the JSONSchema OOP model.
| PARAMETER | DESCRIPTION |
|---|---|
schema_path
|
Path to the JSON schema file to register
TYPE:
|
organization_name
|
Name of the organization to register the schema under
TYPE:
|
schema_name
|
The name of the JSON schema
TYPE:
|
schema_version
|
Optional version of the schema (e.g., '0.0.1'). If not specified, a version will be auto-generated. |
synapse_client
|
If not passed in and caching was not disabled by
|
| RETURNS | DESCRIPTION |
|---|---|
JSONSchema
|
The registered JSONSchema object |
Register a JSON schema
from synapseclient import Synapse
from synapseclient.extensions.curator import register_jsonschema
syn = Synapse()
syn.login()
json_schema = register_jsonschema(
schema_path="/path/to/schema.json",
organization_name="my.org",
schema_name="my.schema",
schema_version="0.0.1",
synapse_client=syn
)
print(f"Registered schema URI: {json_schema.uri}")
print(f"Schema version: {json_schema.version}")
Source code in synapseclient/extensions/curator/schema_management.py
17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 | |
query_schema_registry
¶
query_schema_registry(synapse_client: Optional[Synapse] = None, schema_registry_table_id: Optional[str] = None, column_config: Optional[SchemaRegistryColumnConfig] = None, return_latest_only: bool = True, **filters) -> Union[str, List[str], None]
Query the schema registry table to find schemas matching the provided filters.
This function searches the Synapse schema registry table for schemas that match the provided filter parameters. Results are sorted by version in descending order (newest first). The function supports any number of filter parameters as long as they are configured in the column_config.
| PARAMETER | DESCRIPTION |
|---|---|
synapse_client
|
Optional authenticated Synapse client instance |
schema_registry_table_id
|
Optional Synapse ID of the schema registry table. If None, uses the default table ID. |
column_config
|
Optional configuration for custom column names. If None, uses default configuration ('version' and 'uri' columns).
TYPE:
|
return_latest_only
|
If True (default), returns only the latest URI as a string. If False, returns all matching URIs as a list of strings.
TYPE:
|
**filters
|
Filter parameters to search for matching schemas. These work as follows:
DEFAULT:
|
| RETURNS | DESCRIPTION |
|---|---|
Union[str, List[str], None]
|
If return_latest_only is True: Single URI string of the latest version, or None if not found |
Union[str, List[str], None]
|
If return_latest_only is False: List of URI strings sorted by version (highest version first) |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If no filter parameters are provided |
Expected Table Structure
The schema registry table should contain columns for:
- Schema version for sorting (default: 'version')
- JSON schema URI (default: 'uri')
- Any filterable columns as configured in column_config
Additional columns may be present and will be included in results.
Comprehensive filter usage demonstrations
This includes several examples of how to use the filtering system.
Basic Filtering (using default filters):
from synapseclient import Synapse
from synapseclient.extensions.curator import query_schema_registry
syn = Synapse()
syn.login()
# 1. Get latest schema URI for a specific DCC and datatype
latest_uri = query_schema_registry(
synapse_client=syn,
dcc="ad", # Exact match for Alzheimer's Disease DCC
datatype="Analysis" # Exact datatype match
)
# Returns: "sage.schemas.v2571-ad.Analysis.schema-0.0.0"
# 2. Get all versions of matching schemas (not just latest)
all_versions = query_schema_registry(
synapse_client=syn,
dcc="mc2",
datatype="Biospecimen",
return_latest_only=False
)
# Returns: ["MultiConsortiaCoordinatingCenter-Biospecimen-12.0.0",
# "sage.schemas.v2571-mc2.Biospecimen.schema-9.0.0"]
# 3. Pattern matching with wildcards
# Find all "Biospecimen" schemas across all DCCs
biospecimen_schemas = query_schema_registry(
synapse_client=syn,
datatype="Biospecimen", # Exact match for Biospecimen
return_latest_only=False
)
# Returns: ["MultiConsortiaCoordinatingCenter-Biospecimen-12.0.0",
# "sage.schemas.v2571-mc2.Biospecimen.schema-9.0.0",
# "sage.schemas.v2571-veo.Biospecimen.schema-0.3.0",
# "sage.schemas.v2571-amp.Biospecimen.schema-0.0.1"]
# 4. Pattern matching for DCC variations
mc2_schemas = query_schema_registry(
synapse_client=syn,
dcc="%C2", # Matches 'mc2' and 'MC2'
return_latest_only=False
)
# Returns schemas from both 'mc2' and 'MC2' DCCs
# 5. Using additional columns for filtering (if they exist in your table)
specific_schemas = query_schema_registry(
synapse_client=syn,
dcc="amp", # Must be AMP DCC
org="sage.schemas.v2571", # Must match organization
return_latest_only=False
)
# Returns schemas that match BOTH conditions
Direct Column Filtering (simplified approach):
# Any column in the schema registry table can be used for filtering
# Just use the column name directly as a keyword argument
# Basic filters using standard columns
query_schema_registry(dcc="ad", datatype="Analysis")
query_schema_registry(version="0.0.0")
query_schema_registry(uri="sage.schemas.v2571-ad.Analysis.schema-0.0.0")
# Additional columns (if they exist in your table)
query_schema_registry(org="sage.schemas.v2571")
query_schema_registry(name="ad.Analysis.schema")
# Multiple column filters (all must match)
query_schema_registry(
dcc="mc2",
datatype="Biospecimen",
org="MultiConsortiaCoordinatingCenter"
)
Filter Value Examples with Real Data:
# Exact matching
query_schema_registry(dcc="ad") # Returns schemas with dcc="ad"
query_schema_registry(datatype="Biospecimen") # Returns schemas with datatype="Biospecimen"
query_schema_registry(dcc="MC2") # Returns schemas with dcc="MC2" (case sensitive)
# Pattern matching with wildcards
query_schema_registry(dcc="%C2") # Matches "mc2", "MC2"
query_schema_registry(datatype="%spec%") # Matches "Biospecimen"
# Examples with expected results:
query_schema_registry(dcc="ad", datatype="Analysis")
# Returns: "sage.schemas.v2571-ad.Analysis.schema-0.0.0"
query_schema_registry(datatype="Biospecimen", return_latest_only=False)
# Returns: ["MultiConsortiaCoordinatingCenter-Biospecimen-12.0.0",
# "sage.schemas.v2571-mc2.Biospecimen.schema-9.0.0", ...]
# Multiple conditions (all must be true)
query_schema_registry(
dcc="amp", # AND
datatype="Biospecimen", # AND
org="sage.schemas.v2571" # AND (if org column exists)
)
# Returns: ["sage.schemas.v2571-amp.Biospecimen.schema-0.0.1"]
Source code in synapseclient/extensions/curator/schema_registry.py
| |