Skip to main content

Environment Variables

The following is a summary of a few important environment variables which expose various levers which control how DataHub works.

Feature Flags

VariableDefaultUnit/TypeComponentsDescription
UI_INGESTION_ENABLEDtrueboolean[GMS, MCE Consumer]Enable UI based ingestion.
DATAHUB_ANALYTICS_ENABLEDtrueboolean[Frontend, GMS]Enabled analytics within DataHub.
BOOTSTRAP_SYSTEM_UPDATE_WAIT_FOR_SYSTEM_UPDATEtrueboolean[GMS, MCE Consumer, MAE Consumer]Do not wait for the system-update to complete before starting. This should typically only be disabled during development.

Ingestion

VariableDefaultUnit/TypeComponentsDescription
ASYNC_INGESTION_DEFAULTfalseboolean[GMS]Asynchronously process ingestProposals by writing the ingestion MCP to Kafka. Typically enabled with standalone consumers.
MCP_CONSUMER_ENABLEDtrueboolean[GMS, MCE Consumer]When running in standalone mode, disabled on GMS and enabled on separate MCE Consumer.
MCL_CONSUMER_ENABLEDtrueboolean[GMS, MAE Consumer]When running in standalone mode, disabled on GMS and enabled on separate MAE Consumer.
PE_CONSUMER_ENABLEDtrueboolean[GMS, MAE Consumer]When running in standalone mode, disabled on GMS and enabled on separate MAE Consumer.
ES_BULK_REQUESTS_LIMIT1000docs[GMS, MAE Consumer]Number of bulk documents to index. MAE Consumer if standalone.
ES_BULK_FLUSH_PERIOD1seconds[GMS, MAE Consumer]How frequently indexed documents are made available for query.
ALWAYS_EMIT_CHANGE_LOGfalseboolean[GMS]Enables always emitting a MCL even when no changes are detected. Used for Time Based Lineage when no changes occur.
GRAPH_SERVICE_DIFF_MODE_ENABLEDtrueboolean[GMS]Enables diff mode for graph writes, uses a different code path that produces a diff from previous to next to write relationships instead of wholesale deleting edges and reading.

Caching

VariableDefaultUnit/TypeComponentsDescription
SEARCH_SERVICE_ENABLE_CACHEfalseboolean[GMS]Enable caching of search results.
SEARCH_SERVICE_CACHE_IMPLEMENTATIONcaffeinestring[GMS]Set to hazelcast if the number of GMS replicas > 1 for enabling distributed cache.
CACHE_TTL_SECONDS600seconds[GMS]Default cache time to live.
CACHE_MAX_SIZE10000objects[GMS]Maximum number of items to cache.
LINEAGE_SEARCH_CACHE_ENABLEDtrueboolean[GMS]Enables in-memory cache for searchAcrossLineage query.
CACHE_ENTITY_COUNTS_TTL_SECONDS600seconds[GMS]Homepage entity count time to live.
CACHE_SEARCH_LINEAGE_TTL_SECONDS86400seconds[GMS]Search lineage cache time to live.
CACHE_SEARCH_LINEAGE_LIGHTNING_THRESHOLD300objects[GMS]Lineage graphs exceeding this limit will use a local cache.
VariableDefaultUnit/TypeComponentsDescription
INDEX_PREFIX``string[GMS, MAE Consumer, Elasticsearch Setup, System Update]Prefix Elasticsearch indices with the given string.
ELASTICSEARCH_NUM_SHARDS_PER_INDEX1integer[System Update]Default number of shards per Elasticsearch index.
ELASTICSEARCH_NUM_REPLICAS_PER_INDEX1integer[System Update]Default number of replica per Elasticsearch index.
ELASTICSEARCH_BUILD_INDICES_RETENTION_VALUE60integer[System Update]Number of units for the retention of Elasticsearch clone/backup indices.
ELASTICSEARCH_BUILD_INDICES_RETENTION_UNITDAYSstring[System Update]Unit for the retention of Elasticsearch clone/backup indices.
ELASTICSEARCH_QUERY_EXACT_MATCH_EXCLUSIVEfalseboolean[GMS]Only return exact matches when using quotes.
ELASTICSEARCH_QUERY_EXACT_MATCH_WITH_PREFIXtrueboolean[GMS]Include prefix match in exact match results.
ELASTICSEARCH_QUERY_EXACT_MATCH_FACTOR10.0float[GMS]Multiply by this number on true exact match.
ELASTICSEARCH_QUERY_EXACT_MATCH_PREFIX_FACTOR1.6float[GMS]Multiply by this number when prefix match.
ELASTICSEARCH_QUERY_EXACT_MATCH_CASE_FACTOR0.7float[GMS]Multiply by this number when case insensitive match.
ELASTICSEARCH_QUERY_EXACT_MATCH_ENABLE_STRUCTUREDtrueboolean[GMS]When using structured query, also include exact matches.
ELASTICSEARCH_QUERY_PARTIAL_URN_FACTOR0.5float[GMS]Multiply by this number when partial token match on URN)
ELASTICSEARCH_QUERY_PARTIAL_FACTOR0.4float[GMS]Multiply by this number when partial token match on non-URN field.
ELASTICSEARCH_QUERY_CUSTOM_CONFIG_ENABLEDfalseboolean[GMS]Enable search query and ranking customization configuration.
ELASTICSEARCH_QUERY_CUSTOM_CONFIG_FILEsearch_config.ymlstring[GMS]The location of the search customization configuration.

Kafka

In general, there are lots of Kafka configuration environment variables for both the producer and consumers defined in the official Spring Kafka documentation here. These environment variables follow the standard Spring representation of properties as environment variables. Simply replace the dot, ., with an underscore, _, and convert to uppercase.

VariableDefaultUnit/TypeComponentsDescription
KAFKA_LISTENER_CONCURRENCY1integer[GMS, MCE Consumer, MAE Consumer]Number of Kafka consumer threads. Optimize throughput by matching to topic partitions.
SPRING_KAFKA_PRODUCER_PROPERTIES_MAX_REQUEST_SIZE1048576bytes[GMS, MCE Consumer, MAE Consumer]Max produced message size. Note that the topic configuration is not controlled by this variable.
SCHEMA_REGISTRY_TYPEINTERNALstring[GMS, MCE Consumer, MAE Consumer]Schema registry implementation. One of INTERNAL or KAFKA or AWS_GLUE
KAFKA_SCHEMAREGISTRY_URLhttp://localhost:8080/schema-registry/api/string[GMS, MCE Consumer, MAE Consumer]Schema registry url. Used for INTERNAL and KAFKA. The default value is for the GMS component. The MCE Consumer and MAE Consumer should be the GMS hostname and port.
AWS_GLUE_SCHEMA_REGISTRY_REGIONus-east-1string[GMS, MCE Consumer, MAE Consumer]If using AWS_GLUE in the SCHEMA_REGISTRY_TYPE variable for the schema registry implementation.
AWS_GLUE_SCHEMA_REGISTRY_NAME``string[GMS, MCE Consumer, MAE Consumer]If using AWS_GLUE in the SCHEMA_REGISTRY_TYPE variable for the schema registry.
USE_CONFLUENT_SCHEMA_REGISTRYtrueboolean[kafka-setup]Enable Confluent schema registry configuration.