Skip to content

Commit 30a77c0

Browse files
authored
fix(ingestion/classifier): temporary measure to avoid deadlocks for classifier (#12261)
1 parent ba8bf53 commit 30a77c0

File tree

2 files changed

+4
-5
lines changed

2 files changed

+4
-5
lines changed

metadata-ingestion/docs/dev_guides/classification.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -7,10 +7,10 @@ The classification feature enables sources to be configured to automatically pre
77
Note that a `.` is used to denote nested fields in the YAML recipe.
88

99
| Field | Required | Type | Description | Default |
10-
| ------------------------- | -------- | --------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------------------------------- |
10+
| ------------------------- | -------- | --------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |------------------------------------------------------------|
1111
| enabled | | boolean | Whether classification should be used to auto-detect glossary terms | False |
1212
| sample_size | | int | Number of sample values used for classification. | 100 |
13-
| max_workers | | int | Number of worker processes to use for classification. Set to 1 to disable. | Number of cpu cores or 4 |
13+
| max_workers | | int | Number of worker processes to use for classification. Note that any number above 1 might lead to a deadlock. Set to 1 to disable. | 1 |
1414
| info_type_to_term | | Dict[str,string] | Optional mapping to provide glossary term identifier for info type. | By default, info type is used as glossary term identifier. |
1515
| classifiers | | Array of object | Classifiers to use to auto-detect glossary terms. If more than one classifier, infotype predictions from the classifier defined later in sequence take precedance. | [{'type': 'datahub', 'config': None}] |
1616
| table_pattern | | AllowDenyPattern (see below for fields) | Regex patterns to filter tables for classification. This is used in combination with other patterns in parent config. Specify regex to match the entire table name in `database.schema.table` format. e.g. to match all tables starting with customer in Customer database and public schema, use the regex 'Customer.public.customer.*' | {'allow': ['.*'], 'deny': [], 'ignoreCase': True} |

metadata-ingestion/src/datahub/ingestion/glossary/classifier.py

+2-3
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,3 @@
1-
import os
21
from abc import ABCMeta, abstractmethod
32
from dataclasses import dataclass
43
from typing import Any, Dict, List, Optional
@@ -38,8 +37,8 @@ class ClassificationConfig(ConfigModel):
3837
)
3938

4039
max_workers: int = Field(
41-
default=(os.cpu_count() or 4),
42-
description="Number of worker processes to use for classification. Set to 1 to disable.",
40+
default=1,
41+
description="Number of worker processes to use for classification. Note that any number above 1 might lead to a deadlock. Set to 1 to disable.",
4342
)
4443

4544
table_pattern: AllowDenyPattern = Field(

0 commit comments

Comments
 (0)