머신러닝 파이프라인에 TFDV 통합하기

MLOps

머신러닝 파이프라인에 TFDV 통합하기

백악기작은펭귄 2022. 1. 3. 17:39

머신러닝 파이프라인에 TFDV 통합하기

TFX는 StatisticsGen이라는 파이프라인 컴포넌트를 제공한다. 이는 이전 ExampleGen 컴포넌트의 출력을 입력으로 받아 통계를 생성한다.

from tfx.components import StatisticsGen
from tfx.orchestration.experimental.interactive.interactive_context import InteractiveContext

context = InteractiveContext()

statistics_gen = StatisticsGen(examples=example_gen.outputs['example'])
context.run(statistics_gen)

InteractiveContext를 이용해 다음과 같이 출력을 시각화할 수 있다.

context.show(statistics_gen.outputs['statistics'])

StatisticsGen 컴포넌트에서 생성한 통계 (출처: 텐서플로우 공식 가이드)

또한, SchemaGen이라는 컴포넌트를 이용하여 스키마를 생성할 수도 있다. 이 컴포넌트는 스키마가 없을 때만 작동하여 스키마를 생성한다.

from tfx.components import SchemaGen

schema_gen = SchemaGen(statistics=statistics_gen.outputs['statistics'],
                       infer_feature_shape=True)
context.run(schema_gen)

이렇게 생성된 통계와 스키마를 사용하여, 새로운 데이터셋을 검증할 수 있다.

from tfx.components import ExampleValidator

example_validator = ExampleValidator(statistics=statistics_gen.outputs['statistics'],
                                     schema=schema_gen.outputs['schema'])
context.run(example_validator)

NOTE
ExampleValidator는 앞에서 설명한 skew 및 drift comparator를 사용하려 스키마 관련 이상치를 자동으로 탐지한다. 하지만 이는 모든 잠재 이상치를 포함하지 않는 경우가 있을 수 있으므로, 특정 이상치를 탐지해야 하는 경우 사용자 지정 컴포넌트를 사용하는 것이 바람직하다.

ExampleValidator 컴포넌트가 새 데이터셋과 이전 데이터셋 사이의 데이터셋 통계나 스키마에서 잘못된 정렬을 감지하면, 메타데이터스토어에서 상태를 failed로 설정하고 파이프라인을 중지한다. 이상치가 감지되지 않으면 다음 단계(데이터 전처리)로 이동한다.