embulk_code

Embulk

  • Embulk์— ๋Œ€ํ•œ ์ž์„ธํ•œ ๋‚ด์šฉ์€ ๋งํฌ๋ฅผ ์ฐธ๊ณ ํ•ด์ฃผ์„ธ์š”

    • https://github.com/mjs1995/muse-data-engineer/blob/main/doc/Data%20Ingestion/embulk.md

  • embulk๋ฅผ ์„ค์น˜ํ•ฉ๋‹ˆ๋‹ค.

  • # JRE ์„ค์น˜
    sudo apt install default-jre
    
    # embulk ์ตœ์‹ ๋ฒ„์ „ ๋‹ค์šด๋กœ๋“œ
    curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
    
    # embulk ์‹คํ–‰ ๊ถŒํ•œ ์ถ”๊ฐ€ 
    chmod +x ~/.embulk/bin/embulk
    
    # PATH ์— embulk ๋“ฑ๋ก 
    echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
    
    # ๋ณ€๊ฒฝ๋œ PATH์— ์ ์šฉ
    source ~/.bashrc
    
    # ํ™•์ธ
    embulk -version
    • image

    • [WARN] Unrecognized Java version: openjdk full version "17.0.6+10-Debian-1deb11u1", Unrecognized VM option 'AggressiveOpts', Error: Could not create the Java Virtual Machine. ํ•ด๋‹น ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•˜์˜€์Šต๋‹ˆ๋‹ค.

    • Embulk๊ฐ€ Java ๋ฒ„์ „์„ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜์—ฌ ๋ฐœ์ƒํ•˜๋Š” ๋ฌธ์ œ๋กœ Java ๊ฐ€์ƒ ๋จธ์‹ (VM) ์˜ต์…˜ ์ค‘ ์ธ์‹ํ•˜์ง€ ๋ชปํ•˜๋Š” ๊ฒƒ์œผ๋กœ, Java 9 ์ดํ›„ ๋ฒ„์ „์—์„œ๋Š” ์‚ญ์ œ๋˜์—ˆ์Šต๋‹ˆ๋‹ค. ๋”ฐ๋ผ์„œ Java 9 ์ดํ›„ ๋ฒ„์ „์„ ์‚ฌ์šฉํ•˜๋ฉด ํ•ด๋‹น ์˜ค๋ฅ˜๊ฐ€ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

      • # ๊ธฐ๋ณธ JRE ์„ค์น˜
        sudo apt install default-jre
        
        # Embulk๊ฐ€ ์ธ์‹ํ•  ์ˆ˜ ์žˆ๋Š” Java ๋ฒ„์ „ ์„ ํƒ
        sudo update-alternatives --config java
        
        # 3. Embulk ์‹คํ–‰ ์Šคํฌ๋ฆฝํŠธ ํŒŒ์ผ์— JAVA_OPTS ํ™˜๊ฒฝ ๋ณ€์ˆ˜ ์„ค์ •ํ•˜์—ฌ AggressiveOpts ์˜ต์…˜ ๋น„ํ™œ์„ฑํ™”
        echo 'export JAVA_OPTS="-XX:-UseAggressiveOpts"' >> ~/.bashrc
        source ~/.bashrc
      • image

      • Embulk๊ฐ€ ์ž˜ ์„ค์น˜๋œ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

Embulk plug-in ์„ค์น˜

  • https://plugins.embulk.org/

  • ๊ธฐ๋ณธ์ ์ธ embulk์— plugin์„ ์„ค์น˜ํ•ด์„œ postgres, bigquery, mysql, hdfs, oracle, redshift, s3, dynamodb, elasticsearch๋“ฑ๋“ฑ ๋‹ค์–‘ํ•œ ์ €์žฅ์†Œ์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๊ณ ๋‚˜ ๋„ฃ๊ฑฐ๋‚˜ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค

  • embulk gem list # ์„ค์น˜๋œ ํ”Œ๋Ÿฌ๊ทธ์ธ ๋ฆฌ์ŠคํŠธ 
    embulk gem install embulk-input-mysql # MySQL
    embulk gem install embulk-input-postgresql # PostgreSQL
    embulk gem install embulk-output-bigquery # Bigquery
    • embulk gem install embulk-output-gcs๋ฅผ ์‹คํ–‰ํ• ๋•Œ ERROR: Error installing embulk-output-bigquery:jwt requires Ruby version >= 2.5. ์—๋Ÿฌ๊ฐ€ ๋ฐœ์ƒํ•˜์—ฌ์„œ ํ•ด๊ฒฐํ–ˆ์Šต๋‹ˆ๋‹ค.

      • # Java 8 ์„ค์น˜ํ•˜๊ธฐ
        sudo apt-get update
        sudo apt-get install openjdk-8-jre-headless -y
        
        # Embulk ์„ค์น˜ํ•˜๊ธฐ
        curl --create-dirs -o ~/.embulk/bin/embulk -L "https://dl.embulk.org/embulk-latest.jar"
        chmod +x ~/.embulk/bin/embulk
        echo 'export PATH="$HOME/.embulk/bin:$PATH"' >> ~/.bashrc
        source ~/.bashrc
        
        # ํ•„์š”ํ•œ ํ”Œ๋Ÿฌ๊ทธ์ธ ์„ค์น˜ํ•˜๊ธฐ
        embulk gem install embulk-input-gcs
        embulk gem install embulk-output-gcs
      • image

Embulk Example

  • embulk example ./test # ์ƒ˜ํ”Œ๋ฐ์ดํ„ฐ ์ƒ์„ฑ 
    embulk guess ~/test/seed.yml -o config.yml # ์ž…๋ ฅ ํ”Œ๋Ÿฌ๊ทธ์ธ, ์ถœ๋ ฅ ํ”Œ๋Ÿฌ๊ทธ์ธ ๋ฐ ํ•„๋“œ ๋งคํ•‘ ๊ตฌ์„ฑ์„ ํฌํ•จํ•˜๋Š” config.yml ํŒŒ์ผ์„ ์ถ”์ธกํ•˜๋ฉฐ ์ƒ์„ฑ
    embulk preview config.yml # Embulk๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ config.yml ์ฒ˜๋ฆฌ ๊ฒฐ๊ณผ๋ฅผ ๋ฏธ๋ฆฌ๋ณด๊ธฐ
    embulk run config.yml # # ์‹คํ–‰
    • embulk guess๋กœ yaml ํŒŒ์ผ์„ ํ™•์ธํ•ฉ๋‹ˆ๋‹ค.

      • image

    • image

    • image

Bigquery ์„ค์ •

  • BigQuery Admin : ๊ธฐ์กด์— ์—†๋Š” ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•˜๊ธฐ์œ„ํ•œ ๊ถŒํ•œ์œผ๋กœ ์ดํ•˜์˜ BigQuery Admin ๊ถŒํ•œ์„ ๊ฐ–๊ฒŒ ๋˜๋ฉด, ๊ธฐ์กด์˜ ํ…Œ์ด๋ธ”์„ ์ˆ˜์ •, ์ฝ๊ธฐ, ์“ฐ๊ธฐ๋Š” ๊ฐ€๋Šฅํ•˜์ง€๋งŒ, ์—†๋Š” ํ…Œ์ด๋ธ” ์ƒ์„ฑ์— ๋Œ€ํ•œ ๊ถŒํ•œ์€ ์—†์Šต๋‹ˆ๋‹ค.

  • Embulk ์„ค์ •์‹œ service_account_email์„ ๋„ฃ์–ด์ค˜์•ผํ•ฉ๋‹ˆ๋‹ค.

  • ์„œ๋น„์Šค ๊ณ„์ •์˜ ํ‚ค๋ฅผ ๋„ฃ์–ด์ค๋‹ˆ๋‹ค.

    • image

  • gcs๋กœ testํ•  yamlํŒŒ์ผ์„ ๋งŒ๋“  ๋’ค์— embulk run์„ ์‹คํ–‰ํ•ด์ค๋‹ˆ๋‹ค.

    • image

  • in:
      type: file
      path_prefix: /home/{์œ ์ €}/./test/csv/sample_
      decoders:
      - {type: gzip}
      parser:
        charset: UTF-8
        newline: LF
        type: csv
        delimiter: ','
        quote: '"'
        escape: '"'
        null_string: 'NULL'
        trim_if_not_quoted: false
        skip_header_lines: 1
        allow_extra_columns: false
        allow_optional_columns: false
        columns:
        - {name: id, type: long}
        - {name: account, type: long}
        - {name: time, type: timestamp, format: '%Y-%m-%d %H:%M:%S'}
        - {name: purchase, type: timestamp, format: '%Y%m%d'}
        - {name: comment, type: string}
    out:
      type: bigquery
      mode: append
      auth_method: json_key
      json_keyfile: /home/{์œ ์ €}/test/{ํ”„๋กœ์ ํŠธ ํ‚ค}.json
      service_account_email: zoom-de-service-acct@{ํ”„๋กœ์ ํŠธ}.iam.gserviceaccount.com
      dataset: test_dataset
      auto_create_table: true
      table: embulk_table_%Y_%m

Last updated