hive_format

์ €์žฅ ํฌ๋งท

  • ํ•˜์ด๋ธŒ๋Š” ๋‘ ๊ฐœ์˜ ์ฐจ์›, ์ฆ‰ ๋กœ์šฐ ํฌ๋งท๊ณผ ํŒŒ์ผ ํฌ๋งท์œผ๋กœ ํ…Œ์ด๋ธ” ์ €์žฅ์†Œ๋ฅผ ๊ด€๋ฆฌํ•ฉ๋‹ˆ๋‹ค.

  • ๋กœ์šฐ ํฌ๋งท์€ ํ–‰๊ณผ ํŠน์ • ํ–‰์˜ ํ•„๋“œ๊ฐ€ ์ €์žฅ๋œ ๋ฐฉ์‹์„ ์ง€์‹œํ•ฉ๋‹ˆ๋‹ค. ์ง๋ ฌ์ž-์—ญ์ง๋ ฌ์ž(Serializer-Deserializer)๋ฅผ ํ˜ผํ•ฉํ•œ ํ•˜์ด๋ธŒ ์ „๋ฌธ ์šฉ์–ด์ธ SerDe๋กœ ์ •์˜๋ฉ๋‹ˆ๋‹ค.

  • ํ…Œ์ด๋ธ”์„ ์งˆ์˜ํ•˜๋Š” ๊ฒฝ์šฐ์™€ ๊ฐ™์ด ์—ญ์งˆ๋ ฌํ™”๋ฅผ ์ˆ˜ํ–‰ํ•  ๋•Œ SerDe๋Š” ํŒŒ์ผ์— ์ €์žฅ๋œ ๋ฐ”์ดํŠธ์˜ ๋ฐ์ดํ„ฐํ–‰์„ ํ•˜์ด๋ธŒ์—์„œ ๋‚ด๋ถ€์ ์œผ๋กœ ์‚ฌ์šฉ๋˜๋Š” ๊ฐ์ฒด๋กœ ์—ญ์ง€๋ ฌํ™”ํ•˜์—ฌ ๊ทธ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค.

  • ํŒŒ์ผ ํฌ๋งท

    • Text File

      • ํ…์ŠคํŠธ ํŒŒ์ผ ํฌ๋งท์€ ํ”ผ๊ทธ๋‚˜ grep,sed,awk์™€ ๊ฐ™์€ ์œ ๋‹‰์Šค ํ…์ŠคํŠธ ๋„๊ตฌ ๋“ฑ๊ณผ ๋ฐ์ดํ„ฐ๋ฅผ ๊ณต์œ ํ•˜๊ธฐ ํŽธ๋ฆฌํ•ฉ๋‹ˆ๋‹ค. ํŒŒ์ผ์˜ ๋‚ด์šฉ์„ ์ง์ ‘ ๋ณด๊ณ  ์ˆ˜์ •ํ•˜๊ธฐ ์‰ฌ์›€๋‹ˆ๋‹ค.

      • ํ…์ŠคํŠธ ํฌ๋งท์€ ๋ฐ”์ด๋„ˆ๋ฆฌ ํฌ๋งท๊ณผ ๋น„๊ตํ•˜๋ฉด ์ €์žฅ ๊ณต๊ฐ„์„ ํšจ๊ณผ์ ์œผ๋กœ ์‚ฌ์šฉํ•˜์ง€๋Š” ๋ชปํ•ฉ๋‹ˆ๋‹ค. ์••์ถ•์„ ์‚ฌ์šฉํ•  ์ˆ˜๋„ ์žˆ์ง€๋งŒ ๋ฐ”์ด๋„ˆ๋ฆฌ ํฌ๋งท์„ ์‚ฌ์šฉํ•จ์œผ๋กœ์จ ํ…์ŠคํŠธ ํฌ๋งท๋ณด๋‹ค ๋” ํ–ฅ์ƒ๋œ ๋””์Šคํฌ I/O์™€ ํšจ๊ณผ์ ์ธ ๋””์Šคํฌ ๊ณต๊ฐ„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.

    • SequenceFile

      • ์‹œํ€€์ŠคํŒŒ์ผ์€ ๋ฐ”์ด๋„ˆ๋ฆฌ ํ‚ค-๊ฐ’์œผ๋กœ ๊ตฌ์„œ๋œ ํ”Œ๋žซ ํŒŒ์ผ(flat file - ๊ณ„์ธต์  ๊ตฌ์กฐ๋ฅผ ๊ฐ–์ง€ ์•Š๊ณ  ๋‹จ์ˆœํžˆ ๊ฐ™์€ ํ˜•์‹์˜ ๋ ˆ์ฝ”๋“œ์˜ ๋ชจ์ž„์œผ๋กœ ์ด๋ฃจ์–ด์กŒ์Šต๋‹ˆ๋‹ค). ํ•˜์ด๋ธŒ๋Š” ์ฟผ๋ฆฌ๋ฅผ ๋งต๋ฆฌ๋“€์Šค ์žก์œผ๋กœ ๋ณ€ํ™˜ํ•  ๋•Œ ๋ ˆ์ฝ”๋“œ๋กœ ์‚ฌ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์ ๋‹นํ•œ ํ‚ค-๊ฐ’ ์Œ์„ ์ •ํ•ฉ๋‹ˆ๋‹ค.

      • ์‹œํ€€์ŠคํŒŒ์ผ์€ ๋ธ”๋Ÿญ๊ณผ ๋ ˆ์ฝ”๋“œ ์ˆ˜์ค€์—์„œ ์••์ถ•์ด ๊ฐ€๋Šฅํ•˜๋ฏ€๋กœ ๋””์Šคํฌ ๊ณต๊ฐ„ ํ™œ์šฉ๊ณผ I/O๋ฅผ ์ตœ์ ํ™”ํ•  ์ˆ˜ ์žˆ๊ณ  ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋ธ”๋ก ๋‹จ์œ„ ํŒŒ์ผ ๋ถ„ํ• ๋„ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

      • image

      • image

        • https://towardsdatascience.com/new-in-hadoop-you-should-know-the-various-file-format-in-hadoop-4fcdfa25d42b

      • ํ•˜๋‘ก์ด ์ง€์›ํ•˜๋Š” ์‹œํ€€์ŠคํŒŒ์ผ ํฌ๋งท์€ ํŒŒ์ผ์„ ๋ธ”๋ก์œผ๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๊ณ  ์„ ํƒ์ ์œผ๋กœ ๋ธ”๋ก์„ ์••์ถ•ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. CREATE TABLE ์ ˆ์— STORED AS SEQUENCEFILE ์ ˆ์„ ์ถ”๊ฐ€ํ•˜๋ฉด ๋ฉ๋‹ˆ๋‹ค.

      • No compression์€ ๋ฐ์ดํ„ฐ๋ฅผ ์••์ถ•ํ•˜์ง€ ์•Š๊ณ  ๊ทธ๋Œ€๋กœ ์ €์žฅํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ ์••์ถ• ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š์œผ๋ฏ€๋กœ ๋””์Šคํฌ ๊ณต๊ฐ„์„ ์ ๊ฒŒ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜, ํฐ ํŒŒ์ผ์„ ์ฒ˜๋ฆฌํ•  ๋•Œ ์ž…์ถœ๋ ฅ ์†๋„๊ฐ€ ๋А๋ ค์งˆ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

      • Record compression์€ ๋ฐ์ดํ„ฐ์˜ ๊ฐ ๋ ˆ์ฝ”๋“œ(ํ‚ค-๊ฐ’ ์Œ)๋ฅผ ์••์ถ•ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜๋ฉด ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๊ฐ€ ์ค„์–ด๋“ค์–ด ๋””์Šคํฌ ๊ณต๊ฐ„์„ ์ ˆ์•ฝํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. Record compression์€ deflate, gzip, bzip2 ๋“ฑ์˜ ์••์ถ• ๋ฐฉ์‹์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

      • Block compression์€ ๋ฐ์ดํ„ฐ๋ฅผ ๋ธ”๋ก ๋‹จ์œ„๋กœ ์••์ถ•ํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค. ์ด ๋ฐฉ์‹์€ Record compression๊ณผ ๋‹ฌ๋ฆฌ ๋ฐ์ดํ„ฐ๋ฅผ ๋ธ”๋ก ๋‹จ์œ„๋กœ ์ฒ˜๋ฆฌํ•˜๋ฏ€๋กœ ๋ธ”๋ก ๋‹จ์œ„๋กœ ์ž…์ถœ๋ ฅ์ด ์ˆ˜ํ–‰๋˜์–ด ์ฒ˜๋ฆฌ ์†๋„๊ฐ€ ๋นจ๋ผ์ง‘๋‹ˆ๋‹ค. Block compression์€ deflate, gzip, bzip2, LZO, Snappy ๋“ฑ์˜ ์••์ถ• ๋ฐฉ์‹์„ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค.

    • RCFile

      • image

        • https://towardsdatascience.com/new-in-hadoop-you-should-know-the-various-file-format-in-hadoop-4fcdfa25d42b

      • Hive์˜ Record Columnar File์€ ๋จผ์ € ๋ฐ์ดํ„ฐ๋ฅผ ํ–‰ ๋‹จ์œ„๋กœ Row Group์œผ๋กœ ๋‚˜๋ˆ„๊ณ  Row Group ๋‚ด๋ถ€์— ๋ฐ์ดํ„ฐ๋ฅผ ์—ด๋กœ ์ €์žฅํ•˜๋Š” ํ˜•์‹์˜ ํŒŒ์ผ์œผ๋กœ MapReduce ๊ธฐ๋ฐ˜ ๋ฐ์ดํ„ฐ ์›จ์–ดํ•˜์šฐ์Šค ์‹œ์Šคํ…œ์šฉ์œผ๋กœ ์„ค๊ณ„๋œ ๋ฐ์ดํ„ฐ ๋ฐฐ์น˜ ๊ตฌ์กฐ์ž…๋‹ˆ๋‹ค

      • RCFile์€ ํ–‰ ์ €์žฅ์†Œ์™€ ์—ด ์ €์žฅ์†Œ์˜ ์žฅ์ ์„ ๊ฒฐํ•ฉํ•˜์—ฌ ๋น ๋ฅธ ๋ฐ์ดํ„ฐ ๋กœ๋“œ ๋ฐ ์ฟผ๋ฆฌ ์ฒ˜๋ฆฌ, ์Šคํ† ๋ฆฌ์ง€ ๊ณต๊ฐ„์˜ ํšจ์œจ์ ์ธ ์‚ฌ์šฉ ๋ฐ ๋งค์šฐ ๋™์ ์ธ ์›Œํฌ๋กœ๋“œ ํŒจํ„ด์— ๋Œ€ํ•œ ์ ์‘์„ฑ์— ๋Œ€ํ•œ ์š”๊ตฌ๋ฅผ ์ถฉ์กฑํ•ฉ๋‹ˆ๋‹ค.

      • ๋Œ€๋ถ€๋ถ„ ํ•˜๋‘ก๊ณผ ํ•˜์ด๋ธŒ ์ €์žฅ ๊ณต๊ฐ„์€ ๋กœ์šฐ ๊ธฐ๋ฐ˜์ด๋ฉฐ ์ด๋Š” ๋Œ€๋ถ€๋ถ„ ํšจ๊ณผ์ ์ž…๋‹ˆ๋‹ค. ํŒŒ์ผ์˜ ๋ธ”๋ก ๋‹จ์œ„ ์••์ถ•์€ ๋ฐ˜๋ณต๋˜๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ๋‹ค๋ฃจ๋Š” ๋ฐ ํšจ์œจ์ ์ด๊ณ  ๋กœ์šฐ ๊ธฐ๋ฐ˜์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ž˜ ๋‹ค๋ฃฐ ์ˆ˜ ์žˆ๋Š” ์—ฌ๋Ÿฌ ํ…์ŠคํŠธ ์ฒ˜๋ฆฌ ๋„๊ตฌ๋‚˜ ๋””๋ฒ„๊น… ๋„๊ตฌ(more,head,awk)์™€ ์ž˜ ๋งž์Šต๋‹ˆ๋‹ค.

      • ํ…Œ์ด๋ธ”์ด ์ˆ˜๋ฐฑ ๊ฐœ์˜ ์ปฌ๋Ÿผ์„ ๊ฐ€์ง€๊ณ  ์žˆ๊ณ  ๋Œ€๋ถ€๋ถ„ ์ฟผ๋ฆฌ์—์„œ ๊ทธ์ค‘ ๋ช‡ ๊ฐœ๋งŒ ์‚ฌ์šฉํ•œ๋‹ค๋ฉด ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์˜ค๊ธฐ ์œ„ํ•ด ์ „์ฒด ๋กœ์šฐ๋ฅผ ์Šค์บ”ํ•˜๋Š” ๋ฐฉ์‹์€ ๋‚ญ๋น„์ž…๋‹ˆ๋‹ค. ๋Œ€์‹ ์— ๋ฐ์ดํ„ฐ๊ฐ€ ์ปฌ๋Ÿผ์„ ๊ธฐ์ค€์œผ๋กœ ์ €์žฅ๋˜์–ด ์žˆ๋‹ค๋ฉด ํ•„์š”ํ•œ ๋ฐ์ดํ„ฐ ์ปฌ๋Ÿผ๋งŒ ์ฝ์„ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์— ์„ฑ๋Šฅ์ด ํ–ฅ์ƒ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.

      • ํ•˜์ด๋ธŒ์˜ ๊ฐ•๋ ฅํ•œ ์žฅ์  ์ค‘ ํ•˜๋‚˜๋Š” ์„œ๋กœ ๋‹ค๋ฅธ ๋‘ ๋ฐ์ดํ„ฐ ํฌ๋งท์„ ๊ฐ„๋‹จํžˆ ๋ณ€ํ™˜ํ•˜๋Š” ๋Šฅ๋ ฅ์œผ๋กœ ์ €์žฅ ์ •๋ณด๋Š” ํ…Œ์ด๋ธ”์˜ ๋ฉ”ํƒ€๋ฐ์ดํ„ฐ์— ์ €์žฅํ•ฉ๋‹ˆ๋‹ค.

        • ํ–‰ ์ง€ํ–ฅ

          • image

            • https://towardsdatascience.com/new-in-hadoop-you-should-know-the-various-file-format-in-hadoop-4fcdfa25d42b

        • ์—ด ์ง€ํ–ฅ

          • image

            • https://towardsdatascience.com/new-in-hadoop-you-should-know-the-various-file-format-in-hadoop-4fcdfa25d42b

    • Avro Files

      • image

        • https://www.oreilly.com/library/view/operationalizing-the-data/9781492049517/ch04.html

      • ์›๊ฒฉ ํ”„๋กœ์‹œ์ € ํ˜ธ์ถœ ๋ฐ ๋ฐ์ดํ„ฐ ์ง๋ ฌํ™” ํ”„๋ ˆ์ž„์›Œํฌ์ž…๋‹ˆ๋‹ค.

      • Apache์˜ Hadoop ํ”„๋กœ์ ํŠธ ๋‚ด์—์„œ ๊ฐœ๋ฐœ๋˜์—ˆ์œผ๋ฉฐ JSON์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์œ ํ˜• ๋ฐ ํ”„๋กœํ† ์ฝœ์„ ์ •์˜ํ•˜๊ณ  ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ„๋‹จํ•œ ์ด์ง„ ํ˜•์‹์œผ๋กœ ์ง๋ ฌํ™”ํ•ฉ๋‹ˆ๋‹ค

    • ORC Files

      • image
        • https://www.oreilly.com/library/view/operationalizing-the-data/9781492049517/ch04.html

      • ์—ด ํ˜•์‹์œผ๋กœ ์ €์žฅ๋œ ํ–‰ ๋ฐ์ดํ„ฐ์™€ ํ•จ๊ป˜ ํ•˜๋‚˜์˜ ํŒŒ์ผ์— ํ–‰ ๋ชจ์Œ์„ ์ €์žฅํ•ฉ๋‹ˆ๋‹ค

      • ํด๋Ÿฌ์Šคํ„ฐ ์ „์ฒด์—์„œ ํ–‰ ๋ชจ์Œ์„ ๋ณ‘๋ ฌ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ์—ด ๋ ˆ์ด์•„์›ƒ์ด ์žˆ๋Š” ๊ฐ ํŒŒ์ผ์€ ์••์ถ•์— ์ตœ์ ํ™”๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

      • ๋ฐ์ดํ„ฐ์™€ ์—ด์„ ๊ฑด๋„ˆ๋›ฐ๋ฉด ์ฝ๊ธฐ ๋ฐ ์••์ถ• ํ•ด์ œ ๋กœ๋“œ๊ฐ€ ๋ชจ๋‘ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.

    • Parquet

      • image
        • https://www.oreilly.com/library/view/operationalizing-the-data/9781492049517/ch04.html

      • Hadoop์šฉ ์˜คํ”ˆ ์†Œ์Šค ์—ด ์ง€ํ–ฅ ์Šคํ† ๋ฆฌ์ง€ ํ˜•์‹์ž…๋‹ˆ๋‹ค.

      • Parquet๋Š” ๋ณต์žกํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋Œ€๋Ÿ‰์œผ๋กœ ์ž‘์—…ํ•˜๋„๋ก ์ตœ์ ํ™”๋˜์–ด ์žˆ์œผ๋ฉฐ ํšจ์œจ์ ์ธ ๋ฐ์ดํ„ฐ ์••์ถ• ๋ฐ ์ธ์ฝ”๋”ฉ ์œ ํ˜•์„ ์œ„ํ•œ ๋ฐฉ๋ฒ•์„ ํฌํ•จํ•ฉ๋‹ˆ๋‹ค.

    • ORC, Parquet ๋ฐ Avro์˜ ํ’ˆ์งˆ

      • ORC
        Parquet
        Avro

        Row or column

        Column

        Column

        Row

        Compression

        Great

        Great

        Good

        Speedup (compared to text file)

        10โ€“100x

        10โ€“100x

        10x

        Schema evolution

        Good

        Better

        Best

        Platforms

        Hive, Spark, Presto

        Hive, Spark, Presto

        Hive, Spark

        Splittability

        Best

        Best

        Better

        File statistics

        Yes

        Yes

        No

        Indexes

        Yes

        Yes

        No

        Bloom filters

        Yes

        No

        No

    • Custom INPUTFORMAT and OUTPUTFORMAT

      • ํ•˜์ด๋ธŒ๋Š” ํ•˜๋‘ก์˜ ์ž…๋ ฅ ํฌ๋งท(InputFormat) API๋ฅผ ์ด์šฉํ•ด ํ…์ŠคํŠธ ํŒŒ์ผ, ์‹œํ€€์ŠคํŒŒ์ผ, ์‚ฌ์šฉ์ž ์ •์˜ ํŒŒ์ผ๊ณผ ๊ฐ™์€ ๋‹ค์–‘ํ•œ ์†Œ์Šค๋กœ๋ถ€ํ„ฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ฝ์Šต๋‹ˆ๋‹ค. ์ถœ๋ ฅ ํฌ๋งท(OutputFormat) API๋ฅผ ์ด์šฉํ•˜๋ฉด ๋‹ค์–‘ํ•œ ํฌ๋งท์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์“ธ ์ˆ˜๋„ ์žˆ์Šต๋‹ˆ๋‹ค.

      • ์ž…๋ ฅ ํฌ๋งท(Input Format)์€ ์ฃผ๋กœ ํŒŒ์ผ ํ˜•ํƒœ์˜ ์ž…๋ ฅ ์ŠคํŠธ๋ฆผ์„ ๋ ˆ์ฝ”๋“œ ๋‹จ์œ„๋กœ ๋ถ„ํ• ๋˜๋Š” ๋ฐฉ๋ฒ•์„ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. SerDe๋Š” ๋ ˆ์ฝ”๋“œ๋ฅผ ์ปฌ๋Ÿผ ๋‹จ์œ„๋กœ ๋ถ„์„ํ•˜๊ณ  ์ปค์Šคํ…€ ์ž…๋ ฅ ํฌ๋งท์€ INPUTFORMAT ๋ฌธ์„ ์ด์šฉํ•ด ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•  ๋•Œ ์„ ์–ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ๋ณธ STORED AS TEXTFILE ๋ช…์„ธ์˜ ์ž…๋ ฅ ํฌ๋งท์€org.apache.hadoop.mapreduce.lib.input.TextInputFormat์œผ๋กœ ์ž๋ฐ” ๊ฐ์ฒด๋กœ ๊ตฌํ˜„๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค.

      • ์ถœ๋ ฅ ํฌ๋งท(Output Format)์€ ์ถœ๋ ฅ ์ŠคํŠธ๋ฆผ(๋ณดํ†ต์€ ํŒŒ์ผ)์— ์–ด๋–ป๊ฒŒ ๋ ˆ์ฝ”๋“œ๋ฅผ ๊ธฐ๋กํ•˜๋Š”์ง€ ๊ฒฐ์ •ํ•ฉ๋‹ˆ๋‹ค. SerDe๋Š” ๊ฐœ๋ณ„ ๋ ˆ์ฝ”๋“œ๋ฅผ ์ ์ ˆํ•œ ๋ฐ”์ดํŠธ ์ŠคํŠธ๋ฆผ์œผ๋กœ ์ง๋ ฌํ™”ํ•˜๊ณ  ์‚ฌ์šฉ์ž ์ •์˜ ์ถœ๋ ฅ ํฌ๋งท์€ OUTPUTFORMAT๋ฌธ์„ ์ด์šฉํ•ด ํ…Œ์ด๋ธ”์„ ์ƒ์„ฑํ•  ๋•Œ ์„ ์–ธํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

  • ์‚ฌ์šฉ์ž ์ •์˜ ํ•จ์ˆ˜(UDF), ์‚ฌ์šฉ์ž ์ •์˜ ์ง‘๊ณ„ ํ•จ์ˆ˜(user-defined aggregate function - UDAF), ์‚ฌ์šฉ์ž ์ •์˜ ํ…Œ์ด๋ธ” ์ƒ์„ฑ ํ•จ์ˆ˜(user-defined table-generating function - UDTF) ๋“ฑ ์„ธ ์ข…๋ฅ˜์˜ UDF๋ฅผ ์ง€์›ํ•ฉ๋‹ˆ๋‹ค. ์„ธ ์ข…๋ฅ˜์˜ ์ฐจ์ด์ ์€ ์ž…๋ ฅ์œผ๋กœ ๋ฐ›๋Š” ํ–‰๊ณผ ์ถœ๋ ฅ๋˜๋Š” ํ–‰์˜ ๊ฐœ์ˆ˜๊ฐ€ ๋‹ค๋ฅด๋‹ค๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค.

    • ์ •๊ทœ UDF๋Š” ๋‹จ์ผ ํ–‰์„ ์ฒ˜๋ฆฌํ•œ ํ›„ ๋‹จ์ผ ํ–‰์„ ์ถœ๋ ฅํ•˜๋ฉฐ ์ˆ˜ํ•™ ํ•จ์ˆ˜๋‚˜ ๋ฌธ์ž์—ด ํ•จ์ˆ˜์™€ ๊ฐ™์€ ๋Œ€๋ถ€๋ถ„์˜ ํ•จ์ˆ˜๊ฐ€ ์—ฌ๊ธฐ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

    • UDAF๋Š” ๋‹ค์ˆ˜์˜ ์ž…๋ ฅ ํ–‰์„ ์ฒ˜๋ฆฌํ•œ ํ›„ ๋‹จ์ผ ํ–‰์„ ์ถœ๋ ฅํ•˜๋ฉฐ COUNT๋‚˜ MAX๊ฐ™์€ ์ง‘๊ณ„ ํ•จ์ˆ˜๊ฐ€ ์—ฌ๊ธฐ์— ํ•ด๋‹นํ•ฉ๋‹ˆ๋‹ค.

    • UDTF๋Š” ๋‹จ์ผ ๋กœ์šฐ๋ฅผ ์ฒ˜๋ฆฌํ•œ ํ›„ ๋‹ค์ˆ˜์˜ ํ–‰(ํ…Œ์ด๋ธ”)์„ ์ถœ๋ ฅํ•จํ•ฉ๋‹ˆ๋‹ค.

Reference

  • https://www.amazon.com/Programming-Hive-Warehouse-Language-Hadoop/dp/1449319335

  • https://www.amazon.com/Hadoop-Definitive-Guide-Tom-White/dp/1449311520

  • https://www.oreilly.com/library/view/operationalizing-the-data/9781492049517/ch04.html

  • https://cwiki.apache.org/confluence/display/Hive/FileFormats

Last updated