๐Ÿ“„ Deep Residual Learning for Image Recognition

Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun Affiliation: Microsoft Research, Xiโ€™an Jiaotong University Conference: (1449673200000) DOI: 10.48550/arXiv.1512.03385

  • Keywords: [Residual, ]

์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ

โ€œIs learning better networks as easy as stacking more layers?โ€ (๋„คํŠธ์›Œํฌ๋ฅผ ๊นŠ๊ฒŒ ์Œ“๊ธฐ๋งŒ ํ•˜๋ฉด ์„ฑ๋Šฅ์ด ์ข‹์•„์งˆ๊นŒ?)

  • Xavier/He ์ดˆ๊ธฐํ™”, BatchNorm ๋“ฑ์œผ๋กœ Vanishing/Exploding์€ ์–ด๋А ์ •๋„ ํ•ด๊ฒฐ
    • ๊ทธ๋Ÿฌ๋‚˜ ๊นŠ์–ด์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๋ฌธ์ œ(degradation) ๋Š” ์—ฌ์ „ํžˆ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์Œ
  • ์–•์€ ๋„คํŠธ์›Œํฌ์˜ ์ตœ์ ํ•ด๋ฅผ ๊นŠ์€ ๋„คํŠธ์›Œํฌ๋„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Œ์ด ๋ณด์žฅ๋˜๋Š”๋ฐ,
    • ์‹ค์ œ๋กœ๋Š” ๊นŠ์„์ˆ˜๋ก ํ›ˆ๋ จ ์˜ค์ฐจ๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” Degradation Problem โ†’ ์ด ๋ชจ์ˆœ์ด ResNet์˜ ์ถœ๋ฐœ์ .

์ฃผ์š” ์•„์ด๋””์–ด

Residual Learning์œผ๋กœ ์ตœ์ ํ™” ๋‚œ์ด๋„ ํ•ด๊ฒฐ Identity Shortcut์œผ๋กœ gradient ํ๋ฆ„ ๋ฌธ์ œ ํ•ด๊ฒฐ Bottleneck Architerture๋กœ FLOPs ๊ฐ์†Œ๋กœ ๊นŠ์ด ํ™•์žฅ > ์„ฑ๋Šฅ ํ–ฅ์ƒ

Residual learning์˜ ์ดํ•ด

์ƒˆ๋กœ์šด ๋ ˆ์ด์–ด๊ฐ€ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•  ๋ชฉํ‘œ ๊ฐ€์ •: ํ•ญ๋“ฑํ•จ์ˆ˜

๊ธฐ์กด CNN

  • ๋ฌธ์ œ ์ƒํ™ฉ
    • ๋„คํŠธ์›Œํฌ๋Š” ํ•ญ๋“ฑํ•จ์ˆ˜(identity function) ๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•ด์•ผ ํ•จ.
    • ๋„คํŠธ์›Œํฌ ๊ตฌ์„ฑ: Conv โ†’ BN โ†’ ReLU โ†’ Conv โ€ฆ ๊ฐ™์€ ๋น„์„ ํ˜• ์Šคํƒ
    • ํ•ญ๋“ฑํ•จ์ˆ˜๋ฅผ ์ด๋Ÿฐ ๋น„์„ ํ˜• ์กฐํ•ฉ์œผ๋กœ ์ •ํ™•ํžˆ ๊ตฌํ˜„ํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ์–ด๋ ค์šด ์ตœ์ ํ™” ๋ฌธ์ œ.
  • ์™œ ์–ด๋ ค์šด๊ฐ€?
    • ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์กฐ๊ธˆ๋งŒ ๋ณ€ํ•ด๋„ ์ž…๋ ฅ ๊ฐ€ ์‰ฝ๊ฒŒ ์™œ๊ณก๋จ
    • ๊นŠ์€ ๋ ˆ์ด์–ด์ผ์ˆ˜๋ก ํ•ญ๋“ฑํ•จ์ˆ˜ ์œ ์ง€๊ฐ€ ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅ
    • ์—ญ์ „ํŒŒ๊ฐ€ ํ•ญ๋“ฑํ•จ์ˆ˜ ํ•ด์— ์ˆ˜๋ ดํ•˜๊ธฐ ๋งค์šฐ ์–ด๋ ค์›€ โ†’ ๊นŠ์€ ๋น„์„ ํ˜• ๋„คํŠธ์›Œํฌ์—๊ฒŒ โ€œ๊ทธ๋ƒฅ ์•„๋ฌด๊ฒƒ๋„ ํ•˜์ง€ ๋งˆ๋ผโ€๋Š” ์š”๊ตฌ๋Š” ๊ทน๋„๋กœ ๋‚œ์ด๋„๊ฐ€ ๋†’์Œ.

ResNet

  • ResNet์€
    • /
  • ๋ชฉํ‘œ ํ•จ์ˆ˜๊ฐ€ ๋ผ๋ฉด,

์˜๋ฏธ

  • ๋ ˆ์ด์–ด๋Š” ํ•ญ๋“ฑํ•จ์ˆ˜ ์ „์ฒด๋ฅผ ๋งŒ๋“œ๋Š” ๋Œ€์‹  ๋‹จ์ˆœํžˆ ์ถœ๋ ฅ์„ 0์œผ๋กœ ๋งŒ๋“œ๋Š” ๊ฒƒ๋งŒ ํ•™์Šตํ•˜๋ฉด ๋จ.

  • ํ˜„๋Œ€ ์ดˆ๊ธฐํ™”๋Š” ๋ชจ๋“  weight๋ฅผ 0 ๊ทผ์ฒ˜์˜ ์ž‘์€ ๊ฐ’ ์œผ๋กœ ๋‘ .

    • Xavier/He ์ดˆ๊ธฐํ™”๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ€์ค‘์น˜๊ฐ€ ์ด๋ฏธ 0 ๊ทผ์ฒ˜
    • ๋”ฐ๋ผ์„œ Residual block์˜ ์ดˆ๊นƒ๊ฐ’์€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ โ†’ ์ฆ‰, ์ด๋ฏธ ํ•ญ๋“ฑํ•จ์ˆ˜์— ๋งค์šฐ ๊ฐ€๊นŒ์šด ์ƒํƒœ๋กœ ์‹œ์ž‘
  • ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ทผ์ฒ˜๋กœ ๋‘๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ์‰ฌ์›€ โ†’ ์ตœ์ ํ™” ๋‚œ๋„ ๋Œ€ํญ ๊ฐ์†Œ.

๊ธฐ์กด ๋ฐฉ์‹

  • ์š”๊ตฌ: โ€œ๋น„์„ ํ˜• ๋ ˆ์ด์–ด ์Šคํƒ์œผ๋กœ ๋ณต์žกํ•œ ํ•ญ๋“ฑํ•จ์ˆ˜ ๋ฅผ ๊ตฌํ˜„ํ•˜๋ผ.โ€
  • ๊ฒฐ๊ณผ: ๊ทน๋„๋กœ ์–ด๋ ค์šด ์ตœ์ ํ™” ๋ฌธ์ œ

Residual ๋ฐฉ์‹

  • ์š”๊ตฌ: โ€œ์ž”์ฐจ ๋ฅผ ์œผ๋กœ ๋งŒ๋“ค์–ด๋ผ.โ€
  • ๊ฒฐ๊ณผ: ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ทผ์ฒ˜์— ์œ„์น˜์‹œํ‚ค๋ฉด ๋จ โ†’ ๋งค์šฐ ์‰ฌ์›€

์š”์•ฝ

  • ๋น„์„ ํ˜• ๋ ˆ์ด์–ด ์Šคํƒ์—๊ฒŒ ํ•ญ๋“ฑํ•จ์ˆ˜ ๋ฅผ ๋งŒ๋“ค๊ฒŒ ํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต๋‹ค.
  • ์ž”์ฐจ ๋ฅผ ์œผ๋กœ ๋งŒ๋“ค๊ฒŒ ํ•˜๋Š” ๊ฒƒ์€ ์••๋„์ ์œผ๋กœ ์‰ฝ๋‹ค.
    โ†’ ๊ทธ๋ž˜์„œ ResNet์€ ๋งค์šฐ ๊นŠ์–ด๋„ ํ•™์Šต์ด ์ž˜๋จ.

Identity Shortcut

  • Shortcut connection์€ ์ž…๋ ฅ ๋ฅผ ๊ทธ๋Œ€๋กœ ๋‹ค์Œ ๋ธ”๋ก์œผ๋กœ ์ „๋‹ฌํ•˜๋Š” ๊ฒฝ๋กœ.
  • ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—†๋Š” ํ•ญ๋“ฑ ๋งตํ•‘(identity mapping) ์ด๋ฏ€๋กœ ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€ ์—†์Œ.
  • Residual block์˜ ์ถœ๋ ฅ์€ ์œผ๋กœ ๊ณ„์‚ฐ๋จ.

์™œ ํ•„์š”ํ•œ๊ฐ€?

  1. ํ•ญ๋“ฑํ•จ์ˆ˜๋ฅผ โ€œ๊ตฌ์กฐ์ ์œผ๋กœโ€ ์‰ฝ๊ฒŒ ๋งŒ๋“ค์–ด์„œ ์ตœ์ ํ™”๊ฐ€ ์‰ฌ์›Œ์ง
    • ๊ธฐ์กด CNN์€ ๋น„์„ ํ˜• ์กฐํ•ฉ์œผ๋กœ ๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•˜๋ฏ€๋กœ ์–ด๋ ค์›€
    • identity shortcut์€ ๋ฅผ ๊ทธ๋Œ€๋กœ ์ „๋‹ฌํ•จ์œผ๋กœ์จ
    • ํ•ญ๋“ฑํ•จ์ˆ˜๋ฅผ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ์ฐจ์›์—์„œ ๋ณด์žฅ
    • โ†’ Residual block์ด โ€œ ๊ทธ๋Œ€๋กœ ์ „๋‹ฌ + ์ž‘์€ ๋ณ€ํ™” โ€๋งŒ ํ•™์Šตํ•˜๋ฉด ๋จ
  2. Gradient ํ๋ฆ„์„ ์ง์ ‘ ์ „๋‹ฌํ•˜์—ฌ ๊นŠ์–ด์ ธ๋„ ํ•™์Šต์ด ๋ง๊ฐ€์ง€์ง€ ์•Š์Œ
    • ์—ญ์ „ํŒŒ ์‹œ gradient๊ฐ€ shortcut์„ ํ†ตํ•ด ์•„๋ž˜์ฒ˜๋Ÿผ ํ๋ฆ„ (BN, ReLU์˜ ๋„ํ•จ์ˆ˜ X)
    • ์ดˆ๊ธฐํ™” ์‹œ์ ์—์„œ๋Š” ์ด๋ฏ€๋กœ
      gradient โ†’ gradient๊ฐ€ ์†Œ์‹ค๋˜์ง€ ์•Š๊ณ , ๊นŠ์€ ๋„คํŠธ์›Œํฌ๋„ ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šต๋จ
    • ์ดํ›„ ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉด์„œ ๋Š” ํ•„์š”ํ•œ residual ๊ฐ’์„ ํ•™์Šต

Shortcut ์˜ต์…˜ (Dimension mismatch ์ฒ˜๋ฆฌ)

  • Option A โ€” Zero-padding identity shortcut
    • ์ฐจ์›์ด ๋Š˜์–ด๋‚  ๋•Œ ๋ถ€์กฑํ•œ ์ฑ„๋„์„ 0์œผ๋กœ ์ฑ„์šฐ๊ธฐ
      • (0์œผ๋กœ ์ฑ„์šด ๊ณณ์€ residual learning X)
    • ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—†๋Š” ์™„์ „ํ•œ identity mapping
    • ๊ฐ€์žฅ ๊ฐ€๋ณ๊ณ  ๋‹จ์ˆœํ•œ ๋ฐฉ์‹
  • Option B โ€” Projection shortcut (1ร—1 convolution)
    • ์ž…๋ ฅ/์ถœ๋ ฅ ์ฑ„๋„์ด ๋‹ค๋ฅผ ๋•Œ 1ร—1 Conv๋กœ projection์„ ์ˆ˜ํ–‰ํ•˜์—ฌ shape matching
    • projection layer๊ฐ€ ์ถ”๊ฐ€๋˜๋ฏ€๋กœ ์—ฐ์‚ฐ๋Ÿ‰์ด ์†Œํญ ์ฆ๊ฐ€
  • Option C โ€” Full projection shortcut
    • ๋ชจ๋“  shortcut์„ 1ร—1 Conv projection์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ์‹
    • projection layer ์ˆ˜๊ฐ€ ๋งŽ์•„์ ธ FLOPsยท๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋‘ ํฌ๊ฒŒ ์ฆ๊ฐ€

์„ธ ์˜ต์…˜ ๋ชจ๋‘ plain network๋ณด๋‹ค ํ›จ์”ฌ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ,

  • A๋Š” ๊ฐ€์žฅ ๊ฐ€๋ณ๊ณ  ๊ธฐ๋ณธ
  • B๋Š” ์•ฝ๊ฐ„ ๋” ์ •ํ™•,
  • C๋Š” ๋ฏธ์„ธํ•˜๊ฒŒ ๋” ์ •ํ™•ํ•˜์ง€๋งŒ ๋น„ํšจ์œจ ResNet์€ A/B ๊ธฐ๋ฐ˜์˜ โ€œResidual + Identity Shortcutโ€ ๊ตฌ์กฐ

์š”์•ฝ

  • ResNet์˜ ํ•ต์‹ฌ์€ โ€œResidual + Identity Shortcutโ€ ์กฐํ•ฉ
    • Residual ๊ฐœ๋…๊ณผ Shortcut์œผ๋กœ ๊ตฌํ˜„
  • Projection์€ ์ฐจ์› ๋งž์ถ”๊ธฐ๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ์—๋งŒ ๋ณด์กฐ์ ์œผ๋กœ ์‚ฌ์šฉ

Bottleneck Architecture

ํ•ต์‹ฌ ์•„์ด๋””์–ด

3ร—3 Conv๋Š” ์—ฐ์‚ฐ๋Ÿ‰์ด ๊ฐ€์žฅ ํฌ๋ฏ€๋กœ,
์ฑ„๋„์„ ์ค„์—ฌ์„œ ์—ฐ์‚ฐํ•œ ๋’ค ๋‹ค์‹œ ๋ณต์›

  • Conv์˜ ์—ฐ์‚ฐ๋Ÿ‰(FLOPs)์€
  • [>] ์ฑ„๋„ ์ˆ˜(C)๊ฐ€ ํด์ˆ˜๋ก FLOPs๊ฐ€ ์— ๋น„๋ก€ํ•˜์—ฌ ์ฆ๊ฐ€

๊ตฌ์กฐ

  1. 1ร—1 Conv (์ฐจ์› ์ถ•์†Œ)
    • 256 โ†’ 64
    • ์—ฐ์‚ฐ๋Ÿ‰ ๊ฑฐ์˜ ์—†์Œ
  2. 3ร—3 Conv (ํŠน์ง• ์ถ”์ถœ)
    • ์ถ•์†Œ๋œ ์ฑ„๋„(64ch)์—์„œ ์—ฐ์‚ฐ ์ˆ˜ํ–‰
  3. 1ร—1 Conv (์ฐจ์› ๋ณต์›)
    • 64 โ†’ 256
    • ์›๋ž˜ ์ฑ„๋„ ์ˆ˜๋กœ ๋˜๋Œ๋ฆผ

๋ฐฉ๋ฒ•๋ก 

  • ๋ชจ๋ธ ๊ตฌ์กฐ
  1. Stem (์ž…๋ ฅ ์ฒ˜๋ฆฌ ๊ตฌ๊ฐ„) ๊ทธ๋ฆผ์˜ ZERO PAD โ†’ CONV โ†’ BN โ†’ ReLU โ†’ MAX POOL ์ด ๋ถ€๋ถ„์€ Stage๋กœ ๋ณด๊ธฐ ์ด์ „์˜ ์ดˆ๊ธฐ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„
    • 7ร—7 Conv (stride=2) โ€” ํŠน์ง• ์ถ”์ถœ
    • BatchNorm, ReLU โ€” ์ •๊ทœํ™” + ๋น„์„ ํ˜• ๋ณ€ํ™˜
    • MaxPool โ€” ๊ณต๊ฐ„ ํฌ๊ธฐ ์ ˆ๋ฐ˜์œผ๋กœ ์ถ•์†Œ ์ดํ›„ Stage์—์„œ bottleneck block์ด ์ฒ˜๋ฆฌํ•˜๊ธฐ ์ข‹์€ ํฌ๊ธฐ๋กœ ์ดˆ๊ธฐ Feature Extractor ์ถœ๋ ฅ: 56ร—56, 64ch
  2. Stage 2 (conv2_x): Residual + Bottleneck
    • CONV BLOCK (ํŒŒ๋ž€์ƒ‰) = Bottleneck block + Projection shortcut
      • Stage๊ฐ€ ๋ฐ”๋€Œ๋Š” ์ˆœ๊ฐ„(ํ•ด์ƒ๋„/์ฑ„๋„ ์ฆ๊ฐ€) โ†’ dimension mismatch ๋ฐœ์ƒ
      • shortcut์ด x๋ฅผ ๊ทธ๋Œ€๋กœ ์ „๋‹ฌํ•  ์ˆ˜ ์—†์Œ
      • ๋”ฐ๋ผ์„œ projection shortcut(1ร—1 conv)์ด ํ•„์š”ํ•œ block์ด CONV BLOCK
    • ID BLOCK ร—2 (๋นจ๊ฐ„์ƒ‰) = Bottleneck block + Identity shortcut
      • ์ž…๋ ฅ/์ถœ๋ ฅ dimension์ด ๊ฐ™๊ธฐ ๋•Œ๋ฌธ์— ์ •ํ™•ํžˆ Identity shortcut๋งŒ ์‚ฌ์šฉ โ†’ ID BLOCK
  3. Stage 3,4,5
    • Stage 3: CONV BLOCK + ID BLOCK ร—3
    • Stage 4: CONV BLOCK + ID BLOCK ร—5
    • Stage 5: CONV BLOCK + ID BLOCK ร—2
  4. Head (์ถœ๋ ฅ ์ฒ˜๋ฆฌ ๊ตฌ๊ฐ„)
  • output - Final prediction

Stage โ€œํ•ด์ƒ๋„(spatial size)ยท์ฑ„๋„(channel) ์œ ์ง€๋˜๋Š” Block ๋ฌถ์Œโ€

  • ์ฒซ block: CONV BLOCK (projection shortcut)
    โ†’ Stage ์ „ํ™˜์œผ๋กœ ์ธํ•ด dimension mismatch ํ•ด๊ฒฐ
  • ๋‚˜๋จธ์ง€ block๋“ค: ID BLOCK (identity shortcut)
    โ†’ block ๋‚ด๋ถ€์—์„œ๋Š” ์ฑ„๋„/ํ•ด์ƒ๋„ ๋ณ€ํ•˜์ง€ ์•Š์Œ
    โ†’ shortcut์ด ๊ทธ๋Œ€๋กœ x๋ฅผ ์ „๋‹ฌํ•ด๋„ ์ฐจ์› OK Block ๋‚ด๋ถ€ ๊ตฌ์กฐ
  • CONV BLOCK / ID BLOCK ๋‚ด๋ถ€๋Š” Bottleneck ๊ตฌ์กฐ 1ร—1 Conv (์ถ•์†Œ) โ†’ 3ร—3 Conv โ†’ 1ร—1 Conv (๋ณต์›)
  • โ€œ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ์ฑ„๋„์€ ํ•ญ์ƒ ๋™์ผโ€

์š”์•ฝ

  1. Stage๋Š” ์—ฌ๋Ÿฌ Residual Block๋“ค์˜ ๋ฌถ์Œ
  2. ์ฒซ block์€ projection shortcut์ด ํ•„์š”ํ•˜์—ฌ CONV BLOCK
  3. ๋‚˜๋จธ์ง€๋Š” identity shortcut์œผ๋กœ ID BLOCK
  4. ๋ชจ๋“  block ๋‚ด๋ถ€๋Š” Bottleneck ๊ตฌ์กฐ
  5. Stage๊ฐ€ ๋ฐ”๋€” ๋•Œ dimension mismatch๊ฐ€ ์ƒ๊น€
  • ๋„คํŠธ์›Œํฌ ๊ตฌ์„ฑ ์š”์†Œ Batch Normalization He Initialization Global Average Pooling Zero-padding identity shortcut Fully convolutional inference

  • ํ•™์Šต ๊ตฌ์„ฑ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๋ฐฉ๋ฒ• Optimizer์™€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ Learning rate ์Šค์ผ€์ค„ Dropout ๋ฏธ์‚ฌ์šฉ Multi-scale testing


์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„

Plain vs Residual ๋น„๊ต (18/34-layer) Degradation ๋ฌธ์ œ ์‹œ๊ฐํ™”(Figure 4) Shortcut ์˜ต์…˜(A/B/C) ์‹คํ—˜ ๊ฒฐ๊ณผ Bottleneck ๊นŠ์ด ์ฆ๊ฐ€(50/101/152-layer) ์„ฑ๋Šฅ SOTA ๋น„๊ต(ILSVRC 2015)


๊ฒฐ๋ก  ๋ฐ ์‹œ์‚ฌ์ 

Residual Learning์˜ ์˜ํ–ฅ Shortcut์˜ ๋ณธ์งˆ์  ์˜์˜ ๊นŠ์€ ๋„คํŠธ์›Œํฌ ์„ค๊ณ„์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ๊ธฐ์ค€ ํ›„์† ์—ฐ๊ตฌ(Pre-activation ResNet ๋“ฑ)๋กœ ์ด์–ด์ง„ ํ๋ฆ„


๊ฐœ์ธ ์ฝ”๋ฉ˜ํŠธ

  • ์ดํ•ด๊ฐ€ ์–ด๋ ค์› ๋˜ ๋ถ€๋ถ„

  • ์ถ”๊ฐ€๋กœ ์ฐพ์•„๋ณผ ๊ฐœ๋… (๋…ผ๋ฌธ ๋‚ด ์šฉ์–ดยท์ฐธ๊ณ  ๋ฌธํ—Œ ๋“ฑ)

    • Pre-activation (later variant)
  • ์ถ”๊ฐ€ ์ฐธ๊ณ ํ•  ๋…ผ๋ฌธ (1) Normalized Initialization ๊ณ„์—ด Xavier Initialization Glorot & Bengio โ€” โ€œUnderstanding the Difficulty of Training Deep Feedforward Neural Networksโ€ (2010)

He Initialization (ReLU ์ตœ์ ํ™”) He et al. โ€” โ€œDelving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classificationโ€ (2015)

(2) Intermediate Normalization Layers ๊ณ„์—ด Batch Normalization Ioffe & Szegedy โ€” โ€œBatch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shiftโ€ (2015)


๋ฉ”๋ชจ

โœ” x = ๋ธ”๋ก์˜ ์ž…๋ ฅ(feature map)

  • ์ด์ „ ๋ ˆ์ด์–ด์—์„œ ๋„˜์–ด์˜จ ๊ฐ’
  • โ€œ์›๋ณธ ์ •๋ณดโ€

โœ” F(x) = Residual ํ•จ์ˆ˜๊ฐ€ ํ•™์Šตํ•ด์•ผ ํ•  ๊ฒƒ

  • ๋ธ”๋ก ๋‚ด๋ถ€ ๋ ˆ์ด์–ด(Conv-BN-ReLU-Convโ€ฆ)๊ฐ€ ๋งŒ๋“ค์–ด๋‚ธ ๋ณ€ํ™˜
  • ์ฆ‰, ํ•™์Šต ๊ฐ€๋Šฅํ•œ ๋ถ€๋ถ„
  • ์ดˆ๊ธฐํ™” ๋•Œ๋ฌธ์— ์ฒ˜์Œ์—๋Š” F(x) โ‰ˆ 0

โœ” H(x) = ๋ธ”๋ก์˜ ์ „์ฒด ์ถœ๋ ฅ(๋ชฉํ‘œ ํ•จ์ˆ˜)

  • ์ด ๋ธ”๋ก์ด **์ตœ์ข…์ ์œผ๋กœ ํ‘œํ˜„ํ•ด์•ผ ํ•˜๋Š” ํ•จ์ˆ˜
์šฉ์–ด์˜๋ฏธ๋ˆ„๊ฐ€ ์ •ํ•˜๋Š”๊ฐ€
x์ž…๋ ฅ์ด์ „ ๋ ˆ์ด์–ด๊ฐ€ ์ถœ๋ ฅํ•œ feature
F(x)ํ•™์Šตํ•ด์•ผ ํ•˜๋Š” โ€œ๋ณ€ํ™”๋Ÿ‰โ€๋„คํŠธ์›Œํฌ(ํ•™์Šต)
H(x)๋ธ”๋ก์ด ์ตœ์ข…์ ์œผ๋กœ ํ‘œํ˜„ํ•ด์•ผ ํ•˜๋Š” ๋ชฉํ‘œ ํ•จ์ˆ˜๋ฐ์ดํ„ฐ/ํ•™์Šต ๋ชฉ์ 
  1. ๋”ฅ๋Ÿฌ๋‹ ์ดˆ๊ธฐ์—๋Š” Vanishing/Exploding ๋•Œ๋ฌธ์— ๊นŠ๊ฒŒ ์Œ“๊ธฐ ์–ด๋ ค์›€. โ†’ Xavier/He ์ดˆ๊ธฐํ™”, BatchNorm ๋•๋ถ„์— 20~30์ธต์€ ํ•™์Šต ๊ฐ€๋Šฅ.

  2. ๊ทธ๋Ÿฐ๋ฐ ๊นŠ๊ฒŒ ์Œ“์œผ๋ฉด ์„ฑ๋Šฅ์ด ์ข‹์•„์ ธ์•ผ ํ•˜๋Š”๋ฐ, ์‹ค์ œ๋กœ๋Š” deeper model์ด shallower model๋ณด๋‹ค ํ›ˆ๋ จ ์˜ค์ฐจ๊ฐ€ ๋” ์ปค์ง€๋Š” โ€œDegradation Problemโ€์ด ๋ฐœ์ƒ.

  3. ๊นŠ์€ ์‹ ๊ฒฝ๋ง์€ ํ•ญ๋“ฑํ•จ์ˆ˜ H(x)=x๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•˜๊ธฐ ๋งค์šฐ ์–ด๋ ค์›€. โ†’ ๋น„์„ ํ˜• ์กฐํ•ฉ์œผ๋กœ x๋ฅผ ๊ทธ๋Œ€๋กœ ๋‚ด๋ณด๋‚ด๋Š” ๊ฒƒ์ด ๋ณต์žกํ•จ. โ†’ ์ดˆ๊ธฐํ™”๋Š” F(x)=0์œผ๋กœ ๊ฐ€๊ธฐ ์‰ฝ๊ฒŒ ํ•ด์ฃผ์ง€๋งŒ, H(x)=x ์ž์ฒด๋Š” ์–ด๋ ค์›€.

  4. Residual์„ ๋„์ž…. H(x)=F(x)+x ๋กœ ๋ฐ”๊พธ๋ฉด, F(x)=0์ด ์ตœ์ ํ•ด๊ฐ€ ๋˜์–ด ๊นŠ์€ ๋„คํŠธ์›Œํฌ์—์„œ๋„ ์ตœ์ ํ™”๊ฐ€ ๋งค์šฐ ์‰ฌ์›€.

  5. Shortcut์œผ๋กœ gradient๋ฅผ ์†์‹ค ์—†์ด ๋ณด๋ƒ„.

  6. Bottleneck Architecture๋กœ ์—ฐ์‚ฐ๋Ÿ‰ ๊ฐ์†Œ.