๐Ÿ“„ Deep Residual Learning for Image Recognition

Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun Affiliation: Microsoft Research, Xiโ€™an Jiaotong University Conference: (1449673200000) DOI: 10.48550/arXiv.1512.03385

  • Keywords: [Residual, IdentityShorcut, Degradation, ]

์—ฐ๊ตฌ ๋ฐฐ๊ฒฝ

โ€œIs learning better networks as easy as stacking more layers?โ€ (๋„คํŠธ์›Œํฌ๋ฅผ ๊นŠ๊ฒŒ ์Œ“๊ธฐ๋งŒ ํ•˜๋ฉด ์„ฑ๋Šฅ์ด ์ข‹์•„์งˆ๊นŒ?)

  • Xavier/He ์ดˆ๊ธฐํ™”, BatchNorm ๋“ฑ์œผ๋กœ Vanishing/Exploding์€ ์–ด๋А ์ •๋„ ํ•ด๊ฒฐ
    • ๊ทธ๋Ÿฌ๋‚˜ ๊นŠ์–ด์งˆ์ˆ˜๋ก ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๋ฌธ์ œ(degradation) ๋Š” ์—ฌ์ „ํžˆ ํ•ด๊ฒฐ๋˜์ง€ ์•Š์Œ
  • ์–•์€ ๋„คํŠธ์›Œํฌ์˜ ์ตœ์ ํ•ด๋ฅผ ๊นŠ์€ ๋„คํŠธ์›Œํฌ๋„ ๊ตฌํ˜„ํ•  ์ˆ˜ ์žˆ์Œ์ด ๋ณด์žฅ๋˜๋Š”๋ฐ,
    • ์‹ค์ œ๋กœ๋Š” ๊นŠ์„์ˆ˜๋ก ํ›ˆ๋ จ ์˜ค์ฐจ๊ฐ€ ์ฆ๊ฐ€ํ•˜๋Š” Degradation Problem โ†’ ์ด ๋ชจ์ˆœ์ด ResNet์˜ ์ถœ๋ฐœ์ .

์ฃผ์š” ์•„์ด๋””์–ด

Residual Learning์œผ๋กœ ์ตœ์ ํ™” ๋‚œ์ด๋„ ํ•ด๊ฒฐ Identity Shortcut์œผ๋กœ gradient ํ๋ฆ„ ๋ฌธ์ œ ํ•ด๊ฒฐ Bottleneck Architerture๋กœ FLOPs ๊ฐ์†Œ๋กœ ๊นŠ์ด ํ™•์žฅ > ์„ฑ๋Šฅ ํ–ฅ์ƒ

Residual learning์˜ ์ดํ•ด

์ƒˆ๋กœ์šด ๋ ˆ์ด์–ด๊ฐ€ ์ˆ˜ํ–‰ํ•ด์•ผ ํ•  ๋ชฉํ‘œ ๊ฐ€์ •: ํ•ญ๋“ฑํ•จ์ˆ˜

๊ธฐ์กด CNN

  • ๋ฌธ์ œ ์ƒํ™ฉ
    • ๋„คํŠธ์›Œํฌ๋Š” ํ•ญ๋“ฑํ•จ์ˆ˜(identity function) ๋ฅผ ์ง์ ‘ ๊ตฌํ˜„ํ•ด์•ผ ํ•จ.
    • ๋„คํŠธ์›Œํฌ ๊ตฌ์„ฑ: Conv โ†’ BN โ†’ ReLU โ†’ Conv โ€ฆ ๊ฐ™์€ ๋น„์„ ํ˜• ์Šคํƒ
    • ํ•ญ๋“ฑํ•จ์ˆ˜๋ฅผ ์ด๋Ÿฐ ๋น„์„ ํ˜• ์กฐํ•ฉ์œผ๋กœ ์ •ํ™•ํžˆ ๊ตฌํ˜„ํ•˜๋Š” ๊ฒƒ์€ ๋งค์šฐ ์–ด๋ ค์šด ์ตœ์ ํ™” ๋ฌธ์ œ.
  • ์™œ ์–ด๋ ค์šด๊ฐ€?
    • ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์กฐ๊ธˆ๋งŒ ๋ณ€ํ•ด๋„ ์ž…๋ ฅ ๊ฐ€ ์‰ฝ๊ฒŒ ์™œ๊ณก๋จ
    • ๊นŠ์€ ๋ ˆ์ด์–ด์ผ์ˆ˜๋ก ํ•ญ๋“ฑํ•จ์ˆ˜ ์œ ์ง€๊ฐ€ ๊ฑฐ์˜ ๋ถˆ๊ฐ€๋Šฅ
    • ์—ญ์ „ํŒŒ๊ฐ€ ํ•ญ๋“ฑํ•จ์ˆ˜ ํ•ด์— ์ˆ˜๋ ดํ•˜๊ธฐ ๋งค์šฐ ์–ด๋ ค์›€ โ†’ ๊นŠ์€ ๋น„์„ ํ˜• ๋„คํŠธ์›Œํฌ์—๊ฒŒ โ€œ๊ทธ๋ƒฅ ์•„๋ฌด๊ฒƒ๋„ ํ•˜์ง€ ๋งˆ๋ผโ€๋Š” ์š”๊ตฌ๋Š” ๊ทน๋„๋กœ ๋‚œ์ด๋„๊ฐ€ ๋†’์Œ.

ResNet

  • ResNet์€

  • ๋ชฉํ‘œ ํ•จ์ˆ˜๊ฐ€ ๋ผ๋ฉด,

์˜๋ฏธ

  • ๋ ˆ์ด์–ด๋Š” ํ•ญ๋“ฑํ•จ์ˆ˜ ์ „์ฒด๋ฅผ ๋งŒ๋“œ๋Š” ๋Œ€์‹  ๋‹จ์ˆœํžˆ ์ถœ๋ ฅ์„ 0์œผ๋กœ ๋งŒ๋“œ๋Š” ๊ฒƒ๋งŒ ํ•™์Šตํ•˜๋ฉด ๋จ.

  • ํ˜„๋Œ€ ์ดˆ๊ธฐํ™”๋Š” ๋ชจ๋“  weight๋ฅผ 0 ๊ทผ์ฒ˜์˜ ์ž‘์€ ๊ฐ’ ์œผ๋กœ ๋‘ .

    • Xavier/He ์ดˆ๊ธฐํ™”๋Š” ๋Œ€๋ถ€๋ถ„์˜ ๊ฐ€์ค‘์น˜๊ฐ€ ์ด๋ฏธ 0 ๊ทผ์ฒ˜
    • ๋”ฐ๋ผ์„œ Residual block์˜ ์ดˆ๊นƒ๊ฐ’์€ ์ž์—ฐ์Šค๋Ÿฝ๊ฒŒ โ†’ ์ฆ‰, ์ด๋ฏธ ํ•ญ๋“ฑํ•จ์ˆ˜์— ๋งค์šฐ ๊ฐ€๊นŒ์šด ์ƒํƒœ๋กœ ์‹œ์ž‘
  • ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ทผ์ฒ˜๋กœ ๋‘๋Š” ๊ฒƒ์ด ํ›จ์”ฌ ์‰ฌ์›€ โ†’ ์ตœ์ ํ™” ๋‚œ๋„ ๋Œ€ํญ ๊ฐ์†Œ.

๊ธฐ์กด ๋ฐฉ์‹

  • ์š”๊ตฌ: โ€œ๋น„์„ ํ˜• ๋ ˆ์ด์–ด ์Šคํƒ์œผ๋กœ ๋ณต์žกํ•œ ํ•ญ๋“ฑํ•จ์ˆ˜ ๋ฅผ ๊ตฌํ˜„ํ•˜๋ผ.โ€
  • ๊ฒฐ๊ณผ: ๊ทน๋„๋กœ ์–ด๋ ค์šด ์ตœ์ ํ™” ๋ฌธ์ œ

Residual ๋ฐฉ์‹

  • ์š”๊ตฌ: โ€œ์ž”์ฐจ ๋ฅผ ์œผ๋กœ ๋งŒ๋“ค์–ด๋ผ.โ€
  • ๊ฒฐ๊ณผ: ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ทผ์ฒ˜์— ์œ„์น˜์‹œํ‚ค๋ฉด ๋จ โ†’ ๋งค์šฐ ์‰ฌ์›€

์š”์•ฝ

  • ๋น„์„ ํ˜• ๋ ˆ์ด์–ด ์Šคํƒ์—๊ฒŒ ํ•ญ๋“ฑํ•จ์ˆ˜ ๋ฅผ ๋งŒ๋“ค๊ฒŒ ํ•˜๋Š” ๊ฒƒ์€ ์–ด๋ ต๋‹ค.
  • ์ž”์ฐจ ๋ฅผ ์œผ๋กœ ๋งŒ๋“ค๊ฒŒ ํ•˜๋Š” ๊ฒƒ์€ ์••๋„์ ์œผ๋กœ ์‰ฝ๋‹ค.
    โ†’ ๊ทธ๋ž˜์„œ ResNet์€ ๋งค์šฐ ๊นŠ์–ด๋„ ํ•™์Šต์ด ์ž˜๋จ.

Identity Shortcut

  • Shortcut connection์€ ์ž…๋ ฅ ๋ฅผ ๊ทธ๋Œ€๋กœ ๋‹ค์Œ ๋ธ”๋ก์œผ๋กœ ์ „๋‹ฌํ•˜๋Š” ๊ฒฝ๋กœ.
  • ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—†๋Š” ํ•ญ๋“ฑ ๋งตํ•‘(identity mapping) ์ด๋ฏ€๋กœ ์—ฐ์‚ฐ๋Ÿ‰ ์ฆ๊ฐ€ ์—†์Œ.
  • Residual block์˜ ์ถœ๋ ฅ์€ ์œผ๋กœ ๊ณ„์‚ฐ๋จ.

์™œ ํ•„์š”ํ•œ๊ฐ€?

  1. ํ•ญ๋“ฑํ•จ์ˆ˜๋ฅผ โ€œ๊ตฌ์กฐ์ ์œผ๋กœโ€ ์‰ฝ๊ฒŒ ๋งŒ๋“ค์–ด์„œ ์ตœ์ ํ™”๊ฐ€ ์‰ฌ์›Œ์ง
    • ๊ธฐ์กด CNN์€ ๋น„์„ ํ˜• ์กฐํ•ฉ์œผ๋กœ ๋ฅผ ๋งŒ๋“ค์–ด์•ผ ํ•˜๋ฏ€๋กœ ์–ด๋ ค์›€
    • identity shortcut์€ ๋ฅผ ๊ทธ๋Œ€๋กœ ์ „๋‹ฌํ•จ์œผ๋กœ์จ
    • ํ•ญ๋“ฑํ•จ์ˆ˜๋ฅผ ๋„คํŠธ์›Œํฌ ๊ตฌ์กฐ ์ฐจ์›์—์„œ ๋ณด์žฅ
    • โ†’ Residual block์ด โ€œ ๊ทธ๋Œ€๋กœ ์ „๋‹ฌ + ์ž‘์€ ๋ณ€ํ™” โ€๋งŒ ํ•™์Šตํ•˜๋ฉด ๋จ
  2. Gradient ํ๋ฆ„์„ ์ง์ ‘ ์ „๋‹ฌํ•˜์—ฌ ๊นŠ์–ด์ ธ๋„ ํ•™์Šต์ด ๋ง๊ฐ€์ง€์ง€ ์•Š์Œ
    • ์—ญ์ „ํŒŒ ์‹œ gradient๊ฐ€ shortcut์„ ํ†ตํ•ด ์•„๋ž˜์ฒ˜๋Ÿผ ํ๋ฆ„ (BN, ReLU์˜ ๋„ํ•จ์ˆ˜ X)
    • ์ดˆ๊ธฐํ™” ์‹œ์ ์—์„œ๋Š” ์ด๋ฏ€๋กœ
      gradient โ†’ gradient๊ฐ€ ์†Œ์‹ค๋˜์ง€ ์•Š๊ณ , ๊นŠ์€ ๋„คํŠธ์›Œํฌ๋„ ์•ˆ์ •์ ์œผ๋กœ ํ•™์Šต๋จ
    • ์ดํ›„ ํ•™์Šต์ด ์ง„ํ–‰๋˜๋ฉด์„œ ๋Š” ํ•„์š”ํ•œ residual ๊ฐ’์„ ํ•™์Šต

Shortcut ์˜ต์…˜ (Dimension mismatch ์ฒ˜๋ฆฌ)

  • Option A โ€” Zero-padding identity shortcut
    • ์ฐจ์›์ด ๋Š˜์–ด๋‚  ๋•Œ ๋ถ€์กฑํ•œ ์ฑ„๋„์„ 0์œผ๋กœ ์ฑ„์šฐ๊ธฐ
      • (0์œผ๋กœ ์ฑ„์šด ๊ณณ์€ residual learning X)
    • ํŒŒ๋ผ๋ฏธํ„ฐ๊ฐ€ ์—†๋Š” ์™„์ „ํ•œ identity mapping
    • ๊ฐ€์žฅ ๊ฐ€๋ณ๊ณ  ๋‹จ์ˆœํ•œ ๋ฐฉ์‹
  • Option B โ€” Projection shortcut (1ร—1 convolution)
    • ์ž…๋ ฅ/์ถœ๋ ฅ ์ฑ„๋„์ด ๋‹ค๋ฅผ ๋•Œ 1ร—1 Conv๋กœ projection์„ ์ˆ˜ํ–‰ํ•˜์—ฌ shape matching
    • projection layer๊ฐ€ ์ถ”๊ฐ€๋˜๋ฏ€๋กœ ์—ฐ์‚ฐ๋Ÿ‰์ด ์†Œํญ ์ฆ๊ฐ€
  • Option C โ€” Full projection shortcut
    • ๋ชจ๋“  shortcut์„ 1ร—1 Conv projection์œผ๋กœ ๋Œ€์ฒดํ•˜๋Š” ๋ฐฉ์‹
    • projection layer ์ˆ˜๊ฐ€ ๋งŽ์•„์ ธ FLOPsยท๋ฉ”๋ชจ๋ฆฌ ๋ชจ๋‘ ํฌ๊ฒŒ ์ฆ๊ฐ€

์„ธ ์˜ต์…˜ ๋ชจ๋‘ plain network๋ณด๋‹ค ํ›จ์”ฌ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๋ฉฐ,

  • A๋Š” ๊ฐ€์žฅ ๊ฐ€๋ณ๊ณ  ๊ธฐ๋ณธ
  • B๋Š” ์•ฝ๊ฐ„ ๋” ์ •ํ™•,
  • C๋Š” ๋ฏธ์„ธํ•˜๊ฒŒ ๋” ์ •ํ™•ํ•˜์ง€๋งŒ ๋น„ํšจ์œจ ResNet์€ A/B ๊ธฐ๋ฐ˜์˜ โ€œResidual + Identity Shortcutโ€ ๊ตฌ์กฐ

์š”์•ฝ

  • ResNet์˜ ํ•ต์‹ฌ์€ โ€œResidual + Identity Shortcutโ€ ์กฐํ•ฉ
    • Residual ๊ฐœ๋…๊ณผ Shortcut์œผ๋กœ ๊ตฌํ˜„
  • Projection์€ ์ฐจ์› ๋งž์ถ”๊ธฐ๊ฐ€ ํ•„์š”ํ•œ ๊ฒฝ์šฐ์—๋งŒ ๋ณด์กฐ์ ์œผ๋กœ ์‚ฌ์šฉ

Bottleneck Architecture

ํ•ต์‹ฌ ์•„์ด๋””์–ด

3ร—3 Conv๋Š” ์—ฐ์‚ฐ๋Ÿ‰์ด ๊ฐ€์žฅ ํฌ๋ฏ€๋กœ,
์ฑ„๋„์„ ์ค„์—ฌ์„œ ์—ฐ์‚ฐํ•œ ๋’ค ๋‹ค์‹œ ๋ณต์›

  • Conv์˜ ์—ฐ์‚ฐ๋Ÿ‰(FLOPs)์€ โ†’ ์ฑ„๋„ ์ˆ˜(C)๊ฐ€ ํด์ˆ˜๋ก FLOPs๊ฐ€ ์— ๋น„๋ก€ํ•˜์—ฌ ์ฆ๊ฐ€

๊ตฌ์กฐ

  1. 1ร—1 Conv (์ฐจ์› ์ถ•์†Œ)
    • 256 โ†’ 64
    • ์—ฐ์‚ฐ๋Ÿ‰ ๊ฑฐ์˜ ์—†์Œ
  2. 3ร—3 Conv (ํŠน์ง• ์ถ”์ถœ)
    • ์ถ•์†Œ๋œ ์ฑ„๋„(64ch)์—์„œ ์—ฐ์‚ฐ ์ˆ˜ํ–‰
  3. 1ร—1 Conv (์ฐจ์› ๋ณต์›)
    • 64 โ†’ 256
    • ์›๋ž˜ ์ฑ„๋„ ์ˆ˜๋กœ ๋˜๋Œ๋ฆผ

๋ฐฉ๋ฒ•๋ก 

  • ๋ชจ๋ธ ๊ตฌ์กฐ
  1. Stem (์ž…๋ ฅ ์ฒ˜๋ฆฌ ๊ตฌ๊ฐ„) ๊ทธ๋ฆผ์˜ ZERO PAD โ†’ CONV โ†’ BN โ†’ ReLU โ†’ MAX POOL ์ด ๋ถ€๋ถ„์€ Stage๋กœ ๋ณด๊ธฐ ์ด์ „์˜ ์ดˆ๊ธฐ ์ฒ˜๋ฆฌ ๋‹จ๊ณ„
    • 7ร—7 Conv (stride=2) โ€” ํŠน์ง• ์ถ”์ถœ
    • BatchNorm, ReLU โ€” ์ •๊ทœํ™” + ๋น„์„ ํ˜• ๋ณ€ํ™˜
    • MaxPool โ€” ๊ณต๊ฐ„ ํฌ๊ธฐ ์ ˆ๋ฐ˜์œผ๋กœ ์ถ•์†Œ ์ดํ›„ Stage์—์„œ bottleneck block์ด ์ฒ˜๋ฆฌํ•˜๊ธฐ ์ข‹์€ ํฌ๊ธฐ๋กœ ์ดˆ๊ธฐ Feature Extractor ์ถœ๋ ฅ: 56ร—56, 64ch
  2. Stage 2 (conv2_x): Residual + Bottleneck
    • CONV BLOCK (ํŒŒ๋ž€์ƒ‰) = Bottleneck block + Projection shortcut
      • Stage๊ฐ€ ๋ฐ”๋€Œ๋Š” ์ˆœ๊ฐ„(ํ•ด์ƒ๋„/์ฑ„๋„ ์ฆ๊ฐ€) โ†’ dimension mismatch ๋ฐœ์ƒ
      • shortcut์ด x๋ฅผ ๊ทธ๋Œ€๋กœ ์ „๋‹ฌํ•  ์ˆ˜ ์—†์Œ
      • ๋”ฐ๋ผ์„œ projection shortcut(1ร—1 conv)์ด ํ•„์š”ํ•œ block์ด CONV BLOCK
    • ID BLOCK ร—2 (๋นจ๊ฐ„์ƒ‰) = Bottleneck block + Identity shortcut
      • ์ž…๋ ฅ/์ถœ๋ ฅ dimension์ด ๊ฐ™๊ธฐ ๋•Œ๋ฌธ์— ์ •ํ™•ํžˆ Identity shortcut๋งŒ ์‚ฌ์šฉ โ†’ ID BLOCK
  3. Stage 3,4,5
    • Stage 3: CONV BLOCK + ID BLOCK ร—3
    • Stage 4: CONV BLOCK + ID BLOCK ร—5
    • Stage 5: CONV BLOCK + ID BLOCK ร—2
  4. Head (์ถœ๋ ฅ ์ฒ˜๋ฆฌ ๊ตฌ๊ฐ„)
  • output - Final prediction

Stage โ€œํ•ด์ƒ๋„(spatial size)ยท์ฑ„๋„(channel) ์œ ์ง€๋˜๋Š” Block ๋ฌถ์Œโ€

  • ์ฒซ block: CONV BLOCK (projection shortcut)
    โ†’ Stage ์ „ํ™˜์œผ๋กœ ์ธํ•ด dimension mismatch ํ•ด๊ฒฐ
  • ๋‚˜๋จธ์ง€ block๋“ค: ID BLOCK (identity shortcut)
    โ†’ block ๋‚ด๋ถ€์—์„œ๋Š” ์ฑ„๋„/ํ•ด์ƒ๋„ ๋ณ€ํ•˜์ง€ ์•Š์Œ
    โ†’ shortcut์ด ๊ทธ๋Œ€๋กœ x๋ฅผ ์ „๋‹ฌํ•ด๋„ ์ฐจ์› OK Block ๋‚ด๋ถ€ ๊ตฌ์กฐ
  • CONV BLOCK / ID BLOCK ๋‚ด๋ถ€๋Š” Bottleneck ๊ตฌ์กฐ 1ร—1 Conv (์ถ•์†Œ) โ†’ 3ร—3 Conv โ†’ 1ร—1 Conv (๋ณต์›)
  • โ€œ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ์ฑ„๋„์€ ํ•ญ์ƒ ๋™์ผโ€

์š”์•ฝ

  1. Stage๋Š” ์—ฌ๋Ÿฌ Residual Block๋“ค์˜ ๋ฌถ์Œ
  2. ์ฒซ block์€ projection shortcut์ด ํ•„์š”ํ•˜์—ฌ CONV BLOCK
  3. ๋‚˜๋จธ์ง€๋Š” identity shortcut์œผ๋กœ ID BLOCK
  4. ๋ชจ๋“  block ๋‚ด๋ถ€๋Š” Bottleneck ๊ตฌ์กฐ
  5. Stage๊ฐ€ ๋ฐ”๋€” ๋•Œ dimension mismatch๊ฐ€ ์ƒ๊น€
  • ๋ชจ๋“ˆ ๊ตฌ์„ฑ A. Stem ๋ชจ๋“ˆ
๊ตฌ์„ฑ ์š”์†Œ์„ค๋ช…
Zero Padding์ž…๋ ฅ ํ…์„œ์˜ ๊ณต๊ฐ„ ํฌ๊ธฐ๋ฅผ ์œ ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ์ฃผ๋ณ€์„ 0์œผ๋กœ ์ฑ„์šด๋‹ค. ์ด๋กœ์จ ํ•ฉ์„ฑ๊ณฑ ํ•„ํ„ฐ๊ฐ€ ๊ฒฝ๊ณ„ ์˜์—ญ์—์„œ๋„ ๋™์ผํ•œ ์—ฐ์‚ฐ ๋ฒ”์œ„๋ฅผ ๊ฐ–๋Š”๋‹ค. ResNet์—์„œ๋Š” stem์˜ 7ร—7 conv์™€ residual block ๋‚ด๋ถ€์˜ 3ร—3 conv์—์„œ ์ถœ๋ ฅ ํ•ด์ƒ๋„ ์œ ์ง€๋ฅผ ์œ„ํ•ด ์‚ฌ์šฉ๋œ๋‹ค.
7ร—7 Conv์ž…๋ ฅ ์ด๋ฏธ์ง€์˜ ๋„“์€ ์ˆ˜์šฉ์˜์—ญ์„ ํ•œ ๋ฒˆ์— ์ฒ˜๋ฆฌํ•˜์—ฌ ์ €ํ•ด์ƒ๋„ ํŠน์ง•์„ ๋น ๋ฅด๊ฒŒ ์ถ”์ถœํ•œ๋‹ค. stride=2๋ฅผ ์ ์šฉํ•ด ๊ณต๊ฐ„ ํฌ๊ธฐ๋ฅผ ์ ˆ๋ฐ˜์œผ๋กœ ์ค„์ด๋ฉฐ, ์ดˆ๊ธฐ ํŠน์ง•์ง€๋„๋ฅผ ๊ตฌ์„ฑํ•œ๋‹ค.
BatchNormํ•ฉ์„ฑ๊ณฑ ์—ฐ์‚ฐ ๋’ค์— ๋ฐฐ์น˜๋˜์–ด internal covariate shift๋ฅผ ์™„ํ™”ํ•˜๊ณ  ํ•™์Šต ์•ˆ์ •์„ฑ์„ ํ™•๋ณดํ•œ๋‹ค.
ReLU๋น„์„ ํ˜•์„ฑ์„ ๋ถ€์—ฌํ•˜๋ฉฐ, ์Œ์ˆ˜ ์˜์—ญ์„ 0์œผ๋กœ ์ ˆ๋‹จํ•˜์—ฌ gradient ํ๋ฆ„์„ ์œ ์ง€ํ•œ๋‹ค.
MaxPoolํฐ stride์˜ ํ’€๋ง์„ ํ†ตํ•ด ํ•ด์ƒ๋„๋ฅผ ๋น ๋ฅด๊ฒŒ ์ถ•์†Œํ•˜๊ณ , ์ดํ›„ residual block์ด ์ฒ˜๋ฆฌํ•˜๊ธฐ ์ ํ•ฉํ•œ ํฌ๊ธฐ์˜ ํŠน์ง•์ง€๋„๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

B. Residual Block ๋ชจ๋“ˆ

๊ตฌ์„ฑ ์š”์†Œ์„ค๋ช…
Convolution + BatchNorm + ReLU๊ฐ residual branch๋Š” conv โ†’ BN โ†’ ReLU ์ˆœ์œผ๋กœ ๊ตฌ์„ฑ๋œ๋‹ค.
Basic Block / Bottleneck Block
Basic Block3ร—3 conv ร— 2 ๊ตฌ์กฐ๋กœ shallow ๋„คํŠธ์›Œํฌ(ResNet-18/34)์— ์‚ฌ์šฉ.
Bottleneck Block1ร—1 โ†’ 3ร—3 โ†’ 1ร—1 conv ๊ตฌ์กฐ๋กœ deep ๋„คํŠธ์›Œํฌ(ResNet-50/101/152)์— ์‚ฌ์šฉ.
Shortcut ์ข…๋ฅ˜
Identity Shortcut์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ์˜ ์ฑ„๋„ ์ˆ˜๊ฐ€ ๋™์ผํ•  ๋•Œ ์ง์ ‘ ๋”ํ•จ
Projection Shortcut์ฐจ์›์ด ๋‹ค๋ฅผ ๊ฒฝ์šฐ 1ร—1 conv๋ฅผ ์ด์šฉํ•ด ์„ ํ˜• ๋ณ€ํ™˜
Zero-Padding Shortcut์ถœ๋ ฅ ์ฑ„๋„์ด ๋งŽ์„ ๋•Œ, ๋ถ€์กฑํ•œ ๋ถ€๋ถ„์„ 0์œผ๋กœ ์ฑ„์›Œ ์ผ์น˜์‹œํ‚ด.
Element-wise Addition๋ณ€ํ™˜๋œ ์ž”์ฐจ ์™€ ์ž…๋ ฅ ๋ฅผ element-wise๋กœ ๋”ํ•˜์—ฌ ์ตœ์ข… ์ถœ๋ ฅ์„ ํ˜•์„ฑ.
C. Head ๋ชจ๋“ˆ

D. ํ•™์Šต ์•ˆ์ • ๋ชจ๋“ˆ

  • He initialization

  • BatchNorm์˜ scaling / shifting

  • Fully convolutional inference

  • ํ•™์Šต ๊ตฌ์„ฑ ๋ฐ์ดํ„ฐ ์ฆ๊ฐ• ๋ฐฉ๋ฒ• Optimizer์™€ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ Learning rate ์Šค์ผ€์ค„ Dropout ๋ฏธ์‚ฌ์šฉ Multi-scale testing

๋ฐ์ดํ„ฐ์…‹ ๋ฐ ์ž…๋ ฅ ์ „์ฒ˜๋ฆฌ

ImageNet (ILSVRC 2012)

  • ํ•™์Šต ์ด๋ฏธ์ง€: ์•ฝ 1.28M
  • ๊ฒ€์ฆ ์ด๋ฏธ์ง€: 50K
  • ํด๋ž˜์Šค ์ˆ˜: 1000

์ž…๋ ฅ ํฌ๊ธฐ

  • ํ•™์Šต: 224 ร— 224 RGB crop
  • ํ…Œ์ŠคํŠธ: shorter side = 256 โ†’ center crop 224 ร— 224

๋ฐ์ดํ„ฐ ์ฆ๊ฐ•

๋…ผ๋ฌธ์—์„œ ์‚ฌ์šฉํ•œ ์ฆ๊ฐ•์€ ํ‘œ์ค€์ ์ธ ImageNet ์„ค์ •์— ๊ตญํ•œ๋จ.

  • Random resized crop (224 ร— 224)
  • Random horizontal flip
  • Color jitter, Cutout, Mixup ๋“ฑ ์‚ฌ์šฉํ•˜์ง€ ์•Š์Œ

์ค‘์š”ํ•œ ์ 
โ†’ ResNet์˜ ์„ฑ๋Šฅ ํ–ฅ์ƒ์€ ํŠน์ˆ˜ํ•œ augmentation ๋•Œ๋ฌธ์ด ์•„๋‹˜
โ†’ ๊ตฌ์กฐ์  ๊ฐœ์„ ์˜ ํšจ๊ณผ๋ฅผ ๋ถ„๋ฆฌํ•ด์„œ ๋ณด์—ฌ์ฃผ๊ธฐ ์œ„ํ•จ


์‹คํ—˜ ๊ฒฐ๊ณผ ๋ฐ ๋ถ„์„

Plain vs Residual ๋น„๊ต (18/34-layer) Degradation ๋ฌธ์ œ ์‹œ๊ฐํ™”(Figure 4) Shortcut ์˜ต์…˜(A/B/C) ์‹คํ—˜ ๊ฒฐ๊ณผ Bottleneck ๊นŠ์ด ์ฆ๊ฐ€(50/101/152-layer) ์„ฑ๋Šฅ SOTA ๋น„๊ต(ILSVRC 2015)


๊ฒฐ๋ก  ๋ฐ ์‹œ์‚ฌ์ 

Residual Learning์˜ ์˜ํ–ฅ Shortcut์˜ ๋ณธ์งˆ์  ์˜์˜ ๊นŠ์€ ๋„คํŠธ์›Œํฌ ์„ค๊ณ„์— ๋Œ€ํ•œ ์ƒˆ๋กœ์šด ๊ธฐ์ค€ ํ›„์† ์—ฐ๊ตฌ(Pre-activation ResNet ๋“ฑ)๋กœ ์ด์–ด์ง„ ํ๋ฆ„


๊ฐœ์ธ ์ฝ”๋ฉ˜ํŠธ

  • ์ดํ•ด๊ฐ€ ์–ด๋ ค์› ๋˜ ๋ถ€๋ถ„

  • ์ถ”๊ฐ€๋กœ ์ฐพ์•„๋ณผ ๊ฐœ๋… (๋…ผ๋ฌธ ๋‚ด ์šฉ์–ดยท์ฐธ๊ณ  ๋ฌธํ—Œ ๋“ฑ)

    • ResNet v2: Pre-activation (later variant)
  • ์ถ”๊ฐ€ ์ฐธ๊ณ ํ•  ๋…ผ๋ฌธ

(1) Normalized Initialization ๊ณ„์—ด

  • Xavier Initialization Glorot & Bengio โ€” โ€œUnderstanding the Difficulty of Training Deep Feedforward Neural Networksโ€ (2010)
  • He Initialization (ReLU ์ตœ์ ํ™”) He et al. โ€” โ€œDelving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classificationโ€ (2015)

(2) Intermediate Normalization Layers ๊ณ„์—ด Batch Normalization Ioffe & Szegedy โ€” โ€œBatch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shiftโ€ (2015)