#
Explain a fully convolutional network.

**I was surveying some literature related to Fully Convolutional Networks and came across the following phrase,**

A fully convolutional network is achieved by replacing the parameter-rich fully connected layers in standard CNN architectures by convolutional layers with 1×1 kernels.

**I have two questions.**

What is meant by parameter-rich? Is it called parameter rich because the fully connected layers pass on parameters without any kind of "spatial" reduction? Also, how do 1×1 kernels work? Doesn't 1×1 kernel simply mean that one is sliding a single pixel over the image? I am confused about this.

**Fully convolutional networks
**

A fully convolutional network (FCN) is a neural network that only performs convolution (and subsampling or upsampling) operations. Equivalently, an FCN is a CNN without fully connected layers. Convolution neural networks The typical convolution neural network (CNN) is not fully convolutional because it often contains fully connected layers too (which do not perform the convolution operation), which are parameter-rich, in the sense that they have many parameters **(compared to their equivalent convolution layers),** although the fully connected layers can also be viewed as convolutions with kernels that cover the entire input regions, which is the main idea behind converting a CNN to an FCN. See this video by Andrew Ng that explains how to convert a fully connected layer to a convolutional layer.

**An example of an FCN **An example of a fully convolutional network is the U-net (called in this way because of its U shape, which you can see from the illustration below), which is a famous network that is used for semantic segmentation, i.e. classify pixels of an image so that pixels that belong to the same class **(e.g. a person)** are associated with the same label** (i.e. person), aka pixel-wise (or dense)** classification.

Semantic segmentation So, in semantic segmentation, you want to associate a label with each pixel (or small patch of pixels) of the input image. Here's a more suggestive illustration of a neural network that performs semantic segmentation. Instance segmentationThere's also instance segmentation, where you also want to differentiate different instances of the same class (e.g. you want to distinguish two people in the same image by labelling them differently). An example of a neural network that is used for instance segmentation is mask R-CNN. The blog post Segmentation: U-Net, Mask R-CNN, and Medical Applications (2020) by Rachel Draelos describes these two problems and networks very well.

Here's an example of an image where instances of the same class (i.e. person) have been labelled differently **(orange and blue).
**

Both semantic and instance segmentations are dense classification tasks (specifically, they fall into the category of image segmentation), that is, you want to classify each pixel or many small patches of pixels of an image.

` 1×1 convolutions`

In the U-net diagram above, you can see that there are only convolutions, copy and crop, max-pooling, and upsampling operations. There are no fully connected layers.

So, how do we associate a label to each pixel (or a small patch of pixels) of the input? How do we perform the classification of each pixel (or patch) without a final fully connected layer?

That's where the 1×1 convolution and upsampling operations are useful!

In the case of the U-net diagram above (specifically, the top-right part of the diagram, which is illustrated below for clarity), two 1×1×64 kernels are applied to the input volume (not the images!) to produce two feature maps of size 388×388. They used two 1×1 kernels because there were two classes in their experiments (cell and not-cell). The mentioned blog post also gives you the intuition behind this, so you should read it.

If you have tried to analyse the U-net diagram carefully, you will notice that the output maps have different spatial (height and weight) dimensions than the input images, which have dimensions 572×572×1.

That's fine because our general goal is to perform dense classification (i.e. classify patches of the image, where the patches can contain only one pixel), although I said that we would have performed pixel-wise classification, so maybe you were expecting the outputs to have the same exact spatial dimensions of the inputs. However, note that, in practice, you could also have the output maps to have the same spatial dimension as the inputs: you would just need to perform a different upsampling (deconvolution) operation.

`How do 1×1 convolutions work?`

A 1×1 convolution is just the typical 2d convolution but with a 1×1 kernel.

As you probably already know (and if you didn't know this, now you know it), if you have a g×g kernel that is applied to an input of size h×w×d, where d is the depth of the input volume (which, for example, in the case of grayscale images, it is 1), the kernel actually has the shape g×g×d, i.e. the third dimension of the kernel is equal to the third dimension of the input that it is applied to. This is always the case, except for 3d convolutions, but we are now talking about the typical 2d convolutions! See this answer for more info.

So, in the case we want to apply a 1×1 convolution to an input of shape 388×388×64, where 64 is the depth of the input, then the actual 1×1 kernels that we will need to use have shape 1×1×64 (as I said above for the U-net). The way you reduce the depth of the input with 1×1 is determined by the number of 1×1 kernels that you want to use. This is exactly the same thing as for any 2d convolution operation with different kernels (e.g. 3×3).

In the case of the U-net, the spatial dimensions of the input are reduced in the same way that the spatial dimensions of any input to a CNN are reduced (i.e. 2d convolution followed by downsampling operations). The main difference (apart from not using fully connected layers) between the U-net and other CNNs is that the U-net performs upsampling operations, so it can be viewed as an encoder (left part) followed by a decoder (right part).