> TL;DR: We theoretically analyze the differential architecture search (DARTS) for

understanding the role and impact of skip connections, which inspires a new

method for Neural Architecture Search (NAS) using group-structured sparse gates

and path-depth-wise regularization to overcome the limitation of existing NAS

methods for AutoML.

In our work [1], we investigate the problem of Network architecture search

(NAS), an important family of techniques for Automated Machine Learning

(AutoML). We theoretically analyze the performance of one popular NAS method,

namely differential architecture search (DARTS) [2], which inspires us to design

a more effective NAS method. For the first time, our work theoretically proves

that skip (identity) connections can help networks converge faster and have

superior competitive advantages over other types of operations. This well

explains why DARTS often selects network architectures with dominated skip

connections which often leads to unsatisfactory performance. Based on our

theory, we propose a new path-regularized DARTS with two novel components,

group-structured sparse gate and path-depth-wise regularization, which both

alleviate the unfair competition among operations and thus help finding better

network architectures. Our work advances NAS in both theory and practical

algorithm design to advance the frontiers of automated machine learning.

1. What is network architecture search (NAS)?

Network architecture search (NAS) is an important branch of machine learning

techniques in AutoML. As shown in Fig. 1, given a network graph in the left

figure, NAS aims to address the problem of automatically searching the best

network, that is, how to automatically select proper operations from a

predefined operation set (e.g., 3×3/5×5 convolution/pooling, skip connection,

zero operation, etc) between any two data nodes in the graph. After the NAS

search, one can expect to obtain the well-defined network in the right figure of

Fig. 1 without human experts.

2. Why network architecture search (NAS) is important?

Network architecture search (NAS) [3] is an effective approach for automating

network architecture design, with many successful applications witnessed to

image recognition and language modelling. Unlike expert-designed architectures

which require substantial efforts from experts by trial and error, NAS can

automatically design the network architectures and thus greatly alleviates the

design efforts of experts. Moreover, NAS can also alleviate the design bias

brought by experts which could prohibit achieving better performance. This is

demonstrated by the fact that the networks designed by NAS outperform these

designed by human experts and achieves state-of-the-art performance in many

applications, e.g. image classification.

3. What is the issue in the differential architecture search (DARTS)?

DARTS [2] is a recently developed leading NAS approach. Previous reinforcement

learning (RL) and evolutionary algorithm (EA) based NAS methods reformulate the

network search problem as a discrete optimization problem, since one needs to

select proper operations from a predefined operation sets to connect any two

data nodes in the network. However, such discrete optimization has extremely

search space (total 1e+16 operation choices for the network of 7 nodes), and

suffers from very high computational cost (thousands of GPU days). To solve this

issue, DARTS relaxes the discrete network architecture search process into

finding continuous architecture parameters. That is, DARTS defines weights for

each operations between any two nodes, and optimizes these operation weights to

decide the final network architecture. In this way, it can optimize the

architecture parameters via gradient descent and thus greatly reduces the high

search cost in RL and EA searching approaches. However, this differential NAS

family, including DARTS and its variants, typically selects many skip

connections (identity mapping, one kinds of operations for connecting two data

nodes) which dominate over other types of operations [4,5,6]. Consequently, the

searched networks are observed to have unsatisfactory performance.

4. What is the intrinsic theoretical reason for the dominated skip connection in

DARTS?

Though the dominance of skip connections in DARTS is observed in many works

[4,5,6], theoretical understandings on this issue remain absent, hindering the

development of more advanced methods in a principled way. In this work, we solve

this problem by theoretically analyzing the effects of various types of

operations, e.g. convolution, skip connection and zero operation, to the network

optimization. By imposing proper conditions on the activation functions,

network parameter initializations, and input data, we prove that the

architectures with more skip connections can converge faster than the other

candidates, and thus are selected by DARTS. This result, for the first time,

theoretically and explicitly reveals the impact of skip connections to fast

network optimization and its competitive advantage over other types of

operations in DARTS.

5. What method can resolve the issue of the dominated skip connection in DARTS?

Inspired by our theory, we propose a path-regularized DARTS that consists of two

key modules: (i) a differential group-structured sparse binary gate introduced

for each operation to avoid unfair competition among operations, and (ii) a

path-depth-wise regularization used to incite search exploration for deep

architectures that often converge slower than shallow ones and are not well

explored during search.

(i) Differential group-structured sparse binary gate

By analysis, we find that the reason for the dominated skip connection is that

skip connection and other types of operations share one softmax distribution

(their sum is one). When the weights of skip connections become larger, the

weights of other operations become smaller. To resolve this issue, we introduce

independent binary gate implemented by Bernoulli distribution for each operation

between two nodes to avoid the direct competition between skip connection and

other operations. If the binary gate is one for the operation, then the

operation will be used to connect two data nodes. Otherwise, the operation will

not be used. However, we prove that independent distribution for each operation

gives dense architectures which will lead to information loss when pruning the

network to a small network after the search.

To solve this issue, we propose a group-structured sparse binary gate for each

operation. Specifically, we first define one binary Bernoulli gate variable for

each operation. Then we pass the gate variable into a hard thresholding function

such that when the gate variable is smaller than a threshold, it becomes zero.

In this way, the gate variables in the network become sparse. Moreover, we can

compute the probability of activated gates of the whole network. So we collect

all skip connections in the cell as a skip-connection group and take the

remaining operations into non-skip-connection group. Then we compute the average

activation probability of these two groups and respectively penalize them to

obtain sparse selected network. As a result, as shown in Fig. 1 (a) and (b),

penalizing skip connections heavier than other types of operations can rectify

the competitive advantage of skip connections over other operations and avoids

skip-connection-dominated cell.

Moreover, the above group-structured sparse binary gates can gradually and

automatically prune redundancy and unnecessary connections, which reduces the

information loss of pruning at the end of search. Finally, sparse binary gates

defined on the whole cell can encourage global competition and cooperation of

all operations in the cell, which differs from DARTS that only introduces local

competition among the operations between two nodes.

(ii) Path-depth-wise Regularizer on Operation Gates

Except for the above advantages, the above independent sparse gates also

introduce one issue: they prohibit the method to select deep networks. Without

dominated skip connections in the network, other types of operations, e.g. zero

operation, become freer and are widely used. Accordingly, the the search

algorithm can easily transform a deep network to a shallow network. For example,

it can use skip connections to connect the intermediate nodes to input nodes

and cut off the connection between intermediate neighbouring nodes. Meanwhile,

we prove that gradient descent algorithm prefers shallow networks than deep

ones, as shallow networks often have more smooth function landscape and can be

faster optimized. So these two factors together lead to a bias of search

algorithm to shallow cells.

To resolve this network-selection bias, we propose a path-depth-wise

regularization to rectify the unfair competition between shallow and deep

networks. We first compute the probability that any two neighbouring nodes are

connected with each other by parameterized operations, namely various types of

convolutions. Then to rectify the competitive advantage of shallow networks over

deep ones, we impose path-depth-wised regularization on the stochastic gates to

encourage more exploration to deep networks and then decide the depth of

networks instead of greedily choosing shallow cell at the beginning of search.

As a result, as shown in Fig. 1 (a) and (c), our path-depth-wise regularization

can help find deeper networks which usually have better data fitting ability

than shallow networks. Finally, by combining group-structured sparse gates and

the path-depth-wise regularization, we can obtain our final model which is shown

to be superior than DARTS in Fig. 1 (d) and (e).

6. What results the developed method has achieved?

To evaluate the performance of our method, following conventional setting we

train our model on the search the CIFAR10 dataset to find network architecture,

and then test the selected network on the CIFAR10, CIFAR 100 and ImageNet

datasets. The classification results in Table 1 show that our method achieves

the state-of-the-art results on both CIFAR10 and CIFAR100.

The results on ImageNet in Table 2 also demonstrate the superiority of our

method. All these results show the superiority, robustness and transferability

of the selected network architecture by our method.

7. Takeaway Remarks and What is Next?

This work first provides theoretical understanding of existing DARTS methods,

which is rarely investigated though heavily desired. Then we develop an

effective network architecture search approach which well resolves the issue in

DARTS. Our work pushes the frontiers of both theoretical performance analysis

and practical algorithm design of NAS, which not only helps us better

theoretically understand current NAS approaches and also brings new practical

design insights for NAS. We will release our code and models for anyone who is

interested in NAS and AutoML research and applications.

If you are interested in learning more, please check out our paper

[https://arxiv.org/abs/2006.16537] and feel free to contact us.

Resources:

Paper: Theory-Inspired Path-Regularized Differential Network Architecture Search

[https://arxiv.org/abs/2006.16537]

Code: We will release the codes and models soon.

When referencing this work, please cite:

@inproceedings{zhou2020nas,

title={Theory-Inspired Path-Regularized Differential Network Architecture

Search},

author={Pan Zhou, Caiming Xiong, Richard Socher, and Steven Hoi},

booktitle={Neural Information Processing Systems},

year={2020}

}

References

[1] P. Zhou, C. Xiong, R. Socher, S. Hoi, Theory-Inspired Path-Regularized

Differential Network Architecture Search, Neural Information Processing Systems

(NeurIPS), 2020 (oral). [https://arxiv.org/pdf/2006.16537.pdf]

[2] H. Liu, K. Simonyan, and Y. Yang. DARTS: Differentiable architecture

search.

In Int’l Conf. Learning Representations, 2018.

[https://arxiv.org/pdf/1806.09055.pdf]

[3] B. Zoph and Q. Le. Neural architecture search with reinforcement learning.

In Int’l Conf. Learning Representations, 2017.

[https://arxiv.org/abs/1611.01578]

[4] X. Chen, L. Xie, J. Wu, and Q. Tian. Progressive differentiable

architecture

search: Bridging the depth gap between search and evaluation. In IEEE

International Conference on Computer Vision, pages 1294–1303, 2019.

[https://arxiv.org/abs/1904.12760]

[5] X. Chu, T. Zhou, B. Zhang, and J. Li. Fair DARTS: Eliminating unfair

advantages in differentiable architecture search. arXiv preprint

arXiv:1911.12126, 2019. [https://arxiv.org/abs/1911.12126]

[6] T. Arber Zela, T. Saikia, Y. Marrakchi, T. Brox, and F. Hutter.

Understanding and robustifying differentiable architecture search. In Int’l

Conf. Learning Representations, 2020. [https://arxiv.org/abs/1909.09656]