1 Introduction

Neurons are the cells in the nervous system that carry information to the other cells in the nerve and communicate with each other in distinctive ways. The neurons [10] are the elementary functioning unit in the brain. The nerve cell or the neurons communicate [47] with each other using a dedicated connection called synapses. The neurons are categorized into three types based on their functionality, which are sensory neurons, motor neurons, and interneurons. The sensory neurons [1] send the signals to the brain or the spinal cord. The sensory neurons are responsible for the response of the different stimuli of the human such as sound, light, or touch which is affected by the sensory organs by the cells. The motor neurons get the signals to form the brain and spinal cord [2] to derive the output based on muscle contractions to glandular output. The interneurons are connecting multiple neurons with the brain region or the spinal cord. The connections of these neurons form a circuit called a neural circuit. The neurons comprise a cell body called soma, dendrites, and an axon [21]. The soma is typically compact. The dendrites and axons are the filaments that extrude from them. The dendrites can extend freely from the soma maybe a hundred micrometers. The axon hillock is the swelling point at which the axon leaves soma, which can go for 1 m in human beings or larger in other species. The axon terminals pass [57] the signals to synapses and the other cells in the body. It may be that the neurons do not have axons or dendrites in the case of the undifferentiated cells. Typically, neurons are having a cell body, dendrites, and an axon. The cell body comprises the cytoplasm and the nucleus. The axon prolongs from the cell body and regularly provides growth to minor outlets or branches before termination at nerve points. Dendrites cover the neuron cell body and accept the signals from other neurons. The main contact points are synapses responsible for the communication among neurons, which may connect one dendrite to another dendrite, one axon to another axon. The dendrites [50] are covered with synapses formed by the ends of axons from other neurons. In general nature, the neurons are electrically excitable and maintain the voltage gradients within their membranes. Therefore, the signaling mechanism is electrical and partly chemical.

The general-purpose hardware is based on the arithmetic blocks for simple in-memory calculations. Serial processing does not provide fast and sufficient performance for deep learning applications. The ANN architectures are based on parallel computation and operations. Ordinary chips cannot support a large number of highly and simultaneous operations for neuron processing. The AI-based hardware chip includes different chips that enable parallel processing. The main motivation for using the ANN and AI-based hardware accelerators is to get higher bandwidth memory chips and faster computation in comparison to general-purpose hardware.

Digital tools and simulators are appropriate and applied for discovering the measurable behavior of neural networks. Silicon neuron systems [7] are a mix of analog and digital signals that may be used to analyze behavior using VLSI integrated circuits, and simulate electrophysiological behavior for actual neuron processing at various levels of abstraction. The most recent FPGAs can handle a huge number of physical memory and logic gates [22], allowing large-scale neural networks to be implemented on hardware and at a reasonable cost. The current level of simulation and synthesis technology is that research laboratories can easily afford FPGAs. The hardware synthesis method allows researchers to work on parallel brain cell structures. Digital models will be used for cell-based controls, and digital stem coding techniques will be used to facilitate communication across the medium across vast distances. Subsequently, it is well known that the neurons can be used to module ANN of the earlier generations by equating mean firing rates of processing neurons and hardware for proficient, scalable, and low-power implementations [6] of single-layer feed-forward networks.

Human brain activity can be observed in both the local and delocal domains. The activities are linked to several functions such as vision and hearing, which are linked to specific brain regions. When a brain injury or accident occurs, the behavior of the brain neurons changes. The brain is a miniature network environment in which each portion has its own set of neural connections that are segregated from one another and confections. The local response is merged into a global understanding that causes the entire brain activity to become distressed. Machine learning and ANN-based intelligent methods have been proposed in the medical and health care industry to enhance security and train the models to improve patient treatment, diagnostics, rights, prevention, autonomy, and equality [55]. The research was offered based on deep learning-based Mobile Net V2 and long short-term memory (LSTM) to automate the process of identifying and classification [27] skin diseases. Oversampling techniques [32] can be used to determine cervical cancer based on feature extraction and spatial clustering. The synthetic minority over-sampling is used for hypertension, disease identification [31], and predictions based on the random forest machine learning method. The Wrapper filter [41] was used for disease classification and features selection. Neural networks have been applied for the CT images of the human liver for accurate diagnosis [56] of the disease related to the liver.

ANNs have several advantages that make them ideal for solving specific scenarios and difficulties. ANN systems can learn and model non-linear functions as well as construct complicated associations, which is critical for real-world solutions and associates between non-linear and complex function inputs and outputs. The sense inputs and outputs cause the neural networks to alter or learn. ANN is a term that refers to several deep learning technologies that fall under the umbrella of artificial intelligence [18]. These technologies are mostly used in commercial applications to handle pattern recognition and sophisticated signal processing difficulties. For addressing nonlinear excitation functions, the development and realization of a single neural network require computing logic such as adders, multipliers, and a complex function evaluator [40]. The precision of the computational blocks is the most significant quality in the digital implementation of a single neural network [45]. It is acquired by determining their word length, which aids in the selection of a higher resolution. The fulfillment of the function necessitates appropriate mathematical matching, as the better resolution may result in higher system costs. As a result, implementing a single neural network in hardware will necessitate the multiplier, addition, and excitation function realization blocks [49]. The testing of the advanced neural networks and machine learning algorithms will require an advanced level of FPGA and simulation tools. The FPGA provides the platform in which high performance can be achieved using data processing blocks. The most powerful and mature neuro-chips are digital neural ASICs. High computational precision, great dependability, and high programmability are all advantages of digital technology. Furthermore, advanced design tools for digital full and semi-custom design are accessible. The weights of synaptic connections can be stored on or off the chip. The trade-off between speed and size determines this decision.

The organization of the article is as follows: section 2 presents the related work, section 3 presents the structure of the single-layer neural network, and section 4 presents the design of the logarithmic multi-neuron system. The results & discussions are presented in section 5, followed by conclusions in section 5.

2 Related work

Neural networks (ANN) have been used widely for developments in a broad spectrum of perception, classification, association, control, and biomedical applications. The ANN hardware implementation was done on FPGA in the digital domain to miniaturization of component manufacturing technology. A high-speed ANN architecture was implemented on the Xilinx FPGA chip for random number generators and further used for data encryption over the network. The perceptron model of the multi-bit [4] input neuron was implemented in 130 nm technology. The model was proposed for the low power consumption baes on 4 neurons per layer. The multilayer perceptron architectures are for complex decision regions [33] and activation functions play an important role. The three-layer model first is the input layer, the second layer is the perceptron or hidden layer, and the third output layer. A feedforward network was used for the selection of concrete beams [30]. The metaheuristics approach [44] was used to realize the feed-forward ANN. The ANN engine was implemented on FPGA based on the parallel processing [14] the blocks and hardware parameters were analyzed. The backpropagation multilayer perceptron (MLP) was proposed to design based on a very large scale of integration (VLSI) parameters and FPGA [13] using a very high speed integrated circuit (VHSIC) - hardware description language (VHDL) to amylase the chip performance. The Spiking neural network (SNN) [23] was designed for targeting 64 K neurons on FPGA for hardware accelerator. The performance of the neural network was enhanced using the concept of parallelization [15] applied in both the time and space domains. The design was having 3/2, 7/3, 15/4, and 31/5 inputs/outputs. The design was implemented Altera EP3C16F484- Cyclon III FPGA on Quartus II software using VHDL.

In general, hardware systems for deep neural network (DNN) inference [52] suffer from a lack of on-chip memory, compelling access to additional memory-only processors. It was recommended to employ nonvolatile scalable memory that can scale up to a 64-chip illusion system. The hardware neural network models have been used for dataflow [60] and weight access patterns of neurons, in which recurrent neural networks (RNNs) and probabilistic graphical models are used for compute-in-memory (CIM) designs that can be implemented using CMOS technology. The neural network design was implemented on FPGA [58], and the performance may be measured using hardware metrics like memory, chip area, and size. Such hardware can be utilized to create hardware embedded chips and internet of things (IoT) applications. Deep learning approaches [11] have been successfully employed to handle a variety of artificial intelligence challenges. The FPGA has been utilized to optimize various reconfigurable computer hardware and software for AI designs. The topologic and hardware designs are based on multiple neuron processing and scalable computation. The neural network architecture can be implemented using a processing engine layout [34] for the hardware performance analysis framework for recognizing bottlenecks in the initial stages of a convolutional neural network (CNN). This methodology is useful for evaluating various architectures for embedded chips and associated applications like hardware accelerators. The ANN was modeled for various logic functions and logic gates [26]. One of the gates utilized for serval applications and quick modeling is the XOR gate. The 3-input XOR gate hardware was modeled using ANN to anticipate intelligent learning and numerical methods to improve forecast accuracy. A novel way was presented for accelerating fully linked feed-forward neural networks [48] using an FPGA-based accelerator. The program was created to make diverse implementation activities easier by dividing the architecture into elementary layers, estimating the available computational hardware resources, and generating high-level C++ descriptions using high-level synthesis (HLS) tools. The decision tree classifier and neural networks [38] have been used for the hardware in loop testing in the power window. The machine models were used to estimate the 93% accuracy of the system in automotive power window hardware. The neural networks have been used for the diagnosis of different diseases and their realization in hardware are helpful for the implementation of optimal hardware. The diagnosis of epilepsy neurological disorder [54] has been done using the analysis of the electroencephalography (EEG) signals by embedding the feed-forward multi-layer neural network architecture (MLP ANN) and FPGA using VHDL in the time-frequency domain. The multiple input neural networks [29] have been used for forecasting death cases in China due to COVID-19. The models were applicable to do the study and estimation of COVID-19 cases across the globe. The neural network inference [61] is limited because of encounters between the high computation and storage complexity and resource-restricted hardware requirements in different applications. The current study trends are developing in the direction of neural network research that is complicated in the acceleration of FPGA-based stages. The architecture of neural networks can be designed and synthesized on FPGA to estimate the hardware chip applications, and optical solutions for computing parallelism, data reuse, computing complexity, pruning, and quantization.

The reconfigurable computing architectures [59] play a very important role in real-time applications. The neural network was implemented in FPGA based on reconfigurable computing. The FPGA implementation has many challenges such as less hardware, memory utilization, minimum delay and timing parameters, and low power consumption. The VHDL programming was used to design the hardware chip of the design and on the Xilinx XC V50hq240 FPGA chip, Zynq FPGA [46] was used to test the behavior of the neural network chip and throughput optimization. The neural single-input single-output [51] and Multiple-input multiple-output neural networks were used for forecasting the total number of tourist arrivals in Spain. ANN acquires many inputs from the unique data set or output of erstwhile correlated neurons. Each input approaches through a connection, which is called synapses and which has weight [16]. The scalable ANN chip can be designed that can provide fast response, low price, less power consumption, and switch to operate with embedded chips and integration on FPGA.

The ANNs are used in a variety of applications including brain activity, modeling, and artificial intelligence. When employing HDL language and FPGA-based system retaliation [3], the number of neurons in an ANN design is limited. The ANN obtains a large number of inputs from a single data set or the output of previously connected neurons. The inputs are advanced through a link called synapses, which has a weight attached to it. The realization of the system may be using multilevel communication networks, convolution neural functions, single layer architecture, and other neural networks. The scalable ANN chip can be used to give fast response, cheap cost, and low power consumption, as well as the ability to work with embedded circuits [9] and FPGA integration. The neural systems and switching operations are followed by cluster-based models, in which a large number of units are deployed throughout a specific network to provide original and supplemental services, which can improve communication. The specialized processors use standardized software, response behavior, essential data control, and service module for coordination [28]. For logarithmic inputs, ANN modeling can be done in terms of the power of two. For large-scale network structures in which multiuser support the network’s functionality, such as 2-input, 4-input, 8-input, 16-input, 32-input, 64-input, etc. The chip design and FPGA-based system integration and implementation will offer scalable computing hardware [39] and the platform in which we can extend the user and computational hardware as the communication system is needed. There is a research gap in the design and development of the chip that supports the multi-neuron-clustering environment in which multiple users are communicating in intra-exchange and interexchange environments [35].

The motivation for ANN hardware chip design is the current need for neuromorphic chips in real-world applications that need optimal hardware and memory. The deep learning-based ANN architecture is designed to provide optimal performance [17]. The hierarchical astrocyte network (HANA) design [12] is based on the hierarchical networks-on-chip (NoC) structure by providing a unique of neurons and astrocytes cells that support information exchanges between astrocyte cells and addresses the connectivity difficulty. The design was based on scalable computing. By establishing a modular array of clusters of neurons employing a hierarchical structure of low and high-level routers using 65 nm CMOS technology, the unique hierarchical NoC architecture was employed to overcome the scalability issue [63]. An embedded system-based chip [62] was designed using a cross paradigm neuromorphic chip, to simplify the structure of different neural networks spike or non-spike forms. Neuro-inspired computing chips [43] are a promising approach to the development of intelligent computing because they mimic the structure and operating principles of the biological brain. These neuro-inspired computing chips are superior to traditional systems. It is predicted to provide benefits in terms of hardware memory, energy efficiency, and computational power. The objective of the research work is to design and model of single-layer neural network chip for multiple scalable neuron inputs and estimate the hardware chip performance in terms of memory, delay, and FPGA resources.

3 Structure of single-layer neural network

In a general way, the model of the neural network [5] is depicted in Fig. 1. The model accepts the ‘n’ number of neuron inputs. Let us consider that the inputs are X1, X2, X3……Xn. These inputs are processed with their corresponding weights as W1, W2, W3…...Wn and ‘b’ is the bias input. The nonlinear execution function is f(x). The neuron processing is expressed with the help of the Eq. (1).

Fig. 1
figure 1

ANN Structure [25]

$$ y=f(x) $$
(1)
$$ x=\sum \limits_{i=1}^n{x}_i{w}_i+ Bias\ (b) $$
(2)

The Wi is mentioned as the weights for the ith connections and b is the bias inputs. The behavior of the function f(x) is a nonlinear excitation function. The most popular excitation function used is expressed as.

For linear function,

$$ f(x)=x $$
(3)

For log sigmoid function,

$$ f(x)=\kern0.75em \frac{1}{1+{e}^{-x}} $$
(4)

For tan sigmoid function,

$$ f(x)=\kern0.75em \frac{e^x-{e}^{-x}}{e^x+{e}^{-x}}\kern0.5em $$
(5)

Figure 2 presents the ANN structure of 8 neuron inputs with their weight coefficients, the Eq. (1) is expressed as

Fig. 2
figure 2

ANN with 8 inputs and weights

$$ y=\sum \limits_{i=1}^8{X}_i{W}_i+(b) $$
(6)

The output is expressed as

$$ y={X}_1{W}_1+{X}_2{W}_2+{X}_3{W}_3+{X}_4{W}_4+{X}_5{W}_5+{X}_6{W}_6+{X}_7{W}_7+{X}_8{W}_8+(b) $$
(7)

The hardware realization of the network needs 8 multipliers and 8 adders as shown in Table 1. The multipliers are presented as M1, M2…M8.

$$ y={M}_1+{M}_2+{M}_3+{M}_4+{M}_5+{M}_6+{M}_7+{M}_8+(b) $$
(8)
Table 1 Multipliers and adders

4 Design of Logarithmic Multi Neuron System

The scalable design of the logarithmic single layer multi-neuron system [8, 53] is shown in Fig. 3. The design has 64 neurons inputs X1, X2, X3, X4 …….X64 with corresponding input weights as W1, W2, W3, W4………… W64. The functionality of the 64 inputs ANN can be understood with the help of parallel working of 8 blocks of 8-point ANN. The individual block accepts the 8 neuron points with their weights. The parallel execution of all the modules provides faster operation. The suggested operation is expended in terms of logarithmic execution in terms of the power of 2. It is a scalable architecture that can be progressed in the power of 2. The operation of the 64 inputs ANN can be understood with the help of Table 2. The weighted sum is obtained with the processing of 64-point ANN with a bias to provide final outputs. The scalable architecture is assigned with the module address “000”, “001”, “010”, “011”, “100”, “101”, “110”, and “111” against the sequential processing of 8-point ANN architecture. The design is scalable can be extended to a larger extent and solve the ANN problems at a large scale.

Fig. 3
figure 3

Multiple input ANN (64-point) architecture

Table 2 Realization of 64-point ANN

The finite state machine (FSM) concept is used to create the ANN architecture. The state memory is used to save the current state of the machine, which requires ‘N’ flip-flops. A single clock signal is used to synchronize all of the flip-flops. The state vector is used to hold the state memory in the state machine as depicted in Fig. 4. The state machine processes state-0, state-1, state-2, state-3, state-4, state-5, state-6, and state-7 using the address inputs “000”, “001”, “010”, “011”, “100”, “101”, “110”, and “111”. One hot encoding approach is one in which one state is realized depending on its selection input and one output is derived all at once. The neurons X1 to X8 are multiplied with weights W1 to W8 and added with bias input-1 to produce the neuron output Y1 in state-0 (000). The neurons X9 to X16 are multiplied with weights W9 to W16 and coupled with bias input-2 to produce the neuron output Y2 in state-1 (001). The neurons X17 to X24 are multiplied with weights W17 to W24 and added with bias input-3 in state-2 (010) to produce the neuron output Y3. The neurons X25 to X32 are multiplied with weights W25 to W32 and added with bias input-4 in state-3(011) to produce the neuron output Y4. The neurons X33 to X40 are multiplied with weights W33 to W40 and added with bias input-5 in state-4 (100), yielding the neuron output Y5. The neurons X41 to X48 are multiplied with weights W41 to W48 and added with bias input-6 to produce the neuron output Y6 in state-5 (101). The neurons X49 to X56 are multiplied with weights W49 to W56 and added with bias input-7 to produce the neuron output Y7 in state-6 (110). The neurons X57 to X64 are multiplied with weights W57 to W64 and added with bias input-8 to produce the neuron output Y8 in state-7 (111).

Fig. 4
figure 4

FSM for 64 input ANN Processing

5 Results and discussions

The hardware chip of the 8-point ANN and 64-point ANN is designed using VHDL coding in Xilinx ISE 14.7. Figure 5 presents the register transfer level (RTL) block diagram for the 8-point to 64-point ANN chip. The RTL depicts all inputs and outputs of the designed chip.

Fig. 5
figure 5

RTL of ANN

X1 < 7:0 > to X64 < 7:0 > presents the inputs (8-bit) of 64 neuron inputs ANN architecture with std_logic_vector data type. W1 < 7:0 > to W64 < 7:0 > presents the weight inputs (8-bit) corresponding to neuron inputs X1 to X64 of std_logic_vector data type. B_i < 15:0 > is the bias input treated as the perceptron of the ANN architecture of 16-bit with std_logic_vector data type. X_A < 15:0 > It is the activation function output ANN architecture with the 16-bit size of std_logic_vector data type. Y < 15:0 > It is the actual output with weighted sum and bias input, processed with an activation function of 16-bit of std_logic_vector data type.

Modelsim simulation of 8 input ANN in binary and integer is shown in Figs. 6 and 7 respectively. Table 3 lists the test cases used for the functional simulation of the designed ANN chip. Modelsim simulation of 64 input ANN in binary and integer is shown in Figs. 8 and 9. Table 4 lists the test cases used for the functional simulation of the designed ANN-64 with test case-1 to test case-8.

Fig. 6
figure 6

Modelsim simulation of 8 input ANN in binary

Fig. 7
figure 7

Modelsim simulation of 8 input ANN in integer

Table 3 Test cases for the simulation waveform
Fig. 8
figure 8

Modelsim simulation of 64 input ANN in binary and integer (inputs)

Fig. 9
figure 9

Modelsim simulation of 64 input ANN in binary (weights and outputs)

Table 4 Test cases for the simulation waveform ANN-64 point

The percentage of hardware that is used by the device is given by the device utilization report [37] for the implementation of the chip. The report is taken directly from the Xilinx software as the device utilization report. The report presents the number of adders, multipliers, slices, 4 input lookup tables (LUT) [36], input/output blocks (IOB), total memory usage (kB), combinational delay (ns) that includes path delay and routing delay. The Xilinx device summary for ANN-8, ANN-16, ANN-24, ANN-32, ANN-40 ANN-48, ANN-66, and ANN-64 is given in Table 5. The target device is Virtex-5 FPGA with device xc5vlx20t-2-ff323 used for simulation and synthesis [24]. Figure 10 presents the hardware utilization curve for ANN-8 to ANN-64 hardware chips.

Table 5 Xilinx software parameters for ANN-8 point to ANN-64 point
Fig. 10
figure 10

Hardware utilization for ANN-8 to ANN-64 hardware chip

In the simulation of ANN-64, the hardware and memory usage depends on the utilizations of multipliers and adders. The detail of these units is reported directly by the software and change with the number of neurons and weight inputs. The hardware utilization will increase with the increase in cluster inputs of the ANN chip. The simulation results show that the number of multipliers, adders, slices, LUTs, memory is increasing as the number of neurons are increasing in the multi-input ANN design. The reason for this is that the adders and multipliers blocks increase the number of gates and concurrent logic modules, which takes up more memory and resources on the FPGA.

The report predicts that the number of multipliers and 16-bit adders are increasing with the number of neurons inputs. The predicting of mean squared error (MSE), mean absolute percentage error (MAPE), root mean squared error (RMSE) is done for the FPGA hardware resources [25, 42] based on the training and validation sample neurons with different cluster inputs of ANN design. In the training (X1 to X40) are considered and (X41 to X64) for validation. The values are determined using the equations [19, 20].

$$ MSE=\frac{1}{n}\sum \limits_{i=1}^n{\left|{y}_i-\hat{y_i}\right|}^2 $$
(9)
$$ RMSE=\sqrt{\frac{1}{n}\sum \limits_{i=1}^n{\left|{y}_i-\hat{y_i}\right|}^2} $$
(10)
$$ MAPE=\frac{1}{n}\sum \limits_{i=1}^n\frac{\left|{y}_i-\hat{y_i}\right|}{y_i}.100\% $$
(11)

yi is the actual value and \( \hat{y_i} \) is the predicted value for ‘n’ number of predications. Based on linear regression model and 200 estimations for 64 number of neurons the value of MSE = 0.00500, RMSE = 0.07071, and MAPE = − 0.003906%.

The efficiency of the hardware simulation depends on the resources utilization such as logic gates, input/output block, combinational logic, memory, and delay. For the complex nonlinear application, multilayer perceptron architecture is beneficial in comparison to single-layer multiple input ANN. On the other hand, it is simple to build up and train a single layer perceptron. The neural network model can be explicitly linked to statistical models, allowing it to share the covariance Gaussian density function. The realization of the MLP will provide more delay in comparison to single-layer multiple input ANN. Figure 11 shows the hardware efficiency with targeted FPGA- Virtex-5 for simulation and synthesis of the binary data. The efficiency variations are noticed with the different test cases in which 8 neurons are processed at a time and parallel processing modular design-based approach is followed to realize the 64 input ANN. The single-layer ANN hardware is used to solve simple problems and parallel processing provides fast computation time. In terms of hardware efficiency, the single-layer will provide faster response and computation time in comparison to MLP. The MLP requires more delay to compute the logic as it is processed by different hidden layers. The output function of the ANN hardware chip is the throughput that depends on the length of binary input, weights, bias input, and hardware and timing parameters. The output layer receives the inputs from the layers above it, executes the calculations using its neurons, and then computes the output.

Fig. 11
figure 11

Hardware efficiency with targeted FPGA

The hardware delay depends on two components of propagation delay are logic delay and routing delay. The logic delay is a function of the number and kind of logic gates the signal passes through. Because the FPGA compiler tries to cluster the components of a combinatorial path as tightly as possible on the FPGA. The routing delay is a function of the length of the wire path the signal travels, which is often modest. In the simulation, the total path delay is 37.091 ns, in which 89.00% delay is from logic and 11.00% from routing that help to maintain the FPGA efficiency greater than 90.00% in most cases.

6 Conclusions

ANNs are known for their high degree of connectedness and massive data volumes. For the realization of the single-layer networks, neuron-level parallelism is more effective. The intrinsic distributed component of ANNs is in both memory and computational logic, suggesting that the implementation will be done directly in hardware, allowing for significant benefits as network sizes grow. The hardware chip The scalable chip design of 8 input ANN and 64 input ANN is performed successfully in Xilinx ISE 14.7. The Modelsim simulation is verified under different test cases and hardware parameters are extracted from the targeted device of Virtex-5 FPGA. The number of multipliers/adders for ANN-8, ANN-16, ANN-24, ANN-32, ANN-40, ANN-48, ANN-56 and ANN-64 are 8, 16, 24, 32, 40, 48, 56, and 64 respectively. The number of slices for ANN-8, ANN-16, ANN-24, ANN-32, ANN-40, ANN-48, ANN-56, and ANN-64 are 379, 726, 1073, 1420, 1766, 2113, 2460, and 2807 respectively. The number of LUTs for ANN-8, ANN-16, ANN-24, ANN-32, ANN-40, ANN-48, ANN-56, and ANN-64 are 648, 1280, 1928, 2560, 3208, 3840, 4488, and 5120 respectively. In the same way, the reported number of IOBs are 147, 275, 403, 531, 657, 787, 915, and 1043 for ANN-8 to ANN-64 respectively. The combinational path delay is 37.091 ns, common to all scalable modules. The hardware efficiency of the design is greater than 90.00% with the MSE = 0.00500 for ANN-64. The hardware usage summary concludes that the ANN chip hardware utilization is increasing with the ANN cluster size. The memory is also increasing from 116,736 kB to 165,892 kB. The chip hardware requirements will increase definitely with the number of neuron inputs. The biggest challenge for the hardware is to develop an embedded chip that can be compatible to support the specific hardware. The limitation of the work is that the chip design supports the 64 neurons processing ANN hardware and the chip functionality is verified in Virtex-5 FPGA. Therefore, the device resources utilization and timing parameters will change on another series of FPGA. The design can be extended further for large-scale ANN using pipelined and parallel processing that supports maximum hardware resources count and combinational blocks on the targeted FPGA. In the research work, we have followed the concept of scalable computing and modular design that can be used to support the design and development of the large-scale neuromorphic embedded chip. In the future, the research can be focused on the hardware chip design and synthesis for multilayer neural network architecture.