Software Architecture Course Homepage

E0 257 - Software Architecture

Department of Computer Science and Automation
Indian Institute of Science
Bangalore - 560 012
INDIA

Source Code Search Engine, M. Bharat Kumar

High level usecases:

There could be two kinds of usecases.
If you consider only the user as an external entity to the system, there's just one usecase.

a) Find source code.

If you also consider that the crawler is external to the search engine system, then you can put one more usecase,

b) Feed source code.

Why is this a good decision? Is there a good intuition in making the crawler an external party?
. The crawler most importantly abstracts out the aspect of external source code being fed into the system. If that is not captured in the use case, the system may not 'get' any source code to allow for the user to search. Hence a crawler appears to be a good choice as an actor. Please note that the crawler may still get refined into another subsystem soon.

Let us, now look at each usecase in detail. a) Find source code:

What does this entail?
This usecase means, "The user wants to lookup source code from the web that satisfies certain features he is looking for". An immediate question is:

What does a user look for in source code? How is this specified?

An answer will be, "It may be some 'search-specification'"

This search specification is understood by your system and it has to return results which have to be presented to the user.

The obvious flow is, "Search-Specification" --> "Understand search-specification" --> "Perform searching" --> "Present results".

All these are a sequence of actions the previous usecase requires. Please note that we dont have any notion of classes yet. We hope to get them by asking more questions:

1) So, the user wants to specify the search specification. Who collects the specifications? Let us introduce an entity called "Search-specifiation-interface" or just "SearchInterface". The function of this entity is to provide an external contact point and let the user specify his search requirements. This search interface may contact other people and then collect results and provide them back.

2) Whom does this SearchInterface contact? He has to send it to some person who can understand the search-specification. A search specification is a query. Lets call this person a QueryProcessor. The QueryProcessor will look at the query and split them into possibly smaller queries etc etc. He employs the work of a simpler person who can be asked to give answers for discrete simple searches.

3) This person is the "SearchEngine".

4) The search engine does various things, but essentially works on a SourceCodeBase to search.

This seemed intuitive. The SearchInterface surely is a Boundary class, and the QueryProcessor, SearchEngine were Control classes, and the SourceCodeBase was the Entity class. But what is passed around?

Do we have a case to also look at the SearchSpecification itself as a class? Perhaps the results also?

We can have classes Query, and Result also, and tag all functionality associated with them.. marshalling, unmarshalling, representation etc. into them. This is like saying, let the Query and Result handle their issues by themselves.

Thus, we have the following model:

Query, Result are entity classes that are passed around, SearchInterface is the boundary class, QueryProcessor and SearchEngine are control classes, and SourceCodeBase is again an Entity class.

There are some advantages in using Query and Result as classes. Each of the control classes can operate on them and pass them between each other. So this also works as an excellent protocol. Query is a preliminary thing at the first stage, QueryProcessor, perhaps analyses the query and builds a tree, and then the search engine runs an iterator over the tree and does the searching. Each of these control classes simply accepted a Query class and just added more to it on their processing. The same applies to the Result class too. This abstraction also implies that if we have a QueryCache, then all we need to do is build an equivalence routine between Queries.. something like a .isEquals() method, and we can find if queries (though different syntactically) are essentially the same. A query can also be a composite of queries, thus we could reuse search results.

Now we can also represent this as a Model View Controller.

The Query, the Result, and the SourceCodeBase form the model, and the other control classes are part of the controller. QueryInterface formed the view, and we can also have one more class ResultVisualizer to display the results. Does that sound good?

At this point the distinction from the design presented in the session are:

1) Introduction of classes Query and Result. The advantages of this were just mentioned.

2) Classification of classes in terms of the MVC.

The relationships between these domain classes are also important. SearchInterface is associated with QueryProcessor, QueryProcessor with SearchEngine and the SearchEngine with the SourceCodeBase. SearchInterface, QueryProcessor, and SearchEngine are associated with Query and Result. Infact Query and Result are part of a ternary relationship between the classes SearchInterface, QueryProcessor, and SearchEngine.

At this stage, we have to come up with a high level architecture of the system. The analysis classes seem to reflect some architectural components. However the classes Query and Result are more like message classes.

A first cut architecture is as follows:

SearchInterface <--> QueryProcessor <--> SearchEngine <--> Database

A <--> implies a connector. This is important since the connector is the place where the network will be introduced. It is intended that the message classes Query and Result pass through the connector.

A connector has roles at either end. The connector used here can be a procedure call connector, which accepts a Query object and returns a Result object. This can be a remote procedure call connector as well! Also please realize that there will be thousands of searches that will be coming in at any time. So another connector that suits well is a message passing connector which does not synchronously wait for a Result, but is instead notified about a Result when it arrives. This can potentially use the Reactor pattern. Do you see the advantage of using a connector? The other component classes did not have to bother about any network or efficiency related issue. This connector can even encapsulate a LoadBalancer! and connect to an array of QueryProcessors, SearchEngines etc.

Next take up each component. They can be refined. Each component can again be visualized as an MVC. For instance, the SearchInterface's primary objective is to allow the construction of a Query object, and help visualize the Result object (when it gets it). Query and Result form the model. The view will be the HTML page, or SOAP interface etc etc. The controller is the set of classes that help the construction of the model, or some processing on it. For instance, the servlet that would accept the HTML query page and constructs the Query object will be part of the controller. At this level it is very important to invest in sequence diagrams. The interaction and relationship between modules have become quite important. Hence you need to write a class diagram just for this component, and provide the relationships between the classes. How are the classes related to each other inside this package or component? Also, provide certain wrapper classes that are explicitly meant for interaction with other components or connectors. These are referred to as ports.

The other components can also have class diagrams on their own. The connectors too will have class diagrams on their own! The class diagrams on connectors can even be disjoint since they need to transfer messages over the network.

Summarizing what happened at this stage:

1) The high level analysis provided the large classes which appeared to become components in my architecture.

2) Some of the model classes also served as excellent Message classes.

3) The use of connectors has helped us to abstract out several architectural issues like
a) distribution (the components can be on different machines)

b) load balancing (the connector can take up the job of performing load balancing and routing requests to components that are most free)

c) efficiency aspects in remote transfer (you could use plain RPC, or a message-oriented-middleware or even an Object RPC).

4) Each component and connector can be taken up on their own, and we can have class diagrams for them. The MVC can be applied recursively!

5) Once you have divided the problem into so many sub problems, you are at so much more ease to just solve simple problems.

You should have realized by now that nothing much appeared to be object-oriented yet! For one it shows that elegant designs (if I can call this one as elegant :) ) need nt be object oriented at all.

Object-orientation is different from object-based. You need to add inheritance and polymorphism to make object-based as object-oriented. Clearly inheritance and polypmorphism are issues that come up at the detailed level, since they are mostly about code elegance and reuse. To see the case for object-orientation read the following paragraphs.

This search engine is supposed to handle searches for Java, C++, Visual Basic, Perl, C# etc. At all stages, i.e. at the Query specification stage, the QueryProcessing stage, and the searching stage, you will have language specific issues that come up. The search "I want source code with a class that has a pure virtual function" is irrelevant in all languages other than C++ since they dont offer any such construct. Assuming what I just said in natural language can be specified formally in a machine readable form.. like "class HAS-A (method IS-A PureVirtual)", the way the Query tree was constructed at the QueryProcessor depended on the language specific issues. However, there are advantages in having an abstract form for Query trees. So we could have an class structure where a query expression is parsed by different query parsers (depending on the language), but they all build into a query tree that is generic. Besides some aspects of the query parsing may be same across the languages. eh: the expression syntax that uses AND, OR. So, construct a base class called QueryParser, and specialize them into CPPQueryParser, JavaQueryParser and so on for the language specific functions. Inheritence allows for very good reuse here. A CPPQueryParser can parse AND, OR expressions as well since it is inherited from QueryParser. Similarly the CPPQueryParser parses the Query and returns a QueryTree which is generic, so that we can handle the Queries (and transport them) in a generic fashion.

At the next stage, the QueryProcessor would divide the Query object into many sub queries, so as to reuse results. Thus, the Query "class HAS-A (method IS-A PureVirtual) and class HAS-A (member IS-A static)" could be split into two Queries. So if there is a cache of classes with pure virtual functions and then another cache of classes with static members, then they can be used up straight away. The notion of a Query being a composite and each Query taking care of finding itself (either from a cache or the database) helps us to achieve this.

Next, when we do build the QueryCache, we want to use a Hashtable which stores a map of the queries to the results. This hash map has no botheration of what the query is. It is just a map. So there is no need to introduce language specific Query issues to the QueryCache. So, the abstraction of Query did help.

If you can observe what is happening, wherever there is an interplay between a need to handle things in a generic fashion and a need to handle things specifically, object-orientation (and in specific inheritance) helps! These are also the contexts where design patterns especially the gamma design patterns are most applicable.

At this stage I will handwave and say that you can apply the same kind of analysis at each component and connector and you will end with a family of class diagrams and sequence diagrams.

The same analysis can be applied to the other use case as well, i.e. crawling.

Comments, discussion welcome (if you are still awake)

Overview

Books and References

Lecture Notes

Quizzes and Problem Sets

Resource Centre

Related Courses

Programing Assignments

BACK TO HOME

Modified by: Sandhya G & Shijesta Victor on 03.11.2003