wiki:Architecture

Version 2 (modified by jcnelson, 3 years ago) (diff)

--

Architecture

There are four principal components to Syndicate: the client, the metadata server, the content server, and the gateway server.

(TODO: insert diagram here)

Clients

The client exists as a  FUSE filesystem that a user mounts locally. It subscribes to one or more metadata servers, from which it gets and constructs the filesystem hierarchy seen by the user. Each file in the client is a stub--when an application opens and reads the file, the FUSE module pulls the requested data intoCoBlitz and streams it back to the application via the read() call. When an application writes to a file, the written data is stored locally to the underlying filesystem, and subsequent I/O to the file will be forwarded to the local data. Periodically, the client polls the metadata server for metadata updates, which it then merges into the directory hierarchy. New files discovered by the metadata server will become visible to the client, and files that can no longer be accessed will disappear (unless there are local changes).

Metadata Servers

(TODO: insert diagram here)

A metadata server exists in three parts:

  • a daemon which crawls content servers for files and assembles their URLs into a directory hierarchy
  • an HTTP server with a specially-crafted CGI program that handles HTTP GET requests for metadata
  • command-line tools for metadata server users to manipulate the metadata

The daemon maintains a local directory tree called the master copy, which has the same directory structure that the client will see when it polls the server for metadata. However, the file stubs in the master copy (a.k.a. master copy entries) store the metadata needed by a client to correctly represent a file. The daemon periodically walks its master copy to validate each master copy entry--i.e. to make sure that the content represented by the file is still available, and has not changed since it was indexed. The latter is necessary because a URL to data in CoBlitz must refer to at most one version of a file. Master copy entries that are no longer valid are removed.

Additionally, the daemon may read zero or more sitemaps from content servers, as well as publicly available files on the local host. It will add a master copy entry for each file it not yet represented from these sources. If it detects that multiple URLs map to the same path in the master copy, it selects the URL referring to the content that has changed most recently.

Content Servers

Content servers are normal HTTP servers. Files and forms publicly accessible on them may be crawled by metadata servers and downloaded by clients.

Because CoBlitz expects that each URL refers to at most one version of a file, Syndicate ships with content publishing tools that allow content server users to generate unique URLs for each version of each file they publish, as well as generate sitemaps of their content.

Gateway Servers

Gateway servers are a special type of content server that allows Syndicate metadata servers and clients to index and download content from non-HTTP data sources. In implementation a gateway server is an HTTP server which handles HTTP GET and HTTP PUT requests with a specially-crafted CGI program which translates HTTP requests into a form the non-HTTP data source understands. Syndicate will ship with an  Amazon S3 gateway and an  IRODS gateway.

Attachments