|Version 3 (modified by jcnelson, 6 years ago) (diff)|
Syndicate is a scalable distributed filesystem built around the CoBlitz? CDN. It provides users with the ability to quickly and efficiently publish and read files across the Internet via a filesystem interface. A mounted filesystem subscribes to one or more metadata servers, which in turn crawl content servers across the Internet and translate URLs to files into directory hierarchies to be mounted. Read operations on a file are translated into HTTP GET requests to CoBlitz?, which in turn causes CoBlitz? to cache and replicate the file data for other mounted filesystems to access.
There are four principal components to Syndicate: the client, the metadata server, the content server, and the gateway server.
(TODO: insert diagram here)
The client exists as a FUSE filesystem that a user mounts locally. It subscribes to one or more metadata servers, from which it gets and constructs the filesystem hierarchy seen by the user. Each file in the client is a stub--when an application opens and reads the file, the FUSE module pulls the requested data into CoBlitz? and streams it back to the application via the read() call. When an application writes to a file, the written data is stored locally to the underlying filesystem, and subsequent I/O to the file will be forwarded to the local data. Periodically, the client polls the metadata server for metadata updates, which it then merges into the directory hierarchy. New files discovered by the metadata server will become visible to the client, and files that can no longer be accessed will disappear (unless there are local changes).
(TODO: insert diagram here)
A metadata server exists in three parts:
- a daemon which crawls content servers for files and assembles their URLs into a directory hierarchy
- an HTTP server with a specially-crafted CGI program that handles HTTP GET requests for metadata
- command-line tools for metadata server users to manipulate the metadata
The daemon maintains a local directory tree called the master copy, which has the same directory structure that the client will see when it polls the server for metadata. However, the file stubs in the master copy (a.k.a. master copy entries) store the metadata needed by a client to correctly represent a file. The daemon periodically walks its master copy to validate each master copy entry--i.e. to make sure that the content represented by the file is still available, and has not changed since it was indexed. The latter is necessary because a URL to data in CoBlitz? must refer to at most one version of a file. Master copy entries that are no longer valid are removed.
Additionally, the daemon may read zero or more sitemaps from content servers, as well as publicly available files on the local host. It will add a master copy entry for each file it not yet represented from these sources. If it detects that multiple URLs map to the same path in the master copy, it selects the URL referring to the content that has changed most recently.
Content servers are normal HTTP servers. Files and forms publicly accessible on them may be crawled by metadata servers and downloaded by clients.
Because CoBlitz? expects that each URL refers to at most one version of a file, Syndicate ships with content publishing tools that allow content server users to generate unique URLs for each version of each file they publish, as well as generate sitemaps of their content.
Gateway servers are a special type of content server that allows Syndicate metadata servers and clients to index and download content from non-HTTP data sources. In implementation a gateway server is an HTTP server which handles HTTP GET and HTTP PUT requests with a specially-crafted CGI program which translates HTTP requests into a form the non-HTTP data source understands. Syndicate will ship with an Amazon S3 gateway and an IRODS gateway.
"Web browser" Profile
In this profile, there are many content servers, one metadata server, and many clients. Clients are distinct from content servers, and they all use the same metadata server. This configuration provides filesystem semantics similar to that of the client/server architecture of web browsers and web servers--web browsers download data from web servers and may allow the user to locally change the data, but they cannot republish the same data without explicit support from the web server. Similarly, clients in this configuration can only read globally and write locally, but can only write globally out-of-band.
"Distributed Dropbox" Profile