I need to be able to generate random graphs (both general and bipartite). I require that the generated graph have a specified number of vertices and edges (in the bipartite case the number of vertices on each side is specified). My particular problem is that I also require that the graph be connected. I have found no research on this problem. Generating all graphs satisfying specifications and then selected from this set at random is not feasible for space reasons (I also don't know how to do this).
My plan is to generate a graph at random keeping track of the number of components until I only have d edges to add and the generated graph so far has (d + 1) components. Then add the remaining edges ensureing that each edge added joins two unconnected components of the graph. The difficulty is that now each time I generate a new edge at random I must discard it if both of its endpoints are on the same component of the graph. I have no idea how many edges I can be expected to discard by this approach and have not found anything in the literature on my approach or any other approach.
Is there a result that states how many edges I need to generate before the graph becomes connected?
Is there another approach that guarentees termination in a reasonable time (O(nlog n) or better)?
I rarely visit here so emailing me at rpboland@gmail.com might be useful.

Is there a result that states how many edges I need to generate before the graph becomes connected?

Well... one hint to this might be found in the definition of a spanning tree. The only question remaining is, how many edges does a tree with N vertices have? (this is a pretty elementary question, but if you don't know it, just grab a pen and paper and draw some trees, and count the vertices vs. edges in them, you should quickly be able to figure out the pattern)

Is there another approach that guarentees termination in a reasonable time (O(nlog n) or better)?

If you have no particular restrictions (e.g., like about which vertex to connect to which), then you can do this is linear time, with no extra storage. Here is a hint: what if all the vertices formed one long single branch... how would you create those edges?

Allow me to clarify:
At the point in question in my algorithm the graph has d components
and (d - 1) edges are to be added so that only one component remains.
The graph is being generated by selecting edges to add randomly and
as a result many of the edges generated do not connect two components
and thus must be discarded.
How many edges are expected to be generated, including discarded edges, before only one component remains?

Also, is there a method of generating random graphs (or bipartite graphs) that have a specified number of vertices and edges and is
connected?
The method I described is not truly random but I consider it good
enough for my purposes.

The graph is being generated by selecting edges to add randomly and as a result many of the edges generated do not connect two components and thus must be discarded.

I guess the problems lies in your conception of "randomly". This is an extremely vague term. I assume you mean that you would generate an edge by picking a source vertex and a target vertex, each through uniform sampling of all vertices of the graph, and then reject it if it doesn't connect two components. This is pretty much the worst way of solving this problem, and I don't understand why you would do that. If that's not what you mean, then you must clarify further.

How many edges are expected to be generated, including discarded edges, before only one component remains?

I don't really want to answer this question, because it smells like a homework question. Usually, when you want to figure out what is the complexity or expected run-time cost of the worst possible strategy, it's usually a homework question, because who else would care about knowing this?

Also, there are a number of additional things that are needed to really be able to answer this question, especially the distribution of vertices between components (which I guess you could assume to be N/d on average, but that's kind of meaningless).

And by the way, the worst case is infinity, because it's a probabilistic algorithm, in case you didn't know.

Also, is there a method of generating random graphs (or bipartite graphs) that have a specified number of vertices and edges and is connected?

Here's a simple algorithm for N vertices and M edges: generate N un-connected vertices (making N components); generate N-1 edges to connect them all; and, generate M-N+1 more edges over any vertices. This can be done in O(N+M) time and O(1) auxiliary memory.

How is that algorithm not sufficient?

2 Days Ago

The graph is being generated by selecting edges to add randomly and as a result many of the edges generated do not connect two components and thus must be discarded.

I guess the problems lies in your conception of "randomly". This
is an extremely vague term. I assume you mean that you would
generate an edge by picking a source vertex and a target vertex,
each through uniform sampling of all vertices of the graph, and
then reject it if it doesn't connect two components. This is
pretty much the worst way of solving this problem, and I don't understand why you would do that. If that's not what you mean, then you must clarify further.

It is exactly what I mean. I don't see a better way to solve this problem, despite your assurances that it is a bad way to solve it.

How many edges are expected to be generated, including discarded edges, before only one component remains?

I don't really want to answer this question, because it smells
like a homework question. Usually, when you want to figure out
what is the complexity or expected run-time cost of the worst
possible strategy, it's usually a homework question, because who
else would care about knowing this?

It's not homework. I will be implementing it because I need randomly generated connected graphs for a project I am working on.
My project is in Smalltalk but I may also release it in another language such as C or D. The project is open source so you may pick up a copy when it's done if you like.

Also, there are a number of additional things that are needed to
really be able to answer this question, especially the
distribution of vertices between components (which I guess you
could assume to be N/d on average, but that's kind of meaningless).

I mean to randomly select an edge to add to the graph ignoring whether it connects 2 components or not, then testing whether it connects 2 components or not and rejecting the edge if it fails to connect 2 comonents.

And by the way, the worst case is infinity, because it's a
probabilistic algorithm, in case you didn't know.

I do know. That's why I don't like my algorithm and would prefer one with a specified and reasonable complexity. I am stuck with my algoritm because I know of no other.

Also, is there a method of generating random graphs (or bipartite graphs) that have a specified number of vertices and edges and is connected?

Here's a simple algorithm for N vertices and M edges: generate N
un-connected vertices (making N components); generate N-1 edges to connect them all; and, ...

I assume what you mean here is:

repeat
    1) Select a vertex v randomly and determine which component it is in.
    2) Select a vertex w randomly from the vertices of the remaining components.
    3) Add edge vw and merge the components containing  v and  w.
until only one component remains

.

This looks promising.

... generate M-N+1 more edges over any vertices.
This can be done in O(N+M) time and O(1) auxiliary memory.

The best I can do for merging components is the union-find algorithm which is O(N+M) alpha(N) where alpha is the inverse ackerman function. The auxiliary memory is O(N).
I don't see how to select w in less than O(N) time. That is very expensive so I would be inclined to ransomize by randomly selecting w from the complete set of vertices and rejecting it
if it is in the same component as v (union-find can also be used to dermine the component containing vertex w).
This approach has all the same problems as my algorithm though to a lesser degree so it is arguably an improvement to select vertices (w) randomly rather than edges.
In terms of generating random connected graphs, I think there is a larger degration of randomness than with my algorithm (randomly selecting edges) but the difference here is hard to quantify.

How is that algorithm not sufficient?

I don't see your algorithm (as I described it) as particularly better or worse than mine or all that different for that matter.
One difference it that you ensure connectedness at the start and mine does so at the end. Actually, in a sence, mine also does so at the start. My algorithm first ensures minimum degree constraints for each vertex. Then it ensures connectedness. Finally, it adds any remaining edges required by the number of edges specification for the randomly generated graph. I initially left out some of these details as I didn't consider them relevant.

I think I will try both approaches (connecting the graph by generating random edges and by generating random vertices) and deside by comparing the code.

There is also another way to generate the initial connected graph that your algorithm constructs (a tree). That is to generate a tree of the specified size randomly and I already have code to do this. However, I don't like this approach because of the degragation of the quality of the random connected graphs generated.

Ralph

You didn't understand my algorithm at all. I'm working under the assumption that you are generating the vertices at random. If you are starting with a set of vertices that you cannot choose, then the problem is different and my algorithm does not apply directly (but you can fix that by creating a random-shuffle index map to create a randomly shuffled "view" of the vertices you have, which adds O(N) storage of integers, which is not too bad).

Here is the algorithm I was describing (in the style of the C++ Boost.Graph library, which is the only serious graph-theory library in any language, IMHO):

template <typename Graph, 
          typename RandEngine,
          typename GenVProp, 
          typename GenEProp>
void generate_connected_graph(Graph& g,
                              RandEngine& rng,
                              GenVProp gen_vprop, 
                              GenEProp gen_eprop, 
                              std::size_t N,
                              std::size_t M) {
  typedef typename boost::graph_traits<Graph>::vertex_descriptor VDesc;
  typedef typename boost::graph_traits<Graph>::vertex_iterator VIter;
  typedef typename boost::graph_traits<Graph>::edge_descriptor EDesc;

  assert(M > N);

  // Generate N random unconnected vertices:
  for(std::size_t i = 0; i < N; ++i)
    add_vertex(gen_vprop(), g);

  // Generate N-1 edges to connect all vertices:
  VIter vi, vi_end;
  boost::tie(vi, vi_end) = vertices(g);
  while(vi + 1 < vi_end) {
    add_edge(*vi, *(vi+1), gen_eprop(), g);
    ++vi;
  };

  // Generate M-N+1 random edges:
  for(std::size_t i = 0; i < M-N+1; ++i) {
    VDesc u = rng() % N;
    VDesc v = rng() % N;
    while( u == v )
      v = rng() % N;
    add_edge(u, v, gen_eprop(), g);
  };

};

This is O(N+M). And if you have objections about the fact that the first edge-generating loop is not "random" (whatever that means to you, because it's not clear to me that you have a clear idea of what that means). Then, you have to understand that the random generation of the vertices, assuming that it has a well-distributed random generator, makes those edges uniformly distributed.

In terms of topology, there is a bit of a problem because if you generate only those N-1 edges, you always get the same topology. But as long as M >> N, the additional random edges should create sufficiently diverse topologies (by basically "drowning" the initial single-branch spanning-tree topology).

Also note that if you want to create more initial diversity in the topology, you can do that by a number of simple modifications to the above algorithm. Essentially, what you need to do is turn that part of the algorithm into a random spanning-tree algorithm (similar to Boost.Graph's random_spanning_tree function which uses Wilson's method).

The fundamental trick that you need to conserve in this particular loop that generates the first N-1 edges is that as the iterations progress, every vertex before the current vertex iterator is connected to all the other vertices before that iterator. This is the invariant you conserve, and this is how you guarantee full connectivity at the end of the iteration.

For example, one simple modification to the above loop is this:

//..
  // Generate N-1 edges to connect all vertices:
  VIter vi, vi_end;
  boost::tie(vi, vi_end) = vertices(g);
  while(vi < vi_end) {
    // If current vertex is not connected:
    if( ( in_degree(*vi, g) == 0 ) &&
        ( *vi != 0 ) ) {
      // Connect it to some prior vertex:
      VDesc u = rng() % (*vi);
      add_edge(u, *vi, gen_eprop(), g);
    };
    // Add a random child vertex:
    if( vi + 1 < vi_end ) {
      VDesc v = (rng() % (vi_end - vi)) + *(vi+1);
      add_edge(*vi, v, gen_eprop(), g);
    };
    ++vi;
  };
//..

This would generate at most 2N edges, which might be too much, but I hope you get the idea. This is the kind of modifications you can make that keep the algorithm very cheap overall (still O(N+M)) while diversifying the topology beyond the random shuffling or generation of the vertices (which I think is sufficient when M >> N).

I mean to randomly select an edge to add to the graph ignoring whether it connects 2 components or not, then testing whether it connects 2 components or not and rejecting the edge if it fails to connect 2 comonents.

To make it clear, I must explain why this is bad. The big problem with this is not so much that it is probabilistic (and thus, not strictly guaranteed to finish, but almost certain to finish after some time), but the problem is that it's biased towards forming large connected components, to the exclusion of smaller ones. That's the real issue.

The thing is, every random selection of vertices is more likely to fall on the larger connected components ("larger": those that contain more vertices). This means that early iterations of this algorithm will tend to coalesce large connected components, making the smaller ones (that were not picked / "unlucky" in the early iterations) far less likely to be picked later on. Now, imagine this worst case, you have N vertices and two connected components, one with N-1 vertices and the other with only 1 vertex. The chances of randomly (by a uniform random number generator) picking that vertex from the small component is 1/N, meaning that at this point in the algorithm, you will have to run N iterations, on average to complete the connection (and if you want some typical probabilistic bound like 99% chance of having completed by some time, that time would be pretty large).

At the very least, if you are going to adopt a strategy like this one, you need to fight that bias. One solution might be to prioritize smaller connected components. At this point, this requires some auxiliary data structure to keep track of the connected components and their vertices. At this point, you'd end up with a O(NlogN) time-complexity and O(N) memory-complexity, which is better than a probabilistic O(N^2) time-complexity (with a really bad probabilistic completeness bound).

You didn't understand my algorithm at all. I'm working under the
assumption that you are generating the vertices at random. If you
are starting with a set of vertices that you cannot choose, then
the problem is different and my algorithm does not apply directly
(but you can fix that by creating a random-shuffle index map to
create a randomly shuffled "view" of the vertices you have, which
adds O(N) storage of integers, which is not too bad).

I must confess I don't understand the talk of "vertices at random" or the point of "random-shuffling them.

Here is the algorithm I was describing (in the style of the C++
Boost.Graph library, which is the only serious graph-theory library > in any language, IMHO): ...

The random connected graphs your algorithm generates may be adequate for some applications but not all including mine. Any graph generated by your algorithm will always admit a hamiltonian path which is a disaster for me because the graph will then always admit either a perfect matching (graphs with 2n vertices) or a matching with only one unmatched vertex (graphs with 2n+1 vertices).
Better is to simply generate a tree with n vertices at random which can be done in linear time. I am weary of the random tree approach though because I do not know how random the final product will be and fear a significant bias.

To make it clear, I must explain why this is bad. The big problem with this is not so much that it is probabilistic (and thus, not strictly guaranteed to finish, but almost certain to finish after some time), but the problem is that it's biased towards forming large connected components, to the exclusion of smaller ones. That's the real issue.

But this is what I want to happen. The reason I do is that (in my strong opinion) if you sample the set of graphs for given size and number of edges where the number of edges is at least as large as the number of vertices then you will almost always get a small number of large graphs (likely one) and some number of very small graphs (vertices of degree 0 or 1 mostly). So in regard to this issue my algorithm does exactly what I want.
I will write some code and verify that for randomly generated graphs this is what happens.
There remains the issue efficiency. I don't like the fact that termination is not guaranteed because of my algorithm's probabilistic nature as you pointed out. I also don't know it's expected performance (I suspect something like O(n log n) which is a bit slow but not really bad.

Ralph

I must confess I don't understand the talk of "vertices at random" or the point of "random-shuffling them.

Say you have the following vertices:

1 2 3 4 5

Then, let's say you generate a "random edge" (by picking to vertices at random) and it turns out to be 5 -> 3.

My point with random shuffling is that if you to a random shuffle (such as using the std::random_shuffle algorithm in C++, and I'm sure any other respectable language would have an equivalent to it in its standard library), you might get something like this:

5 3 1 4 2

And in this case, my method of just connecting consecutive vertices (in the shuffled order) gives you, as a first edge, the edge 5 -> 3. Is this less random? No. Is this more efficient? Yes, because you do the shuffle once, and then, you can get N-1 random edges that are just as random as those you would generate from the first method. The point is that all the necessary "randomness" is created by the shuffle and the rest can be done deterministically (and thus, with a guaranteed completeness (and efficiency)).

Of course, like I said before, that particular way of creating edges will create one long chain of vertices (single-branch tree). Which I understand is not what you want, but the main point here is that this idea of doing the random shuffling (or similar) as the initial randomizing step is what will allow you to generate your edge in a systematic and deterministic manner in which it should give an implicit guarantee that you end up with a fully connected graph at completion of the algorithm, and be more efficient.

As far as I know, this is the only way that you will avoid the probabilistic nature of your algorithm while retaining the "randomness".

The point is that the shuffle gives you the random distribution of edges over the vertices, and all you have left to worry about is creating a random topology (e.g., guaranteeing a uniform distribution over the set of unique graph isomorphisms, or whatever other precise criteria you have (as you seem to have one, except you don't seem to want to say what it is)).

I also don't know it's expected performance (I suspect something like O(n log n) which is a bit slow but not really bad.

Where would the "log(n)" come from? There is no dynamic / divide-and-conquer parts to your algorithm at all. Your algorithm definitely has an average complexity of O(N^2), and its probabilistic bound is probably something like "1-g change of success after ln(1/g)*N^2 iterations", or something like that, if memory serves me right. For example, that would be something like 99% chance of completing within 5*N^2 iterations.

I will write some code and verify that for randomly generated graphs this is what happens.

I would recommend that you set up a test for the amount of collisions you get between graphs generated (by collisions, I mean, getting isomorphic graphs). This would be an expensive test suite, so, try it with small graphs, but it would definitely be a good way to formalize what you want in terms of "randomness" and test the impact that certain efficiency improvements might have on the results.

You might also want to check out this page, it seems to have a pretty comprehensive description of many different graph generation algorithms depending on the desired probability distributions.

My point with random shuffling is ...
Got it.

As far as I know, this is the only way that you will avoid the probabilistic nature of your algorithm while retaining the "randomness".

Why is this any better than generating an n vertex random tree which is O(n)?
And I already have code to do this.

(e.g., guaranteeing a uniform distribution over the set of unique graph isomorphisms, or whatever other precise criteria you have (as you seem to have one, except you don't seem to want to say what it is)).

I have some new algorithms for finding maximum matchings and I want to compare them to the standard algorithms. The standard algoritms have a stated complexity of O(E root(V)) but would be better stated as O(E root(V')) where V' is the number of vertices of the largest component and you can do even better if you first partition the graph into its components and apply the matching algorithm to each component separately. This applies to problems other than matching as well e.g. maximum flow.

Thus, in order to make a fair comparison between algorithms, it is better to have a connected graph to start with. This is why I need to generate connected graphs. I have other requirements but they are not related to this discussion. Basically, I want to choose from a uniform distribution over the set of connected graphs of a given size and other constraints. I see no way to do this and so I just generate random graphs and somehow ensure that they are connected and try to be close to the uniform distribution of connected graphs.

Where would the "log(n)" come from?

The expected number of edges to generate before finding the edge to connect the last two components is O(n). When there are three components the expected number of edges to generate is O(n/2) (pardon the lax notation). For k components it is O(n/(k+1)).
So the total expected cost is O(n * 1 + 1/2 + 1/3 ...1/n-1) = O(n log n).
This analysis is not necessarily tight but I don't know how to do a tighter analysis.

You might also want to check out this page, it seems to have a pretty comprehensive description of many different graph generation algorithms depending on the desired probability distributions.

Will do soon as I post this.

Ralh

Be a part of the DaniWeb community

We're a friendly, industry-focused community of developers, IT pros, digital marketers, and technology enthusiasts meeting, networking, learning, and sharing knowledge.